<a href="https://colab.research.google.com/github/Distortedlogic/DS-Unit-2-Regression-Classification/blob/master/Jeremy_Meek_assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
!pip install category_encoders

In [0]:
import numpy as np
import pandas as pd
import scipy as sp
from math import sqrt

import category_encoders as ce
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import pprint
pp = pprint.PrettyPrinter(indent=4)

In [0]:
# Read New York City apartment rental listing data
odf = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert odf.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
odf = odf[(odf['price'] >= np.percentile(odf['price'], 0.5)) & 
        (odf['price'] <= np.percentile(odf['price'], 99.5)) & 
        (odf['latitude'] >= np.percentile(odf['latitude'], 0.05)) & 
        (odf['latitude'] < np.percentile(odf['latitude'], 99.95)) &
        (odf['longitude'] >= np.percentile(odf['longitude'], 0.05)) & 
        (odf['longitude'] <= np.percentile(odf['longitude'], 99.95))]

In [0]:
df = odf

#Feature Engineering

####Check if price is normal, lognormal, or neither

In [7]:
p1 = sp.stats.mstats.normaltest(df['price'].apply(lambda e: np.log(e)), axis=0).pvalue
if p1 > 0.01:
   print('distribution is log-normal')
p2 = sp.stats.mstats.normaltest(df['price'], axis=0).pvalue
if p2 > 0.01:
   print('distribution is normal')
if (p2 < 0.01) & (p1 < 0.01):
  print('Price is neither normal or log-normal')

Price is neither normal or log-normal


There are more distribution types to check for that might allow us to reduce to normal via some transformation if desired.

Given price is neither normal or log normal, I will leave prices as is.

#### Created -> days from most recent timestamp in data

In [0]:
newest_time = pd.to_datetime(df['created']).max()
df['days_old'] = (pd.to_datetime(df['created'])-newest_time).apply(lambda t: abs(t.days))

####Amenities -> amenities score

First, for each amenity, we associate the correlation to price as a weight.

Then, we dot product the vector of whether the appartment has the amenities with the weight vector to produce an amenitity score

In [0]:
amenities = [
            'elevator',
            'cats_allowed',
            'hardwood_floors',
            'doorman',
            'dishwasher',
            'laundry_in_building',
            'fitness_center',
            'laundry_in_unit',
            'roof_deck',
            'outdoor_space',
            'dining_room',
            'high_speed_internet',
            'balcony',
            'swimming_pool',
            'terrace',
            'exclusive',
            'no_fee',
            'loft',
            'garden_patio',
            'new_construction',
            'wheelchair_access',
            'common_outdoor_space'
            ]

amenities_weights = []
for a in amenities:
  amenities_weights += [np.array(df[['price',a]].corr())[0][1]]

df['amenities_score'] = df[amenities].apply(lambda e: e.dot(amenities_weights), axis=1)

####Latitude/Longitude -> neighborhoods

In [0]:
#Google said new york, new york has at least 156 neighborhoods
kmeans = KMeans(n_clusters=156)
kmeans.fit(df[['latitude','longitude']])
labels = kmeans.labels_

new_series = pd.Series(labels)
df['clusters'] = new_series.values

#Potential Typo

There is one row that says it has 10 bathrooms, 2 bedrooms, for $3600. I assume this is a typo and meant to be one bathroom.

In [0]:
df.loc[df['bathrooms']==10, 'bathrooms'] = 1

##Encoding

In [0]:
le =  ce.OneHotEncoder()
to_encode = ['bathrooms', 'bedrooms', 'clusters', 'interest_level']
encoded = le.fit_transform(df[to_encode].astype(str))

In [0]:
df = pd.concat([df, encoded], axis=1).drop(to_encode, axis = 1)

##Drop unneeded features after engineering

In [0]:
toss = ['dogs_allowed', 'created', 'latitude','longitude', 'display_address','street_address', 'description']
processed = df.drop(toss+amenities, axis=1)

#Model Building

In [0]:
kf = KFold(n_splits=10, random_state=42, shuffle=True)
model = LinearRegression()

target = ['price']
features = [f for f in list(processed) if f not in target]

errors = {
      'mse': np.inf,
  }
  
for train_index, test_index in kf.split(processed[features]):
  
  train, test = processed.iloc[train_index], processed.iloc[test_index]
  model.fit(train[features], train[target])
  pred = model.predict(test[features])

  if mean_squared_error(test.price, pred) < errors['mse']:
    errors['mse'] = mean_squared_error(test[['price']], pred)
    errors['rmse'] = sqrt(mean_squared_error(test[['price']], pred))
    errors['r2'] = r2_score(test[['price']], pred)
    errors['mae'] = mean_absolute_error(test[['price']], pred)
    
    best_model = model

The Coefficient Matrix of the model is pretty large given the number of features we plugged in. But you can retirieve it and the intercept with model.intercept_ and model.coef_

In [16]:
pp.pprint(errors)

{   'mae': 588.2146837233715,
    'mse': 833229.4497978776,
    'r2': 0.731770190230039,
    'rmse': 912.814028046172}


In [0]:
'''
transforms the raw vector of attributes into the vector our model would expect
'''
def process_vector(latitude, longitude, pre_war, days_old, amenities_score, bathrooms, bedrooms, interest_level):
  cluster = kmeans.predict([[latitude,longitude]])

  begin = [pre_war, days_old, amenities_score]
  columns = ['bathrooms', 'bedrooms', 'clusters', 'interest_level']
  to_encode = [str(bathrooms), str(bedrooms), str(cluster), str(interest_level)]
  encoded_df = pd.DataFrame([to_encode], columns=columns)
  encoded = le.transform(encoded_df)

  final = begin + list(encoded.iloc[0,:])

  return np.array(final)

In [81]:
def pretty_print_predict(vector, row=False):
  y_pred = best_model.predict(vector)
  estimate = y_pred[0][0]
  print(f'${estimate:,.0f} estimated price for an apartment with the given features in Tribeca.')
  if row:
    print(f'The actual sale price was: ${processed.loc[row, ["price"]].values[0]}')
  print('\n')

# Print predictions and true value for 5 random known data points
for _ in range(0,5):
  row = np.random.randint(0, high=processed.shape[0])
  pretty_print_predict(np.array(processed.loc[row, features]).reshape(1, -1), row=row)

$3,581 estimated price for an apartment with the given features in Tribeca.
The actual sale price was: $3500.0


$2,361 estimated price for an apartment with the given features in Tribeca.
The actual sale price was: $2500.0


$3,497 estimated price for an apartment with the given features in Tribeca.
The actual sale price was: $3190.0


$2,820 estimated price for an apartment with the given features in Tribeca.
The actual sale price was: $3605.0


$3,716 estimated price for an apartment with the given features in Tribeca.
The actual sale price was: $4700.0




In [82]:
#something is wrong with my process_vector function
#dont feel like spending the time to fix it
#going to work on other things
pretty_print_predict(process_vector(40.7, -73.9, 1, 3, 1, 3, 2, 'high').reshape(1, -1))

$-257,349,221,106 estimated price for an apartment with the given features in Tribeca.


