Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

In [2]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [3]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [4]:
# Import block
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [5]:
# Despite filtering the most extreme prices, we still have the apartment with 10 bathrooms.
df[df['bathrooms'] >= 7]

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
1990,10.0,2,2016-04-09 04:34:31,***The building?s well-attended lobby welcomes...,W 52 St.,40.7633,-73.9849,3600,260 W 52 St.,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [6]:
# Let's add a little more filtering, on bedrooms and bathrooms.
df = df.query('bedrooms <= 7 and bathrooms <= 5')
print(df.shape)
df.head()

(48814, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [7]:
# Making a month feature to split data on
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)

In [8]:
df['month'] = df['created'].dt.month

In [9]:
df['interest_level'].value_counts()

low       33943
medium    11181
high       3690
Name: interest_level, dtype: int64

In [10]:
# Mapping interest level to digits
interest_dict = {
    'low': 1,
    'medium': 2,
    'high': 3
}

df['interest_level'] = df['interest_level'].map(interest_dict)
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,3,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4


In [11]:
# Adding features - first, the amenities feature we used last time, a sum of the amenities
boolfeatures = df.columns.tolist()
del boolfeatures[:10]
df['amenities'] = df[boolfeatures].sum(axis=1)
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,amenities
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,6
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,11
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,3,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,7
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,6
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,5


In [41]:
# Second feature - An actual Manhattan norm! Distance from the Empire State Building.
#df['Manhattan_norm'] = (((df['latitude']-40.7484)**2)+((df['longitude']-(-73.9857))**2))**0.5
df['Manhattan_norm'] = (((df['latitude']-40.7484)**2)+((df['longitude']-(-73.9857))**2))**0.5
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,amenities,Manhattan_norm
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,6,0.054913
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,11,0.050047
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,3,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,7,0.018745
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,6,0.018822
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,5,0.083997


In [42]:
# Train-test split.
train = df[(df['month'] == 4) | (df['month'] == 5)]
test = df[df['month'] == 6]

In [43]:
# Define our features and target.
features = ['bathrooms','bedrooms','interest_level','amenities','Manhattan_norm']
target = 'price'

# Instantiate X and y for train and test.
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

# Instantiate and fit model
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [44]:
model.predict([[1,2,2,5,0.050]])

array([2770.23521837])

In [45]:
# Get coefficients and intercept
print(model.coef_)
print(model.intercept_)

[  1819.74931961    466.73432016   -450.83203797     54.58363743
 -13952.07925892]
1343.367110195189


In [46]:
# Predictions for train and test
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [47]:
# Error metrics - MAE, RMSE, R^2

# MAE
print('MAE')
print('train',mean_absolute_error(y_train, y_pred_train))
print('test',mean_absolute_error(y_test, y_pred_test))
print('_________________')

# RMSE
print('RMSE')
print('train',np.sqrt(mean_squared_error(y_train, y_pred_train)))
print('test',np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('__________________')

#R^2
print('R^2')
print('train',r2_score(y_train, y_pred_train))
print('test',r2_score(y_test, y_pred_test))

MAE
train 688.7518339254773
test 701.6494662996475
_________________
RMSE
train 1081.0791345269295
test 1067.512908940707
__________________
R^2
train 0.623611987288516
test 0.6330395580230169


In [48]:
# Some feature optimization? Why not.
# Two gradient descents, one for lat and one for long.
df2 = df.copy()
latmin = df2['latitude'].min()
latmax = df2['latitude'].max()
bounddict = {
    'lowbound': latmin,
    'midbound': ((latmin+latmax)/2),
    'upbound': latmax
}

for i in range(1,100):
    bounderrors={}
    # Set the central value according to the bounds
    bounddict['midbound'] = (bounddict['lowbound']+bounddict['upbound'])/2
    for key,value in bounddict.items():
        # Set the feature's parameter to the bound we're testing
        df2['Manhattan_norm'] = (((df2['latitude']-value)**2)+((df2['longitude']-(-73.9857))**2))**0.5
        
        # Split the data
        train2 = df2[(df2['month'] == 4) | (df2['month'] == 5)]
        test2 = df2[df2['month'] == 6]
        
        # Instantiate X and y for train and test.
        X_train2 = train2[features]
        y_train2 = train2[target]
        X_test2 = test2[features]
        y_test2 = test2[target]

        # Instantiate and fit model
        model2 = LinearRegression()
        model2.fit(X_train2,y_train2)
        
        # Predictions for train and test
        y_pred_train2 = model2.predict(X_train2)
        y_pred_test2 = model2.predict(X_test2)
        
        # Get the error for the value
        bounderrors[key] = mean_absolute_error(y_test2, y_pred_test2)
    #Eliminate whichever extremal bound is worse
    if bounderrors['lowbound'] > bounderrors['upbound']:
        bounddict['lowbound'] = bounddict['midbound']
    else:
        bounddict['upbound'] = bounddict['midbound']

In [49]:
print(bounddict)
print(bounderrors)

{'lowbound': 40.73083749999999, 'midbound': 40.73083749999999, 'upbound': 40.7308375}
{'lowbound': 696.324375108245, 'midbound': 696.324375108245, 'upbound': 696.3243751082398}


In [50]:
# Our top bound is the best one, we'll set the parameter in Manhattan_norm accordingly.
df2['Manhattan_norm'] = (((df2['latitude']-40.7308375)**2)+((df2['longitude']-(-73.9857))**2))**0.5

In [51]:
# Same story for the longitude.

longmin = df2['longitude'].min()
longmax = df2['longitude'].max()
bounddict = {
    'lowbound': longmin,
    'midbound': ((longmin+longmax)/2),
    'upbound': longmax
}

for i in range(1,100):
    bounderrors={}
    # Set the central value according to the bounds
    bounddict['midbound'] = (bounddict['lowbound']+bounddict['upbound'])/2
    for key,value in bounddict.items():
        # Set the feature's parameter to the bound we're testing
        df2['Manhattan_norm'] = (((df2['latitude']-40.7308375)**2)+((df2['longitude']-(value))**2))**0.5
        
        # Split the data
        train2 = df2[(df2['month'] == 4) | (df2['month'] == 5)]
        test2 = df2[df2['month'] == 6]
        
        # Instantiate X and y for train and test.
        X_train2 = train2[features]
        y_train2 = train2[target]
        X_test2 = test2[features]
        y_test2 = test2[target]

        # Instantiate and fit model
        model2 = LinearRegression()
        model2.fit(X_train2,y_train2)
        
        # Predictions for train and test
        y_pred_train2 = model2.predict(X_train2)
        y_pred_test2 = model2.predict(X_test2)
        
        # Get the error for the value
        bounderrors[key] = mean_absolute_error(y_test2, y_pred_test2)
    #Eliminate whichever extremal bound is worse
    if bounderrors['lowbound'] > bounderrors['upbound']:
        bounddict['lowbound'] = bounddict['midbound']
    else:
        bounddict['upbound'] = bounddict['midbound']

In [52]:
print(bounddict)
print(bounderrors)

{'lowbound': -74.0147, 'midbound': -74.0147, 'upbound': -74.0147}
{'lowbound': 684.8271985947569, 'midbound': 684.8271985947569, 'upbound': 684.8271985947569}


In [53]:
# Nice convergence.

In [54]:
# Take it from the top with our original dataframe.

# Set the feature's parameter to the bound we're testing
df['Manhattan_norm'] = (((df['latitude']-40.7308375)**2)+((df['longitude']-(-74.0147))**2))**0.5
        
# Split the data
train = df[(df['month'] == 4) | (df['month'] == 5)]
test = df[df['month'] == 6]
        
# Instantiate X and y for train and test.
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

# Instantiate and fit model
model = LinearRegression()
model.fit(X_train,y_train)
        
# Predictions for train and test
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [57]:
# Error metrics - MAE, RMSE, R^2

# MAE
print('MAE')
print('train',mean_absolute_error(y_train, y_pred_train))
print('test',mean_absolute_error(y_test, y_pred_test))
print('_________________')

# RMSE
print('RMSE')
print('train',np.sqrt(mean_squared_error(y_train, y_pred_train)))
print('test',np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('__________________')

#R^2
print('R^2')
print('train',r2_score(y_train, y_pred_train))
print('test',r2_score(y_test, y_pred_test))

MAE
train 670.3370340741288
test 684.8271985947569
_________________
RMSE
train 1067.6795354888275
test 1055.3217303580102
__________________
R^2
train 0.6328845602209141
test 0.6413731999858721
