<a href="https://colab.research.google.com/github/shengjiyang/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [39]:
print(df.shape)
print(df.columns)
df.head()

(48817, 34)
Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [40]:
print(df.bathrooms.isnull().value_counts())
df.bedrooms.isnull().value_counts()

False    48817
Name: bathrooms, dtype: int64


False    48817
Name: bedrooms, dtype: int64

In [41]:
# It seems far more rational to place the feature engineering step first

# For my two new features,
# I will calculate the total number of rooms
# and the bed-to-bath-ratio

df["total rooms"] = df["bathrooms"].values + df["bedrooms"].values
df["bed-to-bath-ratio"] = df["bedrooms"].values / df["bathrooms"].values
df.head()

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total rooms,bed-to-bath-ratio
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5,2.0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,2.0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,1.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,1.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,4.0


In [42]:
df["bed-to-bath-ratio"].isnull().value_counts()

False    48666
True       151
Name: bed-to-bath-ratio, dtype: int64

In [50]:
# Removing NaN values from df;
# otherwise, they will cause problems when fitting the model

df = df.dropna()

print(df["bed-to-bath-ratio"].isnull().value_counts())
print(df["latitude"].isnull().value_counts())
print(df["longitude"].isnull().value_counts())

False    47109
Name: bed-to-bath-ratio, dtype: int64
False    47109
Name: latitude, dtype: int64
False    47109
Name: longitude, dtype: int64


In [94]:
# Seems mighty strange the inf is a value in the bed-to-bath-ratio.
# I'm too lazy to figure out why, so I'll just remove all columns containing it.

df["bed-to-bath-ratio"].value_counts()

1.000000    18240
2.000000    11954
0.000000     8852
3.000000     3517
1.500000     2722
4.000000      352
1.333333      340
0.500000      195
0.666667      183
inf           153
1.200000      128
2.500000      120
0.800000       83
2.666667       60
1.600000       38
1.666667       33
1.250000       30
1.142857       27
0.857143       24
0.333333       14
0.750000       11
5.000000       10
0.888889        7
3.333333        5
1.428571        3
0.400000        2
2.400000        1
0.222222        1
6.000000        1
0.200000        1
2.333333        1
0.571429        1
Name: bed-to-bath-ratio, dtype: int64

In [99]:
df = df[(df['bed-to-bath-ratio'] <= 6)]

df["bed-to-bath-ratio"].value_counts()

ERROR! Session/line number was not unique in database. History logging moved to new session 60


1.000000    18240
2.000000    11954
0.000000     8852
3.000000     3517
1.500000     2722
4.000000      352
1.333333      340
0.500000      195
0.666667      183
1.200000      128
2.500000      120
0.800000       83
2.666667       60
1.600000       38
1.666667       33
1.250000       30
1.142857       27
0.857143       24
0.333333       14
0.750000       11
5.000000       10
0.888889        7
3.333333        5
1.428571        3
0.400000        2
2.400000        1
0.222222        1
6.000000        1
2.333333        1
0.571429        1
0.200000        1
Name: bed-to-bath-ratio, dtype: int64

In [100]:
# Selecting data based on date to form the training and test sets:

train = df[df.created.str.contains("2016-04|2016-05")]
train.created.value_counts()

2016-05-14 05:23:52    3
2016-05-14 01:11:03    3
2016-05-27 03:59:28    3
2016-04-08 01:14:27    3
2016-04-15 02:24:25    3
                      ..
2016-05-23 02:54:14    1
2016-04-12 06:19:17    1
2016-05-20 06:42:00    1
2016-04-06 01:30:20    1
2016-04-13 03:31:11    1
Name: created, Length: 30251, dtype: int64

In [101]:
test = df[df.created.str.contains("2016-06")]
test.created.value_counts()

2016-06-11 01:20:36    3
2016-06-12 13:20:45    3
2016-06-16 04:08:35    3
2016-06-12 12:30:28    3
2016-06-21 04:44:43    3
                      ..
2016-06-28 02:44:04    1
2016-06-15 02:11:36    1
2016-06-16 05:50:16    1
2016-06-21 02:24:49    1
2016-06-22 05:34:10    1
Name: created, Length: 16110, dtype: int64

In [0]:
# Since we are expected to make two new features, it is only rational to use
# at least one of them in the feature or target.

features = ['latitude', 'longitude']
target = ['bed-to-bath-ratio']


X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

In [103]:
# LinearRegression().fit() will only take a 2D NumpyArray or Matrix,
# so I have converted y_train and y_test to the correct format here below.

y_train_col = train['bed-to-bath-ratio']
y_train_matrix = []

for i in y_train_col:
    y_train_matrix.append([i])
    
y_train = np.array(y_train_matrix)
print(len(y_train))
y_train

30615


array([[1.],
       [1.],
       [4.],
       ...,
       [1.],
       [0.],
       [2.]])

In [104]:
y_test_col = test['bed-to-bath-ratio']
y_test_matrix = []

for i in y_test_col:
    y_test_matrix.append([i])
    
y_test = np.array(y_test_matrix)
print(len(y_test))
y_test

16341


array([[2.],
       [2.],
       [1.],
       ...,
       [1.],
       [2.],
       [2.]])

In [109]:
from sklearn.linear_model import LinearRegression

# Straight-Line Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [115]:
beta0 = model.intercept_[0]
beta1, beta2 = model.coef_[0]

print(f'Bed to Bath Ratio = {beta1:.2f}latitude + {beta2:.2f}longitude + {beta0:.2f}')

Bed to Bath Ratio = -0.80latitude + 2.68longitude + 231.86
ERROR! Session/line number was not unique in database. History logging moved to new session 61


In [126]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

train_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))

print("Mean Absolute Error")
print("-" * 35)
print("Training Error: ", train_mae)
print("Testing Error:  ", test_mae)

Mean Absolute Error
-----------------------------------
Training Error:  0.701518415556419
Testing Error:   0.7056159999674124


In [127]:
train_mse = mean_squared_error(y_train, model.predict(X_train))
test_mse = mean_squared_error(y_test, model.predict(X_test))

print("Mean Squared Error")
print("-" * 35)
print("Training Error: ", train_mse)
print("Testing Error:  ", test_mse)

Mean Squared Error
-----------------------------------
Training Error:  0.7589865898167637
Testing Error:   0.7650216629347462


In [129]:
from math import sqrt

print("Root Mean Squared Error")
print("-" * 34)
print("Training Error: ", sqrt(train_mse))
print("Testing Error:  ", sqrt(test_mse))

Root Mean Squared Error
----------------------------------
Training Error:  0.87119836421837
Testing Error:   0.874655168014656


In [135]:
print("R^2 Score")
print("-" * 37)
print("Training Score: ", r2_score(y_train, model.predict(X_train)))
print("Testing Score:  ", r2_score(y_test, model.predict(X_test)))

print('\nIf we take the R^2 Score as our main error statistic,')
print('then this straight-line model is not accurate in the least.')
print('\nLess than 1 percent of the variation in the target can be')
print('directly attributed to the changes in the chosen features.')

R^2 Score
-------------------------------------
Training Score:  0.007087137928739384
Testing Score:   0.006330131635596703

If we take the R^2 Score as our main error statistic,
then this straight-line model is not accurate in the least.

Less than 1 percent of the variation in the target can be
directly attributed to the changes in the chosen features.


In [145]:
# I will repeat the model, but this time using PolynomialFeatures
# in order to minimized the Mean Absolute Error

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# 4th Degree Polynomial Model

model_2 = make_pipeline(PolynomialFeatures(degree = 4), LinearRegression())
model_2.fit(X_train, y_train)

train_mae_2 = mean_absolute_error(y_train, model_2.predict(X_train))
test_mae_2 = mean_absolute_error(y_test, model_2.predict(X_test))

print("Mean Absolute Error")
print("-" * 35)
print("Training Error: ", train_mae_2)
print("Testing Error:  ", test_mae_2)

Mean Absolute Error
-----------------------------------
Training Error:  0.6847626959708843
Testing Error:   0.6881887890841311


In [152]:
# 7th Degree Polynomial Model

model_3 = make_pipeline(PolynomialFeatures(degree = 7), LinearRegression())
model_3.fit(X_train, y_train)

train_mae_3 = mean_absolute_error(y_train, model_3.predict(X_train))
test_mae_3 = mean_absolute_error(y_test, model_3.predict(X_test))

print("Mean Absolute Error")
print("-" * 35)
print("Training Error: ", train_mae_3)
print("Testing Error:  ", test_mae_3)

Mean Absolute Error
-----------------------------------
Training Error:  0.6758782939090959
Testing Error:   0.6824377049189707


In [161]:
# 16th Degree Polynomial Model

model_4 = make_pipeline(PolynomialFeatures(degree = 16), LinearRegression())
model_4.fit(X_train, y_train)

train_mae_4 = mean_absolute_error(y_train, model_4.predict(X_train))
test_mae_4 = mean_absolute_error(y_test, model_4.predict(X_test))

print("Mean Absolute Error")
print("-" * 35)
print("Training Error: ", train_mae_4)
print("Testing Error:  ", test_mae_4)

Mean Absolute Error
-----------------------------------
Training Error:  0.6735421800208979
Testing Error:   0.6792371109496452


In [173]:
# 27th Degree Polynomial Model

model_5 = make_pipeline(PolynomialFeatures(degree = 27), LinearRegression())
model_5.fit(X_train, y_train)

train_mae_5 = mean_absolute_error(y_train, model_5.predict(X_train))
test_mae_5 = mean_absolute_error(y_test, model_5.predict(X_test))

print("Mean Absolute Error")
print("-" * 35)
print("Training Error: ", train_mae_5)
print("Testing Error:  ", test_mae_5)

Mean Absolute Error
-----------------------------------
Training Error:  0.6715238321799952
Testing Error:   0.6775537655898127


In [197]:
# 50th Degree Polynomial Model

model_6 = make_pipeline(PolynomialFeatures(degree = 50), LinearRegression())
model_6.fit(X_train, y_train)

train_mae_6 = mean_absolute_error(y_train, model_6.predict(X_train))
test_mae_6 = mean_absolute_error(y_test, model_6.predict(X_test))

print("Mean Absolute Error")
print("-" * 35)
print("Training Error: ", train_mae_6)
print("Testing Error:  ", test_mae_6)

Mean Absolute Error
-----------------------------------
Training Error:  0.6790141634063459
Testing Error:   0.6846041598034545


In [0]:
# I have decided to cap my search at 50 degrees above.

# As my reader can see, the 27th degree polynomial model
# produced to lowest test error of 0.6775537655898127,

# Indicating that this model can predict the bed-to-bath ratio
# of a house as a function of latitude and longitude
# within 0.6776 of the actual ratio.

# That's some pretty decent predictive power:)

In [203]:
# Keri's stretch goal:
# Compare the original data's coefficients with those of the standardized data's.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

standard_model = LinearRegression()
standard_model.fit(X_train_scaled, y_train)

print('Intercept:  ', standard_model.intercept_[0])
coefficients = pd.Series(standard_model.coef_[0], features)
print(coefficients.to_string())

Intercept:   1.2653432931776962
latitude    -0.030849
longitude    0.077659


In [212]:
# In Equation Format

beta0 = standard_model.intercept_[0]
beta1, beta2 = standard_model.coef_[0]

print('Standardized:')
print('-' * 56)
print(f'Bed to Bath Ratio = {beta1:.2f}latitude + {beta2:.2f}longitude + {beta0:.2f}')

Standardized:
--------------------------------------------------------
Bed to Bath Ratio = -0.03latitude + 0.08longitude + 1.27


In [214]:
# The Equation of the Original Model from above for Reference

beta0 = model.intercept_[0]
beta1, beta2 = model.coef_[0]

print('Original:')
print('-' * 58)
print(f'Bed to Bath Ratio = {beta1:.2f}latitude + {beta2:.2f}longitude + {beta0:.2f}')

Original:
----------------------------------------------------------
Bed to Bath Ratio = -0.80latitude + 2.68longitude + 231.86
