Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [68]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [69]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [70]:
print(df.shape)
df.head()

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


##Engineer at least two new features. (See below for explanation & ideas.)

###Does the apartment have a description?

In [71]:
df.iloc[1]['description']

'        '

In [72]:
df['has_description'] = np.where((df['description'].isnull() == True)| (df['description'] == '        '), 0,1)

In [73]:
df['has_description'].value_counts()

1    45765
0     3052
Name: has_description, dtype: int64

###How long is the description?

In [74]:
description_length = []
for value in range(len(df)):
  if df.iloc[value]['has_description'] == 1:
    description_length.append(len(df.iloc[value]['description']))
  else:
    description_length.append(0)

In [75]:
df['description_length'] = description_length

In [76]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,has_description,description_length
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,588
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,691
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,492
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,479


In [77]:
df.shape

(48817, 36)

###How many total perks does each apartment have?

In [78]:
perks = df.iloc[:,10:34].columns
perks

Index(['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [79]:
df.iloc[0][perks]

elevator                0
cats_allowed            0
hardwood_floors         0
dogs_allowed            0
doorman                 0
dishwasher              0
no_fee                  0
laundry_in_building     0
fitness_center          0
pre-war                 0
laundry_in_unit         0
roof_deck               0
outdoor_space           0
dining_room             0
high_speed_internet     0
balcony                 0
swimming_pool           0
new_construction        0
terrace                 0
exclusive               0
loft                    0
garden_patio            0
wheelchair_access       0
common_outdoor_space    0
Name: 0, dtype: object

In [80]:
perks_total = []
for num in range(len(df)):
  perks_total.append(df.iloc[num][perks].sum())

In [81]:
df['perks_total'] = perks_total

In [82]:
df[df['perks_total'] == 0].index.value_counts().sum()

3729

In [83]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,has_description,description_length,perks_total
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,588,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,691,3
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,492,2
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,479,1


###Total number of rooms (beds + baths)

In [84]:
df['rooms_total'] = df['bedrooms'] + df['bathrooms']

In [85]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,has_description,description_length,perks_total,rooms_total
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,588,0,4.5
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,3.0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,691,3,2.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,492,2,2.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,479,1,5.0


###Ratio of beds to baths

In [86]:
df['bed_to_baths'] = df['bedrooms']/df['bathrooms']

In [87]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,has_description,description_length,perks_total,rooms_total,bed_to_baths
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,588,0,4.5,2.0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,3.0,2.0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,691,3,2.0,1.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,492,2,2.0,1.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,479,1,5.0,4.0


###What's the neighborhood, based on address or latitude & longitude?

In [88]:
!pip install dbfread
from dbfread import DBF



In [89]:
table = DBF('/content/datasets-1694-2965-ZillowNeighborhoods-NY.dbf')

In [90]:
table = pd.DataFrame(table)
table.head()

Unnamed: 0,State,County,City,Name,RegionID
0,NY,Suffolk,Town of Islip,Bohemia,3736
1,NY,Albany,Town of Coeymans,Ravena,6687
2,NY,Queens,New York,Rego Park,6719
3,NY,Suffolk,Town of Islip,Saltaire,6912
4,NY,Albany,Guilderland,Westmere,9545


In [91]:
table = table[table['City'] == 'New York']
table['County']


2        Queens
5        Queens
10     New York
13       Queens
16       Queens
         ...   
558      Queens
559      Queens
570       Kings
571      Queens
572      Queens
Name: County, Length: 278, dtype: object

In [92]:
table = pd.read_csv('https://raw.githubusercontent.com/nastyalolpro/data/master/nybb.csv')

In [None]:
table['the_geom'][0]

In [97]:
coordinates = []
for value in range(len(df)):
  coordinate = str(df.iloc[value]['longitude']) + ' ' + str(df.iloc[value]['latitude'])
  coordinates.append(coordinate)

In [98]:
df['coordinates'] = coordinates

In [None]:
df['coordinates']

In [None]:
from geopy.geocoders import GoogleV3

geolocator = GoogleV3(api_key=google_key)
locations = geolocator.reverse("22.5757344, 88.4048656")
if locations:
    print(locations[0].address)  # select first location

##Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [142]:
df = df.replace([np.inf, -np.inf], np.nan)

In [143]:
df.fillna(0, inplace=True)

In [144]:
df.isnull().sum()

bathrooms               0
bedrooms                0
created                 0
description             0
display_address         0
latitude                0
longitude               0
price                   0
street_address          0
interest_level          0
elevator                0
cats_allowed            0
hardwood_floors         0
dogs_allowed            0
doorman                 0
dishwasher              0
no_fee                  0
laundry_in_building     0
fitness_center          0
pre-war                 0
laundry_in_unit         0
roof_deck               0
outdoor_space           0
dining_room             0
high_speed_internet     0
balcony                 0
swimming_pool           0
new_construction        0
terrace                 0
exclusive               0
loft                    0
garden_patio            0
wheelchair_access       0
common_outdoor_space    0
has_description         0
description_length      0
perks_total             0
rooms_total             0
bed_to_baths

In [145]:
df['created'] = pd.to_datetime(df['created'])

In [146]:
df['month'] = df['created'].dt.month

In [147]:
april = df[df['month']==4]
may = df[df['month']==5]
june = df[df['month']==6]

In [148]:
train = pd.concat([april, may])
test = june

In [149]:
train.shape, test.shape

((31844, 41), (16973, 41))

##Fit a linear regression model with at least two features.

###MAE for mean baseline with 0 features

In [32]:
# start with a baseline.how far would we be on avarage 
# if we guessed mean for each price?
train['price'].mean()

3575.604007034292

In [127]:
# arrange target vectors
target = 'price'
y_train = train[target]
y_test = test[target]

In [128]:
# get mean baseline
print('Mean baseline with 0 features')
guess = y_train.mean()
print(guess)

Mean baseline with 0 features
3575.604007034292


In [129]:
# we would be 1201$ off on avarage
errors = guess - y_train 
errors.abs().mean()

1201.8811133683944

In [130]:
#same train error with scikit learn
from sklearn.metrics import mean_absolute_error

y_pred = [guess] * len(y_train)

mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (april-may prices): {mae:.2f}$')

Train Error (april-may prices): 1201.88$


In [131]:
# Test Error
y_pred = [guess] * len(y_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error (june prices): {mae:.2f}$')

Test Error (june prices): 1197.71$


###Fit model with multiple features

In [150]:
#import
from sklearn.linear_model import LinearRegression

In [151]:
# initiate model
model = LinearRegression()

In [152]:
train.head(1)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,has_description,description_length,perks_total,rooms_total,bed_to_baths,coordinates,month
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,691,3,2.0,1.0,-74.0018 40.7388,4


In [162]:
# arrange features matrix and target vector
features = ['has_description', 'description_length', 'perks_total', 'rooms_total', 'bed_to_baths']

X_train = train[features]
X_test = test[features]

In [163]:
#fit data
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

###MAE with multiple features

In [164]:
y_pred = model.predict(X_train)

In [165]:
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} $')

Train Error: 794.88 $


In [166]:
y_pred = model.predict(X_test)

In [167]:
mae = mean_absolute_error(y_test, y_pred)
print(f'Train Error: {mae:.2f} $')

Train Error: 797.13 $


##Get the model's coefficients and intercept.

In [170]:
 beta0 = model.intercept_
 beta0

1591.2050569209641

In [172]:
beta1, beta2, beta3, beta4, beta5 = model.coef_
beta1, beta2, beta3, beta4, beta5

(-529.5447518148695,
 0.13031791199893195,
 79.16824182574574,
 1065.4654706308345,
 -702.6224963928721)

In [174]:
print(f'y = {beta0} {beta1}x1 + {beta2}x2 + {beta3}x3 + {beta4}x4 {beta5}x5')

y = 1591.2050569209641 -529.5447518148695x1 + 0.13031791199893195x2 + 79.16824182574574x3 + 1065.4654706308345x4 -702.6224963928721x5


##Get regression metrics RMSE, MAE, and  𝑅2 , for both the train and test data.

In [175]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

###RMSE and R2 for training data

In [184]:
#mean squared error for train
y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
mse

1456270.539984902

In [185]:
#rmse 
np.sqrt(mse)

1206.7603490274703

In [187]:
#r2
r2 = r2_score(y_train, y_pred)
r2

0.5309960368173665

###RMSE and R2 for testing data

In [188]:
#mean squared error for test
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse

1425040.9090226043

In [189]:
#rmse
np.sqrt(mse)

1193.7507734123585

In [190]:
#r2
r2 = r2_score(y_test, y_pred)
r2

0.541495766530834