<a href="https://colab.research.google.com/github/robertworkbuckley/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
df['created'] = pd.to_datetime(df['created'])

In [0]:
train = df[df['created'] <= '2016-05-31']

In [0]:
test = df[df['created'] >= '2016-06-01']

In [0]:
test.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0


In [0]:
#an idea for a feature is a bathroom to bedroom ratio
df['bath_bed_ratio'] = (df['bathrooms'] / df['bedrooms'])

In [0]:
df.nlargest(1, 'latitude')

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,bath_bed_ratio
36456,1.0,2,2016-06-10 03:30:10,2 b/r with sliding glass doors off dining area...,Warburton\r,40.9894,-73.8832,2400,384 Warburton\r,low,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0.5


In [0]:
df.nsmallest(1, 'latitude')

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,bath_bed_ratio
112,2.0,2,2016-04-11 03:36:06,,Ocean Pkwy,40.5757,-73.9686,3250,3111 Ocean Pkwy,low,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0


In [0]:
#latitude min and max are 40.5757, 40.9894
#longitutde min and max are -74.0873, -73.7001
df.nlargest(1, 'longitude')
df.nsmallest(1, 'longitude')

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,bath_bed_ratio
33944,1.0,2,2016-05-17 03:56:03,Newly Renovated apartment in Staten Is...,733 targee st,40.6094,-74.0873,1700,733 targee st,low,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.5


In [0]:
df['latitude'] = pd.cut(df['latitude'], 20, labels=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])

In [0]:
df['longitude'] = pd.cut(df['longitude'], 30, labels=['1','2,','3','4','5,','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30'])

In [0]:
df['latitude'] + df['longitude']

TypeError: ignored

In [0]:
df['neighborhoods'] = df['latitude'].astype(str) + df['longitude'].astype(str)

In [0]:
df['neighborhoods'] = df['neighborhoods'].astype('category')

In [0]:
df['neighborhoods'].value_counts()

J11    5291
I9     3785
I8     3130
H8     3119
H9     2889
       ... 
P19       1
C16       1
C15       1
M4        1
T16       1
Name: neighborhoods, Length: 213, dtype: int64

In [0]:
import plotly.express as px

px.scatter(train, x='bath_bed_ratio', y='price', trendline='ols')

In [0]:
df['bath_bed_ratio'].isnull().sum()

0

In [0]:
#Dropping rows with NaN's in the bath_bed_ratio to allow a model to be fitted on
#this feature. It's a very low number of NaN's comparitively, so it shouldn't 
#have a large negative effect.
df = df[np.isfinite(df['bath_bed_ratio'])]

In [0]:
np.isinf(df['bath_bed_ratio']).sum()

0

In [0]:
np.isinf(df['price']).sum()

0

In [0]:
df['bath_bed_ratio'].isnull().sum()

0

In [0]:
df['price'].isnull().sum()

0

In [0]:
df['bathrooms'].isnull().sum()

0

In [0]:
df['bedrooms'].isnull().sum()

0

In [0]:
(df['bedrooms'] == 0).sum()

In [0]:
(df['bathrooms'] == 0).sum()

153

In [0]:
train = df[df['created'] <= '2016-05-31']

In [0]:
test = df[df['created'] >= '2016-06-01']

In [0]:
train.shape, test.shape

((25667, 36), (13828, 36))

In [0]:
target = 'price'
y_train = train[target]
y_test = test[target]
y_train.shape, y_test.shape

((25667,), (13828,))

In [0]:
train['price'].mean()

3839.0142985156035

In [0]:
#Baseline
guess = y_train.mean()
print(guess)

3839.0142985156035


In [0]:
#Train Error
from sklearn.metrics import mean_absolute_error
y_pred = [guess] *len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (April and May 2016): ${mae:,.0f}')

Train Error (April and May 2016): $1,263


In [0]:
#Test Error

y_pred = [guess] *len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error (June 2016): ${mae:,.0f}')

Test Error (June 2016): $1,264


In [0]:
from sklearn.linear_model import LinearRegression

In [0]:
model = LinearRegression()

In [0]:
features = ['bath_bed_ratio']
target = 'price'
X_train = train[features]
X_test = test[features]
X_test.shape, X_train.shape

((13828, 1), (25667, 1))

In [0]:
#Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: ${mae:.2f}')

Train Error: $1263.05


In [0]:
#Apply model to new data
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: ${mae:.2f}')

Test Error: $1263.96


In [0]:
dfjk = pd.get_dummies(df['neighborhoods'])
dfjk.shape

(39500, 213)

In [0]:
df.shape

(39500, 36)

In [0]:
dfth = pd.concat([df, dfjk], axis=1)

In [0]:
dfth.shape

(39500, 249)

In [0]:
trainth = dfth[dfth['created'] <= '2016-05-31']

In [0]:
testh = dfth[dfth['created'] >= '2016-06-01']

In [0]:
target = 'price'
y_trainth = trainth[target]
y_testh = testh[target]
y_trainth.shape, y_testh.shape

((25667,), (13828,))

In [0]:
trainth['price'].mean()

3839.0142985156035

In [0]:
pd.set_option('display.max_seq_items', 1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_rows', 1000)
dfth.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address', 'latitude', 'longitude', 'price', 'street_address', 'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman', 'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'bath_bed_ratio', 'neighborhoods', 'A10', 'A11', 'A12', 'A13', 'A20', 'A21', 'A22', 'A24', 'A26', 'A6', 'A8', 'A9', 'B1', 'B10', 'B11', 'B27', 'B3', 'B4', 'B5,', 'B6', 'B7', 'B8', 'B9', 'C1', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C4', 'C5,', 'C6', 'C7', 'C8', 'C9', 'D1', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'D5,', 'D6', 'D7', 'D8', 'D9', 'E10', 'E11', 'E12', 'E13', 'E14', 'E15', 'E16', 'E17', 'E18', 'E19', 'E26', 'E27', 'E30', 'E6', 'E7', 'E8', 'E9',
   

In [0]:
#Multi-feature regression
features = ['A10', 'A11', 'A12', 'A13', 'A20', 'A21', 'A22', 'A24', 'A26', 'A6', 'A8', 'A9', 'B1', 'B10', 'B11', 'B27', 'B3', 'B4', 'B5,', 'B6', 'B7', 'B8', 'B9', 'C1', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C4', 'C5,', 'C6', 'C7', 'C8', 'C9', 'D1', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'D5,', 'D6', 'D7', 'D8', 'D9', 'E10', 'E11', 'E12', 'E13', 'E14', 'E15', 'E16', 'E17', 'E18', 'E19', 'E26', 'E27', 'E30', 'E6', 'E7', 'E8', 'E9',
       'F10', 'F11', 'F12', 'F13', 'F14', 'F15', 'F16', 'F18', 'F19', 'F20', 'F21', 'F7', 'F8', 'F9', 'G10', 'G11', 'G12', 'G13', 'G14', 'G15', 'G16', 'G17', 'G18', 'G19', 'G20', 'G21', 'G22', 'G23', 'G24', 'G25', 'G26', 'G27', 'G4', 'G5,', 'G6', 'G7', 'G8', 'G9', 'H10', 'H11', 'H12', 'H13', 'H14', 'H15', 'H16', 'H17', 'H18', 'H19', 'H20', 'H21', 'H22', 'H26', 'H28', 'H4', 'H5,', 'H6', 'H7', 'H8', 'H9', 'I10', 'I11', 'I12', 'I13', 'I14', 'I15', 'I16', 'I17', 'I20', 'I21', 'I22', 'I24', 'I26', 'I29', 'I5,', 'I6', 'I7', 'I8', 'I9', 'J10', 'J11', 'J12', 'J13', 'J14', 'J15', 'J16', 'J21', 'J22', 'J23', 'J24', 'J6', 'J7', 'J8', 'J9', 'K10', 'K11', 'K12', 'K13', 'K16', 'K18', 'K23', 'K6', 'K7', 'K9', 'L10', 'L11', 'L12', 'L13', 'L14', 'L21', 'L22', 'L9', 'M11', 'M12', 'M13', 'M14', 'M16', 'M17', 'M19', 'M20', 'M21', 'M4', 'N12', 'N13', 'N14', 'N15', 'N16', 'N17', 'N18', 'N19', 'N21', 'N9', 'O13', 'O14', 'O15', 'O16', 'O17', 'O18', 'O19', 'P14', 'P15', 'P17', 'P18', 'P19', 'Q15', 'Q18',
       'Q24', 'T16']

X_trainth = trainth[features]
X_testth = testh[features]

In [0]:
#Fit to the model
model.fit(X_trainth, y_trainth)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
y_pred = model.predict(X_trainth)
mae = mean_absolute_error(y_trainth, y_pred)
print(f'Train Error: ${mae:,.0f}')

Train Error: $1,117


In [0]:
#apply model to new data
y_pred = model.predict(X_testth)
mae = mean_absolute_error(y_testh, y_pred)
print(f'Test Error: ${mae:,.0f}')

Test Error: $823,198,670,823


In [0]:
(y_pred - y_testh).abs().mean()

823198670823.2113