<a href="https://colab.research.google.com/github/accarter/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

## Split train/test 

Train: April & May 2016

Test: June 2016 and after

In [None]:
df.shape

(48817, 34)

In [None]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
df.dtypes

bathrooms               float64
bedrooms                  int64
created                  object
description              object
display_address          object
latitude                float64
longitude               float64
price                     int64
street_address           object
interest_level           object
elevator                  int64
cats_allowed              int64
hardwood_floors           int64
dogs_allowed              int64
doorman                   int64
dishwasher                int64
no_fee                    int64
laundry_in_building       int64
fitness_center            int64
pre-war                   int64
laundry_in_unit           int64
roof_deck                 int64
outdoor_space             int64
dining_room               int64
high_speed_internet       int64
balcony                   int64
swimming_pool             int64
new_construction          int64
terrace                   int64
exclusive                 int64
loft                      int64
garden_p

In [None]:
df['bedrooms'] = df['bedrooms'].astype('float64')
df['bathrooms'] = df['bathrooms'].astype('float64')
df['price'] = df['price'].astype('float64')

In [None]:
df['created_datetime'] = pd.to_datetime(df['created'])
df['created_datetime'].head()

0   2016-06-24 07:54:24
1   2016-06-12 12:19:27
2   2016-04-17 03:26:41
3   2016-04-18 02:22:02
4   2016-04-28 01:32:41
Name: created_datetime, dtype: datetime64[ns]

In [None]:
boundary = pd.to_datetime('20160601', format='%Y%m%d')
train = df[df['created_datetime'] < boundary]
test = df[df['created_datetime'] >= boundary]

## Engineer at least two new features

### Total number of rooms (bedrooms and bathrooms)

In [None]:
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['total_rooms'].value_counts()

2.0     15286
3.0     11083
1.0      9200
4.0      6940
5.0      3160
6.0      1605
4.5       291
7.0       277
3.5       210
5.5       190
2.5       154
0.0       151
8.0        83
6.5        67
9.0        49
7.5        32
10.0       16
8.5        11
1.5         9
12.0        2
11.0        1
Name: total_rooms, dtype: int64

### Are cats or dogs allowed?

In [None]:
df['cats_or_dogs'] = df['cats_allowed'] | df['dogs_allowed']
df['cats_or_dogs'].value_counts()

0    25433
1    23384
Name: cats_or_dogs, dtype: int64

## Fit a linear regression model with at least two features

In [None]:
df.corr()['price'].sort_values()

longitude              -0.251004
latitude               -0.036286
pre-war                -0.029122
laundry_in_building    -0.019417
exclusive              -0.013251
loft                    0.007100
common_outdoor_space    0.011517
cats_or_dogs            0.050989
cats_allowed            0.051453
dogs_allowed            0.060401
new_construction        0.071431
wheelchair_access       0.072517
high_speed_internet     0.090269
hardwood_floors         0.101503
garden_patio            0.103672
roof_deck               0.122929
no_fee                  0.132240
swimming_pool           0.134513
balcony                 0.139140
outdoor_space           0.142146
terrace                 0.145973
elevator                0.207169
dishwasher              0.223899
fitness_center          0.228775
dining_room             0.242911
laundry_in_unit         0.271195
doorman                 0.276215
bedrooms                0.535503
total_rooms             0.649097
bathrooms               0.687296
price     

In [None]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [None]:
# 2. Instantiate this class
model = LinearRegression()

In [None]:
# 3. Arrange y and X features matrices
target = ['price']
y_train = train[target]
y_test = test[target]
print(y_train.shape)
print(y_test.shape)

features = ['bathrooms', 'bedrooms']
X_train = train[features]
X_test = test[features]

X_train.shape, X_test.shape

(31844, 1)
(16973, 1)


((31844, 2), (16973, 2))

In [None]:
# 4. Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae_train = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae_train:.2f} percentage points')

Train Error: 818.53 percentage points


In [None]:
# Compare to baseline - the model is an improvement!
guess = y_train.mean()
y_pred = [guess] * train.shape[0]
mae = mean_absolute_error(y_train, y_pred)
mae

1201.8811133682555

In [None]:
# 5. Apply the model to new data
y_pred = model.predict(X_test)
mae_test = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f} percentage points')

Test Error: 1201.88 percentage points


## Get the model's coefficients and intercept

In [None]:
model.coef_, model.intercept_

(array([[2072.61011639,  389.3248959 ]]), array([485.71869002]))

## Get regression metrics RMSE, MAE, and  R2 , for both the train and test data

### Mean Absolute Error for train and test data are shown above

### Least Squares Regression (RMSE)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# RMSE for training data
y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
rmse_train = np.sqrt(mse)
rmse_train

1232.0225917223484

In [None]:
# RMSE for test data
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rsme_test = np.sqrt(mse)
rsme_test

1219.719357233823

### R2 Score

In [None]:
# R2 score for training data
y_pred = model.predict(X_train)
r2_score(y_train, y_pred)

0.5111543084316607

In [None]:
# R2 score for test data
y_pred = model.predict(X_test)
r2_score(y_test, y_pred)

0.5213303957090345

### The Best MAE I can get

In [None]:
from itertools import combinations

In [None]:
all_num_features = list(df.dtypes[df.dtypes.values == np.dtype('float64')].index) + list(df.dtypes[df.dtypes.values == np.dtype('int64')].index)
all_num_features.remove('total_rooms')
all_num_features.remove('cats_or_dogs')
all_num_features.remove('price')

In [None]:
def find_best_train_mae(df, n):
  best_mae = None
  best_features = None
  for features in combinations(all_num_features, n):
    y_train = train['price']
    X_train = train[list(features)]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_train)
    mae = mean_absolute_error(y_train, y_pred)
    if not best_mae or mae < best_mae:
      best_mae = mae
      best_features = features
  return (best_mae, best_features)


In [None]:
find_best_train_mae(df, 2)

(818.5310213271714, ('bathrooms', 'bedrooms'))

### But is the MAE for these two features low for the test data, too?

In [None]:
y_train = train['price']
X_train = train[['price', 'garden_patio']]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mae

3585.8565073940963

In [None]:
def find_best_test_mae(df, n):
  best_mae = None
  best_features = None
  for features in combinations(all_num_features, n):
    y_train = train['price']
    X_train = train[list(features)]
    X_test = test[list(features)]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    if not best_mae or mae < best_mae:
      best_mae = mae
      best_features = features
  return (best_mae, best_features)

In [None]:
# too computationaly expensive to run for all combinations of 4 features 
for i in range(2,4):
  mae, features = find_best_test_mae(df, i)
  print('features: {}\tmae: {:.2g}'.format(str(features), mae))

features: ('bathrooms', 'longitude')	mae: 8.2e+02
features: ('bathrooms', 'bedrooms', 'longitude')	mae: 7.4e+02


In [None]:
len(list(combinations(all_num_features, 4))) # that's a lot of training/testing

20475