Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
# Take a look to the data
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Engineer new features

In [4]:
# Create the feature `beds_to_baths`, which is the ratio of bedrooms to bathrooms
df['beds_to_baths'] = df['bedrooms']/df['bathrooms']
df['beds_to_baths']

0        2.0
1        2.0
2        1.0
3        1.0
4        4.0
        ... 
49347    2.0
49348    1.0
49349    1.0
49350    0.0
49351    2.0
Name: beds_to_baths, Length: 48817, dtype: float64

In [0]:
# Some apartments have no bathrooms, so the ratio of bedrooms to bathrooms is an infinite number 
# Replace infinite values to NaN values
df = df.replace([np.inf, -np.inf], np.nan)

In [6]:
# Get the total of null values
df.isnull().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
beds_to_baths 

In [7]:
# There are 304 null values for the feature `beds_to_baths`, due to the devision of a certain number to zero bathrooms
# Delete these rows
df = df.dropna(axis=0, subset=['beds_to_baths'])
df.isnull().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
beds_to_baths 

In [8]:
# Create the feature `rooms`, which is the total number of rooms in the apartment
df['rooms'] = df['bedrooms'] + df['bathrooms']
df['rooms']

0        4.5
1        3.0
2        2.0
3        2.0
4        5.0
        ... 
49347    3.0
49348    2.0
49349    2.0
49350    1.0
49351    3.0
Name: rooms, Length: 48513, dtype: float64

In [9]:
# Create the feature `perks`, which is the total number of perks
# The values for each perks are 0 and 1, so we can just add the value of all the perks for the same apartment
df['perks'] = df.iloc[:, 10:34].sum(axis=1)
df['perks']

0        0
1        5
2        3
3        2
4        1
        ..
49347    5
49348    9
49349    5
49350    5
49351    1
Name: perks, Length: 48513, dtype: int64

### Train/test split

Use data from April & May 2016 to train. Use data from June 2016 to test.

In [10]:
# Check the type of column `created`
df['created'].dtypes

dtype('O')

In [11]:
# Convert string to datetime format
df['created'] = pd.to_datetime(df['created'])
df['created'].dtypes

dtype('<M8[ns]')

In [12]:
# Check what years we have in the dataset
pd.unique(df['created'].dt.year)

array([2016])

In [13]:
# We have only the year of 2016, so it's not helpful to extract the year from date in order to make the split
# Extract the month from the date
df['month_created'] = df['created'].dt.month
df['month_created']

0        6
1        6
2        4
3        4
4        4
        ..
49347    6
49348    4
49349    4
49350    4
49351    4
Name: month_created, Length: 48513, dtype: int64

In [14]:
# Get unique values for month
pd.unique(df['month_created'])

array([6, 4, 5])

In [45]:
# Use data from April and May to train
train = df[df['month_created'] != 6]
pd.unique(train['month_created'])

array([4, 5])

In [16]:
# Use data from June to test
test = df[df['month_created'] == 6]
pd.unique(test['month_created'])

array([6])

In [17]:
# Get the number of observations in the train and test datasets
train.shape, test.shape

((31653, 38), (16860, 38))

### Fit a linear regression model with at least two features

In [0]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression

In [0]:
# 2. Instantiate this class
model = LinearRegression()

In [0]:
# 3. Arrrange X features matrix and y target vectors

# X features matrices
features = ['beds_to_baths', 'rooms', 'perks']
X_train = train[features]
X_test = test[features]

# Y target vectors
target = 'price'
y_train = train[target]
y_test = test[target]

In [21]:
# Fit the model
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [22]:
# Train error
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} percentage points')

Train Error: 796.76 percentage points


In [23]:
# Test error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f} percentage points')

Test Error: 800.20 percentage points


### The model's coefficients and intercept

In [24]:
# Get the equation of the model
beta0 = model.intercept_
beta1, beta2, beta3 = model.coef_
print(f'y = {beta0} + {beta1}*x1 + {beta2}*x2 + {beta3}*x3')

y = 1139.5054190856285 + -704.3343107763574*x1 + 1075.086984291165*x2 + 79.52807687635406*x3


In [25]:
# Get the model's coefficients and intercept
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())

Intercept 1139.5054190856285
beds_to_baths    -704.334311
rooms            1075.086984
perks              79.528077


Interpretation:

Every additional room adds \$1075 to the rental price.
Every additional perk adds \$70 to the rental price
With every additional bedroom to the same number of bathrooms, the rental price seems to be lower.

### Regression metrics 
RMSE, MAE, and  R2 , for both the train and test data.

In [26]:
# Regression metrics for train data
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred) # Mean squared Error
rmse = np.sqrt(mse) # Root Mean squared Error
mae = mean_absolute_error(y_train, y_pred) # Mean Absolute Error
r2 = r2_score(y_train, y_pred)
print('Root Mean Squared Error: ', rmse)
print('Mean Absolute Error: ', mae)
print('R2: ', r2)

Root Mean Squared Error:  1204.3873514262723
Mean Absolute Error:  796.7579758688394
R2:  0.5318096111041295


In [27]:
# Regression metrics for test data
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred) # Mean squared Error
rmse = np.sqrt(mse) # Root Mean squared Error
mae = mean_absolute_error(y_test, y_pred) # Mean Absolute Error
r2 = r2_score(y_test, y_pred)
print('Root Mean Squared Error: ', rmse)
print('Mean Absolute Error: ', mae)
print('R2: ', r2)

Root Mean Squared Error:  1195.0065275174468
Mean Absolute Error:  800.1970457615961
R2:  0.54065700505824
