Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [3]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 2.7MB/s 
[?25hCollecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/70/19/8437e22c84083a6d5d8a3c80f4edc73c9dcbb89261d07e6bd13b48752bbd/plotly-4.1.1-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 36.9MB/s 
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pandas-profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/ad/24a16fa4ba612fb96a3c4bb115a5b9741483f53b66d3d3afd987f20fa227/phik-0.9.8-py3-none-any.whl (606kB)
[K     |████████████████████████████████| 614kB 38.6MB/s 
[?25hCollecting confuse>=1.0.0 (f

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 94, done.[K
remote: Total 94 (delta 0), reused 0 (delta 0), pack-reused 94
Unpacking objects: 100% (94/94), done.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master


In [37]:
#Organising the code
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]
df['bath_bed']=df['bathrooms']+df['bedrooms'] #Number of bathrooms plus bedrooms
df['bathandbed']=df['bathrooms']*df['bedrooms']
df['cats_dogs']=df['cats_allowed']+df['dogs_allowed']
#Converting interest level into integer or rather cardinal
df['interest_level'] = df['interest_level'].replace({'low':1,'medium':2,'high':3}).astype('int')
df['openspaces']=df['roof_deck']+df['outdoor_space']+df['balcony']+df['terrace']+df['loft']+df['garden_patio']
#Getting the bed to bath ratio
df['bed_per_bathrooms']=df['bedrooms']/df['bathrooms']
df['bed_per_bathrooms'] = df['bed_per_bathrooms'].fillna(0)
df['bed_per_bathrooms'] = df['bed_per_bathrooms'].replace(np.inf,9999)
#Getting the bath to bed ratio
df['bath_per_bed']=df['bathrooms']/df['bedrooms']
df['bath_per_bed'] = df['bath_per_bed'].fillna(0)
df['bath_per_bed'] = df['bath_per_bed'].replace(np.inf,9999)
#Chaging created feature datetime
df['created'] = pd.to_datetime(df['created'],infer_datetime_format=True)
#Getting the boolean columns and adding them to perk columns
perks_columns = []
for col in df:
  if(df[col].nunique()==2):
    perks_columns.append(col)
#Getting the feature of number of perks for a place
df['no_of_perks']=df[perks_columns].sum(axis=1)
features = [col for col in df if ((df[col].dtype=='float64') |(df[col].dtype=='int64'))]
#Removing price from the features as it is the target
features.remove('price')
#features.remove('latitude')
#features.remove('longitude')
train = df[df['created']<'2016-06-01']
test= df[df['created']>='2016-06-01']
model = LinearRegression(normalize=True)
#Fitting the data on train data
#Taking the log of the price
model.fit(train[features],np.log(train['price']))
#print(f'The intercept of the model is {model.intercept_:,.0f}')
coefficients = pd.Series(model.coef_,features).sort_values(ascending=False)
print(coefficients.to_string())
#Based on model predicting the price for both train  and test data
#Taking a exp of the prediction to bring the price back to scale
y_train = np.exp(model.predict(train[features]))
y_test = np.exp(model.predict(test[features]))
train_price = train['price']
test_price = test['price']
rmse_train = np.sqrt(mean_squared_error(train_price,y_train))
rmse_test = np.sqrt(mean_squared_error(test_price,y_test))
#MAE for both train and test data
print(f'The MAE for Train :{mean_absolute_error(train_price,y_train):.2f} and Test:{mean_absolute_error(test_price,y_test):.2f}')
print(f'The R2 score for Train :{r2_score(train_price,y_train):.2f} and Test:{r2_score(test_price,y_test):.2f}')
print(f'The RMSE score for Train :{rmse_train:.2f} and Test:{rmse_test:.2f}')


terrace                 4.320753e+11
loft                    4.320753e+11
garden_patio            4.320753e+11
outdoor_space           4.320753e+11
balcony                 4.320753e+11
roof_deck               4.320753e+11
bath_bed                4.224688e+11
cats_dogs               1.044645e+11
doorman                 1.984994e+10
laundry_in_unit         1.984994e+10
elevator                1.984994e+10
dishwasher              1.984994e+10
wheelchair_access       1.984994e+10
dining_room             1.984994e+10
fitness_center          1.984994e+10
exclusive               1.984994e+10
swimming_pool           1.984994e+10
no_fee                  1.984994e+10
laundry_in_building     1.984994e+10
pre-war                 1.984994e+10
hardwood_floors         1.984994e+10
new_construction        1.984994e+10
common_outdoor_space    1.984994e+10
high_speed_internet     1.984994e+10
latitude                2.026356e-01
bed_per_bathrooms       3.908025e-05
bath_per_bed           -7.044169e-06
b

In [0]:
# Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
# Engineer at least two new features. (See below for explanation & ideas.)
# Fit a linear regression model with at least two features.
# Get the model's coefficients and intercept.
# Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.
# What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
# As always, commit your notebook to your fork of the GitHub repo.


bathrooms               0
bedrooms                0
latitude                0
longitude               0
interest_level          0
elevator                0
cats_allowed            0
hardwood_floors         0
dogs_allowed            0
doorman                 0
dishwasher              0
no_fee                  0
laundry_in_building     0
fitness_center          0
pre-war                 0
laundry_in_unit         0
roof_deck               0
outdoor_space           0
dining_room             0
high_speed_internet     0
balcony                 0
swimming_pool           0
new_construction        0
terrace                 0
exclusive               0
loft                    0
garden_patio            0
wheelchair_access       0
common_outdoor_space    0
bath_bed                0
no_of_perks             0
dtype: int64