Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv', index_col='created')
#assert df.shape == (49352, 34)



# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

df.head()


df = df.sort_index()

In [None]:
df['bedrooms'].astype(float)
df = df.dropna()
df.isnull().sum()

In [None]:
#  Engineer at least two new features. (See below for explanation & ideas.)
df.head()

df['bedbathRatio'] = df['bathrooms'] / df['bedrooms']
df['total'] = df['bedrooms'] + df['bathrooms']
df['bedbathRatio'].round(2)

df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()

In [None]:
# Train/test split 
# df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)
# df['created'].describe()
# df.dtypes

# train = df[df.created.dt.month < 6]
# test  = df[df.created.dt.month == 6]
# train.shape, test.shape

y = df['price'] # Is one-dimentional
X = df[['total', 'bedbathRatio']] # Is two-dimentional


In [None]:
cutoff = '2016-06-01'
mask = X.index < cutoff

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

y_train.tail()

created
2016-05-31 22:07:36    3100
2016-05-31 22:39:35    1900
2016-05-31 22:46:46    3000
2016-05-31 22:46:47    2525
2016-05-31 23:10:48    3095
Name: price, dtype: int64

In [None]:
from sklearn.metrics import mean_absolute_error

y_pred = [y_train.mean()] * len(y_train)
print('price mean', y_train.mean())
print('Baseline MAE:', mean_absolute_error(y_train, y_pred))

price mean 3836.7355757624528
Baseline MAE: 1262.5709323846613


In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
# y_train.isnull().sum()


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
print('Training MAE:', mean_absolute_error(y_train, model.predict(X_train)))
print('Test MAE:', mean_absolute_error(y_test, model.predict(X_test)))

Training MAE: 899.5418536899451
Test MAE: 906.6641190970572


In [None]:
from sklearn.metrics import mean_squared_error

print('Training RMSE:', mean_squared_error(y_train, model.predict(X_train), squared=False))
print('Test RMSE:', mean_squared_error(y_test, model.predict(X_test), squared=False))

Training RMSE: 1337.8854663681518
Test RMSE: 1326.8776646639346


In [None]:
from sklearn.metrics import r2_score

print('Training R^2:', r2_score(y_train, model.predict(X_train)))
print('Test R^2:', r2_score(y_test, model.predict(X_test)))

Training R^2: 0.4691633211269619
Test R^2: 0.4812829886125394
