<a href="https://colab.research.google.com/github/owenburton/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [x] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

## Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [0]:
# Make sure dataset only includes observations from April, May, and June.
assert df['created'].str.contains('2016-04|2016-05|2016-06').value_counts()[1] == df.shape[0]

In [0]:
# Training data includes observations from April & May.
train = df[df['created'].str.contains('2016-04|2016-05')]

# Test data includes observations from June.
test = df[df['created'].str.contains('2016-06')]

## Establish a baseline.

In [130]:
from sklearn.metrics import mean_absolute_error

# Arrange y target vectors.
target = 'price'
y_train = train[target]
y_test = test[target]

# Get mean baseline.
guess = y_train.mean()
print(f'Mean baseline (using zero features): {guess:.2f}')

# Show train error.
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train error (April & May): {mae:.2f} percentage points')

# Show test error. 
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test error (June): {mae:.2f} percentage points')

Mean baseline (using zero features): 3575.60
Train error (April & May): 1201.88 percentage points
Test error (June): 1197.71 percentage points


^ yikes.

## Engineer at least two new features.

I'm going to use two new features. The first is a ratio of bathrooms to bedrooms. The second is the total number of bed and bathrooms.

In [0]:
# Create new feature for the total number of rooms for an apartment.
train['total_rooms'] = train['bathrooms'] + train['bedrooms']
test['total_rooms'] = test['bathrooms'] + test['bedrooms']

In [0]:
# Create new feature for the ratio of bathrooms to bedrooms.
train['bath_bed_ratio'] = train['bathrooms'] / train['bedrooms']
test['bath_bed_ratio'] = test['bathrooms'] / test['bedrooms']

In [0]:
# Replace infinite and NaN values with mean.
train_no_inf = train[~train.isin([np.inf]).any(1)]
test_no_inf = test[~test.isin([np.inf]).any(1)]

train_ratio_mean = train_no_inf['bath_bed_ratio'].mean()
test_ratio_mean = test_no_inf['bath_bed_ratio'].mean()

train['bath_bed_ratio'] = train['bath_bed_ratio'].replace(np.inf, train_ratio_mean).round(2)
test['bath_bed_ratio'] = test['bath_bed_ratio'].replace(np.inf, test_ratio_mean).round(2)

train['bath_bed_ratio'] = train['bath_bed_ratio'].replace(np.nan, train_ratio_mean).round(2)
test['bath_bed_ratio'] = test['bath_bed_ratio'].replace(np.nan, test_ratio_mean).round(2)

## Fit a linear regression model with at least two features.

In [134]:
# 1. Import the appropriate estimator class from Scikit-Learn.
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class.
model = LinearRegression()

# 3. Arrange X features matrices (already did y target vector).
features = ['total_rooms', 'bath_bed_ratio', 'latitude', 'longitude']
X_train = train[features]
X_test = test[features]
print(f'Linear Regression, dependent on: {features}')

# 4. Fit the model.
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} percentage points')

# 5. Apply the model to new data.
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f} percentage points')

Linear Regression, dependent on: ['total_rooms', 'bath_bed_ratio', 'latitude', 'longitude']
Train Error: 772.47 percentage points
Test Error: 784.67 percentage points


## Get the model's coefficients and intercept.

In [136]:
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())

Intercept -1358407.953726254
total_rooms         913.507831
bath_bed_ratio     1524.830255
latitude           2092.005789
longitude        -17210.149025


## Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.

In [146]:
# For the training data.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y = train[target]
y_pred = model.predict(X_train)

mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)
print('Mean Squared Error:', mse)
print('Root Mean Squared Error:', rmse)
print('Mean Absolute Error:', mae)
print('R^2:', r2)

Mean Squared Error: 1414524.4328843853
Root Mean Squared Error: 1189.3378127699402
Mean Absolute Error: 772.4702662614842
R^2: 0.5444407156321917


In [147]:
# For the test data.
y = test[target]
y_pred = model.predict(X_test)

mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)
print('Mean Squared Error:', mse)
print('Root Mean Squared Error:', rmse)
print('Mean Absolute Error:', mae)
print('R^2:', r2)

Mean Squared Error: 1393065.2940971793
Root Mean Squared Error: 1180.2818706127698
Mean Absolute Error: 784.6718381468669
R^2: 0.5517838605204604
