Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
# Generate two engineered attributes
df_work = df.copy()

# Create a "building convenience" feature factor that adds up all the instances of a building perk by unit
df_work['convenience_building'] = (df_work['elevator'] 
+ df_work['doorman'] 
+ df_work['laundry_in_building'] 
+ df_work['fitness_center'] 
+ df_work['high_speed_internet']
+ df_work['swimming_pool']
+ df_work['wheelchair_access']
+ df_work['common_outdoor_space']);

# Create a "unit convenience" feature factor that adds up all the instances of a unit perk 
df_work['convenience_unit'] = df_work['dishwasher'] + df_work['laundry_in_unit'];

# Create a "pet convenience" feature factor that adds up all the instances of dog and cat friendly unit
df_work['convenience_pet'] = df_work['cats_allowed'] + df_work['dogs_allowed'];

In [0]:
# Funnel the dataframe down to those attributes on which we want to conduct the regression analysis
columns = ['created', 'price', 'bedrooms', 'bathrooms', 'convenience_building', 'convenience_unit', 'convenience_pet']
df_analyze = df_work[columns]

In [0]:
# Construct training and testing dataframes
df_train = df_analyze.query('created > "2016-04-00" and created < "2016-06-00"') # April and May 2016
df_train_X = df_train[columns]
df_train_X.drop(['created', 'price'], axis=1, inplace=True)
df_train_X.reset_index(inplace=True, drop=True)
df_train_y = pd.DataFrame(df_train['price'])
df_train_y.reset_index(inplace=True, drop=True)

df_test = df_analyze.query('created > "2016-06-00" and created < "2016-07-00"')  # June 2016 
df_test_X = df_test[columns]
df_test_X.drop(['created', 'price'], axis=1, inplace=True)
df_test_X.reset_index(inplace=True, drop=True)
df_test_y = pd.DataFrame(df_test['price'])
df_test_y.reset_index(inplace=True, drop=True)


In [182]:
# Let's create a baseline "dummy" estimate (regressor)
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error

# Generate an instance of the dummy regressor model
dummy_mean = DummyRegressor(strategy='mean')

# "Train" the dummy regressor model
dummy_mean.fit(df_train_X, df_train_y)

# Generate the dummy model's R**2 score
y_pred_dummy = dummy_mean.predict(df_test_X)

# Generate the mean absolute error of the dummy regressor
mae_dummy = mean_absolute_error(df_test_y, y_pred_dummy)

print(f'The dummy model\'s prediction error (mae) is: ${round(mae_dummy, 2)}')

The dummy model's prediction error (mae) is: $1197.71


In [183]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create a linear regression model
mdl = LinearRegression()

# Fit the model using the training data subset
mdl.fit(df_train_X, df_train_y)

# Print out the regression expression
print(f'REGRESSION EQUATION: rent = {round(mdl.coef_[0][0], 2)}*(# bedrooms) + {round(mdl.coef_[0][1], 2)}*(# bathrooms) + {round(mdl.coef_[0][2], 2)}*(# building perks) + {round(mdl.coef_[0][3], 2)}*(# unit perks) + {round(mdl.coef_[0][4], 2)}*(# dog and/or cat perks) + {round(mdl.intercept_[0], 2)}\n')

# Generate model price predictions using the training data subset
y_pred_train = mdl.predict(df_train_X)
train_mae = mean_absolute_error(df_train_y, y_pred_train)
train_r2  = r2_score(df_train_y, y_pred_train)
train_mse = mean_squared_error(df_train_y, y_pred_train)
train_rmse = np.sqrt(train_mse)

# Generate model price predictions using the testing data subset
y_pred_test = mdl.predict(df_test_X)
test_mae = mean_absolute_error(df_test_y, y_pred_test)
test_r2  = r2_score(df_test_y, y_pred_test)
test_mse = mean_squared_error(df_test_y, y_pred_test)
test_rmse = np.sqrt(test_mse)

out_data = [{'data': 'training', 'MAE': round(train_mae, 2), 'R2': train_r2, 'RMSE': round(train_rmse, 2)}, {'data': 'test', 'MAE': round(test_mae, 2), 'R2': test_r2, 'RMSE': round(test_rmse, 2)}] 
df_output = pd.DataFrame(out_data)

df_output



REGRESSION EQUATION: rent = 405.37*(# bedrooms) + 1888.92*(# bathrooms) + 152.96*(# building perks) + 215.01*(# unit perks) + 55.5*(# dog and/or cat perks) + 275.67



Unnamed: 0,data,MAE,R2,RMSE
0,training,777.87,0.546223,1187.01
1,test,784.94,0.554289,1176.98
