<a href="https://colab.research.google.com/github/chrishuskey/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/Assignment_DS_212_Regression_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [✓] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [✓] Engineer at least two new features. (See below for explanation & ideas.)
- [✓] Fit a linear regression model with at least two features.
- [✓] Get the model's coefficients and intercept.
- [✓] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [✓] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [✓] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
# Import libraries:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# Read New York City apartment rental listing data
rent_data = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert rent_data.shape == (49352, 34)

In [0]:
# Remove the most extreme 0.1% of prices,
# the most extreme 0.1% of latitudes, &
# the most extreme 0.1% of longitudes:
rent_data = rent_data[(rent_data['price'] >= rent_data['price'].quantile(0.001)) & 
        (rent_data['price'] <= rent_data['price'].quantile(0.999)) & 
        (rent_data['latitude'] >= rent_data['latitude'].quantile(0.001)) & 
        (rent_data['latitude'] < rent_data['latitude'].quantile(0.999)) &
        (rent_data['longitude'] >= rent_data['longitude'].quantile(0.001)) & 
        (rent_data['longitude'] <= rent_data['longitude'].quantile(0.999))]

In [0]:
# Change to the right data types:
rent_data['created'] = pd.to_datetime(rent_data['created'], infer_datetime_format=True)  # infer_datetime_format can be up to 5-10x faster w.r.t. processing time

In [0]:
# Split into training and test data:
# Training data: listings from April and May 2016
# Test data: listings from June 2016
train = rent_data[(rent_data['created'].dt.month <= 5)]
test = rent_data[rent_data['created'].dt.month == 6]

# Check to make sure the resulting datasets have the right numbers of 
# observations (and that we got all of them) and features:
assert (train.shape[0] + test.shape[0] == rent_data.shape[0]) & (train.shape[1] == rent_data.shape[1]) &  (test.shape[1] == rent_data.shape[1])

In [0]:
# Add new feature:  Total # of bedrooms and bathrooms:
train['bedrooms+bathrooms'] = train['bedrooms'] + train['bathrooms']
test['bedrooms+bathrooms'] = test['bedrooms'] + test['bathrooms']

# Add new feature:  Perks by price tiers:
# The current features for amenities are all binary 0/1 
# variables better suited to classification approaches.  But since we're 
# required to use linear regression here instead, one way to improve 
# price-predicting power would be to group perks by corresponding 
# price level (based on rarity in the data set, with some manual adjustments 
# based on intuition / "domain knowledge"):
# (1) Level 1 perks: slight price-boosters
# (2) Level 2 perks: higher end perks indicative of higher-rent apartments
# (3) Level 3 perks: luxury perks indicative of very expensive apartments
train['L1_price_boost_perks'] = train['elevator'] + (train['cats_allowed'] & train['dogs_allowed']) + train['laundry_in_building']
train['L2_high_end_perks'] = train['hardwood_floors'] + train['doorman'] + train['dishwasher'] + train['fitness_center'] + train['pre-war'] + train['roof_deck'] + train['high_speed_internet']
train['L3_luxury_perks'] = train['swimming_pool'] + train['laundry_in_unit'] + train['terrace'] + train['balcony'] + train['new_construction'] + train['loft']

test['L1_price_boost_perks'] = test['elevator'] + (test['cats_allowed'] & test['dogs_allowed']) + test['laundry_in_building']
test['L2_high_end_perks'] = test['hardwood_floors'] + test['doorman'] + test['dishwasher'] + test['fitness_center'] + test['pre-war'] + test['roof_deck'] + test['high_speed_internet']
test['L3_luxury_perks'] = test['swimming_pool'] + test['laundry_in_unit'] + test['terrace'] + test['balcony'] + test['new_construction'] + test['loft']

# Convert "interest_level" to simple 1/2/3 numerical representation of this 
# categorical variable, so we can work with this feature (given required to 
# use linear regression model here):
train['interest_level'] = train['interest_level'].replace({'low': 1, 'medium': 2, 'high': 3})
test['interest_level'] = test['interest_level'].replace({'low': 1, 'medium': 2, 'high': 3})

# ------------------------------------------------------------------------------

# [?? TO DO all below just for practice with aspects of Pandas/Python -- these 
# are all things I'm not 100% sure how to do, but should know how to do! ??]]

# [?? What to do about the warnings below?  I'm getting the same warning x2 
# when I use .loc instead to do the same as the above... ??]

# [?? Luxury:
# contains "luxury" in description (but this would be a 0/1 binary feature... 
# only useful for classification, or can we use with regression too?) ??]

# [?? For the perks feature above, what is a better way to do this?  How would 
# a top tier data scientist frame this problem, if constrained to only using a 
# linear regression model and this starter dataset ??]

# [?? Column selection based on conditions/criteria:  Sort perks into medium/price-boost, high-end/premium perks and luxury 
# perks automatically, by sorting the column names based on conditions: 
# median as 1 or 0, 75% percentile as 1 or 0, 90% quantile as 1 or 0.  
# Not sure how to work with columns this way, only rows!!... --> need to learn ??]

# df_a = train.copy()
# a = df_a.median() == 0
# a = pd.DataFrame(a)
# a = a.reset_index()
# a.columns = ['index', 'criterion']

# b = a[a['criterion'] == True]
# b

# # for i in a:
# #   if a.loc[i, criterion] == True:
# #     print(a[i].index)

# # # pd.DataFrame(data=train, column=list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentati

In [0]:
# # [?? To Do:  Improve new feature #2 by first weighting the following features 
# # by rarity and likely level of demand, rather than just adding up all of the 
# # 1's (has/doesn't have x amenity) ??]

# # Categorize as:  Price premium features:
# # Median and up:
# 'elevator', 
# ('cats_allowed' & 'dogs_allowed', )
# ('laundry_in_building' or 'laundry_in_unit')

# # Categorize as:  High-end perks:
# # 75% and up:
# 'hardwood_floors', 
# 'doorman', 
# 'dishwasher', 
# 'fitness_center', 
# 'roof_deck'
# 'high_speed_internet', 
# 'pre-war', 

# # Categorize as:  Luxury perks:
# # Higher %s only (not even 75% has):
# 'swimming_pool'
# 'laundry_in_unit',  # means it's more likely to be a larger apt. --> higher price
# 'balcony', 
# 'terrace', 
# 'new_construction', 
# 'loft',

In [0]:
# Multiple Linear Regression Model for the above NYC apartment rent data:

# Import model class:
from sklearn.linear_model import LinearRegression

# Initiate model:
model = LinearRegression()

# Features matrix and target vector:
features = ['bedrooms+bathrooms', 'interest_level', 'L1_price_boost_perks', 'L2_high_end_perks', 'L3_luxury_perks']
target = 'price'

# Fit the model to our training data:
model.fit(train[features], train[target])

# Error on training set:
y_true_train = train[target]
y_pred_train = model.predict(train[features])
print('Training Set: Model Error:')
print(f'MAE: {mean_absolute_error(y_true_train, y_pred_train):.1f}')
mse_train = mean_squared_error(y_true_train, y_pred_train)
print(f'MSE: {mse_train:.1f}')
print(f'RMSE: {sqrt(mse_train):.1f}')
print(f'R^2 score: {r2_score(y_true_train, y_pred_train):.2f}\n')

# Error on new data: our test set:
y_true_test = test[target]
y_pred_test = model.predict(test[features])
print('Test Data (New):  Model Error:')
print(f'MAE: {mean_absolute_error(y_true_test, y_pred_test):.1f}')
mse_test = mean_squared_error(y_true_test, y_pred_test)
print(f'MSE: {mse_test:.1f}')
print(f'RMSE: {sqrt(mse_test):.1f}')
print(f'R^2 score: {r2_score(y_true_test, y_pred_test):.2f}')

Training Set: Model Error:
MAE: 893.6
MSE: 2253903.8
RMSE: 1501.3
R^2 score: 0.48

Test Data (New):  Model Error:
MAE: 900.4
MSE: 2320432.3
RMSE: 1523.3
R^2 score: 0.48
