<a href="https://colab.research.google.com/github/jasimrashid/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# import pandas as pd
# from google.colab import files
# uploaded = files.upload()

# neighborhoods = pd.read_csv('neighborhoods-coordinates.csv',sep=',')
# neighborhoods.head()

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [4]:
df.columns[]

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

**1.** Test train split

In [88]:
train = df[df['created'].str.contains('2016-05|2016-04')]
test = df[df['created'].str.contains('2016-06')]
train.shape, test.shape

((31844, 35), (16973, 35))

In [0]:
df.head()

**2.** Engineer at least two features

In [124]:
# yuppie magnet: gym-tan-laundry + doorman for amazon delivery
df['yuppie_magnet'] = df['fitness_center'] + df['doorman'] + df['roof_deck'] + (df['laundry_in_building']|df['laundry_in_unit'])
df['yuppie_magnet']

0        0
1        2
2        1
3        0
4        0
        ..
49347    1
49348    3
49349    1
49350    1
49351    0
Name: yuppie_magnet, Length: 48817, dtype: int64

**3.** Fit a linear regression model with 2+ features

In [90]:
# baseline
guess = train['price'].mean()

#train error
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f}')

Train Error: 1201.88


In [0]:
def linear_reg(features):

  from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
  from sklearn.linear_model import LinearRegression
  model = LinearRegression()
  # features = ['laundry_in_unit', 'bedrooms', 'new_construction']
  target = 'price'
  y_train=train[target]
  y_test = test[target]
  X_train = train[features]
  X_test = test[features]


  model.fit(X_train, y_train)
  y_pred = model.predict(X_train)
  mae = mean_absolute_error(y_train, y_pred)
  rmse = np.sqrt(mean_squared_error(y_train, y_pred))
  r2 = r2_score(y_train, y_pred)
  print(f'Linear Regression, dependent on: {features}')
  print(f'Train MAE: {mae:.2f}')
  print(f'Train MSE: {rmse:.2f}')
  print(f'Train R2: {r2:.2f}')

  y_pred = model.predict(X_test)
  mae = mean_absolute_error(y_test, y_pred)
  rmse = np.sqrt(mean_squared_error(y_test, y_pred))
  r2 = r2_score(y_test, y_pred)
  # print(f'Linear Regression, dependent on: {features}')
  print()
  print(f'Test MAE: {mae:.2f}')
  print(f'Test MSE: {rmse:.2f}')
  print(f'Test R2: {r2:.2f}')

  print


In [92]:
features = ['laundry_in_unit', 'bedrooms']
linear_reg(features)


Linear Regression, dependent on: ['laundry_in_unit', 'bedrooms']
Train MAE: 955.28
Train MSE: 1445.08
Train R2: 0.33

Test MAE: 976.67
Test MSE: 1457.59
Test R2: 0.32


In [63]:
features = ['bedrooms']
linear_reg(features)

Linear Regression, dependent on: ['bedrooms']
Train MAE: 969.88
Train MSE: 1487.04
Train R2: 0.29

Test MAE: 988.73
Test MSE: 1491.01
Test R2: 0.28


In [0]:
# features = ['bedrooms','laundry_in_unit','pre-war','fitness_center']
# linear_reg(features)

In [97]:
features = ['bedrooms','yuppie_magnet','exclusive']
linear_reg(features)

Linear Regression, dependent on: ['bedrooms', 'yuppie_magnet', 'exclusive']
Train MAE: 909.27
Train MSE: 1411.24
Train R2: 0.36

Test MAE: 919.07
Test MSE: 1409.89
Test R2: 0.36


In [78]:
features = ['bedrooms','exclusive','bathrooms']
linear_reg(features)

Linear Regression, dependent on: ['bedrooms', 'exclusive', 'bathrooms']
Train MAE: 818.44
Train MSE: 1231.96
Train R2: 0.51

Test MAE: 825.86
Test MSE: 1219.57
Test R2: 0.52


In [121]:
features = ['bedrooms','laundry_in_unit','fitness_center','dogs_allowed','longitude','exclusive','bathrooms']
linear_reg(features)

Linear Regression, dependent on: ['bedrooms', 'laundry_in_unit', 'fitness_center', 'dogs_allowed', 'longitude', 'exclusive', 'bathrooms']
Train MAE: 720.91
Train MSE: 1128.25
Train R2: 0.59

Test MAE: 728.83
Test MSE: 1118.66
Test R2: 0.60
[   424.70018398    421.18903612    283.44113374     74.38386805
 -13680.9966226     134.74115761   1899.48916997]


**4.** Model's coefficient and intercept

In [119]:
features = ['bedrooms','laundry_in_unit','fitness_center','dogs_allowed','longitude','exclusive','bathrooms']
target = 'price'
X_train=train[features]
y_train=train[target]
model.fit(X_train, y_train)
print('Intercept: ', model.intercept_)
coefficients = pd.Series(model.coef_,features)
# print(model.coef_)
# print(features)
print(coefficients.to_string())

Intercept:  -1011571.2057281791
bedrooms             424.700184
laundry_in_unit      421.189036
fitness_center       283.441134
dogs_allowed          74.383868
longitude         -13680.996623
exclusive            134.741158
bathrooms           1899.489170


**Best MAE**

features = ['bedrooms','laundry_in_unit','fitness_center','dogs_allowed','longitude','exclusive','bathrooms']

In [0]:
# features = ['bedrooms','laundry_in_unit','fitness_center','dogs_allowed','longitude','exclusive','bathrooms']
# MAE = 720.91

<!-- **5.** Best MAE -->