<a href="https://colab.research.google.com/github/mefrem/DS-Unit-2-Regression-Classification/blob/master/module_2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
# Looking at data features. It seems like a number of one-hot encoded features, and some categorical
# in addition to string descriptions
# df.describe(exclude='number')
# df.describe()

Gonna create two categories based on GPS coordinates, indicating "is it in Chelsea?" and "is it in Tribeca?". I'll find the central long,lat coordinates at the center of both neighborhoods and take a roughly kilometer square area around that point, indicated by a difference of one-half a hundredth place (.5 * .01) margin around both the long and lat coordinates of that central coordinate.


Tribeca coordinates
40.7163° N, 74.0086° W

Chelsea coordinates
40.7465° N, 74.0014° W


In [0]:
# Creating features 'chelsea' and 'tribeca'

# coordinates from google
chelsea_coordinates = [40.7163,-74.0086]
tribeca_coordinates = [40.7465,-74.0014]

# Rough square-kilometer zone around central Chelsea GPS coordinates
chelsea_w = (df['latitude'] < (chelsea_coordinates[0] + .005))
chelsea_e = (df['latitude'] > (chelsea_coordinates[0] - .005))
chelsea_n = (df['longitude'] < (chelsea_coordinates[1] + .005))
chelsea_s = (df['longitude'] > (chelsea_coordinates[1] - .005))

mask_chelsea = chelsea_w & chelsea_e & chelsea_n & chelsea_s

# Creating empty column, then filling
df['chelsea'] = 0
df.loc[(mask_chelsea), 'chelsea'] = 1

len(df.loc[(mask_chelsea)])

495

In [0]:
# Rough square-kilometer zone around central Tribeca GPS coordinates
tribeca_w = (df['latitude'] < (tribeca_coordinates[0] + .005))
tribeca_e = (df['latitude'] > (tribeca_coordinates[0] - .005))
tribeca_n = (df['longitude'] < (tribeca_coordinates[1] + .005))
tribeca_s = (df['longitude'] > (tribeca_coordinates[1] - .005))

mask_tribeca = tribeca_w & tribeca_e & tribeca_n & tribeca_s

# Creating empty column, then filling
df['tribeca'] = 0
df.loc[(mask_tribeca), 'tribeca'] = 1

# Confirming that no listing is in both neighborhoods
df.loc[(mask_tribeca) & (mask_chelsea)] # returns empty dataframe

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,chelsea,tribeca


In [0]:
# I want to make "interest level" a dummy variable
dummy_interestlevel = pd.get_dummies(df['interest_level'])
df = pd.concat([df, dummy_interestlevel], axis=1)

In [0]:
train = df[df['created'].str.contains('2016-04|2016-05')]
test = df[df['created'].str.contains('2016-06')]
train.head(1)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,chelsea,tribeca,high,low,medium
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
# Let's model, regressing bedrooms and Tribeca on price
model_2vars = LinearRegression()
target = 'price'
features = ['bedrooms','tribeca']
model_2vars.fit(train[features],train[target])
y_pred = model_2vars.predict(test[features])

# Let's measure our error!
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(test['price'],y_pred)
print(f"Our model's error: ${mae:,.0f}")

print(f'In this model, Tribeca-ness accounts for ${round(model_2vars.coef_[1],2)} more a month in rent')
print(f'and the y-intercept is ',model_2vars.intercept_)

Our model's error: $985
In this model, Tribeca-ness accounts for $600.29 more a month in rent
and the y-intercept is  2260.013902648964


In [0]:
from sklearn.metrics import r2_score
r2_score(test['price'],y_pred) 
# Not the best

0.2876697844720504

In [0]:
import plotly.express as px
px.scatter_3d(
    train,
    x='bedrooms', 
    y='tribeca', 
    z='price', 
    text='price', 
    title='Shizz')

In [0]:
# Let's model all the features!

from sklearn.linear_model import LinearRegression

model = LinearRegression()

target = 'price'
features = ['bathrooms',
            'bedrooms',
            'latitude',
            'longitude',
            'elevator',
            'cats_allowed',
            'hardwood_floors',
            'dogs_allowed',
            'doorman',
            'dishwasher',
            'no_fee',
            'laundry_in_building',
            'fitness_center',
            'pre-war',
            'laundry_in_unit',
            'roof_deck',
            'outdoor_space',
            'dining_room',
            'high_speed_internet',
            'balcony',
            'swimming_pool',
            'new_construction',
            'terrace',
            'exclusive',
            'loft',
            'garden_patio',
            'wheelchair_access',
            'common_outdoor_space',
            'chelsea',
            'tribeca',
            'high',
            'low',
            'medium']

# Fitting
model.fit(train[features],train[target])

# Using test data
y_pred = model.predict(test[features])

In [0]:
# Let's measure our error!
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(test['price'],y_pred)
print(f"Our model's error: ${mae:,.0f}")

Our model's error: $674


In [0]:
from sklearn.metrics import r2_score
r2_score(test['price'],y_pred) 
# Better!

0.6495894017749219