Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)

In [5]:
df['created'].iloc[0]

Timestamp('2016-06-24 07:54:24')

In [6]:
for col in df:
    print(col)

bathrooms
bedrooms
created
description
display_address
latitude
longitude
price
street_address
interest_level
elevator
cats_allowed
hardwood_floors
dogs_allowed
doorman
dishwasher
no_fee
laundry_in_building
fitness_center
pre-war
laundry_in_unit
roof_deck
outdoor_space
dining_room
high_speed_internet
balcony
swimming_pool
new_construction
terrace
exclusive
loft
garden_patio
wheelchair_access
common_outdoor_space


In [7]:
df.describe(exclude="number")

Unnamed: 0,created,description,display_address,street_address,interest_level
count,48817,47392.0,48684,48807,48817
unique,48148,37853.0,8674,15135,3
top,2016-05-14 01:11:03,,Broadway,3333 Broadway,low
freq,3,1627.0,435,174,33946
first,2016-04-01 22:12:41,,,,
last,2016-06-29 21:41:47,,,,


In [8]:
df = df.replace({'low':1, 'medium':2, 'high':3})

In [9]:
df = df.drop(columns=['description', 'display_address', 'street_address'])

In [10]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,latitude,longitude,price,interest_level,elevator,cats_allowed,hardwood_floors,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,40.7145,-73.9425,3000,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,40.7947,-73.9667,5465,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,40.7388,-74.0018,2850,3,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,40.7539,-73.9677,3275,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,40.8241,-73.9493,3350,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
y = df['price']
X = df.drop(columns=['price'])

In [12]:
X_train = X[X['created'] > '2016-06-01 00:00:01']
X_test = X[X['created'] <= '2016-06-01 00:00:01']

In [13]:
X_train.head()

Unnamed: 0,bathrooms,bedrooms,created,latitude,longitude,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,40.7145,-73.9425,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,40.7947,-73.9667,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
11,1.0,1,2016-06-03 03:21:22,40.8448,-73.9396,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14,1.0,1,2016-06-01 03:11:01,40.7584,-73.9648,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
24,2.0,4,2016-06-07 04:39:56,40.7391,-73.9936,2,1,1,1,1,...,1,1,0,0,1,0,0,0,0,0


In [14]:
y_train = y[y.index.isin(X_train.index)]
y_test = y[y.index.isin(X_test.index)]

In [15]:
y_train.head()

0     3000
1     5465
11    1675
14    3050
24    7400
Name: price, dtype: int64

In [16]:
X_train = X_train.drop(columns=['created'])
X_test = X_test.drop(columns=['created'])
y_train = y_train.drop(columns=['created'])
y_test = y_test.drop(columns=['created'])

In [17]:
print(type(X_train))
print(type(y_train))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


***With the data cleaned and sorted above, the below cells are used to fit the model***

In [18]:
y_train.mean()

3587.0546750721733

In [19]:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()

In [20]:
LR.fit(X_train, y_train)

LinearRegression()

In [21]:
from sklearn.metrics import mean_absolute_error

print('Train MAE:', mean_absolute_error(y_train, LR.predict(X_train)))
print('Test MAE:', mean_absolute_error(y_test, LR.predict(X_test)))

Train MAE: 680.6647225999974
Test MAE: 683.1040903789151


**Upon reflection regarding the previous project, the MAE of this model was significantly lower than that of yesterday's. Below calculates the intercept, coefficients, and mean squared error.**

In [22]:
type(LR.coef_)

numpy.ndarray

In [23]:
for col in X_train:
    print(col)

bathrooms
bedrooms
latitude
longitude
interest_level
elevator
cats_allowed
hardwood_floors
dogs_allowed
doorman
dishwasher
no_fee
laundry_in_building
fitness_center
pre-war
laundry_in_unit
roof_deck
outdoor_space
dining_room
high_speed_internet
balcony
swimming_pool
new_construction
terrace
exclusive
loft
garden_patio
wheelchair_access
common_outdoor_space


In [24]:
features = ["bathrooms",
            "bedrooms",
            "latitude",
            "longitude",
            "interest_level",
            "elevator",
            "cats_allowed",
            "hardwood_floors",
            "dogs_allowed",
            "doorman",
            "dishwasher",
            "no_fee",
            "laundry_in_building",
            "fitness_center",
            "pre-war",
            "laundry_in_unit",
            "roof_deck",
            "outdoor_space",
            "dining_room",
            "high_speed_internet",
            "balcony",
            "swimming_pool",
            "new_construction",
            "terrace",
            "exclusive",
            "loft",
            "garden_patio",
            "wheelchair_access",
            "common_outdoor_space"
    
]

In [25]:
coefs = pd.Series(LR.coef_, features)

In [26]:
coefs

bathrooms                1765.377513
bedrooms                  494.394379
latitude                 1350.237997
longitude              -12981.416522
interest_level           -433.630944
elevator                  126.694839
cats_allowed              -33.801244
hardwood_floors          -223.873271
dogs_allowed               79.493667
doorman                   459.631371
dishwasher                 47.213051
no_fee                    -91.978257
laundry_in_building      -135.633241
fitness_center            132.455704
pre-war                   -71.606310
laundry_in_unit           364.218885
roof_deck                -179.689650
outdoor_space            -123.554496
dining_room               258.128660
high_speed_internet      -281.518594
balcony                    89.022559
swimming_pool              99.972143
new_construction         -173.193017
terrace                    73.839328
exclusive                  94.127095
loft                      228.774590
garden_patio              245.720307
w

In [31]:
print('Intercept:', LR.intercept_)

Intercept: -1014223.6050025919


In [34]:
y_pred = LR.predict(X_test)

In [35]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
mse

1136925.036141478

In [36]:
import math

math.sqrt(mse)

1066.2668691005447