<a href="https://colab.research.google.com/github/AmyBeisel/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/212_assignment_AMY_BEISEL_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [X] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
#take a look at the data
print(df.shape)
df.sample(10)

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
39742,2.0,3,2016-06-23 18:37:24,"3 Bedrooms with 2 Full Baths, a Washer & Dryer...",84th,40.7768,-73.9537,6500,228 E 84th,medium,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
38342,1.0,1,2016-05-05 06:49:26,Spacious studio convertible into a 1 bedroom! ...,East 82nd Street,40.7753,-73.954,2850,240 East 82nd Street,low,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
14745,1.0,2,2016-06-26 02:58:54,Spacious and newly Renovated 2 bedroom Feature...,West 107th Street,40.8012,-73.966,3000,210 West 107th Street,low,0,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
20431,1.5,1,2016-06-10 05:59:01,<p><a website_redacted,Thompson Street,40.724,-74.0035,8895,55 Thompson Street,low,1,1,1,1,1,1,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
39999,1.0,1,2016-05-06 01:22:38,West 57th Street gem: one bedroom pre-war buil...,West 57th Street,40.7682,-73.987,2400,424 West 57th Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
22854,1.0,1,2016-06-11 05:58:26,Newly renovated 1 bedroom in the beautiful Upp...,Broadway,40.7878,-73.9763,2750,2350 Broadway,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12154,1.0,2,2016-06-13 14:54:09,What a great location to be right by Central P...,West 112 St.,40.8002,-73.9539,2500,140 West 112 St.,low,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
21372,1.0,1,2016-06-07 02:27:38,Spacious 1 bedroom 1 Bathroom in heart of Asto...,36th St,40.7674,-73.9151,2200,25-65 36th St,low,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
39888,1.0,0,2016-05-26 01:36:26,New to market. Spacious studio in Midtown East...,East 58th Street,40.7585,-73.9618,2100,414 East 58th Street,low,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
41382,1.0,1,2016-05-22 05:33:25,*DON'T BE FOOLED by brokers promising NO FEE d...,East 102nd Street,40.7897,-73.9488,2350,120 East 102nd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
df.isnull().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
month         

###Train/test Split

In [0]:
 #Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
print(df['created'].dtype)
df[['created']]

object


Unnamed: 0,created
0,2016-06-24 07:54:24
1,2016-06-12 12:19:27
2,2016-04-17 03:26:41
3,2016-04-18 02:22:02
4,2016-04-28 01:32:41
...,...
49347,2016-06-02 05:41:05
49348,2016-04-04 18:22:34
49349,2016-04-16 02:13:40
49350,2016-04-08 02:13:33


In [0]:
#need to convert strings to dates format
df['created'] = pd.to_datetime(df['created'])
df['created'].dtypes

dtype('<M8[ns]')

In [0]:
#check what years are in the dataset
pd.unique(df['created'].dt.year)

array([2016])

In [0]:
#We only have one year 2016, so will only extract the month.
df['month']=df['created'].dt.month
df[['created','month']].sample(10)

Unnamed: 0,created,month
3134,2016-04-20 01:22:46,4
40688,2016-05-29 03:02:34,5
8209,2016-04-22 04:43:04,4
24174,2016-06-11 02:20:27,6
4925,2016-04-06 03:34:27,4
29624,2016-05-05 04:30:47,5
21117,2016-06-18 02:24:28,6
44354,2016-04-27 02:49:15,4
30488,2016-05-19 02:49:32,5
24277,2016-05-12 03:45:52,5


In [0]:
#now I have a new column with only month
print(df.shape)
df.columns

(48817, 35)


Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'month'],
      dtype='object')

In [0]:
#get the unique values for month
#df['month'].value_counts()
pd.unique(df['month'])

array([6, 4, 5])

In [0]:
#Use the date from April to May to train
#Use the data from June to test
train = df.query('month <= 5')
test = df.query('month == 6')

In [0]:
#what are the shapes
train.shape, test.shape

((31844, 37), (16973, 37))

###Engineer new features 

In [0]:
# Engineer at least two new features
#first feature =  How many total perks does each apartment have?
#The values for each perk is 1 or 0.  So we can add the value of all the perks for the same apartment.
#iloc[StartRow:EndRow, StartCol, EndCol] (Axis 1 = along columns = I can just type columns as well)
df['perks']=df.iloc[:, 10:34].sum(axis = 'columns')
df['perks']

0        0
1        5
2        3
3        2
4        1
        ..
49347    5
49348    9
49349    5
49350    5
49351    1
Name: perks, Length: 48817, dtype: int64

In [0]:
#I have a new column with total number of perks for each apartment.
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'month',
       'perks'],
      dtype='object')

In [0]:
df[['perks']]

Unnamed: 0,perks
0,0
1,5
2,3
3,2
4,1
...,...
49347,5
49348,9
49349,5
49350,5


In [0]:
#second feature = Total number of rooms (beds + baths)
#create a new column of total number of rooms
df['rooms']=df['bathrooms'] + df['bedrooms']
df['rooms']


0        4.5
1        3.0
2        2.0
3        2.0
4        5.0
        ... 
49347    3.0
49348    2.0
49349    2.0
49350    1.0
49351    3.0
Name: rooms, Length: 48817, dtype: float64

In [0]:
#check the columns
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'month',
       'perks', 'rooms'],
      dtype='object')

In [0]:
df[['perks', 'rooms']]


Unnamed: 0,perks,rooms
0,0,4.5
1,5,3.0
2,3,2.0
3,2,2.0
4,1,5.0
...,...,...
49347,5,3.0
49348,9,2.0
49349,5,2.0
49350,5,1.0


###Fit a linear regression model with at least two features. 

In [0]:
#5 step process 
#1. import appropriate estimator class from sckikit-learn
from sklearn.linear_model import LinearRegression

In [0]:
#2. instantiate this class
model = LinearRegression()

In [0]:
#3. Arrange X features matices 
#because I split train/test before I added new featues, I had to re-run my train/test inorder to have the new columns 'perks' 'rooms'
features = ['perks', 'rooms']
X_train = train[features]
X_test = test[features]

#Y target 
target = 'price'
y_train = train[target]
y_test = test[target]
print(f'Lineaer Regression, dependent on: {features}')
print(X_train.shape, X_test.shape)

Lineaer Regression, dependent on: ['perks', 'rooms']
(31844, 2) (16973, 2)


In [0]:
# 4. fit the model
model.fit(X_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
# 5.Train error
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} percentage points')

Train Error: 856.62 percentage points


In [0]:
# also 5. test error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f} percentage points')

Test Error: 868.00 percentage points


### Get the model's coefficients and intercept.

In [0]:
model.intercept_, model.coef_

(998.1946341143803, array([101.76435155, 767.95662272]))

In [0]:
#equation for the model
beta0 = model.intercept_
beta1, beta2 = model.coef_
print(f'y = {beta0} + {beta1}x1 + {beta2}x2')


y = 998.1946341143803 + 101.76435154698721x1 + 767.956622719957x2


In [0]:
#models coefficients and intercept. 
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())

Intercept 998.1946341143803
perks    101.764352
rooms    767.956623


###Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.

In [0]:
#regression for train metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred) #Mean sqaured error
rmse = np.sqrt(mse) #root mean squared Error
mae = mean_absolute_error(y_train, y_pred) #mean absolute Error
r2 = r2_score(y_train, y_pred)

print('Mean Squared Error:', mse)
print('Root Mean Squared Error:', rmse)
print('Mean Absolute Error:', mae)
print('R^2:', r2)

Mean Squared Error: 1678877.2727735792
Root Mean Squared Error; 1295.7149658677172
Mean Absolute Error 856.6168448694106
R^2: 0.45930369872329946


In [0]:
#regression for test metrics
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred) #Mean sqaured error
rmse = np.sqrt(mse) #root mean squared Error
mae = mean_absolute_error(y_test, y_pred) #mean absolute Error
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Root Mean Squared Error:', rmse)
print('Mean Absolute Error:', mae)
print('R^2:', r2)

Mean Squared Error: 1672432.715272962
Root Mean Squared Error; 1293.2257015977382
Mean Absolute Error 867.9967359241992
R^2: 0.46189791795456303
