<a href="https://colab.research.google.com/github/repoocsov/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [x] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [x] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [x] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [x] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [132]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

""" START ASSIGNMENT HERE """

' START ASSIGNMENT HERE '

In [133]:
""" Engineering two features """

# Create a feature for the string length of 'description'
df['description'] = df['description'].replace({np.NaN: ' '})
df['description_length'] = df['description'].str.len()

df['description_length'] = df['description_length'].replace({np.NaN: 0})
df['description_length'].head(5)

0    588
1      8
2    691
3    492
4    479
Name: description_length, dtype: int64

In [134]:
# Create a feature for ratio of beds to baths ('bedrooms' / 'bathrooms')
train['bed_bath_ratio'] = train['bedrooms'] / train['bathrooms']
test['bed_bath_ratio'] = test['bedrooms'] / test['bathrooms']
train.head(5)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,description_length,bed_bath_ratio
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,691,1.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,492,1.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,479,4.0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,2.0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,579,2.0


In [135]:
""" Casting 'created' to a datetime object """

# 'created' is of type object
df['created'].dtype

dtype('O')

In [0]:
# Casting the 'created' column to a pandas datetime object
df['created'] = pd.to_datetime(df['created'])

In [137]:
""" Splitting the dataframe into training and testing """

# TRAINING
start_date = '04-01-2016'
end_date = '05-31-2016'
train = df[(df['created'] > start_date) & (df['created'] <= end_date)]

# Confirming 
print('Training')
print(train['created'].min())
print(train['created'].max())



# TESTING
start_date = '06-01-2016'
end_date = '06-30-2016'
test = df[(df['created'] > start_date) & (df['created'] <= end_date)]

# Confirming 
print('Testing')
print(test['created'].min())
print(test['created'].max())

Training
2016-04-01 22:12:41
2016-05-30 20:46:36
Testing
2016-06-01 01:10:37
2016-06-29 21:41:47


In [0]:
from sklearn.linear_model import LinearRegression

# Instantiate the linear regression model
model = LinearRegression()

In [139]:
# Specify the features and target

# Features in matrix format and the target in vector format
features = ['description_length', 'bedrooms']
target = 'price'

# Train
X_train = train[features]
y_train = train[target]

# Test
X_test = test[features]
y_test = test[target]

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [140]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [141]:
# The models intercept and coefficients
model.intercept_, model.coef_

(2011.3144672967646, array([4.65153917e-01, 8.38848047e+02]))

In [148]:
#-------------------------------------------------------------
# Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.
# RMSE: root mean squared error
# MAE: mean absolue error
# R2: score that represents the percentage that the features explain the target
# -------------------------------------------------------------
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

"""MAE"""
y_pred_train = model.predict(X_train)
train_mae = mean_absolute_error(y_train, y_pred_train)

y_pred_test = model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred_test)

print("Training mean absolute error:", train_mae)
print("Testing mean absolute error:", test_mae)



"""RMSE"""
train_mse = mean_squared_error(y_train, y_pred_train)
train_rmse = np.sqrt(train_mse)

test_mse = mean_squared_error(y_test, y_pred_test)
test_rmse = np.sqrt(test_mse)

print('\n')
print("Training root mean square error:", train_rmse)
print("Testing root mean square error:", test_rmse)



"""R2"""
train_r2 = model.score(X_train, y_train)
test_r2 = model.score(X_test, y_test)

print('\n')
print("Training R2:", train_r2)
print("Testing R2:", test_r2)

Training mean absolute error: 962.4696476831961
Testing mean absolute error: 976.1577566196978


Training root mean square error: 1475.8408407790305
Testing root mean square error: 1477.407690415826


Training R2: 0.2985996996759941
Testing R2: 0.29770960505429545


In [0]:
"""
I've been trying to read in an attempt to understand the math more. Passing more features to the sklearn function would be trivial. The hard part is creating new features that have predictive power.
"""