<a href="https://colab.research.google.com/github/YinmiAlas/DS-Unit-2-Linear-Models/blob/master/LS_DS_212_Regrssion2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]
#lest see the dataset
df.head(1)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# I need to take out the time from the created column 
# to a easy split because is string date time
df['created'] = df['created'].apply(lambda x: pd.Timestamp(x).strftime('%Y-%m-%d'))
df.head(1)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# I need to create a index date time column 
#from the created column to pull the data even easy 
df['created'] = pd.to_datetime(df['created'])
df = df.set_index(df['created'])
df = df.sort_index()
df.head(1)

Unnamed: 0_level_0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
2016-04-01,1.0,1,2016-04-01,Reduced Fee!! Priced To Rent!\rLarge Newly Upd...,West End Ave,40.7939,-73.9738,2745,700 West End Ave,medium,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1


In [None]:
# splitting the data 
rent_date_train = df['2016-04-01':'2016-05-31']

rent_date_test = df['2016-06-01':'2016-06-30']

rent_date_train.shape, rent_date_test.shape

((31844, 34), (16973, 34))

In [None]:
# apartments has balcony and fitness center?
# two new features
features = ['balcony', 'fitness_center']
target = 'price'

In [None]:
#calling library
from sklearn.linear_model import LinearRegression
model = LinearRegression()


In [None]:
#Arrange y target vector
y_train = rent_date_train[target]
y_test = rent_date_test[target]

y_train.shape, y_test.shape

((31844,), (16973,))

In [None]:
#Arrange X features matrices
x_train = rent_date_train[features]
x_test = rent_date_test[features]

x_train.shape, x_test.shape

((31844, 2), (16973, 2))

In [None]:
# fitting a linear regression model with two features
model.fit(x_train, y_train)
#lets predict x_train
y_pred = model.predict(x_train)
y_pred

array([4149.80597965, 3302.2928455 , 3302.2928455 , ..., 3302.2928455 ,
       3302.2928455 , 3302.2928455 ])

In [None]:
#Getting coefficients and intercept.

model.coef_ , model.intercept_

(array([766.11231225, 847.51313416]), 3302.2928454955377)

In [None]:
# Geting regression metrics RMSE, MAE, and  𝑅2 , for both the train and test data.
from sklearn.metrics import mean_absolute_error , mean_squared_error, r2_score

In [None]:
# Train metrics
MSE_train = mean_squared_error(y_train, y_pred)
RMSE_train = np.sqrt(MSE_train)
MAE_train = mean_absolute_error(y_train, y_pred)
R2_train = r2_score(y_train, y_pred)
print('MSE:', MSE_train, 'RMSE:', RMSE_train, 'MAE:', MAE_train, 'R2', R2_train)

MSE: 2913302.537084948 RMSE: 1706.839927200248 MAE: 1161.2397486079312 R2 0.06174684007751263


In [None]:
# fitting a linear regression model with two features
model.fit(x_test, y_test)
#lets predict x_test
x_pred = model.predict(x_test)

# to take the test RMSE, MAE, and  𝑅2 i did take out the x_predict like above
# i will like my TL review me and tell me if i did rigth if not explain me how can i do it
#Thanks


# test metrics
MSE_test = mean_squared_error(y_test, x_pred)
RMSE_test = np.sqrt(MSE_test)
MAE_test = mean_absolute_error(y_test, x_pred)
R2_test = r2_score(y_test, x_pred)
print('MSE:', MSE_test, 'RMSE:', RMSE_test, 'MAE:', MAE_test, 'R2', R2_test)

MSE: 2889528.096369748 RMSE: 1699.8611991482564 MAE: 1153.344264043483 R2 0.07029976716786712


In [None]:
#What's the best test MAE you can get?
MAE_test = mean_absolute_error(y_test, x_pred)
print('Whats the best test MAE you can get?', MAE_test)

Whats the best test MAE you can get? 1153.344264043483
