<a href="https://colab.research.google.com/github/eyvonne/DS-Unit-2-Regression-Classification/blob/master/module2/Eyve_Geo_assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 5.0MB/s 
[?25hCollecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/70/19/8437e22c84083a6d5d8a3c80f4edc73c9dcbb89261d07e6bd13b48752bbd/plotly-4.1.1-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 16.2MB/s 
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pandas-profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/ad/24a16fa4ba612fb96a3c4bb115a5b9741483f53b66d3d3afd987f20fa227/phik-0.9.8-py3-none-any.whl (606kB)
[K     |████████████████████████████████| 614kB 41.7MB/s 
[?25hCollecting confuse>=1.0.0 (f

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [60]:
df.head()
df.created=pd.to_datetime(df['created'], infer_datetime_format=True)
df['created'][0].date()
def extractMonth(date):
  return date.month

df['monthCreated']=df['created'].apply(extractMonth)

#This splits the data, but a split isn't actually desired until all the feature
#engineering is done. I'll return to this. 
trainDF=df[df['monthCreated'] != 6]
testDF=df[df['monthCreated']==6]

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'monthCreated', 'PCA1', 'PCA2', 'PCA3'],
      dtype='object')

In [74]:
#created feature number one
df['bedBathDiff']=df['bedrooms']-df['bathrooms']
#and number two
df['catsandDogs']=(df['cats_allowed']+df['dogs_allowed']).replace({2:1,1:0})


Unnamed: 0,cats_allowed,dogs_allowed,catsandDogs
0,0,0,0
1,1,1,1
2,0,0,0
3,0,0,0
4,0,0,0
5,0,0,0
6,1,1,1
7,0,0,0
8,1,1,1
9,0,0,0


In [0]:
#Do a PCA on lat lon and bed bath
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scale=StandardScaler()
pca=PCA()

feats=['latitude','longitude','bedrooms','bathrooms']
X=df[feats].values
z=scale.fit_transform(X)
pca.fit(z)
ser=pd.Series(pca.explained_variance_ratio_)
#see what it takes to get 80% of explained variance, the first three gives almost 90%
pca.explained_variance_ratio_[0:3].sum()
df['PCA1']=pca.transform(z)[:,0]
df['PCA2'] =pca.transform(z)[:,1]
df['PCA3']=pca.transform(z)[:,2]

In [0]:
#bring back the train test split
trainDF=df[df['monthCreated'] != 6]
testDF=df[df['monthCreated']==6]

In [104]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
#initiate the object
model=LinearRegression()
#separate out my features 
features=['PCA1','PCA2','PCA3']
target='price'
X_train=trainDF[feats]
y_train=trainDF[target]
X_test=testDF[feats]
y_test=testDF[target]

model.fit(X_train,y_train)
y_pred=model.predict(X_train)


print('Coeffecients:',model.coef_)
print('intercept:',model.intercept_)

print('Train R2:',model.score(X_train,y_train))
print('test R2:', model.score(X_test, y_test))
trainmeanSquaredError= mean_squared_error(y_train,y_pred)
print('Training Root Mean Squared Error',np.sqrt(trainmeanSquaredError))
testMeanSE=mean_squared_error(y_test, model.predict(X_test))
print('Test RMSE:',np.sqrt(testMeanSE))
MAE=mean_absolute_error(y_train,y_pred)
testMAE=mean_absolute_error(y_test, model.predict(X_test))
print('train MAE:',MAE)
print('test MAE:',testMAE)

Coeffecients: [  1863.91519176 -16328.42647587    428.45152883   2002.79660997]
intercept: -1283306.4906903051
Train R2: 0.5769090842021488
test R2: 0.5882903576478172
Training Root Mean Squared Error 1146.1715544741874
Test RMSE: 1131.1950868928648
train MAE: 739.1822083746687
test MAE: 744.9752658120702


In [0]:
def linreg(train, test, features):
  target='price'
  X_train=train[features]
  y_train=train[target]
  X_test=test[features]
  y_test=test[target]
  model.fit(X_train,y_train)
  trainPred=model.predict(X_train)
  testPred=model.predict(X_test)
  trainMAE=mean_absolute_error(y_train, trainPred)
  testMAE=mean_absolute_error(y_test, testPred)

  print('Train MAE:',trainMAE)
  print('test MAE:', testMAE)
  

In [144]:
linreg(trainDF, testDF,['bedrooms','bathrooms','latitude','longitude','balcony', 'loft','no_fee', 'pre-war'])


Train MAE: 738.6042212525542
test MAE: 742.917960337658


In [0]:
longlist=df.columns.to_list()[-25:-6]
longlist.append('bedrooms')
longlist.append('bathrooms')

In [0]:
longlist.append('latitude')
longlist.append('longitude')

In [152]:
for i in range(len(longlist)):
  for q in range(i,len(longlist)):
    if linreg(trainDF, testDF, [longlist[i], longlist[q]]) < 740:
      print(longlist[i], longlist[q])
      linreg(trainDF, testDF, [longlist[i], longlist[q]])

Train MAE: 1170.789739369894
test MAE: 1169.4183894726643


TypeError: ignored