<a href="https://colab.research.google.com/github/scottwmwork/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))] 

In [15]:
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)


#Creating training data
train = df[df.created.dt.month < 6]
test  = df[df.created.dt.month == 6]
train.shape, test.shape

((31844, 34), (16973, 34))

In [16]:
import plotly.express as px
px.scatter(train, x='latitude', y='price', trendline = 'ols', trendline_color_override = 'red')

In [22]:
train.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,cluster
count,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0
mean,1.203728,1.824583,40.750743,-73.972867,3575.604007,0.53043,0.477139,0.480907,0.445861,0.430725,0.418666,0.369834,0.057311,0.267586,0.185938,0.1757,0.133777,0.143983,0.10429,0.08862,0.060734,0.055929,0.05147,0.047733,0.042269,0.044216,0.039222,0.028388,0.029048,4.460338
std,0.472447,0.825018,0.038658,0.02891,1762.136694,0.499081,0.499485,0.499643,0.497068,0.495185,0.493348,0.482767,0.232439,0.442707,0.389062,0.380571,0.340418,0.351078,0.305641,0.284198,0.238845,0.229788,0.220957,0.213203,0.201204,0.205577,0.194127,0.166082,0.167943,2.602488
min,0.0,1.0,40.5757,-74.0873,1375.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,40.7285,-73.9918,2500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
50%,1.0,1.528357,40.7517,-73.9781,3150.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,1.0,2.0,40.7736,-73.955,4095.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
max,10.0,7.0,40.9102,-73.7001,15500.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0


In [0]:
#Still way too much data based off of plot...
train['bedrooms'] = train['bedrooms'].replace(0, train['bedrooms'].mean())

In [18]:
import plotly.express as px
px.scatter(train, x='longitude', y='latitude', color='price')

In [19]:
# Cluster the locations
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, n_jobs=-1)
train['cluster'] = kmeans.fit_predict(train[['longitude', 'latitude']])
test['cluster'] = kmeans.predict(test[['longitude', 'latitude']])
px.scatter(train, x='longitude', y='latitude', color='cluster')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [0]:
#engineering features
def make_features(df):
  clusters = pd.get_dummies(df['cluster'], prefix='cluster')
  for col in clusters:
    df[col] = clusters[col]
  
  #Encode the High Speed internet
  high_speed_internet = pd.get_dummies(df['high_speed_internet'], prefix = 'high_speed_internet')
  for col in high_speed_internet:
    df[col] = high_speed_internet[col]
  
  #Encode cats or dogs rules
  df['cats_or_dogs'] = (df['cats_allowed']==1) | (df['dogs_allowed']==1)
   
  # Total number of rooms (beds + baths)
  df['rooms'] = df['bedrooms'] + df['bathrooms']

  # Ratio of baths to beds
  df['ratio_baths_beds'] =  df['bathrooms'] / df['bedrooms']
  
  return df

In [41]:
import warnings
warnings.filterwarnings("ignore")

train = make_features(train)
test = make_features(test)
print(train.shape)
print(test.shape)

(31844, 50)
(16973, 50)


In [45]:
#Get Baselines
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_train = train['price']
y_test = test['price']
y_pred_train = [y_train.mean()] * len(y_train)
y_pred_test  = [y_train.mean()] * len(y_test)

print("Mean Baseline:\n")
print("--------Test---------")
print("Root Mean Squared Error:",np.sqrt(mean_squared_error(y_train, y_pred_train)))
print('Train Mean Absolute Error:', mean_absolute_error(y_train, y_pred_train))
print('Train R^2 Score:', r2_score(y_train, y_pred_train))

print("\n--------Test---------")

print('Test Root Mean Square Error:', np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('Test Mean Absolute Error:', mean_absolute_error(y_test, y_pred_test))
print('Test R^2 Score:', r2_score(y_test, y_pred_test))

Mean Baseline:

--------Test---------
Root Mean Squared Error: 1762.1090255404863
Train Mean Absolute Error: 1201.8811133682555
Train R^2 Score: 0.0

--------Test---------
Test Root Mean Square Error: 1762.9952880399528
Test Mean Absolute Error: 1197.7088871089013
Test R^2 Score: -4.218690517676649e-05


In [48]:
#Linear Regression Fit

from sklearn.linear_model import LinearRegression

target = 'price'
features =['high_speed_internet', 'longitude']


X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
    
print(f'Linear Regression with {len(features)} features')
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())
    
print('Train Mean Squared Error:', np.sqrt(mean_squared_error(y_train, y_pred_train)))
print('Test  Mean Square Error:', np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('Train Mean Absolute Error:', mean_absolute_error(y_train, y_pred_train))
print('Test Mean Absolute Error:', mean_absolute_error(y_test, y_pred_test))
print('Train R^2 Score:', r2_score(y_train, y_pred_train))
print('Test R^2 Score:', r2_score(y_test, y_pred_test))

Linear Regression with 2 features
Intercept -1088716.3991749058
high_speed_internet      378.517721
longitude             -14765.663499
Train Mean Squared Error: 1702.7745631413275
Test  Mean Square Error: 1702.8780111773021
Train Mean Absolute Error: 1140.9673685377127
Test Mean Absolute Error: 1135.5132643154102
Train R^2 Score: 0.06621099409838094
Test R^2 Score: 0.06699688610730448
