<a href="https://colab.research.google.com/github/lineality/DS-Unit-2-Regression-Classification/blob/master/module2/GGA_2_1_2_assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] 1 Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] 2 Engineer at least two new features. (See below for explanation & ideas.)
- [ ] 3 Fit a linear regression model with at least two features.
- [ ] 4 Get the model's coefficients and intercept.
- [ ] 5 Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] 6 What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] 7 As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
#Import Libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import sklearn
import scipy.stats as stats
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, t, ttest_1samp
from matplotlib import style

## The next few cells are pre-loading code, compliments of the chef. Enjoy. We'll be back for your drink orders below.

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# GGA Code Beings Here

# The Plan...

After selecting Y and making our new engineered x1 x2 features:

For our train/test split, we will use the data from April & May 2016 to train and data from June 2016 to test.

Then run various tests (above) and see how the model is performing.

# Select Y = price

In [0]:
#there are NaN in 3 columns...note.
df.isna().sum()

In [24]:
df.shape

(48817, 34)

In [25]:
df.head(2)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Feature engineering:

Here we will be making two new features by combining or comparing existing features. But wait there's...actually there isn't any more than that here.


In [0]:
df["new_rooms"] =  df["bathrooms"] + df["bedrooms"]

In [0]:
df["luxuries"] =  df["hardwood_floors"] + df["doorman"] + df["fitness_center"] + df["swimming_pool"]

## Time Based Split:

### note, I got stuck for a long time because I wasn't putting the date in quotes...I need to ask about how to take a break e.g. when brownbag goes straight through lunch. Cannot think straight after 11 hours of no breaks.

In [0]:
df["created"] = pd.to_datetime(df["created"], infer_datetime_format=True)

In [0]:
train = df[df['created'] < '2016-06-01 00:00:00']
test = df[df['created'] >= '2016-06-01 00:00:00']

Let's examine our new test and train sets...it appears to be roughly a 1:3 split

In [30]:
len(train), len(test)

(31844, 16973)

#Baseline

Let us look at the baseline of our test and train sets - Y vs. mean of Y

They are rather similar in temperment. 

In [69]:
train_mean = train['price'].mean()
train_mean

3575.604007034292

In [70]:
test_mean = test['price'].mean()
test_mean

3587.0546750721733

#Mean Absolute Error
## What is the MAE of our baseline assumption?

In [68]:
from sklearn.metrics import mean_absolute_error

# Arrange y target vectors
target = 'price'
y_train = train[target]
y_test = test[target]

# Get mean baseline
print('Mean Baseline (using 0 features)')
guess = y_train.mean()

# Train Error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
#MSE = 
print(f'Train Error (April & May 2016): {mae:.2f} USD $')

# Test Error
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
#MSE = 

print(f'Test Error (from June 2016): {mae:.2f} USD $')

Mean Baseline (using 0 features)
Train Error (April & May 2016): 1201.88 USD $
Test Error (from June 2016): 1197.71 USD $


In [0]:
from sklearn.metrics import mean_squared_error 

mean_squared_error(y_test,guess) 

## Visual exploration using a 3D graph...

Note...this causes colab problems so not runing it for now

In [16]:
"""
#this one actually runs even though the text-til
import pandas as pd
import plotly.express as px

px.scatter_3d(
    train,
    x='new_rooms', 
    y='luxuries', 
    z='price', 
    text='price', 
    title='price'
)
"""

"\n#this one actually runs even though the text-til\nimport pandas as pd\nimport plotly.express as px\n\npx.scatter_3d(\n    train,\n    x='new_rooms', \n    y='luxuries', \n    z='price', \n    text='price', \n    title='price'\n)\n"

In [17]:
"""
import pandas as pd
import plotly.express as px

px.scatter_3d(
    train,
    x='new_rooms', 
    y='luxuries', 
    z='price', 
    text='created', 
    title='display_address'
)
"""

"\nimport pandas as pd\nimport plotly.express as px\n\npx.scatter_3d(\n    train,\n    x='new_rooms', \n    y='luxuries', \n    z='price', \n    text='created', \n    title='display_address'\n)\n"

# 2 Feature Multiple Linear Regression
##And MAE

Using scikit-learn to fit a multiple regression with two new engineered shiny off the press fresh features.

In [76]:
from sklearn.linear_model import LinearRegression

# TODO: Complete this cell

# Re-arrange X features matrices
features = ['new_rooms', 
            'luxuries']
print(f'Linear Regression, dependent on: {features}')
X_train = train[features]
X_test = test[features]

# Fitting the model
model = LinearRegression()
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
print('The Training-Set Mean Average Error is:', train_mae)

# Apply the model to new data
test_mae = mean_absolute_error(y_test, model.predict(X_test))
print('The Test-Set Mean Average Error is:', test_mae)


Linear Regression, dependent on: ['new_rooms', 'luxuries']
The Training-Set Mean Average Error is: 849.2777927783169
The Test-Set Mean Average Error is: 863.2246406118427


## Beta, Coeficients, Intercept 

In [54]:
model.intercept_, model.coef_

(1019.6977061533298, array([785.43623215, 331.96723609]))

In [55]:
beta0 = model.intercept_
beta1, beta2 = model.coef_
print(f'y = {beta0} + {beta1}x1 + {beta2}x2')

y = 1019.6977061533298 + 785.4362321539645x1 + 331.9672360861988x2


In [72]:
# This is easier to read
print('Intercepts:', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())

Intercepts: 1019.6977061533298
new_rooms    785.436232
luxuries     331.967236


mean_squared_error(Y_true,Y_pred) 


for both the train and test data.
RMSE
MAE
R2 



In [0]:
#mean_squared_error(df['price'],y_pred) 

In [0]:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [0]:
# Plot data
    fig = plt.figure(figsize=(7,7))
    ax = plt.axes()
    df.plot.scatter(feature, target, ax=ax)
    
    # Make predictions
    x = df[feature]
    y = df[target]
    y_pred = m*x + b
    
        # Plot predictions
    ax.plot(x, y_pred)
    
    '''
    # Plot squared errors
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    scale = (xmax-xmin)/(ymax-ymin)
    for x, y1, y2 in zip(x, y, y_pred):
        bottom_left = (x, min(y1, y2))
        height = abs(y1 - y2)
        width = height * scale
        ax.add_patch(Rectangle(xy=bottom_left, width=width, height=height, alpha=0.1))
    '''



In [0]:
    # Make predictions
    x = df[feature]
    y = df[target]
    y_pred = m*x + b

In [0]:

   
    # Print regression metrics
    mse = mean_squared_error(y, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    print('Mean Squared Error:', mse)
    print('Root Mean Squared Error:', rmse)
    print('Mean Absolute Error:', mae)
    print('R^2:', r2)

#RMSE

In [78]:
train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
train_rmse

1284.9188631947177

In [81]:
test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
test_rmse

1283.1787005784092

#MSE

In [82]:
test_MSE = mean_squared_error(y_test, model.predict(X_test))
test_MSE

1646547.5776180949

In [84]:
train_MSE = mean_squared_error(y_train, model.predict(X_train))
train_MSE

1651016.4849936059

R**2

In [93]:
from sklearn.metrics import r2_score

r2_test = r2_score(y_test,y_pred)
r2_test


-4.218690517676649e-05

In [91]:
from sklearn.metrics import r2_score

r2_train = r2_score(y_train,model.predict(X_train))
r2_train

0.46827649569159724