<a href="https://colab.research.google.com/github/Tclack88/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [X] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [X] Engineer at least two new features. (See below for explanation & ideas.)
- [X] Fit a linear regression model with at least two features.
- [X] Get the model's coefficients and intercept.
- [X] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [X] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [X] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [X] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [X] Add your own stretch goal(s) !

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [103]:
import pandas_profiling
df.profile_report()



In [32]:
df.created = pd.to_datetime(df.created)

df['month'] = df.created.map(lambda x: int(x.strftime('%m'))) # create a month col for dataset

# Features (I created in yesterday's assignment

# High number of bedrooms is important, but it's less desireable if the ratio of bathrooms to bedrooms is low
# So I'm engineering a feature which will take that into account. It gives a high value to more bedrooms
# plus "bonus points" for more bathrooms to bedroom ratio
#df['ideal_bathroom_bedroom_ratio'] = df.bedrooms + (df.bathrooms / (df.bedrooms + 1)) # +1 in the denominator removes division by 0. In this case the the column entry will just represent the sum of bedrooms and bathrooms
df['bed_and_bath'] = df.bedrooms + df.bathrooms

conveniences = ['elevator','laundry_in_building','laundry_in_unit','dishwasher','dining_room','high_speed_internet','wheelchair_access']
look = ['hardwood_floors','roof_deck','outdoor_space','balcony','loft','garden_patio','common_outdoor_space']
lux = ['fitness_center','terrace','pre-war','swimming_pool','exclusive','doorman']

df['conveniences'] = df[conveniences].sum(axis=1)
df['look'] = df[look].sum(axis=1)
df['lux'] = df[lux].sum(axis=1)

#df.interest_level.unique() # returns: low, medium and high. This seems to be
interest_convert = {'low':1,'medium':2,'high':3}
df.interest_level = df.interest_level.map(interest_convert)


df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,bed_and_bath,conveniences,look,lux
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,4.5,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,3.0,1,0,2
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,3,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,2.0,2,1,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,2.0,0,1,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,5.0,0,0,1


In [33]:
df.month.value_counts()
# No NaN's and only April, May and June

6    16973
4    16217
5    15627
Name: month, dtype: int64

In [0]:
#features = ['ideal_bathroom_bedroom_ratio','interest_level','conveniences','look','lux']
features = ['bed_and_bath','interest_level','conveniences','look','lux']

label = ['price']

train_df = df[df.month != 6]
test_df = df[df.month == 6]

X_train = train_df[features]
y_train = train_df[label]
X_test = test_df[features]
y_test = test_df[label]

In [0]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

In [96]:
print("Unit change in price based on feature:")
print(pd.Series(np.round(model.coef_[0]),features).to_string())
print("\nIntercept(base price with none of those features):")
print(f"${model.intercept_[0]:,.0f}")

Unit change in price based on feature:
bed_and_bath      768.0
interest_level   -560.0
conveniences      210.0
look              -92.0
lux               257.0

Intercept(base price with none of those features):
$1,778


In [0]:
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score

def show_error(prediction,actual):
  rmse = mean_squared_error(prediction,actual)**.5
  mae = mean_absolute_error(prediction,actual)
  r2 = r2_score(prediction,actual)
  print(f"Root Mean Squared Error: ${rmse:,.0f}")
  print(f"Mean Absolute Error:\t ${mae:,.0f}")
  print(f"R\u00b2: \t\t {r2}")

In [102]:
show_error(y_pred,y_test)

Root Mean Squared Error: $1,211
Mean Absolute Error:	 $805
R²: 		 0.08187963257831421


## Compare with yesterday's attempt where the train/test split was random and not based on month

In [101]:
from sklearn.model_selection import train_test_split

X_train2,X_test2,y_train2,y_test2 = train_test_split(df[features],df[label])

model2 = LinearRegression()

model2.fit(X_train2,y_train2)
y_pred2 = model2.predict(X_test2)


print("Unit change in price based on feature:")
print(pd.Series(np.round(model2.coef_[0]),features).to_string())
print("\nIntercept(base price with none of those features):")
print(f"${model2.intercept_[0]:,.0f}\n\n")


show_error(y_pred2,y_test2)

Unit change in price based on feature:
bed_and_bath      776.0
interest_level   -582.0
conveniences      189.0
look              -87.0
lux               280.0

Intercept(base price with none of those features):
$1,788


Root Mean Squared Error: $1,221
Mean Absolute Error:	 $806
R²: 		 0.08624705905052932


### The Models did about the same
The "random choices" almost did better. This makes me believe that there's not much value in making future predictions on this type of dataset. The value of a condo is based on the physical features (whether or not there's a pool, highspeed internet, etc.) And not based on when the listing was posted (i.e. the `created` column)