<a href="https://colab.research.google.com/github/jonathanmendoza-tx/DS-Unit-2-Regression-Classification/blob/master/module2/Jonathan_Mendoza_assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

## Load

In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 4.4MB/s 
[?25hCollecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/70/19/8437e22c84083a6d5d8a3c80f4edc73c9dcbb89261d07e6bd13b48752bbd/plotly-4.1.1-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 37.5MB/s 
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pandas-profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/ad/24a16fa4ba612fb96a3c4bb115a5b9741483f53b66d3d3afd987f20fa227/phik-0.9.8-py3-none-any.whl (606kB)
[K     |████████████████████████████████| 614kB 52.3MB/s 
[?25hCollecting confuse>=1.0.0 (f

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

## Explore

In [0]:
import matplotlib.pyplot as plt
import datetime
df['created'] = pd.to_datetime(df['created'])

In [0]:
df['month'] = df['created'].map(lambda x : x.month)

In [26]:
df['month'].value_counts()

6    16973
4    16217
5    15627
Name: month, dtype: int64

In [0]:
train = df[df['month']<6]
test = df[df['month']==6]

In [29]:
train.shape, test.shape, df.shape

((31844, 35), (16973, 35), (48817, 35))

In [36]:
train.corr()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month
bathrooms,1.0,0.526102,0.012872,-0.019719,0.684137,0.128704,0.0221,0.095403,0.02507,0.154125,0.173795,0.129592,-0.014814,0.1479,-0.014654,0.211802,0.103708,0.135735,0.229534,0.090885,0.128851,0.113007,0.065157,0.133543,-0.000402,0.007647,0.090583,0.067371,-0.009281,0.011215
bedrooms,0.526102,1.0,0.00465,0.055544,0.5365,-0.030263,-0.008355,0.096108,-0.006896,-0.046827,0.15662,0.16256,0.000825,0.015655,-0.002732,0.153591,0.044011,0.118787,0.190639,0.058695,0.098536,0.03438,-0.002055,0.099822,-0.01556,-0.10975,0.073061,0.011869,-0.005031,0.012243
latitude,0.012872,0.00465,1.0,0.329175,-0.039129,-0.016379,-0.035711,0.019477,-0.038169,-0.042532,-0.02418,-0.018046,-0.055495,-0.107993,0.02826,-0.047701,-0.062222,-0.084935,0.016496,-0.033745,0.019896,0.028281,-0.054351,0.004899,-0.05391,-0.016551,-0.002173,-0.072748,-0.124857,0.031554
longitude,-0.019719,0.055544,0.329175,1.0,-0.250091,-0.190341,-0.064892,-0.106368,-0.077785,-0.274412,-0.162148,-0.087616,-0.058856,-0.256357,0.002466,-0.123731,-0.158661,-0.107538,-0.01746,-0.128477,-0.03596,-0.071829,-0.107124,-0.052616,0.048941,-0.058037,-0.029241,-0.064151,-0.115252,0.015911
price,0.684137,0.5365,-0.039129,-0.250091,1.0,0.204558,0.052167,0.105506,0.060905,0.272624,0.227775,0.135182,-0.020344,0.226138,-0.029749,0.279104,0.123536,0.136653,0.239696,0.092171,0.130938,0.132301,0.071246,0.142655,-0.010897,0.000185,0.092367,0.07306,0.006269,0.017114
elevator,0.128704,-0.030263,-0.016379,-0.190341,0.204558,1.0,0.039135,0.267743,0.038985,0.617265,0.342497,0.230979,0.14428,0.431833,-0.096178,0.128841,0.330937,0.212922,0.197731,0.276124,0.172863,0.183826,0.187847,0.14218,0.024406,0.052694,0.088001,0.158554,0.123025,0.008934
cats_allowed,0.0221,-0.008355,-0.035711,-0.064892,0.052167,0.039135,1.0,-0.165084,0.936082,0.098092,-0.039279,-0.022044,0.106905,0.135813,0.045873,0.000729,0.034979,0.081362,-0.020895,0.083517,0.021379,0.011551,0.056047,0.011132,0.032739,-0.036948,0.010058,0.043787,0.112179,0.010659
hardwood_floors,0.095403,0.096108,0.019477,-0.106368,0.105506,0.267743,-0.165084,1.0,-0.173728,0.191773,0.634526,0.347272,-0.147549,0.160174,0.011397,0.353154,0.272214,0.18387,0.316261,0.241037,0.175504,0.167263,0.185692,0.179247,-0.192835,0.116145,0.16297,0.12385,-0.125314,-0.011709
dogs_allowed,0.02507,-0.006896,-0.038169,-0.077785,0.060905,0.038985,0.936082,-0.173728,1.0,0.104055,-0.036654,-0.010719,0.09141,0.139972,0.051973,0.011688,0.041319,0.081291,-0.017096,0.097098,0.024257,0.010426,0.069262,0.009273,0.032616,-0.041726,0.013057,0.053235,0.114579,0.008155
doorman,0.154125,-0.046827,-0.042532,-0.274412,0.272624,0.617265,0.098092,0.191773,0.104055,1.0,0.299848,0.260018,0.086744,0.605068,-0.054007,0.154492,0.390499,0.21372,0.188729,0.317876,0.163822,0.262984,0.221016,0.134242,-0.072733,0.006645,0.079066,0.17169,0.13654,0.005209


In [48]:
train['price'].describe()

count    31844.000000
mean      3575.604007
std       1762.136694
min       1375.000000
25%       2500.000000
50%       3150.000000
75%       4095.000000
max      15500.000000
Name: price, dtype: float64

In [53]:
train.query('bedrooms<1').shape

(6172, 35)

In [55]:
train.query('bathrooms<1').shape

(191, 35)

In [62]:
train = train.query('bedrooms>0')
train = train.query('bathrooms>0')
train = train.query('price<10000')
train.shape

(25114, 35)

In [63]:
train['price'].describe()

count    25114.000000
mean      3682.767819
std       1425.626872
min       1375.000000
25%       2700.000000
50%       3395.000000
75%       4395.000000
max       9999.000000
Name: price, dtype: float64