<a href="https://colab.research.google.com/github/mattdmeans/DS-Unit-2-Linear-Models/blob/master/DS_211_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 1*

---

# Regression 1

## Assignment

You'll use another **New York City** real estate dataset. 

But now you'll **predict how much it costs to rent an apartment**, instead of how much it costs to buy a condo.

The data comes from renthop.com, an apartment listing website.

- [ ] Look at the data. Choose a feature, and plot its relationship with the target.
- [ ] Use scikit-learn for linear regression with one feature. You can follow the [5-step process from Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Basics-of-the-API).
- [ ] Define a function to make new predictions and explain the model coefficient.
- [ ] Organize and comment your code.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.

If your **Plotly** visualizations aren't working:
- You must have JavaScript enabled in your browser
- You probably want to use Chrome or Firefox
- You may need to turn off ad blockers
- [If you're using Jupyter Lab locally, you need to install some "extensions"](https://plot.ly/python/getting-started/#jupyterlab-support-python-35)

## Stretch Goals
- [ ] Do linear regression with two or more features.
- [ ] Read [The Discovery of Statistical Regression](https://priceonomics.com/the-discovery-of-statistical-regression/)
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 2.1: What Is Statistical Learning?

In [1]:
import sys
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv',
                 parse_dates = [2])
assert df.shape == (49352, 34)

In [3]:
# Remove outliers: 
# the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= 1375) & (df['price'] <= 15500) & 
        (df['latitude'] >=40.57) & (df['latitude'] < 40.99) &
        (df['longitude'] >= -74.1) & (df['longitude'] <= -73.38)]

In [4]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48818 entries, 0 to 49351
Data columns (total 34 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   bathrooms             48818 non-null  float64       
 1   bedrooms              48818 non-null  int64         
 2   created               48818 non-null  datetime64[ns]
 3   description           47393 non-null  object        
 4   display_address       48685 non-null  object        
 5   latitude              48818 non-null  float64       
 6   longitude             48818 non-null  float64       
 7   price                 48818 non-null  int64         
 8   street_address        48808 non-null  object        
 9   interest_level        48818 non-null  object        
 10  elevator              48818 non-null  int64         
 11  cats_allowed          48818 non-null  int64         
 12  hardwood_floors       48818 non-null  int64         
 13  dogs_allowed    

In [6]:
# multiple features, single feature model below

factors = df

In [7]:
factors = factors.drop(columns= ['description', 'display_address', 'latitude', 'longitude', 'street_address', 'interest_level', 'price', 'created'])

In [8]:
y = df['price']
X = factors

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

In [10]:
print(X_train.shape)
print(X_test.shape)

(39054, 26)
(9764, 26)


In [11]:
lr = LinearRegression()

In [12]:
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [13]:
print('training MAE', mean_absolute_error(y_train, lr.predict(X_train)))
print('test MAE', mean_absolute_error(y_test, lr.predict(X_test)))

training MAE 752.2105364646997
test MAE 758.1757558189008


In [14]:
lr.coef_

array([1798.20682369,  462.30520288,  162.72663422,  -91.51782591,
       -217.02257665,  184.14959337,  618.13944939,   97.86863374,
       -187.69495481, -290.73593437,  225.08055656,  -60.59923833,
        517.31539517, -136.49475494, -148.34439368,  188.43195388,
       -312.1623678 ,  -69.38910306,   38.11661186, -146.43941989,
        202.7205623 ,   36.41851566,  248.61776435,   71.20658326,
        162.74588634,  107.61238727])

In [15]:
# Single feature

y1 = df['price']
X1 = df[['bedrooms']]

In [16]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = .2, random_state = 42)

In [17]:
lr.fit(X1_train, y1_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [18]:
def predict(bedrooms):
  y_pred = lr.predict([[bedrooms]])
  estimate = y_pred[0]
  coefficient = lr.coef_[0]
  result = f'${estimate:,.0f} estimated rent for {bedrooms:,.0f} bedrooms in NY.'
  explanation = f'In this linear regression, each additional bedroom adds ${coefficient:,.0f}'
  return result + ' ' + explanation

In [19]:
predict(2)

'$3,972 estimated rent for 2 bedrooms in NY. In this linear regression, each additional bedroom adds $855'

In [20]:
lr.coef_

array([855.30531892])