<a href="https://colab.research.google.com/github/austiezr/DS-Unit-2-Linear-Models/blob/master/module1-regression-1/LS_DS_211_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 1*

---

# Regression 1

## Assignment

You'll use another **New York City** real estate dataset. 

But now you'll **predict how much it costs to rent an apartment**, instead of how much it costs to buy a condo.

The data comes from renthop.com, an apartment listing website.

- [x] Look at the data. Choose a feature, and plot its relationship with the target.
- [x] Use scikit-learn for linear regression with one feature. You can follow the [5-step process from Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Basics-of-the-API).
- [x] Define a function to make new predictions and explain the model coefficient.
- [x] Organize and comment your code.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.

If your **Plotly** visualizations aren't working:
- You must have JavaScript enabled in your browser
- You probably want to use Chrome or Firefox
- You may need to turn off ad blockers
- [If you're using Jupyter Lab locally, you need to install some "extensions"](https://plot.ly/python/getting-started/#jupyterlab-support-python-35)

## Stretch Goals
- [ ] Do linear regression with two or more features.
- [x] Read [The Discovery of Statistical Regression](https://priceonomics.com/the-discovery-of-statistical-regression/)
- [x] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 2.1: What Is Statistical Learning?

In [0]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# Read New York City apartment rental listing data
import pandas as pd
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

In [0]:
# Remove outliers: 
# the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= 1375) & (df['price'] <= 15500) & 
        (df['latitude'] >=40.57) & (df['latitude'] < 40.99) &
        (df['longitude'] >= -74.1) & (df['longitude'] <= -73.38)]

### Initial regression

In [0]:
# Imports and setting up DF

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import plotly.graph_objects as go

reg = LinearRegression()

test, train = np.array_split(df, 2)

testX = test[['bedrooms']]
trainX = train[['bedrooms']]
testY = test['price']
trainY = train['price']

In [387]:
# Training and Fitting for bedrooms

reg.fit(trainX, trainY)

predY = reg.predict(testX)

print(f'Coefficients: {reg.coef_}')
print(f'Mean squared error: {mean_squared_error(testY, predY)}')
print(f'Coefficient of determination: {r2_score(testY, predY)}')

Coefficients: [867.05850907]
Mean squared error: 2189513.514811866
Coefficient of determination: 0.28070790720979255


In [388]:
# Plotting bedrooms 

fig = go.Figure(data=go.Scatter(x=test['bedrooms'], y=test['price'], mode='markers'))
fig.add_trace(go.Scatter(x=test['bedrooms'], y=predY))

fig.show()

### Encoding New Features

In [0]:
# setting numeric values to interest level

df['interest_level'] = df['interest_level'].map({'low' : 1, 'medium' : 2, 'high' : 3})

In [0]:
# total number of bedrooms and bathrooms

df['bed_bath'] = (lambda x, y: x + y)(df['bedrooms'], df['bathrooms'])

In [0]:
# inventing a value based on 'desirable' features tempered by interest level

df['arbitrary_luxury_value'] = (lambda a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u: (a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+q+s+t+u)/r)(df['elevator'], df['hardwood_floors'], df['doorman'], df['dishwasher'], df['no_fee'], df['fitness_center'], df['laundry_in_unit'], df['roof_deck'], df['outdoor_space'], df['dining_room'], df['high_speed_internet'], df['balcony'], df['swimming_pool'], df['new_construction'], df['terrace'], df['loft'], df['garden_patio'], df['interest_level'], df['bed_bath'], df['cats_allowed'], df['dogs_allowed'])

### Experimenting training with different features

In [0]:
# Function to train and plot easily with different features/targets

def linear_regression(features, df=df, target='price'):
  reg = LinearRegression()
  
  test, train = np.array_split(df, 2)
  testX = test[[features]]
  trainX = train[[features]]
  testY = test[target]
  trainY = train[target]
  
  reg.fit(trainX, trainY)

  predY = reg.predict(testX)

  print(f'Coefficients: {reg.coef_}')
  print(f'Mean squared error: {mean_squared_error(testY, predY)}')
  print(f'Coefficient of determination: {r2_score(testY, predY)}')

  fig = go.Figure(data=go.Scatter(x=test[features], y=testY, mode='markers'))
  fig.add_trace(go.Scatter(x=test[features], y=predY))
  fig.show()

In [393]:
linear_regression('bathrooms')

Coefficients: [2619.764562]
Mean squared error: 1650609.2966387523
Coefficient of determination: 0.4577470258454758


In [394]:
linear_regression('bed_bath')

Coefficients: [822.50269689]
Mean squared error: 1789448.034449351
Coefficient of determination: 0.41213616041598444


In [395]:
linear_regression('arbitrary_luxury_value')

Coefficients: [273.11281008]
Mean squared error: 2112317.2144357883
Coefficient of determination: 0.30606819298904775


In [396]:
linear_regression('bedrooms')

Coefficients: [867.05850907]
Mean squared error: 2189513.514811866
Coefficient of determination: 0.28070790720979255


### Prediction function

In [397]:
def predict(feature, featureValue, df=df, target='price'):
  reg = LinearRegression()

  test, train = np.array_split(df, 2)
  testX = test[[feature]]
  trainX = train[[feature]]
  testY = test[target]
  trainY = train[target]
  
  reg.fit(trainX, trainY)

  predY = reg.predict([[featureValue]])

  estimate = predY[0]
  coefficient = reg.coef_[0]
  result = f'${estimate:,.0f} estimated price for {featureValue:,.0f} {feature} in Tribeca.'
  explanation = f'In this linear regression, each increase in {feature} adds ${coefficient:,.0f}.'
  print(f'{result} \n {explanation}')

predict('bathrooms', 1.5)

$4,360 estimated price for 2 bathrooms in Tribeca. 
 In this linear regression, each increase in bathrooms adds $2,620.
