Lambda School Data Science

*Unit 2, Sprint 1, Module 1*

---

# Regression 1

## Assignment

You'll use another **New York City** real estate dataset. 

But now you'll **predict how much it costs to rent an apartment**, instead of how much it costs to buy a condo.

The data comes from renthop.com, an apartment listing website.

- [ ] Look at the data. Choose a feature, and plot its relationship with the target.
- [ ] Use scikit-learn for linear regression with one feature. You can follow the [5-step process from Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Basics-of-the-API).
- [ ] Define a function to make new predictions and explain the model coefficient.
- [ ] Organize and comment your code.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.

If your **Plotly** visualizations aren't working:
- You must have JavaScript enabled in your browser
- You probably want to use Chrome or Firefox
- You may need to turn off ad blockers
- [If you're using Jupyter Lab locally, you need to install some "extensions"](https://plot.ly/python/getting-started/#jupyterlab-support-python-35)

## Stretch Goals
- [ ] Do linear regression with two or more features.
- [ ] Read [The Discovery of Statistical Regression](https://priceonomics.com/the-discovery-of-statistical-regression/)
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 2.1: What Is Statistical Learning?

In [0]:
import sys
# I need to decide if to use locally
# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# Read New York City apartment rental listing data
import pandas as pd
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

In [0]:
# Remove outliers: 
# the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= 1375) & (df['price'] <= 15500) & 
        (df['latitude'] >=40.57) & (df['latitude'] < 40.99) &
        (df['longitude'] >= -74.1) & (df['longitude'] <= -73.38)]

In [44]:
print(df.shape)
df.head(2)

(48818, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# Want to check if there are any null values
df.isnull().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
dtype: int64

In [0]:
# Finding the mean baseline
theMean = df["price"].mean()
theMean

3579.5609816051456

In [0]:
# Showing what the guess would be if I guessed using the baseline
myGuesses = df["price"] - theMean

In [0]:
print(f"If we just used the mean, (${theMean:,.0f}) to guess the price for the apartments,")
print(f"then we would be off, on average by ${myGuesses.abs().mean():,.0f} on each guess")

If we just used the mean, ($3,580) to guess the price for the apartments,
then we would be off, on average by $1,202 on each guess


In [0]:
# importing plotly
import plotly.express as px

In [0]:
# Plotting with plotly.express price and bedrooms
px.scatter(df, x='price', y='bedrooms', trendline="ols")

In [18]:
# Trying another plot with plotly.express
px.scatter(df, x="price", y="new_construction", trendline="ols")

In [21]:
# Doing one last try with a scatter plot in plotly of another type of 
# feature
px.scatter(df, x='price', y="bathrooms", trendline="ols")

In [0]:
# Doing the import of sklearn
from sklearn.linear_model import LinearRegression

In [0]:
# Instantiating the modle
linRegModel = LinearRegression()

In [60]:
# The training data for the Model 
X = df[['bedrooms']]
y = df['price']
print(X.shape, y.shape)

(48818, 1) (48818,)


In [75]:
# I am using as the training data the number of bathrooms
linRegModel.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
x_test = [[3]]

In [93]:
# Will be now using this to now make a prediction with the model now fitted
y_prediction = linRegModel.predict(x_test)
y_prediction

array([4827.73665176])

In [85]:
# Printing this out
print(f"My prediction for price of an apartment that has {x_test[0][0]} bedrooms is: ${y_prediction[0]:,.0f}")

My prediction for price of an apartment that has 2 bedrooms is: $3,974


In [0]:
# Creating a funtion that will predict the price of an apartment to rent 
def myPredict(numBedrooms):
  '''
    This method will make a prediction about the price to rent an apartment
    for a month, given the number of bedrooms.
  '''
  y_prediction = linRegModel.predict([[numBedrooms]])
  # getting the coeficient from the array of the predictions
  thePrediction = y_prediction[0]
  theCoef = linRegModel.coef_[0]
  result = f"With {numBedrooms} bedrooms, we predict the price to be ${thePrediction:,.0f}"
  coefExp = f"For each additional bedrooms the price will increase by ${theCoef:,.0f}"

  return print(result + "\n" + coefExp)

In [99]:
# Trying out the "myPredict" function
myPredict(4)

With 4 bedrooms, we predict the price to be $5,681
For each additional bedrooms the price will increase by $853


In [100]:
myPredict(6)

With 6 bedrooms, we predict the price to be $7,387
For each additional bedrooms the price will increase by $853


In [103]:
from ipywidgets import interact
interact(myPredict, numBedrooms=(0,40) );

interactive(children=(IntSlider(value=20, description='numBedrooms', max=40), Output()), _dom_classes=('widget…

In [109]:
# Doing a prediction but that has more than one feature
features = ['bedrooms', 'bathrooms']
target = 'price'
df[features].shape

(48818, 2)

In [0]:
linModel = LinearRegression()

In [107]:
# Fitting the model
linModel.fit(df[features], df[target])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [108]:
# Making the predictions with some new data
# Using 3 bedrooms and 2 bathrooms
y_predict = linModel.predict([[3,2]])
y_predict

array([5818.43891708])

In [0]:
# Just changing the function just a little to handle the two features
def myNewPredictions(numbaths, numBedrooms):
  # making the predictions
  y_pred = linModel.predict([[numBedrooms, numbaths]])
  # getting the coef
  theCoef = linModel.coef_[0]
  # getting the prediction
  thePred = y_pred[0]

  result = f"The predicted cost of the Apartment is ${thePred:,.0f}"
  explan = f"The coeficient for this is ${theCoef:,.0f}"
  return print(result + "\n" + explan)

In [118]:
myNewPredictions(3,7)

The predicted cost of the Apartment is $9,458
The coeficient for this is $385


In [120]:
interact(myNewPredictions, numbaths=(0,10), numBedrooms=(0,20));

interactive(children=(IntSlider(value=5, description='numbaths', max=10), IntSlider(value=10, description='num…