Lambda School Data Science

*Unit 2, Sprint 1, Module 1*

---

# Regression & Classification, Module 1

- Clean your data and remove outliers
- Plot relationship between feature and target
- Begin with baselines for regression
- Use scikit-learn to fit a linear regression
- Make new predictions and explain coefficients

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).


In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module1')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

# Explore your data and remove outliers

## Overview

### Predict how much a NYC condo costs 🏠💸

[Amateurs & Experts Guess How Much a NYC Condo With a Private Terrace Costs](https://www.youtube.com/watch?v=JQCctBOgH9I)

> Real Estate Agent Leonard Steinberg just sold a pre-war condo in New York City's Tribeca neighborhood. We challenged three people - an apartment renter, an apartment owner and a real estate expert - to try to guess how much the apartment sold for. Leonard reveals more and more details to them as they refine their guesses.


The condo is 1,497 square feet.

Here are the final guesses:

- Apartment Renter: \$15 million
- Apartment Buyer: \$2.2 million
- Real Estate Expert: \$2.2 million

Let's see how we compare!

First, we need data:

- [Kaggle has NYC property sales data](https://www.kaggle.com/new-york-city/nyc-property-sales), but it's not up-to-date.
- The data comes from the [New York City Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page). There's also a glossary of property sales terms and NYC Building Class Code Descriptions
- The data can also be found on the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.

## Follow Along

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv('../data/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# Get Pandas Profiling Report
df.profile_report()

In [0]:
# Does this data include the Tribeca neighborhood?
mask = df['NEIGHBORHOOD'].str.contains('TRIBECA')
df[mask]

In [0]:
# Keep this subset
df = df[mask]

# Down from > 20k rows to < 150
df.shape

In [0]:
# Why so few property sales in Tribeca?
# Check the date range
# It's just the first 4 months of 2019
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)
df['SALE_DATE'].describe()

In [0]:
# See the Pandas Profiling Report:
# SALE_PRICE was read as strings
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# SALE_PRICE is now a number, but hard to read in scientific notation
# Format numbers: Comma separator for thousands. Zero decimals
pd.options.display.float_format = '{:,.0f}'.format
df.describe()

In [0]:
# Max SCALE_PRICE is $260 million!
# Look at this observation.
# Equivalent to: df[df['SALE_PRICE']==260000000]
df.query('SALE_PRICE == 260000000')

In [0]:
# How many property sales were for multiple units?
df['TOTAL_UNITS'].value_counts()

In [0]:
# Keep only the single units

# Would this be better though?
# df = df[(df.RESIDENTIAL_UNITS==1) & (df.COMMERICAL_UNITS==0)]

df = df.query('TOTAL_UNITS==1')
df.describe()

In [0]:
# Now max sales price is $39 million
# Look at this observation
# It's huge, over 8,000 square feet
# Maybe it's legit, just a huge, expensive condo
df.query('SALE_PRICE == 39285000')

In [0]:
# The min sales price is $0

# There's a glossary here: 
# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

# It says:

# A $0 sale indicates that there was a transfer of ownership without a 
# cash consideration. There can be a number of reasons for a $0 sale including 
# transfers of ownership from parents to children. 

# How often did $0 sales occur in this subset of the data?
len(df.query('SALE_PRICE == 0'))

In [0]:
# 75% percentile for gross square feet is 2,500
# But the max is over 39,000 square feet.

# Look at property sales for > 5,000 square feet

# We see the same condo we looked at earlier,
# Plus two property sales at 39,567 square feet.
# TOTAL_UNITS = 1, but BUILDING_CLASS_CATEGORY = 2-10 UNIT RESIDENTIAL
# Is this dirty data?

df.query('GROSS_SQUARE_FEET > 5000')

In [0]:
# What are the building class categories?
# How frequently does each occur?

df['BUILDING_CLASS_CATEGORY'].value_counts()

In [0]:
# Keep subset of rows:
# Sale price more than $0, 
# Building class category = Condos - Elevator Apartments

mask = (df['SALE_PRICE'] > 0) & (df['BUILDING_CLASS_CATEGORY'] == '13 CONDOS - ELEVATOR APARTMENTS')
df = df[mask]

# Down to 106 rows
df.shape

## Challenge

In your assignment, you'll use another New York City real estate dataset. You'll predict how much it costs to rent an apartment, instead of how much it costs to buy a condo.

The data comes from renthop.com, an apartment listing website. When you do the assignment, look at the data. What's the distribution of the target, `price`, and features such as `longitude` and `latitude`? Remove outliers.

# Plot relationship between feature and target

## Overview

- [Plotly Express](https://plot.ly/python/plotly-express/) examples
- [plotly_express.scatter](https://www.plotly.express/plotly_express/#plotly_express.scatter) docs

## Follow Along

In [0]:
# Scatter plot
import plotly.express as px
px.scatter(df, x='GROSS_SQUARE_FEET', y='SALE_PRICE')

In [0]:
# With OLS (Ordinary Least Squares) trendline,
# The outliers influence the "line of best fit"
px.scatter(df, x='GROSS_SQUARE_FEET', y='SALE_PRICE', trendline='ols')

In [0]:
# Look at sales for more than $35 million

# All are at 70 Vestry Street
# All but one have the same SALE_PRICE & SALE_DATE
# Was the SALE_PRICE for each? Or in tital?
# Is this dirty data?

df.query('SALE_PRICE > 35000000')

In [0]:
# Make a judgment call:
# Keep rows where sale price was < $35 million
df = df.query('SALE_PRICE < 35000000')

# Down to 90 rows
df.shape

## Challenge

In your assignment, you will choose a feature, and plot its relationship with the target.

# Begin with baselines for regression

## Overview

Now I've subset my data, I've removed what I've considered to be outliers. Before I make a _predictive_ model, just how would I _describe_ the sales price? 

Before we _predict_ it, we just want to be able to _describe_ it. With visualizations, like we did before, and even just with pandas methods like you've been using all throughout Lambda School. 

So I can do `df['SALE_PRICE'].describe()` and I can see that now in my
dataset the mean sale price is $3.9 million, when I've excluded those those outliers. 

The mean still is higher than the median, but you better believe it
would have been much higher if I hadn't excluded those outliers. So it all depends on our assumptions.

But here's the interesting thing: **we could use this one number as a "prediction", if we didn't have any data except for sales price...** 

Imagine we didn't have any any other information about condos, then what would you tell somebody? If you had some sales prices like this but you didn't have any of these other columns. If somebody asked you, "How much do you think a condo in Tribeca costs?"

You could say, "Well, I've got 90 sales prices here, and I see that on average they cost $3.9 nine million."

So we do this all the time in the real world. We use descriptive
statistics for prediction. And that's not wrong or bad, in fact **that's where you should start. This is called the _mean baseline_.**

## Follow Along

In [0]:
# The Mean Sale Price is $3.9 million.
# If we didn't use any data except sales prices,
# we could "predict" that Tribeca condos cost about $3.9 million.
# This is called the "mean baseline."
df['SALE_PRICE'].describe()

In [0]:
# But we *can* use more data beyond the sales price.
# Re-plot the relationship between gross square feet and sale pruce.
px.scatter(df, x='GROSS_SQUARE_FEET', y='SALE_PRICE', trendline='ols')

## Challenge

In your assignment, after you remove outliers, what is the mean price in your subset of the data?

# Use scikit-learn to fit a linear regression

## Overview

#### Jake VanderPlas, [_Python Data Science Handbook_, Chapter 5.2: Introducing Scikit-Learn](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Basics-of-the-API)

The best way to think about data within Scikit-Learn is in terms of tables of data. 

![](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.02-samples-features.png)

The features matrix is often stored in a variable named `X`. The features matrix is assumed to be two-dimensional, with shape `[n_samples, n_features]`, and is most often contained in a NumPy array or a Pandas `DataFrame`.

We also generally work with a label or target array, which by convention we will usually call `y`. The target array is usually one dimensional, with length `n_samples`, and is generally contained in a NumPy array or Pandas `Series`. The target array may have continuous numerical values, or discrete classes/labels. 

The target array is the quantity we want to _predict from the data_: in statistical terms, it is the dependent variable. 

Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.

Most commonly, the steps in using the Scikit-Learn estimator API are as follows:

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the `fit()` method of the model instance.
5. Apply the Model to new data: For supervised learning, often we predict labels for unknown data using the `predict()` method.

## Follow Along

In [0]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class
model = LinearRegression()

# 3. Arrange X features matrix & y target vector
features = ['GROSS_SQUARE_FEET']
target = 'SALE_PRICE'
X = df[features]
y = df[target]

# 4. Fit the model
model.fit(X, y)

# 5. Apply the model
y_pred = model.predict(X)

## Challenge

In your assignment, you will use scikit-learn for linear regression with one feature. For a stretch goal, you can do linear regression with two or more features.

# Make new predictions and explain coefficients

## Overview


Now we've generated our predictions, and we can we can look a plot of actual prices versus predicted prices. 



## Follow Along

In [0]:
# plot actual vs predicted
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X, y)
plt.scatter(X, y_pred);

In [0]:
# Interpolate
plt.scatter(X, y)
plt.plot(X, y_pred);

**What pattern did the model "learn", about the relationship between square feet & price?**

To help answer this question, we'll look at the  `coef_` and `intercept_` attributes of the `LinearRegression` object, which are documented here: 

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [0]:
model.coef_[0]

In [0]:
model.intercept_

In [0]:
# Equations for a line
print('y = mx + b')
print(f'y = {model.coef_[0]}*x + {model.intercept_}')
print(f'price = {model.coef_[0]}*square_feet + {model.intercept_}')

In [0]:
# 5. Apply the model to *new/unknown* data
def predict(square_feet):
    y_pred = model.predict([[square_feet]])
    return y_pred[0]

predict(1497)

In [0]:
# Give nicer output
def predict(square_feet):
    y_pred = model.predict([[square_feet]])
    estimate = y_pred[0]
    coefficient = model.coef_[0]
    
    # Format with $ and comma separators. No decimals.
    result = f'${estimate:,.0f} estimated price for {square_feet:,.0f} square foot condo in Tribeca.'
    explanation = f' Each additional square foot is associated with a ${coefficient:,.0f} increase in this model.'
    return result + explanation

predict(1497)

In [0]:
# What does the model for low square footage?
predict(500)

In [0]:
# For high square footage?
predict(10000)

In [0]:
# These values are outside the min & max of the data the model was fit on,
# but predictive models assume future data will have similar distribution.
df['SALE_PRICE'].describe()

In [0]:
df['GROSS_SQUARE_FEET'].describe()

In [0]:
predict(630)

In [0]:
# Re-run the prediction functon interactively
from ipywidgets import interact
interact(predict, square_feet=(630, 5000)); # (min, max)

### Organize & comment code

Putting it all together!

In [0]:
from ipywidgets import interact
import pandas as pd
from sklearn.linear_model import LinearRegression

# Read New York City property sales data, from first 4 months of 2019.
# Dataset has 23040 rows, 21 columns.
df = pd.read_csv('../data/NYC_Citywide_Rolling_Calendar_Sales.csv')
assert df.shape == (23040, 21)

# Change column names. Replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# Remove symbols from SALE_PRICE string, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

# Keep subset of rows:
# Tribeca neighborhood, Condos - Elevator Apartments, 
# 1 unit, sale price more than $1, less than $35 million
mask = (
    (df['NEIGHBORHOOD'].str.contains('TRIBECA')) & 
    (df['BUILDING_CLASS_CATEGORY'] == '13 CONDOS - ELEVATOR APARTMENTS') &
    (df['TOTAL_UNITS'] == 1) & 
    (df['SALE_PRICE'] > 0) & 
    (df['SALE_PRICE'] < 35000000)
)
df = df[mask]

# Data now has 90 rows, 21 columns
assert df.shape == (90, 21)

# Convert SALE_DATE to datetime
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)

# Arrange X features matrix & y target vector
features = ['GROSS_SQUARE_FEET']
target = 'SALE_PRICE'
X = df[features]
y = df[target]

# Fit model
model = LinearRegression()
model.fit(X, y)

# Predict
def predict(square_feet):
    y_pred = model.predict([[square_feet]])
    estimate = y_pred[0]
    coefficient = model.coef_[0]
    
    # Format with $ and comma separators. No decimals.
    result = f'${estimate:,.0f} estimated price for {square_feet:,.0f} square foot condo in Tribeca.'
    explanation = f' Each additional square foot adds about ${coefficient:,.0f}'
    return result + explanation

interact(predict, square_feet=(630, 5000)); # (min, max)

### How'd we do? ...

[Amateurs & Experts Guess How Much a NYC Condo With a Private Terrace Costs](https://www.youtube.com/watch?v=JQCctBOgH9I)

> Real Estate Agent Leonard Steinberg just sold a pre-war condo in New York City's Tribeca neighborhood. We challenged three people - an apartment renter, an apartment owner and a real estate expert - to try to guess how much the apartment sold for. Leonard reveals more and more details to them as they refine their guesses.


The condo is 1,497 square feet.

Here are the final guesses from the video:

- Apartment Renter: \$15 million
- Apartment Buyer: \$2.2 million
- Real Estate Expert: \$2.2 million

Here's our prediction from our model:

In [0]:
predict(1497)

___The final answer is revealed in [the video at 12:28](https://youtu.be/JQCctBOgH9I?t=748)!___

What was the error for our prediction, versus the people in the video?

## Challenge

In your assignment, you will define a function to make new predictions and explain the model coefficient.

# Review


You'll practice these objectives when you do your assignment:

- Clean your data and remove outliers
- Plot relationship between feature and target
- Begin with baselines for regression
- Use scikit-learn to fit a linear regression
- Make new predictions and explain coefficients


You'll use another New York City real estate dataset. You'll predict how much it costs to rent an apartment, instead of how much it costs to buy a condo.

You've been provided with a separate notebook for your assignment, which has all the instructions and stretch goals. Good luck and have fun!

# Sources

#### Plotly Express
- [Plotly Express](https://plot.ly/python/plotly-express/) examples
- [plotly_express.scatter](https://www.plotly.express/plotly_express/#plotly_express.scatter) docs

#### NYC Real Estate
- [Amateurs & Experts Guess How Much a NYC Condo With a Private Terrace Costs](https://www.youtube.com/watch?v=JQCctBOgH9I)
- [Kaggle has NYC property sales data](https://www.kaggle.com/new-york-city/nyc-property-sales), but it's not up-to-date.
- The data comes from the [New York City Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page). There's also a glossary of property sales terms and NYC Building Class Code Descriptions
- The data can also be found on the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.

#### Scikit-Learn
- Jake VanderPlas, [_Python Data Science Handbook_, Chapter 5.2: Introducing Scikit-Learn](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Basics-of-the-API)
- [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)