Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 3

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

Instead, predict property sales prices for **One Family Dwellings** (`BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'`) using a subset of the data where the **sale price was more than \\$100 thousand and less than $2 million.** 

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do exploratory visualizations with Seaborn.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a linear regression model with multiple features.
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.


## Stretch Goals
- [ ] Add your own stretch goal(s) !
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way (without an excessive amount of formulas or academic pre-requisites).
(That book is good regardless of whether your cultural worldview is inferential statistics or predictive machine learning)
- [ ] Read Leo Breiman's paper, ["Statistical Modeling: The Two Cultures"](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [51]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module3')

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.0.0)
Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [53]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv('../data/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)
df['LAND_SQUARE_FEET'] = df['LAND_SQUARE_FEET'].str.replace(',','')
df.head(8)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,CHELSEA,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,FASHION,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,FASHION,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,GREENWICH VILLAGE-WEST,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019
5,1,UPPER EAST SIDE (79-96),07 RENTALS - WALKUP APARTMENTS,2B,1551,131,,C4,354 EAST 89TH STREET,,10128.0,10.0,0.0,10.0,2013,6570.0,1920.0,2,C4,0,01/01/2019
6,1,UPPER WEST SIDE (96-116),07 RENTALS - WALKUP APARTMENTS,2B,1891,159,,C4,304 WEST 106 STREET,,10025.0,10.0,0.0,10.0,1716,5810.0,1900.0,2,C4,0,01/01/2019
7,2,MORRIS PARK/VAN NEST,01 ONE FAMILY DWELLINGS,1,4090,37,,A1,1193 SACKET AVENUE,,10461.0,1.0,0.0,1.0,3404,1328.0,1925.0,1,A1,0,01/01/2019


In [54]:
print(df.shape)

(23040, 21)


In [55]:
#df.loc[df['LAND_SQUARE_FEET'].isnull()]['LAND_SQUARE_FEET']
df.head(10)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,CHELSEA,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,FASHION,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,FASHION,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,GREENWICH VILLAGE-WEST,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019
5,1,UPPER EAST SIDE (79-96),07 RENTALS - WALKUP APARTMENTS,2B,1551,131,,C4,354 EAST 89TH STREET,,10128.0,10.0,0.0,10.0,2013,6570.0,1920.0,2,C4,0,01/01/2019
6,1,UPPER WEST SIDE (96-116),07 RENTALS - WALKUP APARTMENTS,2B,1891,159,,C4,304 WEST 106 STREET,,10025.0,10.0,0.0,10.0,1716,5810.0,1900.0,2,C4,0,01/01/2019
7,2,MORRIS PARK/VAN NEST,01 ONE FAMILY DWELLINGS,1,4090,37,,A1,1193 SACKET AVENUE,,10461.0,1.0,0.0,1.0,3404,1328.0,1925.0,1,A1,0,01/01/2019
8,2,MORRIS PARK/VAN NEST,01 ONE FAMILY DWELLINGS,1,4120,18,,A5,1215 VAN NEST AVENUE,,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,01/01/2019
9,2,MORRIS PARK/VAN NEST,01 ONE FAMILY DWELLINGS,1,4120,20,,A5,1211 VAN NEST AVENUE,,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,01/01/2019


In [0]:
df['LAND_SQUARE_FEET'].dropna(inplace=True)
df['SALE_PRICE'].dropna(inplace=True)

In [57]:
print(df.shape)

(23040, 21)


In [58]:
print(df['LAND_SQUARE_FEET'].isnull().sum())
print(df['SALE_PRICE'].isnull().sum())

0
0


In [59]:
print(df)

       BOROUGH              NEIGHBORHOOD  ... SALE_PRICE   SALE_DATE
0            1                   CHELSEA  ...          0  01/01/2019
1            1                   FASHION  ...          0  01/01/2019
2            1                   FASHION  ...          0  01/01/2019
3            1    GREENWICH VILLAGE-WEST  ...          0  01/01/2019
4            1   UPPER EAST SIDE (59-79)  ...          0  01/01/2019
5            1   UPPER EAST SIDE (79-96)  ...          0  01/01/2019
6            1  UPPER WEST SIDE (96-116)  ...          0  01/01/2019
7            2      MORRIS PARK/VAN NEST  ...          0  01/01/2019
8            2      MORRIS PARK/VAN NEST  ...          0  01/01/2019
9            2      MORRIS PARK/VAN NEST  ...          0  01/01/2019
10           2      MORRIS PARK/VAN NEST  ...          0  01/01/2019
11           2      MORRIS PARK/VAN NEST  ...          0  01/01/2019
12           2      MORRIS PARK/VAN NEST  ...          0  01/01/2019
13           2      MORRIS PARK/VA

In [60]:
#<-----Converting Sale Date to Pandas Date-Time----->
from datetime import date, timedelta
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])
df.head(5)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,CHELSEA,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,2019-01-01
1,1,FASHION,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,2019-01-01
2,1,FASHION,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,2019-01-01
3,1,GREENWICH VILLAGE-WEST,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,2019-01-01
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,2019-01-01


In [0]:
#Import Numpy Real Quick
import numpy as np
#<----Creating and applying a mask for the Estate's Class---->
classmask = (df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS')
df = df[classmask]
#<----Creating and applying masks to keep the price between 2 mil and 1k---->
pricemask1 = (df['SALE_PRICE'] < 2000000)
df = df[pricemask1]
pricemask2 = (df['SALE_PRICE'] > 100000)
df = df[pricemask2]
#<---Creating DataFrames based on Month of sale--->
#$-Masks
aprilp = df['SALE_DATE'].map(lambda x: x.month) == (4)
apriln = df['SALE_DATE'].map(lambda x: x.month) != 4
#$-DataFramesales prices for ales prices for 
dftest = df[aprilp]
dftrain = df[apriln]

In [62]:
dftrain.head(3)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OCEAN PARKWAY-NORTH,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,QUEENS VILLAGE,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-01-01
78,2,PELHAM PARKWAY SOUTH,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,2019-01-02


In [0]:
#<--More Imports-->
import plotly.express as px
from sklearn.linear_model import LinearRegression
import category_encoders as ce
#<--More Imports-->
df['YEAR_BUILT'] = df['YEAR_BUILT'].astype(int)
df['BOROUGH'] = df['BOROUGH'].astype(int)

In [64]:
dftrain['SALE_PRICE'].fillna((dftrain['SALE_PRICE'].mean()), inplace=True)
df = df.drop(columns='EASE-MENT')
df = df.drop(columns='APARTMENT_NUMBER')



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [0]:
onehotshot = ce.OneHotEncoder

In [0]:
features = ['LAND_SQUARE_FEET','BOROUGH','TAX_CLASS_AT_PRESENT',
            'GROSS_SQUARE_FEET',"YEAR_BUILT","TAX_CLASS_AT_TIME_OF_SALE",
           "TOTAL_UNITS","RESIDENTIAL_UNITS","COMMERCIAL_UNITS"]
target = 'SALE_PRICE'
X = dftrain[features]
y = dftrain[target]
testX = dftest[features]
testy = dftest[target]

In [0]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import make_pipeline

In [68]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    IterativeImputer(), 
    LinearRegression()
)
pipeline.fit(X,y)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['LAND_SQUARE_FEET',
                                      'TAX_CLASS_AT_PRESENT'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'LAND_SQUARE_FEET',
                                          'data_type': dtype('O'),
                                          'mapping': 6800      1
4000      2
3500      3
1710      4
2000      5
3000      6
1800      7
5000      8
2400      9
7000     10
3700     11
2500     12
2435     13
760      14
3920     15
2626     16
1383     17
2200     18
3570     19
8...
                 IterativeImputer(add_indicator=False, estimator=None,
                                  imputation_order='ascending',
                                  initial_strategy='mean', max_iter=10,
                                  max_value=None, min_value

In [0]:
y_pred = pipeline.predict(testX)

In [0]:
preddf = dftest.copy()
preddf['Prediction'] = y_pred

In [75]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
r2_score(preddf['SALE_PRICE'],preddf['Prediction'])

0.22563200966447328

In [74]:
mean_absolute_error(preddf['SALE_PRICE'],preddf['Prediction'])

184428.85436058466

In [84]:
fig = px.scatter(preddf, x="GROSS_SQUARE_FEET", y="Prediction",
                 title='Price Predictions',color="YEAR_BUILT",
                 color_continuous_scale=px.colors.sequential.Viridis
                )
fig.show()