<a href="https://colab.research.google.com/github/scottwmwork/DS-Unit-2-Regression-Classification/blob/master/module3/assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 3

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

Instead, predict property sales prices for **One Family Dwellings** (`BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'`) using a subset of the data where the **sale price was more than \\$100 thousand and less than $2 million.** 

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do exploratory visualizations with Seaborn.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a linear regression model with multiple features.
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.


## Stretch Goals
- [ ] Add your own stretch goal(s) !
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way (without an excessive amount of formulas or academic pre-requisites).
(That book is good regardless of whether your cultural worldview is inferential statistics or predictive machine learning)
- [ ] Read Leo Breiman's paper, ["Statistical Modeling: The Two Cultures"](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module3')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv('../data/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

# df = df[df['LAND_SQUARE_FEET'].isna() == False]

# df['LAND_SQUARE_FEET'] = (
#     df['LAND_SQUARE_FEET']
#     .str.replace(',','')
#     .str.replace('########','0')
#     .astype(int)
# )

# df = df[df['LAND_SQUARE_FEET'] > 0]


In [0]:
#format SALE_DATE feature to pandas datetime
df.SALE_DATE = pd.to_datetime(df.SALE_DATE, infer_datetime_format = True)

In [37]:
# Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test
train =  df[df.SALE_DATE.dt.year  == 2019] 
train =  train[train.SALE_DATE.dt.month >= 1] 
train =  train[train.SALE_DATE.dt.month <= 3]

test = df[df.SALE_DATE.dt.month == 4]
test = test[test.SALE_DATE.dt.year == 2019]

print("train shape:\n",train.shape)

print("\ntest shape:\n",test.shape)

train shape:
 (18167, 21)

test shape:
 (4873, 21)


In [38]:
#Show Features
train.describe()

Unnamed: 0,BOROUGH,BLOCK,LOT,EASE-MENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
count,18167.0,18167.0,18167.0,0.0,18167.0,18167.0,18167.0,18167.0,18167.0,18162.0,18167.0,18167.0
mean,3.016018,4447.262344,343.641548,,10782.699015,1.721418,0.298949,2.172235,3214.913,1822.192765,1.617053,1217331.0
std,1.268013,3679.405576,606.189463,,1121.115406,9.381721,6.087744,11.663443,21558.29,483.641156,0.807349,10921220.0
min,1.0,1.0,1.0,,0.0,0.0,-1.0,0.0,0.0,0.0,1.0,0.0
25%,2.0,1343.0,21.0,,10306.0,0.0,0.0,1.0,528.0,1920.0,1.0,0.0
50%,3.0,3569.0,49.0,,11210.0,1.0,0.0,1.0,1368.0,1940.0,1.0,430000.0
75%,4.0,6656.0,286.0,,11360.0,2.0,0.0,2.0,2273.5,1965.0,2.0,840056.0
max,5.0,16350.0,9022.0,,11697.0,750.0,570.0,755.0,1303935.0,2019.0,4.0,850000000.0


In [39]:
#Explore data with plotly
import plotly.express as px
px.scatter(train, x = 'GROSS_SQUARE_FEET', y = 'SALE_PRICE', color = 'SALE_PRICE', trendline = 'ols')

In [40]:
test = test.fillna(method = 'ffill')
px.scatter(test, x = 'GROSS_SQUARE_FEET', y = 'SALE_PRICE', color = 'SALE_PRICE', trendline = 'ols')

In [41]:
#Do one-hot encoding of categorical features

#What features are non-numeric?
train.describe(exclude ='number')

Unnamed: 0,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,18167,18167,18167.0,18167,18167,3972.0,18139.0,18167,18167
unique,251,43,10.0,140,17926,1450.0,3207.0,140,90
top,FLUSHING-NORTH,01 ONE FAMILY DWELLINGS,1.0,D4,N/A ROCKAWAY BOULEVARD,4.0,0.0,D4,2019-01-24 00:00:00
freq,549,4094,8911.0,2640,5,81.0,5784.0,2640,480
first,,,,,,,,,2019-01-01 00:00:00
last,,,,,,,,,2019-03-31 00:00:00


In [0]:
train.TAX_CLASS_AT_PRESENT = train.TAX_CLASS_AT_PRESENT.str.replace('A','').str.replace('B','').str.replace('C','').str.replace('D','')
test.TAX_CLASS_AT_PRESENT = test.TAX_CLASS_AT_PRESENT.str.replace('A','').str.replace('B','').str.replace('C','').str.replace('D','')

In [0]:
categorical_columns = ['TAX_CLASS_AT_PRESENT']
for col in categorical_columns:
  train[col] = train[col].astype(int)
  test[col] = test[col].astype(int)

In [45]:
train[categorical_columns].describe()

Unnamed: 0,TAX_CLASS_AT_PRESENT
count,18167.0
mean,1.616998
std,0.807357
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,4.0


In [51]:
train['TAX_CLASS_AT_PRESENT'].value_counts(normalize=True)

1    0.521495
2    0.409259
4    0.069246
Name: TAX_CLASS_AT_PRESENT, dtype: float64

In [56]:
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_regression, SelectKBest

#scale features
target = 'SALE_PRICE'
high_cardinality = ["NEIGHBORHOOD",	"BUILDING_CLASS_CATEGORY",	"TAX_CLASS_AT_PRESENT",	"BUILDING_CLASS_AT_PRESENT",	"ADDRESS",	"APARTMENT_NUMBER",	"LAND_SQUARE_FEET",	"BUILDING_CLASS_AT_TIME_OF_SALE","SALE_DATE"]
features = train.columns.drop([target] + high_cardinality)

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

for k in range(1, len(X_train_encoded.columns)+1):
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = selector.transform(X_test_scaled)
    
    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test MAE: ${mae:,.0f} \n')

1 features



invalid value encountered in true_divide


Degrees of freedom <= 0 for slice.



ValueError: ignored