<a href="https://colab.research.google.com/github/bofori-tech/DS-Unit-2-Linear-Models/blob/master/Copy_of_LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [None]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [None]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [None]:
# Do train/test split
# Use data from January — March 2019 to train
# Use data from April 2019 to test
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)
cutoff = pd.to_datetime('2019-04-01')
train = df[df.SALE_DATE < cutoff]
test  = df[df.SALE_DATE >= cutoff]

In [None]:
#Todo
train.describe(include='number')

Unnamed: 0,BLOCK,LOT,EASE-MENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
count,18167.0,18167.0,0.0,18167.0,18167.0,18167.0,18167.0,18167.0,18162.0,18167.0,18167.0
mean,4447.262344,343.641548,,10782.699015,1.721418,0.298949,2.172235,3214.913,1822.192765,1.617053,1217331.0
std,3679.405576,606.189463,,1121.115406,9.381721,6.087744,11.663443,21558.29,483.641156,0.807349,10921220.0
min,1.0,1.0,,0.0,0.0,-1.0,0.0,0.0,0.0,1.0,0.0
25%,1343.0,21.0,,10306.0,0.0,0.0,1.0,528.0,1920.0,1.0,0.0
50%,3569.0,49.0,,11210.0,1.0,0.0,1.0,1368.0,1940.0,1.0,430000.0
75%,6656.0,286.0,,11360.0,2.0,0.0,2.0,2273.5,1965.0,2.0,840056.0
max,16350.0,9022.0,,11697.0,750.0,570.0,755.0,1303935.0,2019.0,4.0,850000000.0


In [None]:
train.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [None]:
# TODO
train.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq,first,last
BOROUGH,18167,5,4,5883,NaT,NaT
TAX_CLASS_AT_PRESENT,18167,10,1,8911,NaT,NaT
NEIGHBORHOOD,18167,11,OTHER,15034,NaT,NaT
BUILDING_CLASS_CATEGORY,18167,43,01 ONE FAMILY DWELLINGS,4094,NaT,NaT
SALE_DATE,18167,90,2019-01-24 00:00:00,480,2019-01-01,2019-03-31
BUILDING_CLASS_AT_PRESENT,18167,140,D4,2640,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,18167,140,D4,2640,NaT,NaT
APARTMENT_NUMBER,3972,1450,4,81,NaT,NaT
LAND_SQUARE_FEET,18139,3207,0,5784,NaT,NaT
ADDRESS,18167,17926,N/A ROCKAWAY BOULEVARD,5,NaT,NaT


Looking at relationship between building class category and sale price


In [None]:
train['BUILDING_CLASS_CATEGORY'].value_counts()

01 ONE FAMILY DWELLINGS                       4094
02 TWO FAMILY DWELLINGS                       3675
10 COOPS - ELEVATOR APARTMENTS                2686
13 CONDOS - ELEVATOR APARTMENTS               2551
03 THREE FAMILY DWELLINGS                     1137
07 RENTALS - WALKUP APARTMENTS                 649
09 COOPS - WALKUP APARTMENTS                   513
15 CONDOS - 2-10 UNIT RESIDENTIAL              325
04 TAX CLASS 1 CONDOS                          320
44 CONDO PARKING                               285
17 CONDO COOPS                                 237
22 STORE BUILDINGS                             228
05 TAX CLASS 1 VACANT LAND                     214
12 CONDOS - WALKUP APARTMENTS                  196
14 RENTALS - 4-10 UNIT                         165
29 COMMERCIAL GARAGES                          113
08 RENTALS - ELEVATOR APARTMENTS                82
30 WAREHOUSES                                   81
21 OFFICE BUILDINGS                             80
43 CONDO OFFICE BUILDINGS      

In [None]:
train['SALE_PRICE'].mean()

1217331.1143832223

In [None]:
train.groupby('BUILDING_CLASS_CATEGORY')['SALE_PRICE'].mean().sort_values()

BUILDING_CLASS_CATEGORY
48 CONDO TERRACES/GARDENS/CABANAS             5.600000e+04
42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC     4.150000e+05
09 COOPS - WALKUP APARTMENTS                  4.239359e+05
04 TAX CLASS 1 CONDOS                         4.284021e+05
01 ONE FAMILY DWELLINGS                       4.579418e+05
12 CONDOS - WALKUP APARTMENTS                 4.642084e+05
02 TWO FAMILY DWELLINGS                       5.038693e+05
03 THREE FAMILY DWELLINGS                     5.572495e+05
06 TAX CLASS 1 - OTHER                        5.920303e+05
10 COOPS - ELEVATOR APARTMENTS                6.964924e+05
05 TAX CLASS 1 VACANT LAND                    7.525440e+05
17 CONDO COOPS                                7.640789e+05
15 CONDOS - 2-10 UNIT RESIDENTIAL             1.034268e+06
49 CONDO WAREHOUSES/FACTORY/INDUS             1.124250e+06
14 RENTALS - 4-10 UNIT                        1.144187e+06
39 TRANSPORTATION FACILITIES                  1.225000e+06
07 RENTALS - WALKUP APARTMENTS  

In [None]:
# TODO
target = 'SALE_PRICE'
high_cardinality = ['APARTMENT_NUMBER', 'LAND_SQUARE_FEET', 'ADDRESS', 'SALE_DATE']
features = train.columns.drop([target] + high_cardinality)

In [None]:
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [None]:
# TODO
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

In [None]:
X_test = encoder.transform(X_test)

In [None]:
# How many features do we have currently?
features = X_train.columns
n = len(features)
n

359

In [None]:
# How many ways to choose 1 to n features?
from math import factorial

def n_choose_k(n, k):
    return factorial(n)/(factorial(k)*factorial(n-k))

combinations = sum(n_choose_k(n,k) for k in range(1,n+1))
print(f'{combinations:,.0f}')

1,174,271,291,386,916,874,685,345,269,208,887,556,048,889,390,854,564,549,941,599,875,005,944,593,983,281,739,092,655,868,899,482,742,990,831,616


In [None]:
# TODO: Select the 15 features that best correlate with the target
# (15 is an arbitrary starting point here)
from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func=f_regression, k=15)
X_train_selected = selector.fit_transform(X_train, y_train)

ValueError: ignored