<a href="https://colab.research.google.com/github/Bayaniblues/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/Copy_of_LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January â€”Â March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand â€”Â use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! ðŸ’¥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if youâ€™re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [3]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [4]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

Use a subset of the data where BUILDING_CLASS_CATEGORY == '01 ONE FAMILY DWELLINGS' and the sale price was more than 100 thousand and less than 2 million.


In [5]:
# Filter by class Fruitful function
def get_subset():
  feature = ['BUILDING_CLASS_CATEGORY', '01 ONE FAMILY DWELLINGS']
  subset = df[df[feature[0]] == feature[1]] 
  target = "SALE_PRICE"
  a = subset[target] > 100000
  b = subset[target] < 2000000
  mask = (a) & (b)
  return subset[mask]

get_subset().head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,01/02/2019


 Do train/test split. Use data from January â€” March 2019 to train. Use data from April 2019 to test.


In [6]:
# Filter by date fruitful function
def filter_date(subset, date_a, date_b):
  a = subset["SALE_DATE"] >= date_a
  b = subset["SALE_DATE"] < date_b
  mask = (a) & (b)
  product = get_subset().loc[mask]
  return product



In [7]:
train = filter_date(get_subset(), '01/01/2019', '03/01/2019')
test = filter_date(get_subset(), '03/01/2019', '04/01/2019')
train.shape, test.shape

((1708, 21), (799, 21))

In [8]:
train, test

(      BOROUGH NEIGHBORHOOD  ... SALE_PRICE   SALE_DATE
 44          3        OTHER  ...     550000  01/01/2019
 61          4        OTHER  ...     200000  01/01/2019
 78          2        OTHER  ...     810000  01/02/2019
 108         3        OTHER  ...     125000  01/02/2019
 111         3        OTHER  ...     620000  01/02/2019
 ...       ...          ...  ...        ...         ...
 12045       5        OTHER  ...     460000  02/28/2019
 12051       5        OTHER  ...     870000  02/28/2019
 12053       5        OTHER  ...     380000  02/28/2019
 12054       5        OTHER  ...     520000  02/28/2019
 12055       5        OTHER  ...     584284  02/28/2019
 
 [1708 rows x 21 columns],
       BOROUGH NEIGHBORHOOD  ... SALE_PRICE   SALE_DATE
 12115       2        OTHER  ...     515000  03/01/2019
 12116       2        OTHER  ...     555000  03/01/2019
 12124       2        OTHER  ...     571000  03/01/2019
 12127       2        OTHER  ...     580000  03/01/2019
 12130       2     

In [12]:
train.describe(exclude="number")

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,1708,1708,1708,1708,1708,1708,1,1708,1708,1708
unique,4,6,1,2,13,1704,1,674,11,44
top,4,OTHER,01 ONE FAMILY DWELLINGS,1,A1,216-29 114TH ROAD,RP.,4000,A1,01/31/2019
freq,816,1625,1708,1689,635,2,1,145,635,78


In [13]:
features = ["BOROUGH",	"NEIGHBORHOOD",	"BUILDING_CLASS_CATEGORY",	"TAX_CLASS_AT_PRESENT",	"BUILDING_CLASS_AT_PRESENT",	"ADDRESS",	"APARTMENT_NUMBER",	"LAND_SQUARE_FEET",	"BUILDING_CLASS_AT_TIME_OF_SALE",	"SALE_DATE"]
target = "SALE_PRICE"
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

 Do one-hot encoding of categorical features.


In [14]:
print('Mean Baseline')
guess = y_train.mean()
print(guess)

Mean Baseline
623486.388173302


In [15]:
X_train.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,A9,4832 BAY PARKWAY,,6800,A9,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,A1,80-23 232ND STREET,,4000,A1,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,A1,1260 RHINELANDER AVE,,3500,A1,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,A1,469 E 25TH ST,,4000,A1,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,A5,5521 WHITTY LANE,,1710,A5,01/02/2019


In [20]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X_train[['NEIGHBORHOOD']])
train_trans = ohe.transform(X_train[['NEIGHBORHOOD']]).toarray()


In [21]:
print(train_trans)

[[0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 1.]
 ...
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 1.]]


In [40]:
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
onehot = lambda a: encoder.fit_transform(a)
X_train2 = encoder.fit_transform(X_train)
X_train2

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,ADDRESS_4832 BAY PARKWAY,ADDRESS_80-23 232ND STREET,ADDRESS_1260 RHINELANDER AVE,ADDRESS_469 E 25TH ST,ADDRESS_5521 WHITTY LANE,ADDRESS_1747 EAST 23RD STREET,ADDRESS_1582 EAST 15TH STREET,ADDRESS_201-08 50TH AVENUE,ADDRESS_85-11 57 ROAD,ADDRESS_53-19 198TH STREET,ADDRESS_208-03 HOLLIS AVENUE,ADDRESS_157-43 82ND STREET,ADDRESS_102-33 164TH ROAD,ADDRESS_24-27 92ND STREET,...,SALE_DATE_01/07/2019,SALE_DATE_01/08/2019,SALE_DATE_01/09/2019,SALE_DATE_01/10/2019,SALE_DATE_01/11/2019,SALE_DATE_01/14/2019,SALE_DATE_01/15/2019,SALE_DATE_01/16/2019,SALE_DATE_01/17/2019,SALE_DATE_01/18/2019,SALE_DATE_01/21/2019,SALE_DATE_01/22/2019,SALE_DATE_01/23/2019,SALE_DATE_01/24/2019,SALE_DATE_01/25/2019,SALE_DATE_01/28/2019,SALE_DATE_01/29/2019,SALE_DATE_01/30/2019,SALE_DATE_01/31/2019,SALE_DATE_02/01/2019,SALE_DATE_02/04/2019,SALE_DATE_02/05/2019,SALE_DATE_02/06/2019,SALE_DATE_02/07/2019,SALE_DATE_02/08/2019,SALE_DATE_02/11/2019,SALE_DATE_02/12/2019,SALE_DATE_02/13/2019,SALE_DATE_02/14/2019,SALE_DATE_02/15/2019,SALE_DATE_02/17/2019,SALE_DATE_02/18/2019,SALE_DATE_02/19/2019,SALE_DATE_02/20/2019,SALE_DATE_02/21/2019,SALE_DATE_02/22/2019,SALE_DATE_02/25/2019,SALE_DATE_02/26/2019,SALE_DATE_02/27/2019,SALE_DATE_02/28/2019
44,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
61,0,1,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
78,0,0,1,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
108,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
111,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12045,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12051,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12053,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12054,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


 Do feature selection with SelectKBest.

In [41]:
from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(chi2, k='all').fit_transform(X_train2, y_train)
X_new

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

In [44]:
def selectk(X_train,X_test,y_train):
  from sklearn.feature_selection import SelectKBest
  X_train = X_train.select_dtypes(include='number')
  X_test = X_test.select_dtypes(include='number')
  
  selector = SelectKBest(k='all')

  X_train_selected = selector.fit_transform(X_train, y_train)
  return X_train_selected
  #X_test_selected = selector.transform(X_test)
selectk(onehot(X_train),onehot(X_test),y_train )

   82   87   88   93  100  107  108  110  117  121  126  127  140  146
  149  160  169  171  172  173  175  180  182  183  185  193  202  204
  211  220  226  228  231  234  243  246  249  251  252  257  258  259
  262  274  278  295  302  306  309  310  322  325  333  339  343  346
  350  356  359  361  363  365  373  374  376  377  378  381  382  383
  385  386  395  404  407  408  411  412  425  427  429  431  433  437
  438  444  448  452  453  457  459  461  473  474  479  486  487  492
  493  502  506  510  514  515  518  520  522  526  527  530  531  532
  541  546  555  558  560  563  565  568  583  584  588  591  593  595
  599  600  603  610  612  618  620  621  627  632  634  635  636  638
  639  640  642  645  647  649  651  654  656  663  664  670  678  683
  684  688  690  692  693  694  696  703  704  706  708  709  710  718
  721  725  726  735  747  749  750  760  763  773  774  777  778  779
  781  784  785  792  793  801  819  821  828  830  834  845  852  857
  858 

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

 Fit a ridge regression model with multiple features. Use the normalize=True parameter (or do feature scaling beforehand â€” use the scaler's fit_transform method with the train set, and the scaler's transform method with the test set)


 Get mean absolute error for the test set.


In [64]:
def test_error(X, header):
  from sklearn.metrics import mean_absolute_error
  y_pred = [guess] * len(X)
  mae    = mean_absolute_error(X, y_pred)
  return print(f'{header} is {mae:.2f}')

test_error(y_train, "Train error")

test_error(y_test, "Test error")

Train error is 214345.38
Test error is 216849.11


Did a streach goal, made a pipeline with ridge regression

In [63]:
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

pipeline = make_pipeline(
    ce.OneHotEncoder(),
    Ridge(normalize=True)
)

# fit on train
pipeline.fit(X_train, y_train)
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('test Accuracy:', pipeline.score(X_test, y_test))

Training Accuracy: 0.8470139499655494
test Accuracy: 0.17995445264702525
