<a href="https://colab.research.google.com/github/ethanmjansen/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS10_assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

#Pre-work stuff

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [0]:
#Ignore Warnings
warnings.filterwarnings(action='ignore', category=RuntimeWarning, module='sklearn')

#Making a subset of the Data 

In [139]:
#Just looking at what I'm working with
df

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23035,4,OTHER,01 ONE FAMILY DWELLINGS,1,10965,276,,A5,111-17 FRANCIS LEWIS BLVD,,11429.0,1.0,0.0,1.0,1800,1224.0,1945.0,1,A5,510000,04/30/2019
23036,4,OTHER,09 COOPS - WALKUP APARTMENTS,2,169,29,,C6,"45-14 43RD STREET, 3C",,11104.0,0.0,0.0,0.0,0,0.0,1929.0,2,C6,355000,04/30/2019
23037,4,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,131,4,,D4,"50-05 43RD AVENUE, 3M",,11377.0,0.0,0.0,0.0,0,0.0,1932.0,2,D4,375000,04/30/2019
23038,4,OTHER,02 TWO FAMILY DWELLINGS,1,8932,18,,S2,91-10 JAMAICA AVE,,11421.0,2.0,1.0,3.0,2078,2200.0,1931.0,1,S2,1100000,04/30/2019


In [0]:
#Making the subset
dfsub = df[(df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS') &
           (df['SALE_PRICE'] >= 100000) & 
           (df['SALE_PRICE'] <= 2000000)]

In [141]:
#Checking the subset
dfsub.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BLOCK,3164.0,6908.597029,3964.333893,21.0,4003.25,6269.5,10206.25,16350.0
LOT,3164.0,75.847029,160.779187,1.0,21.0,42.0,69.0,2720.0
EASE-MENT,0.0,,,,,,,
ZIP_CODE,3164.0,11027.442162,482.591574,10030.0,10461.0,11235.0,11413.0,11697.0
RESIDENTIAL_UNITS,3164.0,0.987358,0.114537,0.0,1.0,1.0,1.0,2.0
COMMERCIAL_UNITS,3164.0,0.015803,0.127241,0.0,0.0,0.0,0.0,2.0
TOTAL_UNITS,3164.0,1.003161,0.172362,0.0,1.0,1.0,1.0,3.0
GROSS_SQUARE_FEET,3164.0,1469.718394,586.645088,0.0,1144.0,1360.0,1683.0,7875.0
YEAR_BUILT,3164.0,1943.639697,26.679176,1890.0,1925.0,1938.0,1955.0,2018.0
TAX_CLASS_AT_TIME_OF_SALE,3164.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


#Making a Train/Test Split

In [142]:
#Changing created object to datetime 
dfsub['SALE_DATE'] = pd.to_datetime(dfsub['SALE_DATE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
#Split into train/test
train = dfsub[(dfsub['SALE_DATE'] >= '2019-01') & (dfsub['SALE_DATE'] < '2019-04')]
test = dfsub[dfsub['SALE_DATE'] >= '2019-04']

#One-Hot Encoding

In [144]:
dfsub.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq,first,last
BUILDING_CLASS_CATEGORY,3164,1,01 ONE FAMILY DWELLINGS,3164,NaT,NaT
APARTMENT_NUMBER,1,1,RP.,1,NaT,NaT
TAX_CLASS_AT_PRESENT,3164,2,1,3123,NaT,NaT
BOROUGH,3164,5,4,1585,NaT,NaT
NEIGHBORHOOD,3164,7,OTHER,2970,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,3164,11,A1,1189,NaT,NaT
BUILDING_CLASS_AT_PRESENT,3164,13,A1,1188,NaT,NaT
SALE_DATE,3164,91,2019-01-31 00:00:00,78,2019-01-01,2019-04-30
LAND_SQUARE_FEET,3164,1037,4000,290,NaT,NaT
ADDRESS,3164,3148,267 DECKER AVENUE,2,NaT,NaT


In [0]:
#Arrange X feature Matrix and y target vector
target = 'SALE_PRICE'
high_cardinality = ['ADDRESS',
                    'LAND_SQUARE_FEET',
                    'SALE_DATE',
                    'BUILDING_CLASS_AT_PRESENT', 
                    'BUILDING_CLASS_AT_TIME_OF_SALE', 
                    'TAX_CLASS_AT_PRESENT', 
                    'EASE-MENT']
features = train.columns.drop([target] + high_cardinality)

In [0]:
#X Matrix and y vector for train and test
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [0]:
#import category encoder and fit to train and test
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [148]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BOROUGH_3,2517.0,0.159317,0.366044,0.0,0.0,0.0,0.0,1.0
BOROUGH_4,2517.0,0.480334,0.499712,0.0,0.0,0.0,1.0,1.0
BOROUGH_2,2517.0,0.096146,0.29485,0.0,0.0,0.0,0.0,1.0
BOROUGH_5,2517.0,0.263806,0.440783,0.0,0.0,0.0,1.0,1.0
BOROUGH_1,2517.0,0.000397,0.019932,0.0,0.0,0.0,0.0,1.0
NEIGHBORHOOD_OTHER,2517.0,0.940803,0.236041,0.0,1.0,1.0,1.0,1.0
NEIGHBORHOOD_FLUSHING-NORTH,2517.0,0.030989,0.173323,0.0,0.0,0.0,0.0,1.0
NEIGHBORHOOD_EAST NEW YORK,2517.0,0.008741,0.0931,0.0,0.0,0.0,0.0,1.0
NEIGHBORHOOD_BEDFORD STUYVESANT,2517.0,0.003576,0.059702,0.0,0.0,0.0,0.0,1.0
NEIGHBORHOOD_FOREST HILLS,2517.0,0.006754,0.081921,0.0,0.0,0.0,0.0,1.0


#Feature Selection with SelectKBest

In [0]:
#Import SelectKBest and f_regression
from sklearn.feature_selection import SelectKBest, f_regression

In [158]:
#Figuring out how many features should be selected
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for k in range(1, len(X_train.columns)+1):
    print(f'{k} features')

    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)

    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test MAE: ${mae:,.0f} \n')

1 features
Test MAE: $185,788 

2 features
Test MAE: $186,656 

3 features
Test MAE: $185,097 

4 features
Test MAE: $179,647 

5 features
Test MAE: $180,190 

6 features
Test MAE: $179,774 

7 features
Test MAE: $174,930 

8 features
Test MAE: $166,934 

9 features
Test MAE: $166,405 

10 features
Test MAE: $165,528 

11 features
Test MAE: $167,066 

12 features
Test MAE: $162,853 

13 features
Test MAE: $162,853 

14 features
Test MAE: $163,513 

15 features
Test MAE: $163,546 

16 features


  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x)

Test MAE: $163,543 

17 features
Test MAE: $163,578 

18 features
Test MAE: $163,629 

19 features
Test MAE: $163,632 

20 features
Test MAE: $163,632 

21 features
Test MAE: $163,632 

22 features
Test MAE: $163,632 

23 features
Test MAE: $163,632 

24 features
Test MAE: $163,632 



  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [0]:
#Create the selector
selector = SelectKBest(score_func=f_regression, k=15)

In [160]:
#fit_transform on the train set
#transform on test set
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [161]:
#Which features were selected and which were not
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print('\n')
print('Features not selected:')
for name in unselected_names:
    print(name)

Features selected:
BOROUGH_3
BOROUGH_4
BOROUGH_2
BOROUGH_5
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_FOREST HILLS
NEIGHBORHOOD_BOROUGH PARK
NEIGHBORHOOD_ASTORIA
BLOCK
ZIP_CODE
RESIDENTIAL_UNITS
COMMERCIAL_UNITS
TOTAL_UNITS
GROSS_SQUARE_FEET


Features not selected:
BOROUGH_1
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_BEDFORD STUYVESANT
BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS
LOT
APARTMENT_NUMBER_nan
APARTMENT_NUMBER_RP.
YEAR_BUILT
TAX_CLASS_AT_TIME_OF_SALE


#Ridge Regression 

In [0]:
#Import estimator
from sklearn.linear_model import RidgeCV

In [0]:
#Arrange X features matrix & y target vector
features = ['BOROUGH_3',
            'BOROUGH_4',
            'BOROUGH_2', 
            'BOROUGH_5',
            'NEIGHBORHOOD_OTHER',
            'NEIGHBORHOOD_FLUSHING-NORTH', 
            'NEIGHBORHOOD_FOREST HILLS',
            'NEIGHBORHOOD_BOROUGH PARK',
            'NEIGHBORHOOD_ASTORIA',
            'BLOCK',
            'ZIP_CODE',
            'RESIDENTIAL_UNITS',
            'COMMERCIAL_UNITS',
            'TOTAL_UNITS',
            'GROSS_SQUARE_FEET']

target = 'SALE_PRICE'

In [189]:
# Fit model to train

ridge = RidgeCV(alphas=(0.1, 1.0, 10), normalize=True)
ridge.fit(X_train, y_train)
ridge.alpha_

0.1

In [190]:
# Fit model to test
ridge.fit(X_test, y_test)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
        gcv_mode=None, normalize=True, scoring=None, store_cv_values=False)

#MAE for Test set

In [194]:
#Finding MAE for X_test
y_pred = ridge.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mae

158924.65283110898