<a href="https://colab.research.google.com/github/ameralhomdy/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 3

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

Instead, predict property sales prices for **One Family Dwellings** (`BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'`). 

Use a subset of the data where the **sale price was more than \\$100 thousand and less than $2 million.** 

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.

- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Fit a ridge regression model with multiple features.
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.


## Stretch Goals
- [ ] Add your own stretch goal(s) !
- [ ] Instead of `RidgeRegression`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `RidgeRegression`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module3')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv('../data/condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [6]:
import pandas_profiling
pandas_profiling.ProfileReport(df)



In [7]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [0]:
# dropping Sale Price values under 100,000

df = df[(df['SALE_PRICE'] > 100000) & (df['SALE_PRICE'] < 2000000)]

In [0]:
# High NaN values

df = df.drop(columns=['EASE-MENT', 'APARTMENT_NUMBER'])

In [10]:
df.isnull().sum()

BOROUGH                            0
NEIGHBORHOOD                       0
BUILDING_CLASS_CATEGORY            0
TAX_CLASS_AT_PRESENT               0
BLOCK                              0
LOT                                0
BUILDING_CLASS_AT_PRESENT          0
ADDRESS                            0
ZIP_CODE                           0
RESIDENTIAL_UNITS                  0
COMMERCIAL_UNITS                   0
TOTAL_UNITS                        0
LAND_SQUARE_FEET                  37
GROSS_SQUARE_FEET                  0
YEAR_BUILT                        22
TAX_CLASS_AT_TIME_OF_SALE          0
BUILDING_CLASS_AT_TIME_OF_SALE     0
SALE_PRICE                         0
SALE_DATE                          0
dtype: int64

In [0]:
df = df.dropna()

In [0]:
df['LAND_SQUARE_FEET'] = df['LAND_SQUARE_FEET'].str.replace(',', '').astype(int)

In [0]:
df = df[df['LAND_SQUARE_FEET'] != 0]

In [0]:
# splitting the dataframe into test and train

df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)
cutoff = pd.to_datetime('2019-04-01')
train = df[df['SALE_DATE'] < cutoff]
test  = df[df['SALE_DATE'] >= cutoff]

In [0]:
# Excluding the non-numeric columns from the features with high cardinality for now

target = 'SALE_PRICE'
high_cardinality = ['ADDRESS', 'BUILDING_CLASS_AT_PRESENT', 'BUILDING_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_CATEGORY', 'SALE_DATE']
features = train.columns.drop([target] + high_cardinality)

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [0]:
# OneHotEncoding

import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.fit_transform(X_test)

In [17]:
X_train.head()

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_UPPER WEST SIDE (79-96),NEIGHBORHOOD_UPPER WEST SIDE (59-79),NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_GRAMERCY,NEIGHBORHOOD_UPPER EAST SIDE (59-79),NEIGHBORHOOD_UPPER EAST SIDE (79-96),TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_4,TAX_CLASS_AT_PRESENT_2,TAX_CLASS_AT_PRESENT_2A,TAX_CLASS_AT_PRESENT_1B,TAX_CLASS_AT_PRESENT_1A,TAX_CLASS_AT_PRESENT_2C,TAX_CLASS_AT_PRESENT_2B,TAX_CLASS_AT_PRESENT_1C,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,5495,801,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1
61,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,7918,72,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1
74,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3815,41,10472.0,2.0,0.0,2.0,4129,2112.0,1930.0,1
78,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,4210,19,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1
79,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3772,77,10472.0,3.0,0.0,3.0,2500,3600.0,1935.0,1


In [18]:
X_train.describe()

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_UPPER WEST SIDE (79-96),NEIGHBORHOOD_UPPER WEST SIDE (59-79),NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_GRAMERCY,NEIGHBORHOOD_UPPER EAST SIDE (59-79),NEIGHBORHOOD_UPPER EAST SIDE (79-96),TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_4,TAX_CLASS_AT_PRESENT_2,TAX_CLASS_AT_PRESENT_2A,TAX_CLASS_AT_PRESENT_1B,TAX_CLASS_AT_PRESENT_1A,TAX_CLASS_AT_PRESENT_2C,TAX_CLASS_AT_PRESENT_2B,TAX_CLASS_AT_PRESENT_1C,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
count,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0,6355.0
mean,0.273171,0.387254,0.1262,0.171518,0.041857,0.907946,0.021086,0.030212,0.001574,0.001574,0.014319,0.012746,0.003777,0.003619,0.000944,0.002203,0.778757,0.038238,0.086861,0.033202,0.015106,0.021558,0.020614,0.003619,0.002046,5579.845948,224.551062,10875.61605,1.727301,0.177498,2.175452,7774.220299,2446.591345,1879.456176,1.259166
std,0.445623,0.487161,0.332101,0.376991,0.200278,0.289124,0.143682,0.171185,0.03964,0.03964,0.118813,0.112185,0.061342,0.060056,0.030715,0.046888,0.415117,0.191784,0.281653,0.179178,0.121985,0.145246,0.142098,0.060056,0.045186,3813.738666,449.828489,1189.986399,3.948129,1.669204,6.576843,26870.767164,8347.297717,360.84188,0.649223
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,1.0,0.0,0.0,-1.0,0.0,58.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2569.0,22.0,10461.0,1.0,0.0,1.0,2000.0,1188.0,1920.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4951.0,47.0,11219.0,1.0,0.0,1.0,2592.0,1624.0,1935.0,1.0
75%,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7942.0,101.0,11373.0,2.0,0.0,2.0,4000.0,2310.0,1965.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,16312.0,5026.0,11694.0,155.0,50.0,156.0,484555.0,244619.0,2018.0,4.0


In [19]:
# feature selection with SelectKBest
from sklearn.feature_selection import f_regression, SelectKBest

selector = SelectKBest(score_func=f_regression, k=15)

# IMPORTANT!
# .fit_transform on the train set
# .transform on test set

X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
X_train_selected.shape, X_test_selected.shape

((6355, 15), (1695, 15))

In [20]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for k in range(1, len(X_train.columns)+1):
    
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test MAE: ${mae:,.0f} \n')

1 features
Test MAE: $286,104 

2 features
Test MAE: $272,377 

3 features
Test MAE: $319,003 

4 features
Test MAE: $297,512 

5 features
Test MAE: $304,495 

6 features
Test MAE: $302,067 

7 features
Test MAE: $305,864 

8 features
Test MAE: $317,536 

9 features
Test MAE: $377,184 

10 features
Test MAE: $380,773 

11 features
Test MAE: $397,012 

12 features
Test MAE: $397,503 

13 features
Test MAE: $389,714 

14 features
Test MAE: $430,576 

15 features
Test MAE: $435,210 

16 features
Test MAE: $435,944 

17 features
Test MAE: $371,879 

18 features
Test MAE: $379,478 

19 features
Test MAE: $378,499 

20 features
Test MAE: $378,571 

21 features
Test MAE: $378,760 

22 features
Test MAE: $416,037 

23 features
Test MAE: $415,117 

24 features
Test MAE: $416,489 

25 features
Test MAE: $416,489 

26 features
Test MAE: $416,881 

27 features
Test MAE: $416,834 

28 features
Test MAE: $416,285 

29 features
Test MAE: $416,285 

30 features
Test MAE: $398,659 

31 features
Test MA

In [24]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

for k in range(1, len(X_train_encoded.columns)+1):
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = selector.transform(X_test_scaled)
    
    model = RidgeCV()
    model.fit(X_train_selected, y_train)
    
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test MAE: ${mae:,.0f} \n')

NameError: ignored

In [21]:
# Correlation matrix!
X_train.corr()

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_UPPER WEST SIDE (79-96),NEIGHBORHOOD_UPPER WEST SIDE (59-79),NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_GRAMERCY,NEIGHBORHOOD_UPPER EAST SIDE (59-79),NEIGHBORHOOD_UPPER EAST SIDE (79-96),TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_4,TAX_CLASS_AT_PRESENT_2,TAX_CLASS_AT_PRESENT_2A,TAX_CLASS_AT_PRESENT_1B,TAX_CLASS_AT_PRESENT_1A,TAX_CLASS_AT_PRESENT_2C,TAX_CLASS_AT_PRESENT_2B,TAX_CLASS_AT_PRESENT_1C,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
BOROUGH_3,1.0,-0.48737,-0.232983,-0.278943,-0.128135,-0.079637,0.239398,-0.108207,-0.024338,-0.024338,0.196605,-0.069658,-0.037746,-0.036948,-0.018846,-0.028806,-0.076503,0.01219,-0.047386,0.073641,-0.003545,-0.001032,0.184453,0.015978,0.066035,-0.099426,0.031224,0.127584,0.048967,0.072543,0.088357,-0.079994,0.054696,-0.018596,0.052815
BOROUGH_4,-0.48737,1.0,-0.302121,-0.361719,-0.16616,-0.078726,-0.116676,0.222022,-0.03156,-0.03156,-0.095819,0.142926,0.077448,-0.047913,-0.024439,-0.037355,0.120221,-0.032178,-0.03414,-0.049962,-0.008412,-0.049052,-0.090326,-0.042533,-0.035993,0.487129,-0.082305,0.277924,-0.047859,-0.040416,-0.052156,-0.021181,-0.064598,0.025923,-0.081014
BOROUGH_2,-0.232983,-0.302121,1.0,-0.172917,-0.079431,0.121008,-0.055776,-0.067078,-0.015087,-0.015087,-0.045806,-0.043181,-0.023399,-0.022904,-0.011683,-0.017857,0.055296,0.037889,-0.10375,0.069749,-0.000447,-0.049885,-0.055135,0.040223,-0.017206,-0.139054,-0.112608,-0.16042,0.016049,-0.014863,-0.009779,-0.058735,-0.00757,0.005014,-0.000621
BOROUGH_5,-0.278943,-0.361719,-0.172917,1.0,-0.0951,0.144879,-0.066778,-0.08031,-0.018063,-0.018063,-0.054841,-0.051699,-0.028015,-0.027423,-0.013987,-0.02138,0.087648,-0.016715,-0.109206,-0.07034,0.029206,0.125033,-0.066011,-0.020471,-0.0206,-0.262917,-0.048721,-0.285759,-0.054958,-0.038884,-0.061586,0.12939,-0.041986,0.030331,-0.098055
BOROUGH_1,-0.128135,-0.16616,-0.079431,-0.0951,1.0,-0.104677,-0.030675,-0.036891,0.18994,0.18994,-0.025192,-0.023749,-0.012869,0.288354,0.147081,0.224811,-0.378882,0.019785,0.566079,-0.025576,-0.025885,-0.031024,0.024978,0.039742,0.007928,-0.2382,0.409163,-0.156005,0.084299,0.034736,0.062409,0.083349,0.127015,-0.087086,0.265151
NEIGHBORHOOD_OTHER,-0.079637,-0.078726,0.121008,0.144879,-0.104677,1.0,-0.460927,-0.554325,-0.124679,-0.124679,-0.378534,-0.356846,-0.193366,-0.189279,-0.096546,-0.147569,0.193511,-0.021659,-0.214885,0.004324,0.01266,0.039768,-0.149172,0.01919,-0.081957,0.186573,-0.19926,-0.046575,0.014817,0.016904,0.024634,0.004149,0.010729,0.009157,-0.142022
NEIGHBORHOOD_BEDFORD STUYVESANT,0.239398,-0.116676,-0.055776,-0.066778,-0.030675,-0.460927,1.0,-0.025905,-0.005826,-0.005826,-0.01769,-0.016676,-0.009036,-0.008845,-0.004512,-0.006896,-0.082731,-0.000707,0.005292,0.015594,-0.009197,-0.021785,0.240792,-0.008845,0.017596,-0.146572,0.066372,0.03208,0.014299,-0.009702,8.2e-05,-0.028086,0.032154,0.014992,0.057822
NEIGHBORHOOD_FLUSHING-NORTH,-0.108207,0.222022,-0.067078,-0.08031,-0.036891,-0.554325,-0.025905,1.0,-0.007007,-0.007007,-0.021274,-0.020055,-0.010867,-0.010638,-0.005426,-0.008294,-0.089743,0.031919,0.160998,-0.022447,-0.014323,-0.026199,-0.025607,-0.010638,-0.007991,-0.013635,0.117716,0.062563,-0.02623,-0.006653,-0.024699,-0.003543,-0.025141,0.014246,0.085306
NEIGHBORHOOD_UPPER WEST SIDE (79-96),-0.024338,-0.03156,-0.015087,-0.018063,0.18994,-0.124679,-0.005826,-0.007007,1.0,-0.001576,-0.004785,-0.004511,-0.002444,-0.002393,-0.00122,-0.001865,-0.074482,-0.007916,0.128718,-0.007357,-0.004917,-0.005893,-0.00576,-0.002393,-0.001797,-0.045334,0.075618,-0.028397,-0.000275,0.005292,-0.000455,0.018803,-0.002482,0.00829,0.045305
NEIGHBORHOOD_UPPER WEST SIDE (59-79),-0.024338,-0.03156,-0.015087,-0.018063,0.18994,-0.124679,-0.005826,-0.007007,-0.001576,1.0,-0.004785,-0.004511,-0.002444,-0.002393,-0.00122,-0.001865,-0.074482,-0.007916,0.128718,-0.007357,-0.004917,-0.005893,-0.00576,-0.002393,-0.001797,-0.046209,0.084135,-0.028447,-0.007314,-0.004222,-0.007096,0.023246,-0.00731,-0.12053,0.045305


In [26]:
# Now with regularization via ridge regression
from sklearn.linear_model import Ridge

ridge_reg = Ridge().fit(X, y)
mean_squared_error(y, ridge_reg.predict(X))

NameError: ignored

In [29]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

for k in range(1, len(X_train.columns)+1):
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = selector.transform(X_test_scaled)
    
    model = RidgeCV()
    model.fit(X_train_selected, y_train)
    
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test MAE: ${mae:,.0f} \n')

1 features
Test MAE: $286,108 

2 features
Test MAE: $272,382 

3 features
Test MAE: $318,791 

4 features
Test MAE: $297,424 

5 features
Test MAE: $304,440 

6 features
Test MAE: $302,010 

7 features
Test MAE: $305,806 

8 features
Test MAE: $317,192 

9 features
Test MAE: $376,081 

10 features
Test MAE: $379,632 

11 features
Test MAE: $395,728 

12 features
Test MAE: $396,215 

13 features
Test MAE: $388,282 

14 features
Test MAE: $429,284 

15 features
Test MAE: $433,884 

16 features
Test MAE: $434,581 

17 features
Test MAE: $375,285 

18 features
Test MAE: $383,102 

19 features
Test MAE: $382,126 

20 features
Test MAE: $382,197 

21 features
Test MAE: $382,386 

22 features
Test MAE: $420,834 

23 features
Test MAE: $419,971 

24 features
Test MAE: $420,817 

25 features
Test MAE: $420,778 

26 features
Test MAE: $421,190 

27 features
Test MAE: $421,200 

28 features
Test MAE: $420,656 

29 features
Test MAE: $420,790 

30 features
Test MAE: $404,191 

31 features
Test MA

In [31]:
k = 27
selector = SelectKBest(score_func=f_regression, k=k)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)

all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print('\nFeatures not selected:')
for name in unselected_names:
    print(name)

Features selected:
BOROUGH_3
BOROUGH_2
BOROUGH_5
BOROUGH_1
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_UPPER WEST SIDE (59-79)
NEIGHBORHOOD_BOROUGH PARK
NEIGHBORHOOD_ASTORIA
NEIGHBORHOOD_FOREST HILLS
NEIGHBORHOOD_GRAMERCY
NEIGHBORHOOD_UPPER EAST SIDE (79-96)
TAX_CLASS_AT_PRESENT_1
TAX_CLASS_AT_PRESENT_4
TAX_CLASS_AT_PRESENT_2
TAX_CLASS_AT_PRESENT_2A
TAX_CLASS_AT_PRESENT_1A
TAX_CLASS_AT_PRESENT_2C
TAX_CLASS_AT_PRESENT_2B
TAX_CLASS_AT_PRESENT_1C
BLOCK
ZIP_CODE
RESIDENTIAL_UNITS
LAND_SQUARE_FEET
GROSS_SQUARE_FEET
TAX_CLASS_AT_TIME_OF_SALE

Features not selected:
BOROUGH_4
NEIGHBORHOOD_UPPER WEST SIDE (79-96)
NEIGHBORHOOD_UPPER EAST SIDE (59-79)
TAX_CLASS_AT_PRESENT_1B
LOT
COMMERCIAL_UNITS
TOTAL_UNITS
YEAR_BUILT
