<a href="https://colab.research.google.com/github/maiali13/DS-Unit-2-Linear-Models/blob/master/M_Ali_DS13_U2_S1_Regression_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January â€”Â March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand â€”Â use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! ðŸ’¥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if youâ€™re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [5]:
df.tail()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
23035,4,OTHER,01 ONE FAMILY DWELLINGS,1,10965,276,,A5,111-17 FRANCIS LEWIS BLVD,,11429.0,1.0,0.0,1.0,1800,1224.0,1945.0,1,A5,510000,04/30/2019
23036,4,OTHER,09 COOPS - WALKUP APARTMENTS,2,169,29,,C6,"45-14 43RD STREET, 3C",,11104.0,0.0,0.0,0.0,0,0.0,1929.0,2,C6,355000,04/30/2019
23037,4,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,131,4,,D4,"50-05 43RD AVENUE, 3M",,11377.0,0.0,0.0,0.0,0,0.0,1932.0,2,D4,375000,04/30/2019
23038,4,OTHER,02 TWO FAMILY DWELLINGS,1,8932,18,,S2,91-10 JAMAICA AVE,,11421.0,2.0,1.0,3.0,2078,2200.0,1931.0,1,S2,1100000,04/30/2019
23039,4,OTHER,12 CONDOS - WALKUP APARTMENTS,2,1216,1161,,R2,"61-05 39TH AVENUE, F5",F5,11377.0,1.0,0.0,85.0,15151,854.0,1927.0,2,R2,569202,04/30/2019


In [6]:
#needs more data cleaning in land sq feet column 
df['LAND_SQUARE_FEET'] = df['LAND_SQUARE_FEET'].str.replace(",", "")
df['LAND_SQUARE_FEET'].isnull().value_counts()

False    22987
True        53
Name: LAND_SQUARE_FEET, dtype: int64

In [0]:
df['LAND_SQUARE_FEET'] = pd.to_numeric(df['LAND_SQUARE_FEET'], errors='coerce')

In [0]:
#make datetime
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)

Use a subset of the data where BUILDING_CLASS_CATEGORY == '01 ONE FAMILY DWELLINGS' and the sale price was more than 100 thousand and less than 2 million.

In [9]:
#new df for single family homes
sfam = df[(df['SALE_PRICE'] > 100000) & (df['SALE_PRICE']< 2000000) & (df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS')]
sfam.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800.0,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000.0,2001.0,1940.0,1,A1,200000,2019-01-01
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500.0,2043.0,1925.0,1,A1,810000,2019-01-02
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000.0,2680.0,1899.0,1,A1,125000,2019-01-02
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710.0,1872.0,1940.0,1,A5,620000,2019-01-02


In [10]:
print(sfam.shape)
#sfam.describe()

(3151, 21)


Do train/test split. Use data from January â€” March 2019 to train. Use data from April 2019 to test.

In [11]:
#train Jan-March
train = sfam[sfam.SALE_DATE.dt.month < 4]
#test only april
test = sfam[sfam.SALE_DATE.dt.month == 4]
train.shape,test.shape

((2507, 21), (644, 21))

In [12]:
train['SALE_DATE'].dt.month.value_counts()

1    947
3    799
2    761
Name: SALE_DATE, dtype: int64

In [13]:
train.describe()

Unnamed: 0,BLOCK,LOT,EASE-MENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
count,2507.0,2507.0,0.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0
mean,6758.303949,75.778221,,10993.398484,0.987635,0.016354,1.003989,3146.051057,1473.744715,1944.766653,1.0,621573.7
std,3975.909029,157.531138,,494.291462,0.110532,0.129966,0.171794,1798.714872,599.217635,27.059337,0.0,291607.2
min,21.0,1.0,,10301.0,0.0,0.0,0.0,0.0,0.0,1890.0,1.0,104000.0
25%,3837.5,21.0,,10314.0,1.0,0.0,1.0,2000.0,1144.0,1925.0,1.0,440500.0
50%,6022.0,42.0,,11234.0,1.0,0.0,1.0,2600.0,1368.0,1940.0,1.0,560000.0
75%,9888.5,70.0,,11413.0,1.0,0.0,1.0,4000.0,1683.0,1960.0,1.0,750000.0
max,16350.0,2720.0,,11697.0,1.0,2.0,3.0,18906.0,7875.0,2018.0,1.0,1955000.0


In [14]:
test.describe(include='object')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,BUILDING_CLASS_AT_TIME_OF_SALE
count,644,644,644,644,644,644,0.0,644
unique,5,7,1,2,11,643,0.0,10
top,4,OTHER,01 ONE FAMILY DWELLINGS,1,A1,46-12 30TH ROAD,,A1
freq,376,599,644,635,266,2,,267


In [0]:
target = 'SALE_PRICE'

#drop columns with high cardinality
high_cardinality = ['ADDRESS', 'APARTMENT_NUMBER', 'EASE-MENT', 'SALE_DATE'] #removed sale_date because of dtype

#lets also remove mostly NAN columns
features = train.columns.drop([target] + high_cardinality)


In [0]:
#assign train and test features
x_train = train[features]
y_train = train[target]
x_test = test[features]
y_test = test[target]

In [30]:
sfam.shape, x_train.shape, x_test.shape
#we dropped 5 columns 

((3151, 21), (2507, 16), (644, 16))

 Do one-hot encoding of categorical features.

In [0]:
import category_encoders as ce
#link https://contrib.scikit-learn.org/categorical-encoding/onehot.html
encoder = ce.OneHotEncoder(use_cat_names=True)
x_train = encoder.fit_transform(x_train)
x_test = encoder.transform(x_test)

Do feature selection with SelectKBest.

In [0]:
#Squelch annoying warnings
import warnings
warnings.simplefilter('ignore')

In [0]:
from sklearn.feature_selection import SelectKBest, f_regression
#15/75 features
selector = SelectKBest(score_func=f_regression, k=15)

#fit_transform training data, then .transform testing data
x_train_selected = selector.fit_transform(x_train, y_train)
x_train_selected.shape
x_test_selected = selector.transform(x_test)

In [40]:
#select the features
all_names = x_train.columns
s_mask = selector.get_support()
s_names = all_names[s_mask]
uns_names = all_names[~s_mask]

print('15 Selected Features:')
for name in s_names:
  print(name)
print('\n')
print('Remaining Unselected Features:')
for name in uns_names:
  print(name)

15 Selected Features:
BOROUGH_3
BOROUGH_2
BOROUGH_5
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_FOREST HILLS
BLOCK
BUILDING_CLASS_AT_PRESENT_A5
BUILDING_CLASS_AT_PRESENT_A3
ZIP_CODE
COMMERCIAL_UNITS
TOTAL_UNITS
LAND_SQUARE_FEET
GROSS_SQUARE_FEET
BUILDING_CLASS_AT_TIME_OF_SALE_A3


Remaining Unselected Features:
BOROUGH_4
BOROUGH_1
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_BOROUGH PARK
NEIGHBORHOOD_ASTORIA
BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS
TAX_CLASS_AT_PRESENT_1
TAX_CLASS_AT_PRESENT_1D
LOT
BUILDING_CLASS_AT_PRESENT_A9
BUILDING_CLASS_AT_PRESENT_A1
BUILDING_CLASS_AT_PRESENT_A0
BUILDING_CLASS_AT_PRESENT_A2
BUILDING_CLASS_AT_PRESENT_S1
BUILDING_CLASS_AT_PRESENT_A4
BUILDING_CLASS_AT_PRESENT_A6
BUILDING_CLASS_AT_PRESENT_A8
BUILDING_CLASS_AT_PRESENT_B2
BUILDING_CLASS_AT_PRESENT_S0
BUILDING_CLASS_AT_PRESENT_B3
RESIDENTIAL_UNITS
YEAR_BUILT
TAX_CLASS_AT_TIME_OF_SALE
BUILDING_CLASS_AT_TIME_OF_SALE_A9
BUILDING_CLASS_AT_TIME_OF_SALE_A1
BUIL

 Fit a ridge regression model with multiple features. Use the normalize=True parameter (or do feature scaling beforehand â€” use the scaler's fit_transform method with the train set, and the scaler's transform method with the test set)

 Get mean absolute error for the test set.

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [45]:
for k in range(1, len(x_train.columns)+1):
  print(f'{k} features')

#do SelectKBest fit and transform
selector = SelectKBest(score_func=f_regression, k=k)
x_train_selected = selector.fit_transform(x_train, y_train)
x_test_selected = selector.transform(x_test)

#now lin regression on the selected dfs
model = LinearRegression()
model.fit(x_train_selected, y_train)

#now predict the y (price) using the selected x_test features
y_pred = model.predict(x_test_selected)
error = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: ${error:,.0f} \n")
#the model(number of features) that has the lowest mean absolute error is the the best number of features(k) to use for this model
#idk why i can't get the MAE listing for each feature

1 features
2 features
3 features
4 features
5 features
6 features
7 features
8 features
9 features
10 features
11 features
12 features
13 features
14 features
15 features
16 features
17 features
18 features
19 features
20 features
21 features
22 features
23 features
24 features
25 features
26 features
27 features
28 features
29 features
30 features
31 features
32 features
33 features
34 features
35 features
36 features
37 features
38 features
39 features
40 features
41 features
42 features
43 features
44 features
45 features
46 features
47 features
48 features
49 features
Test MAE: $154,922 

