# Feature Selection Techniques - House Prices

Home values are influenced by many factors. Basically, there are two major aspects:

 1. The environmental information, including location, local economy, school district, air quality, etc.
 2. The characteristics information of the property, such as lot size, house size and age, the number of rooms, heating / AC systems, garage, and so on.

When people consider buying homes, usually the location has been constrained to a certain area such as not too far from the workplace. 

**With location factor pretty much fixed, the property characteristics information weights more in the home prices**. 

There are many factors describing the condition of a house, and they do not weigh equally in determining the home value. I present a feature selection process to examine the key features affecting their values.


In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook
warnings.filterwarnings('ignore') 

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [None]:
train.shape

(1460, 81)

In [None]:
train = train.set_index("Id")

In [None]:
test = test.set_index("Id")
test

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2006,WD,Normal
2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Abnorml
2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2006,WD,Abnorml
2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


### **Combinar os dois datasets**

In [None]:
df = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                test.loc[:,'MSSubClass':'SaleCondition']), ignore_index=False)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2006,WD,Normal
2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Abnorml
2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2006,WD,Abnorml
2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


**Pré-Processamento**

In [None]:
df.drop('Alley', axis=1, inplace=True)
df.drop('PoolQC', axis=1, inplace=True)
df.drop('Fence', axis=1, inplace=True)
df.drop('MiscFeature', axis=1, inplace=True)

In [None]:
df.loc[df.MasVnrType.isnull(), 'MasVnrType'] = 'None'
df.loc[df.MasVnrType == 'None', 'MasVnrArea'] = 0
df.loc[df.LotFrontage.isnull(), 'LotFrontage'] = df.LotFrontage.median()
df.loc[df.LotArea.isnull(), 'MasVnrType'] = 0
df.loc[df.BsmtQual.isnull(), 'BsmtQual'] = 'NoBsmt'
df.loc[df.BsmtCond.isnull(), 'BsmtCond'] = 'NoBsmt'
df.loc[df.BsmtExposure.isnull(), 'BsmtExposure'] = 'NoBsmt'
df.loc[df.BsmtFinType1.isnull(), 'BsmtFinType1'] = 'NoBsmt'
df.loc[df.BsmtFinType2.isnull(), 'BsmtFinType2'] = 'NoBsmt'
df.loc[df.BsmtFinType1=='NoBsmt', 'BsmtFinSF1'] = 0
df.loc[df.BsmtFinType2=='NoBsmt', 'BsmtFinSF2'] = 0
df.loc[df.BsmtFinSF1.isnull(), 'BsmtFinSF1'] = df.BsmtFinSF1.median()
df.loc[df.BsmtQual=='NoBsmt', 'BsmtUnfSF'] = 0
df.loc[df.BsmtUnfSF.isnull(), 'BsmtUnfSF'] = df.BsmtUnfSF.median()
df.loc[df.BsmtQual=='NoBsmt', 'TotalBsmtSF'] = 0
df.loc[df.FireplaceQu.isnull(), 'FireplaceQu'] = 'NoFireplace'
df.loc[df.GarageType.isnull(), 'GarageType'] = 'NoGarage'
df.loc[df.GarageFinish.isnull(), 'GarageFinish'] = 'NoGarage'
df.loc[df.GarageQual.isnull(), 'GarageQual'] = 'NoGarage'
df.loc[df.GarageCond.isnull(), 'GarageCond'] = 'NoGarage'
df.loc[df.BsmtFullBath.isnull(), 'BsmtFullBath'] = 0
df.loc[df.BsmtHalfBath.isnull(), 'BsmtHalfBath'] = 0
df.loc[df.KitchenQual.isnull(), 'KitchenQual'] = 'TA'
df.loc[df.MSZoning.isnull(), 'MSZoning'] = 'RL'
df.loc[df.Utilities.isnull(), 'Utilities'] = 'AllPub'
df.loc[df.Exterior1st.isnull(), 'Exterior1st'] = 'VinylSd'
df.loc[df.Exterior2nd.isnull(), 'Exterior2nd'] = 'VinylSd'
df.loc[df.Functional.isnull(), 'Functional'] = 'Typ'
df.loc[df.SaleCondition.isnull(), 'SaleCondition'] = 'Normal'
df.loc[df.SaleCondition.isnull(), 'SaleType'] = 'WD'
df.loc[df['Electrical'].isnull(), 'Electrical'] = 'SBrkr'
df.loc[df['SaleType'].isnull(), 'SaleType'] = 'NoSale'

df.loc[df.GarageYrBlt.isnull(), 'GarageYrBlt'] = df.GarageYrBlt.median()

df.loc[df['GarageArea'].isnull(), 'GarageArea'] = df.loc[df['GarageType']=='Detchd', 'GarageArea'].mean()
df.loc[df['GarageCars'].isnull(), 'GarageCars'] = df.loc[df['GarageType']=='Detchd', 'GarageCars'].median()

**Handling categorical data**

In [None]:
size_mapping = {'Y': 1,'N': 0}
df['CentralAir'] = df['CentralAir'].map(size_mapping)

In [None]:
df = pd.get_dummies(df)

In [None]:
df.shape

(2919, 285)

In [None]:
#log transform the target:
train["SalePrice"] = np.log(train["SalePrice"])

In [None]:
#creating matrices for sklearn:
X_train = df[:train.shape[0]]
X_test = df[train.shape[0]:]

**Treino e Teste**

In [None]:
y = train.SalePrice.values

In [None]:
np.random.seed(42)
from sklearn.model_selection import train_test_split
X = df[:train.shape[0]].values #treino

X_train_main, X_test_main, y_train_main, y_test_main = train_test_split(X, y, test_size=0.3)

**Feature na mesma escala**

Para cada elemento, tire a média de divida pelo desvio padrão da coluna
$$ Z = \frac{(x_{ij} - \mu_j) }{\sigma^2_j} \to N(0,1)$$

In [None]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std_main = stdsc.fit_transform(X_train_main)
X_test_std_main = stdsc.transform(X_test_main)

In [None]:
X_train_std_main.shape

(1022, 285)

### **Modelo Baseline**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
np.random.seed(9)

result = cross_val_score(RandomForestRegressor(), X_train_std_main, y_train_main, scoring='neg_root_mean_squared_error', cv=3)

print("RMSE:" ,-result.mean())

RMSE: 0.15123954413930024


### **Feature importance with random forests**

Using a random forest, we can measure feature importance as the averaged impurity decrease computed from all decision trees in the forest without making any assumptions whether our data is linearly separable or not.


In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_std_main, y)

# get top 20 features list
feature_imp = pd.DataFrame(model.feature_importances_, index=df.columns, columns=["importance"])
feat_imp_20 = feature_imp.sort_values("importance", ascending=False).head(50).index
feat_imp_20

Index(['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', '1stFlrSF',
       'BsmtFinSF1', 'GarageArea', 'YearBuilt', 'CentralAir', 'LotArea',
       'OverallCond', 'YearRemodAdd', '2ndFlrSF', 'LotFrontage', 'BsmtUnfSF',
       'MSZoning_RM', 'GarageYrBlt', 'GarageType_Detchd', 'OpenPorchSF',
       'MSZoning_C (all)', 'MoSold', 'WoodDeckSF', 'Fireplaces',
       'FireplaceQu_NoFireplace', 'MasVnrArea', 'FullBath', 'TotRmsAbvGrd',
       'BsmtQual_Gd', 'GarageType_Attchd', 'MSSubClass', 'EnclosedPorch',
       'GarageFinish_Unf', 'BedroomAbvGr', 'KitchenAbvGr', 'YrSold',
       'KitchenQual_Gd', 'KitchenQual_Ex', 'MSZoning_RL', 'BsmtQual_Ex',
       'ExterCond_Fa', 'BsmtFullBath', 'LotShape_Reg', 'KitchenQual_TA',
       'Neighborhood_Edwards', 'GarageQual_TA', 'GarageCond_TA',
       'SaleCondition_Abnorml', 'BsmtQual_TA', 'ExterQual_TA', 'HalfBath'],
      dtype='object')

In [None]:
np.random.seed(9)
df_f_imp = df[:train.shape[0]].filter(items=feat_imp_20)

# numpy
X_f_imp = df_f_imp.values

# scaling
X_train_std = stdsc.fit_transform(X_f_imp)

result = cross_val_score(RandomForestRegressor(), X_train_std, y, scoring='neg_root_mean_squared_error', cv=3)
print("RMSE:", -result.mean())

RMSE: 0.1471065192445131


### **Feature selector that removes all low-variance features.**

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

$$ Var[X] = p(1-p)$$


In [None]:
from sklearn.feature_selection import VarianceThreshold

#Find all features with more than 85% variance in values.
threshold = 0.85
vt = VarianceThreshold().fit(X_main)

# Find feature names
feat_var_threshold = df.columns[vt.variances_ > threshold * (1-threshold)]
# select the top 20 

feat_var_threshold[0:50]

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'MSZoning_RL',
       'LotShape_IR1', 'LotShape_Reg', 'LotConfig_Corner', 'LotConfig_Inside',
       'Neighborhood_NAmes', 'BldgType_1Fam', 'HouseStyle_1Story',
       'HouseStyle_2Story', 'RoofStyle_Gable', 'RoofStyle_Hip',
       'Exterior1st_HdBoard', 'Exterior1st_MetalSd', 'Exterior1st_VinylSd',
       'Exterior2nd_VinylSd', 'MasVnrType_BrkFace'],
      dtype='object')

In [None]:
np.random.seed(9)
df_var_t = df[:train.shape[0]].filter(items=feat_var_threshold[0:50])

X_vt = df_var_t.values

X_train_std = stdsc.fit_transform(X_vt)
                             
result = cross_val_score(RandomForestRegressor(), X_train_std, y, scoring='neg_root_mean_squared_error', cv=3)
print("RMSE:",-result.mean())

RMSE: 0.1507205044578203


### **Univariate feature selection - SelectKBest/f_regression**

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method.

In [None]:
from sklearn.feature_selection import f_regression, SelectKBest

X_scored = SelectKBest(score_func=f_regression, k=50).fit(X_std_main, y)
feature_scoring = pd.DataFrame({
        'feature': df.columns,
        'score': X_scored.scores_
    })

feat_scored_20 = feature_scoring.sort_values('score', ascending=False).head(50)['feature'].values
feat_scored_20

array(['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
       'TotalBsmtSF', '1stFlrSF', 'ExterQual_TA', 'FullBath', 'YearBuilt',
       'YearRemodAdd', 'KitchenQual_TA', 'TotRmsAbvGrd',
       'Foundation_PConc', 'FireplaceQu_NoFireplace', 'ExterQual_Gd',
       'GarageYrBlt', 'Fireplaces', 'BsmtQual_TA', 'HeatingQC_Ex',
       'BsmtQual_Ex', 'BsmtFinType1_GLQ', 'GarageFinish_Unf',
       'MasVnrArea', 'GarageFinish_Fin', 'GarageType_Attchd',
       'KitchenQual_Ex', 'KitchenQual_Gd', 'GarageType_Detchd',
       'MasVnrType_None', 'BsmtFinSF1', 'GarageCond_TA', 'ExterQual_Ex',
       'Neighborhood_NridgHt', 'CentralAir', 'MSZoning_RM',
       'FireplaceQu_Gd', 'Foundation_CBlock', 'Exterior2nd_VinylSd',
       'Exterior1st_VinylSd', 'HeatingQC_TA', 'LotFrontage',
       'BsmtQual_Gd', 'WoodDeckSF', 'GarageQual_TA', 'SaleType_New',
       'SaleCondition_Partial', 'GarageType_NoGarage',
       'GarageQual_NoGarage', 'GarageFinish_NoGarage',
       'GarageCond_NoGarage'], dtype=o

In [None]:
np.random.seed(9)
df_kbest = df[:train.shape[0]].filter(items=feat_scored_20)

X_kb = df_kbest.values

X_train_std = stdsc.fit_transform(X_kb)

result = cross_val_score(RandomForestRegressor(), X_train_std, y, scoring='neg_root_mean_squared_error', cv=3)
print("RMSE:",-result.mean())

RMSE: 0.14997347805846542


### **Recursive Feature Elimination - Random Forest Regressor**

Recursivamente eliminando o feature com menor 'importancia'

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(RandomForestRegressor(), 50, verbose=2)
rfe.fit(X_train_std_main, y_train_main)

feature_rfe_scoring = pd.DataFrame({
        'feature': df.columns,
        'score': rfe.ranking_
    })

feat_rfe_20 = feature_rfe_scoring[feature_rfe_scoring['score'] == 1]['feature'].values
feat_rfe_20

Fitting estimator with 285 features.
Fitting estimator with 284 features.
Fitting estimator with 283 features.


KeyboardInterrupt: 

In [None]:
np.random.seed(9)
df_rfe = df[:train.shape[0]].filter(items=feat_rfe_20)

X_rfe = df_rfe.values

X_train_std = stdsc.fit_transform(X_rfe)


result = cross_val_score(RandomForestRegressor(), X_train_std, y, scoring='neg_root_mean_squared_error', cv=3)
print("RMSE:",-result.mean())

RMSE: 0.14733246231422417


### **Recursive Feature elimination - Lasso** 

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import Lasso

rfe_lasso = RFE(Lasso(), 50, verbose=2 )
rfe_lasso.fit(X_std_main, y)

feature_rfe_scoring = pd.DataFrame({
        'feature': df.columns,
        'score': rfe.ranking_
    })

feat_rfe_20_lasso = feature_rfe_scoring[feature_rfe_scoring['score'] == 1]['feature'].values
feat_rfe_20_lasso

Fitting estimator with 285 features.
Fitting estimator with 284 features.
Fitting estimator with 283 features.
Fitting estimator with 282 features.
Fitting estimator with 281 features.
Fitting estimator with 280 features.
Fitting estimator with 279 features.
Fitting estimator with 278 features.
Fitting estimator with 277 features.
Fitting estimator with 276 features.
Fitting estimator with 275 features.
Fitting estimator with 274 features.
Fitting estimator with 273 features.
Fitting estimator with 272 features.
Fitting estimator with 271 features.
Fitting estimator with 270 features.
Fitting estimator with 269 features.
Fitting estimator with 268 features.
Fitting estimator with 267 features.
Fitting estimator with 266 features.
Fitting estimator with 265 features.
Fitting estimator with 264 features.
Fitting estimator with 263 features.
Fitting estimator with 262 features.
Fitting estimator with 261 features.
Fitting estimator with 260 features.
Fitting estimator with 259 features.
F

AttributeError: 'RFE' object has no attribute 'ranking_'

In [None]:
np.random.seed(9)
df_rfe_lasso = df[:train.shape[0]].filter(items=feat_rfe_20_lasso)

X_rfe_lasso = df_rfe_lasso.values

X_train_std = stdsc.fit_transform(X_rfe_lasso)

result = cross_val_score(RandomForestRegressor(), X_train_std, y, scoring='neg_root_mean_squared_error', cv=3)
print("RMSE:",-result.mean())

RMSE: 0.14733246231422417


### **Stack of algorithms**

In [None]:
features = np.hstack([
        feat_var_threshold[0:50], 
        feat_imp_20,
        feat_scored_20,
#         feat_rfe_20,
        feat_rfe_20_lasso
    ])

features_stack = np.unique(features)
print('Final features set:\n')
print(len(features_stack),"features:")
for f in features_stack:
    print("\t-{}".format(f))

Final features set:

87 features:
	-1stFlrSF
	-2ndFlrSF
	-3SsnPorch
	-BedroomAbvGr
	-BldgType_1Fam
	-BsmtFinSF1
	-BsmtFinSF2
	-BsmtFinType1_GLQ
	-BsmtFullBath
	-BsmtQual_Ex
	-BsmtQual_Gd
	-BsmtQual_TA
	-BsmtUnfSF
	-CentralAir
	-EnclosedPorch
	-ExterCond_Fa
	-ExterQual_Ex
	-ExterQual_Gd
	-ExterQual_TA
	-Exterior1st_HdBoard
	-Exterior1st_MetalSd
	-Exterior1st_VinylSd
	-Exterior2nd_VinylSd
	-FireplaceQu_Gd
	-FireplaceQu_NoFireplace
	-Fireplaces
	-Foundation_CBlock
	-Foundation_PConc
	-FullBath
	-GarageArea
	-GarageCars
	-GarageCond_NoGarage
	-GarageCond_TA
	-GarageFinish_Fin
	-GarageFinish_NoGarage
	-GarageFinish_Unf
	-GarageQual_NoGarage
	-GarageQual_TA
	-GarageType_Attchd
	-GarageType_Detchd
	-GarageType_NoGarage
	-GarageYrBlt
	-GrLivArea
	-HalfBath
	-HeatingQC_Ex
	-HeatingQC_TA
	-HouseStyle_1Story
	-HouseStyle_2Story
	-KitchenAbvGr
	-KitchenQual_Ex
	-KitchenQual_Gd
	-KitchenQual_TA
	-LotArea
	-LotConfig_Corner
	-LotConfig_Inside
	-LotFrontage
	-LotShape_IR1
	-LotShape_Reg
	-LowQualFinS

In [None]:
np.random.seed(9)
df_stack = df[:train.shape[0]].filter(items=features_stack)

X_stack = df_stack.values

X_train_std = stdsc.fit_transform(X_stack)

result = cross_val_score(RandomForestRegressor(), X_train_std, y, scoring='neg_root_mean_squared_error',cv = 3)
print("RMSE:",-result.mean())

RMSE: 0.14743557488252831
