# Feature selection

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import RFE, SelectKBest, f_regression

Data reading:

In [2]:
data = pd.read_excel('data/data_ford_price.xlsx') 
data.head()

Unnamed: 0,price,year,condition,cylinders,odometer,title_status,transmission,drive,size,lat,long,weather
0,43900,2016,4,6,43500,clean,automatic,4wd,full-size,36.4715,-82.4834,59.0
1,15490,2009,2,8,98131,clean,automatic,4wd,full-size,40.468826,-74.281734,52.0
2,2495,2002,2,8,201803,clean,automatic,4wd,full-size,42.477134,-82.949564,45.0
3,1300,2000,1,8,170305,rebuilt,automatic,4wd,full-size,40.764373,-82.349503,49.0
4,13865,2010,3,8,166062,clean,automatic,4wd,,49.210949,-123.11472,


Remove text features and drop records containing NaN. Note that `cylinders` considered as numeric feature here. In fact it is nominal categorical feature. One-hot encoding should performed.

In [3]:
data = data[['price','year', 'cylinders', 'odometer', 'lat', 'long', 'weather']]
data.dropna(inplace = True)

Create X, y, then split them into train and test samples:

In [4]:
y = data['price']
x = data.drop(columns='price')

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=40)

OLS model with no feature selection:

In [5]:
estimator_no_select = LinearRegression()
estimator_no_select.fit(X_train, y_train)
y_pred_test_no_select = estimator_no_select.predict(X_test)
print(f'Test MAE, no selection: {mean_absolute_error(y_test, y_pred_test_no_select): .3f}')

Test MAE, no selection:  4682.957


RFE:

In [6]:
# RFE algorithm: most important features

estimator = LinearRegression()
rfe_selector = RFE(estimator, n_features_to_select=3, step=1)
rfe_feat_selected = rfe_selector.fit(X_train, y_train)
rfe_feat_selected.get_feature_names_out()

array(['year', 'cylinders', 'lat'], dtype=object)

In [7]:
# RFE. Train and test samples
rfe_feats = list(rfe_feat_selected.get_feature_names_out())

X_train_rfe = X_train[rfe_feats]
X_test_rfe = X_test[rfe_feats]

OLS model with RFE selection. MAE

In [8]:
estimator_rfe = LinearRegression()
estimator_rfe.fit(X_train_rfe, y_train)
y_pred_test_rfe = estimator_rfe.predict(X_test_rfe)
print(f'Test MAE, RFE: {mean_absolute_error(y_test, y_pred_test_rfe): .3f}')

Test MAE, RFE:  5096.570


SelectKBest: most important features. Since we have numeric input and numeric output, the best choice is to use Pearson correlation.

In [9]:
kbest_selector = SelectKBest(f_regression, k=3)
kbest_selector.fit(X_train, y_train)
 
kbest_selector.get_feature_names_out()

array(['year', 'cylinders', 'odometer'], dtype=object)

In [10]:
kbest_feats = list(kbest_selector.get_feature_names_out())

X_train_kbest = X_train[kbest_feats]
X_test_kbest = X_test[kbest_feats]

OLS model with SelectKBest features. MAE

In [11]:
estimator_kbest = LinearRegression()
estimator_kbest.fit(X_train_kbest, y_train)
y_pred_test_kbest = estimator_kbest.predict(X_test_kbest)
print(f'Test MAE, SelectKBest: {mean_absolute_error(y_test, y_pred_test_kbest): .3f}')

Test MAE, SelectKBest:  4708.946


*The best result is shown by the model with no feature selection, since we learn the model on all numeric features. SelectKBest result shows insignifant decline but speed up the learning process. Therefore, in terms of learning time/MAE SelectKBest is the best feature selection method so far. And finally RFE result is the worst one because of significant MAE decline.*