# (Extra) Linear Regression with Feature Selection

In this notebook, we use a subset of the features with the LinearRegression model. We use the `SequentialFeatureSelector` to select the best features.

In [26]:
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

# Prepare data

In [27]:
# Load the train data
train_data = pd.read_csv('../data/houses_train.csv', index_col=0)

In [28]:
# Split data into features and labels.
X_data = train_data.drop(columns='price')
y_data = train_data['price']

In [29]:
# Split features and labels into train (X_train, y_train) and validation set (X_val, y_val).
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, stratify=X_data['object_type_name'], test_size=0.1)

# Find best features

We use the SequentialFeatureSelector to find the best `3 numerical features`.

The `categorical features` are selected by hand.

External resource: https://scikit-learn.org/stable/modules/feature_selection.html

In [30]:
categorical_features = ['zipcode', 'municipality_name', 'object_type_name']
selected_categorical_features = ['zipcode', 'object_type_name']

X_train_numeric = X_train.drop(columns=categorical_features)

In [39]:
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select=3)

# Train (fit) the model with the train data.
_ = sfs.fit(X_train_numeric, y_train)

support_columns = list(X_train_numeric.columns[sfs.get_support()]) + selected_categorical_features

print("Features selected: ", support_columns)

# Define model
model = Pipeline([
    ('pre', make_column_transformer((OneHotEncoder(handle_unknown='ignore'), selected_categorical_features), remainder='passthrough')),
    ('reg', LinearRegression())
])

_ = model.fit(X_train[support_columns], y_train)

Features selected:  ['living_area', 'num_rooms', 'travel_time_public_transport', 'zipcode', 'object_type_name']


## Predict and evaluate prices for the validation set

In [32]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [33]:
# Predict with the model the validation data.
y_val_pred = model.predict(X_val[support_columns])

In [34]:
# How good are we on the validation data?
print(mean_absolute_percentage_error(y_val, y_val_pred))

33.349669965456066


Wait the performance is worse! Why is that?
In this example we have a lot of data (`20'000`) and we are using a linear model so far. Therefore, the model is **underfitting the data**. Less features makes the model less complex, and therefore it is underfitting the data even harder.
**In another problem with less data, the model may perform better with less features**.

Note that even with the drop in performance it can still be better to make a model with less features for this example, because it is **easier** to explain to a customer and in a product **less inputs must be entered** by a user to get a prediction.

# Predict prices for test set

In [35]:
# Load the test set
test_data = pd.read_csv('../data/houses_test.csv', index_col=0)

In [36]:
# Split data into features and labels.
X_test = test_data.drop(columns='price')
y_test = test_data['price']

In [37]:
X_test_sub = X_test[support_columns]

y_test_pred = model.predict(X_test_sub)

print(mean_absolute_percentage_error(y_test, y_test_pred))

31.89291408638914
