<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

Moving beyond basic feature selection methods, this lab introduces forward feature selection. Through an iterative process, we progressively include features that contribute to improving the model's adjusted R-squared score. By systematically evaluating the impact of each feature, we aim to construct a regression model that captures the underlying patterns in the data.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

### 5. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 5.1 Load Wine Data & Define Predictor and Target

In [16]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine = pd.read_csv('/Users/francescafelizardo/Documents/Francesca/IOD - UTS - Data Analytics and AI Program/Modules/DATA/winequality_merged.csv')

# define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']
# Load the dataset as a pandas data frame
X = pd.DataFrame(wine, columns = predictor_columns)

In [18]:
## Create training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### 5.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [20]:
## Flag intermediate output

show_steps = True   # for testing/debugging
# show_steps = False  # without showing steps

In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

# start with no predictors
included = []
# keep track of model and parameters
best = {'feature': '', 'r2': 0, 'a_r2': 0}
model = LinearRegression()
n = X_train.shape[0]

while True:
    changed = False

    if show_steps:
        print('')

    excluded = list(set(X.columns) - set(included))

    if show_steps:
        print(f"(Step) Excluded = {', '.join(excluded)}")

    for new_column in excluded:

        if show_steps:
            print(f"(Step) Trying {new_column}...")
            print(f"(Step) - Features = {', '.join(included + [new_column])}")

        # Select current set of features
        X_train_subset = X_train[included + [new_column]]

        # Fit the model
        model.fit(X_train_subset, y_train)
        y_pred = model.predict(X_train_subset)
        r2 = r2_score(y_train, y_pred)

        # Number of predictors
        k = len(included) + 1

        # Adjusted R^2 formula
        adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)

        if show_steps:
            print(f"(Step) - Adjusted R^2: This = {adjusted_r2:.3f}; Best = {best['a_r2']:.3f}")

        # If this model improves adjusted R^2
        if adjusted_r2 > best['a_r2']:
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            changed = True
            if show_steps:
                print(f"(Step) - New Best!   : Feature = {best['feature']}; R^2 = {best['r2']:.3f}; Adjusted R^2 = {best['a_r2']:.3f}")

    # End for

    if changed:
        included.append(best['feature'])
        excluded = list(set(excluded) - set(best['feature']))
        print(f"Added feature {best['feature']} with R^2 = {best['r2']:.3f} and adjusted R^2 = {best['a_r2']:.3f}")
    else:
        break

print('\nResulting features:')
print(', '.join(included))



(Step) Excluded = citric acid, total sulfur dioxide, density, red_wine, residual sugar, fixed acidity, sulphates, volatile acidity, free sulfur dioxide, alcohol, chlorides, pH
(Step) Trying citric acid...
(Step) - Features = citric acid
(Step) - Adjusted R^2: This = 0.008; Best = 0.000
(Step) - New Best!   : Feature = citric acid; R^2 = 0.008; Adjusted R^2 = 0.008
(Step) Trying total sulfur dioxide...
(Step) - Features = total sulfur dioxide
(Step) - Adjusted R^2: This = 0.001; Best = 0.008
(Step) Trying density...
(Step) - Features = density
(Step) - Adjusted R^2: This = 0.096; Best = 0.008
(Step) - New Best!   : Feature = density; R^2 = 0.096; Adjusted R^2 = 0.096
(Step) Trying red_wine...
(Step) - Features = red_wine
(Step) - Adjusted R^2: This = 0.014; Best = 0.096
(Step) Trying residual sugar...
(Step) - Features = residual sugar
(Step) - Adjusted R^2: This = 0.001; Best = 0.096
(Step) Trying fixed acidity...
(Step) - Features = fixed acidity
(Step) - Adjusted R^2: This = 0.006; 



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



