<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

Moving beyond basic feature selection methods, this lab introduces forward feature selection. Through an iterative process, we progressively include features that contribute to improving the model's adjusted R-squared score. By systematically evaluating the impact of each feature, we aim to construct a regression model that captures the underlying patterns in the data.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

### 5. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 5.1 Load Wine Data & Define Predictor and Target

In [5]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine = pd.read_csv(r'C:\Users\lytton\Downloads\DATA\winequality_merged.csv')

# define the target variable (dependent variable) as y
y = wine['quality']

# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']

# Load the dataset as a pandas data frame
X = pd.DataFrame(wine, columns = predictor_columns)

In [7]:
## Create training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### 5.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [19]:
show_steps = True   # for testing/debugging
# show_steps = False  # without showing steps

In [25]:
included = [] #Start with no features in the model.
best = {'feature': '', 'r2': 0, 'a_r2': 0} #Tracks the best feature found in the current round based on R^2 and adjusted R^2.
model = LinearRegression() #Creates a LR model and stores the no. of rows in training data.
n = X_train.shape[0] #Stores the no. of rows in training data.

while True:
    changed = False #A loop that keeps trying to add new features until no improvement is found. 

    if show_steps:
        print('') #Just a visual seperator, prints a blank line only if show_steps = True.

    excluded = list(set(X.columns) - set(included)) #Calculates which features haven’t been added to the model yet.

    if show_steps:
        print(f"(Step) Excluded = {', '.join(excluded)}") #Prints pool of remaining features to be tested.

    for new_column in excluded: #Loops through each unused feature.

        if show_steps:
            print(f'(Step) Trying {new_column}...') #Flags new variable it's testing.
            print(f"(Step) - Features = {', '.join(included + [new_column])}") #Flags new variable in conjunction with previous best variables.

        fit = model.fit(X_train[included + [new_column]], y_train) #Trains the model using current + one new feature,
        r2 = fit.score(X_train[included + [new_column]], y_train) #Calculates the R^2 score.
        k = len(included + [new_column]) #Number of predictors in this model.
        adjusted_r2 = 1 - ( ( (1 - r2) * (n - 1) ) / (n - k - 1) ) #Computes adjusted R^2, which corrects for model complexity (higher is better, but it can decrease if a feature doesn't add up).

        if show_steps:
            print(f"(Step) - Adjusted R^2: This = {adjusted_r2:.3f}; Best = {best['a_r2']:.3f}") #Prints adjusted R^2 with the new feature and best BEFORE testing the new feature to 3 decimal places. 

        if adjusted_r2 > best['a_r2']: #DECISION POINT: if the new feature improves adjusted R^2,
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2} #Record new parameters.
            changed = True  #Flag that a better model is found.
            if show_steps:
                print(f"(Step) - New Best!   : Feature = {best['feature']}; R^2 = {best['r2']:.3f}; Adjusted R^2 = {best['a_r2']:.3f}")

    if changed: #If a better model is found after testing all remaining features,
        included.append(best['feature']) #Append new best feature to included list.
        excluded = list(set(excluded) - set(best['feature'])) #Update excluded list.
        print(f"Added feature {best['feature']} with R^2 = {best['r2']:.3f} and adjusted R^2 = {best['a_r2']:.3f}")

    else:
        print('*'*50) #Just a visual seperator, it prints 50 asterisks.
        break

print('')
print('Resulting features:')
print(', '.join(included))


(Step) Excluded = total sulfur dioxide, red_wine, density, alcohol, citric acid, sulphates, chlorides, volatile acidity, pH, free sulfur dioxide, residual sugar, fixed acidity
(Step) Trying total sulfur dioxide...
(Step) - Features = total sulfur dioxide
(Step) - Adjusted R^2: This = 0.002; Best = 0.000
(Step) - New Best!   : Feature = total sulfur dioxide; R^2 = 0.002; Adjusted R^2 = 0.002
(Step) Trying red_wine...
(Step) - Features = red_wine
(Step) - Adjusted R^2: This = 0.014; Best = 0.002
(Step) - New Best!   : Feature = red_wine; R^2 = 0.014; Adjusted R^2 = 0.014
(Step) Trying density...
(Step) - Features = density
(Step) - Adjusted R^2: This = 0.093; Best = 0.014
(Step) - New Best!   : Feature = density; R^2 = 0.093; Adjusted R^2 = 0.093
(Step) Trying alcohol...
(Step) - Features = alcohol
(Step) - Adjusted R^2: This = 0.198; Best = 0.093
(Step) - New Best!   : Feature = alcohol; R^2 = 0.198; Adjusted R^2 = 0.198
(Step) Trying citric acid...
(Step) - Features = citric acid
(Ste

In [17]:
#Data Scientist not expected to be able to write this code from scratch.
#Documentation for repeatability and storytelling/communication to the business is key for Data Scientists now. 



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



