<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.2: Feature Selection

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### 5. Forward Feature Selection

> Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Create a Regression model using Forward Feature Selection by looping over all the features adding one at a time until there are no improvements on the prediction metric ( R2  and  AdjustedR2  in this case).

#### 5.1 Load Wine Data & Define Predictor and Target

In [42]:
## Load the wine quality dataset

# Load the wine dataset from csv
wine_csv = 'D:\IOD\Data\Datasets 2\winequality_merged.csv'
wine = pd.read_csv(wine_csv)

# define the target variable (dependent variable) as y
y = wine['quality']
y

0       5
1       5
2       5
3       6
4       5
       ..
6492    6
6493    5
6494    6
6495    7
6496    6
Name: quality, Length: 6497, dtype: int64

In [43]:
# Take all columns except target as predictor columns
predictor_columns = [c for c in wine.columns if c != 'quality']
predictor_columns

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'red_wine']

In [44]:
# Load the dataset as a pandas data frame, define X
X = pd.DataFrame(wine, columns = predictor_columns)
X

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,red_wine
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,1
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,1
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,0
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,0
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,0
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,0


In [45]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [46]:
## Create training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=42)
X_train

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,red_wine
1916,6.6,0.240,0.35,7.70,0.031,36.0,135.0,0.99380,3.19,0.37,10.5,0
947,8.3,0.280,0.48,2.10,0.093,6.0,12.0,0.99408,3.26,0.62,12.4,1
877,7.7,0.715,0.01,2.10,0.064,31.0,43.0,0.99371,3.41,0.57,11.8,1
2927,5.2,0.370,0.33,1.20,0.028,13.0,81.0,0.99020,3.37,0.38,11.7,0
6063,6.6,0.260,0.56,15.40,0.053,32.0,141.0,0.99810,3.11,0.49,9.3,0
...,...,...,...,...,...,...,...,...,...,...,...,...
3772,7.6,0.320,0.58,16.75,0.050,43.0,163.0,0.99990,3.15,0.54,9.2,0
5191,5.6,0.280,0.27,3.90,0.043,52.0,158.0,0.99202,3.35,0.44,10.7,0
5226,6.4,0.370,0.20,5.60,0.117,61.0,183.0,0.99459,3.24,0.43,9.5,0
5390,6.5,0.260,0.50,8.00,0.051,46.0,197.0,0.99536,3.18,0.47,9.5,0


In [47]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)

0.30294471018798774

In [48]:
lr.score(X_test, y_test)

0.26715748512455784

#### 5.2 Overview of the code below

The external `while` loop goes forever until there are no improvements to the model, which is controlled by the flag `changed` (until is **not** changed).
The inner `for` loop goes over each of the features not yet included in the model and calculates the correlation coefficient. If any model improves on the previous best model then the records are updated.

#### Code variables
- `included`: list of the features (predictors) that were included in the model; starts empty.
- `excluded`: list of features that have **not** been included in the model; starts as the full list of features.
- `best`: dictionary to keep record of the best model found at any stage; starts 'empty'.
- `model`: object of class LinearRegression, with default values for all parameters.

#### Methods of the `LinearRegression` object to investigate
- `fit()`
- `fit.score()`

#### Adjusted $R^2$ formula
$$Adjusted \; R^2 = 1 - { (1 - R^2) (n - 1)  \over n - k - 1 }$$

#### Linear Regression [reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [34]:
## Flag intermediate output

show_steps = True   # for testing/debugging
# show_steps = False  # without showing steps

In [35]:
## Use Forward Feature Selection to pick a good model

# start with no predictors
included_list = []
# keep track of model and parameters
best = {'feature': '', 'r2': 0, 'a_r2': 0}
# create a model object to hold the modelling parameters
model = LinearRegression() # create a model for Linear Regression
# get the number of cases in the training data
n = X_train.shape[0]

while True:
    changed = False
    
    if show_steps:
        print('') 

    # list the features to be evaluated
    excluded_list = list(set(X.columns) - set(included_list))
    
    if show_steps:
        print('(Step) Excluded = %s' % ', '.join(excluded_list))  

    # for each remaining feature to be evaluated
    for new_column in excluded_list:
        
        if show_steps:
            print('(Step) Trying %s...' % new_column)
            print('(Step) - Features = %s' % ', '.join(included_list + [new_column]))

        # fit the model with the Training data
        fit = model.fit(X_train[included_list + [new_column]], y_train) # fit a model; consider which predictors should be included
        # calculate the score (R^2 for Regression)
        r2 = fit.score(X_train[included_list + [new_column]],y_train) # calculate the score
        # number of predictors in this model
        k = len(included_list + [new_column])
        # calculate the adjusted R^2
        adjusted_r2 = 1 - ((1 -r2) * (n-1)) / (n - k - 1)# calculate the Adjusted R^2

        if show_steps:
            print('(Step) - Adjusted R^2: This = %.3f; Best = %.3f' % 
                  (adjusted_r2, best['a_r2']))

        # if model improves
        if adjusted_r2 > best['a_r2']:
            # record new parameters
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            # flag that found a better model
            changed = True
            if show_steps:
                print('(Step) - New Best!   : Feature = %s; R^2 = %.3f; Adjusted R^2 = %.3f' % 
                      (best['feature'], best['r2'], best['a_r2']))
    # END for

    # if found a better model after testing all remaining features
    if changed:
        # update control details
        included_list.append(best['feature'])
        excluded_list = list(set(excluded_list) - set(best['feature']))
        print('Added feature %-4s with R^2 = %.3f and adjusted R^2 = %.3f' % (best['feature'], best['r2'], best['a_r2']))
    else:
        # terminate if no better model
        break

print('')
print('Resulting features:')
print(', '.join(included_list))


(Step) Excluded = total sulfur dioxide, chlorides, residual sugar, fixed acidity, density, citric acid, free sulfur dioxide, alcohol, sulphates, volatile acidity, red_wine, pH
(Step) Trying total sulfur dioxide...
(Step) - Features = total sulfur dioxide
(Step) - Adjusted R^2: This = 0.002; Best = 0.000
(Step) - New Best!   : Feature = total sulfur dioxide; R^2 = 0.002; Adjusted R^2 = 0.002
(Step) Trying chlorides...
(Step) - Features = chlorides
(Step) - Adjusted R^2: This = 0.037; Best = 0.002
(Step) - New Best!   : Feature = chlorides; R^2 = 0.037; Adjusted R^2 = 0.037
(Step) Trying residual sugar...
(Step) - Features = residual sugar
(Step) - Adjusted R^2: This = 0.002; Best = 0.037
(Step) Trying fixed acidity...
(Step) - Features = fixed acidity
(Step) - Adjusted R^2: This = 0.004; Best = 0.037
(Step) Trying density...
(Step) - Features = density
(Step) - Adjusted R^2: This = 0.091; Best = 0.037
(Step) - New Best!   : Feature = density; R^2 = 0.091; Adjusted R^2 = 0.091
(Step) Tr

In [36]:
N = X_train.shape[0]
N

5197

In [37]:
X.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'red_wine'],
      dtype='object')

In [38]:
set(X.columns)

{'alcohol',
 'chlorides',
 'citric acid',
 'density',
 'fixed acidity',
 'free sulfur dioxide',
 'pH',
 'red_wine',
 'residual sugar',
 'sulphates',
 'total sulfur dioxide',
 'volatile acidity'}

In [39]:
show_steps = True 

In [40]:
## Use Forward Feature Selection to pick a good model

# start with no predictors
included = []
# keep track of model and parameters
best = {'feature': '', 'r2': 0, 'a_r2': 0}
# create a model object to hold the modelling parameters
model = LinearRegression()
# get the number of cases in the train data
n = X_train.shape[0]

while True:
    changed = False
    
    if show_steps:
        print('') 

    # list the features to be evaluated
    excluded = list(set(X.columns) - set(included))
    
    if show_steps:
        print('(Step) Excluded = %s' % ', '.join(excluded))  

    # for each remaining feature to be evaluated
    for new_column in excluded:
        
        if show_steps:
            print('(Step) Trying %s...' % new_column)
            print('(Step) - Features = %s' % ', '.join(included + [new_column]))

        # fit the model with the Training data
        fit = model.fit(X_train[included + [new_column]], y_train)
        # calculate the score (R^2 for Regression)
        r2 = fit.score(X_train[included + [new_column]], y_train)
        # number of predictors in this model
        k = len(included + [new_column])
        # calculate the adjusted R^2
        adjusted_r2 = 1 - ( ( (1 - r2) * (n - 1) ) / (n - k - 1) )

        if show_steps:
            print('(Step) - Adjusted R^2: This = %.3f; Best = %.3f' % 
                  (adjusted_r2, best['a_r2']))

        # if model improves
        if adjusted_r2 > best['a_r2']:
            # record new parameters
            best = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            # flag that found a better model
            changed = True
            if show_steps:
                print('(Step) - New Best!   : Feature = %s; R^2 = %.3f; Adjusted R^2 = %.3f' % 
                      (best['feature'], best['r2'], best['a_r2']))
    # END for

    # if found a better model after testing all remaining features
    if changed:
        # update control details
        included.append(best['feature'])
        excluded = list(set(excluded) - set(best['feature']))
        print('Added feature %-4s with R^2 = %.3f and adjusted R^2 = %.3f' % 
              (best['feature'], best['r2'], best['a_r2']))
    else:
        # terminate if no better model
        print('*'*50)
        break

print('')
print('Resulting features:')
print(', '.join(included))


(Step) Excluded = total sulfur dioxide, chlorides, residual sugar, fixed acidity, density, citric acid, free sulfur dioxide, alcohol, sulphates, volatile acidity, red_wine, pH
(Step) Trying total sulfur dioxide...
(Step) - Features = total sulfur dioxide
(Step) - Adjusted R^2: This = 0.002; Best = 0.000
(Step) - New Best!   : Feature = total sulfur dioxide; R^2 = 0.002; Adjusted R^2 = 0.002
(Step) Trying chlorides...
(Step) - Features = chlorides
(Step) - Adjusted R^2: This = 0.037; Best = 0.002
(Step) - New Best!   : Feature = chlorides; R^2 = 0.037; Adjusted R^2 = 0.037
(Step) Trying residual sugar...
(Step) - Features = residual sugar
(Step) - Adjusted R^2: This = 0.002; Best = 0.037
(Step) Trying fixed acidity...
(Step) - Features = fixed acidity
(Step) - Adjusted R^2: This = 0.004; Best = 0.037
(Step) Trying density...
(Step) - Features = density
(Step) - Adjusted R^2: This = 0.091; Best = 0.037
(Step) - New Best!   : Feature = density; R^2 = 0.091; Adjusted R^2 = 0.091
(Step) Tr

show_details = True
#set parameters
predictors_list = []
best_para = {'feature': '', 'r2': 0 , 'a_2': 0}
slg_model = LinearRegression()  # create model fore linear regression
N = X_train.shape[0]

while True:
    changed = False
    
    if show_details:
        print('')
    
    excld_predic_list = list(set(X.columns) - set(predictors_list))
    
    if show_details:
        print('(Step) Excluded = %s' % ','.join(excld_predic_list))
    
    for new_column in excld_predic_list:
        
        if show_details:
            print('(Step) Trying %s ...' % new_column)
            print('(Step) - Features = %s' % ','.join(predictors_list + [new_column]))
        
        fit = model.fit(X_train[predictors_list + [new_column]], y_train)
        r2 = fit.score(X_train[predictors_list + [new_column]], y_train)
        k = len(predictors_list + [new_column])
        
        adjusted_r2 = 1 - (((1 -r2) * (n-1)) / (n - k - 1))
    
        if show_details:
            print('(Step) - Adjusted R^2 : This = %.3f; Best = %.3f' % (adjusted_r2, best['a_r2']))
    
        if adjusted_r2 > best['a_r2']:
            best_para = {'feature': new_column, 'r2': r2, 'a_r2': adjusted_r2}
            changed = True
            if show_details:
                print('(Step) - New Best! : Feature = %s; R^2 = %.3f; Adjusted R^2 = %.3f' % (best_para['feature'], best_para['r2'], best_para['a_r2']))
            
    if changed:
        predictors_list.append(best_para['feature'])
        excld_predic_list = list(set(excld_predic_list) - set(best_para['feature']))
        print('Added feature %-4s with R^2 = %.3f and adjusted R^2 = %.3f' % (best_para['feature'], best_para['r2'], best_para['a_r2']))
    else:
        break
        
print('')
print('Resulting features:')
print(', '.join(included_list))        



---



---



> > > > > > > > > © 2022 Institute of Data


---



---



