# Import Libraries/Data and Instantiating a Sample Model

In [1]:
import pandas as pd # dataframe/data cleaning/manipulation
import numpy as np # array computations
from matplotlib import pyplot as plt # plotting/graphing
import seaborn as sns # additional visualizations
import matplotlib.patches as mpatches
from sklearn.tree import DecisionTreeClassifier # Decision tree algorithm
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score # classification, confusion matrix, and ROC AUC accuracy metrics along with display functions
from sklearn.model_selection import train_test_split, cross_val_score # train test split and cross validation accuracy function
from sklearn.feature_selection import SelectKBest, mutual_info_classif # filter method functions
from sklearn.feature_selection import SequentialFeatureSelector as SFS, RFE # wrapper method functions

Note: If you are using Google Colab, you must upload the training and testing CSVs from Canvas by doing the following:

* On the left-side bar, click the folder icon.
* Click the 'Upload to session storage' button.
* Upload the two CSV files; they will appear below the 'sample_data' folder.

**Unfortunately, this process must be done every time the runtime is disconnected - just a quirk with Google Colab.**

If you are using Jupyter notebook, just make sure the training and testings CSV files are in the same folder location as this .ipynb file.

In [2]:
training_df = pd.read_csv('training_data.csv',index_col=0)
testing_df = pd.read_csv('testing_data.csv',index_col=0)

As before, we are working with the British Bank Dataset with ~ 600 records aiming to predict whether or not someone will buy a personal equity plan (PEP) based on other data such as age, sex, region, and income.

Our main goal is to explore the forms of feature selection discussed in class. As in the previous two Python notebooks, we will instantiate a sample model to work with.


In [3]:
X = training_df.drop(columns = ['pep'])
y = training_df.pep
model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, random_state = 3).fit(X, y)

# Decision Tree Pre-Pruning

 Pre-pruning involves stopping the decision tree before it has finished classifying the training set as a method of preventing overfitting; it is a greedy approach in that we might avoid a split that results in a subsequent split being extremely valuable for the model. However, it is not as computationally expensive as post-pruning.

Pre-pruning a decision tree can be tackled a few different ways in Scikit-Learn.

**One way is to limit the depth of the tree using the `max_depth` parameter.**

To experiment, we will again split the training data further using the `train_test_split()` function as shown in the Model_Evaluation_Final.ipynb file.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

From here, we will instantiate a for-loop where the tree's depth will vary from a maximum of 10 to a minimum of 2. The idea is that we are restricting the tree's depth during its initial creation rather than letting it grow fully. Pruning the model after it has been created would be a form of post-pruning.

In [5]:
for depth in reversed(range(2,11)):
   temp_model = DecisionTreeClassifier(criterion = 'entropy', max_depth=depth).fit(X_train, y_train)
   print(f"For a max_depth value of {depth}: Test Classification Accuracy: {round(accuracy_score(y_test, temp_model.predict(X_test)),2)*100}%, 10-Fold CV Accuracy: {round((cross_val_score(temp_model, X, y, cv = 10)).mean(),2)*100}%")

For a max_depth value of 10: Test Classification Accuracy: 85.0%, 10-Fold CV Accuracy: 83.0%
For a max_depth value of 9: Test Classification Accuracy: 86.0%, 10-Fold CV Accuracy: 83.0%
For a max_depth value of 8: Test Classification Accuracy: 89.0%, 10-Fold CV Accuracy: 85.0%
For a max_depth value of 7: Test Classification Accuracy: 89.0%, 10-Fold CV Accuracy: 86.0%
For a max_depth value of 6: Test Classification Accuracy: 88.0%, 10-Fold CV Accuracy: 87.0%
For a max_depth value of 5: Test Classification Accuracy: 83.0%, 10-Fold CV Accuracy: 84.0%
For a max_depth value of 4: Test Classification Accuracy: 84.0%, 10-Fold CV Accuracy: 83.0%
For a max_depth value of 3: Test Classification Accuracy: 74.0%, 10-Fold CV Accuracy: 74.0%
For a max_depth value of 2: Test Classification Accuracy: 57.99999999999999%, 10-Fold CV Accuracy: 59.0%


We can apply this same concept to the `min_samples_leaves` parameter, which will not allow splits that result in leaf nodes with less than the specified number of samples.



In [6]:
for leaf in reversed(range(2,11)):
   temp_model = DecisionTreeClassifier(criterion = 'entropy', min_samples_leaf=leaf).fit(X_train, y_train)
   print(f"For a min_samples_leaf value of {leaf}: Test Classification Accuracy: {round(accuracy_score(y_test, temp_model.predict(X_test)),2)*100}%, 10-Fold CV Accuracy: {round((cross_val_score(temp_model, X, y, cv = 10)).mean(),2)*100}%")

For a min_samples_leaf value of 10: Test Classification Accuracy: 87.0%, 10-Fold CV Accuracy: 88.0%
For a min_samples_leaf value of 9: Test Classification Accuracy: 87.0%, 10-Fold CV Accuracy: 88.0%
For a min_samples_leaf value of 8: Test Classification Accuracy: 87.0%, 10-Fold CV Accuracy: 88.0%
For a min_samples_leaf value of 7: Test Classification Accuracy: 84.0%, 10-Fold CV Accuracy: 86.0%
For a min_samples_leaf value of 6: Test Classification Accuracy: 85.0%, 10-Fold CV Accuracy: 86.0%
For a min_samples_leaf value of 5: Test Classification Accuracy: 87.0%, 10-Fold CV Accuracy: 86.0%
For a min_samples_leaf value of 4: Test Classification Accuracy: 90.0%, 10-Fold CV Accuracy: 84.0%
For a min_samples_leaf value of 3: Test Classification Accuracy: 87.0%, 10-Fold CV Accuracy: 85.0%
For a min_samples_leaf value of 2: Test Classification Accuracy: 86.0%, 10-Fold CV Accuracy: 85.0%


By increasing the minimum number of samples required at a leaf, the tree might become more generalized, reducing the risk of overfitting. On the other hand, setting this value too high might lead to underfitting where the tree does not capture sufficient patterns from the data.

# Filter Method

The filter selection method involves selecting features based on their statistical properities and relevance to the target variable (in our case, PEP), independent of any machine learning algorithm. It differs from the aforementioned concept of pre-pruning in that it is a pre-processing step applied before the algorithm is run and can be applied to almost all types of models.

There are various statistical measures we can use to "score" each feature, including but not limited to:

- Correlation coefficients (via a correlation matrix)
- Chi-square test
- Mutual information

Based on these scores, we can then sequentially assess and remove each feature IF the model's performance improves by doing so.

The main advantages of the filter method is that it is not computationally expensive and is not tailored to a specific model type. However, it does not necessarily consider interactions between features.





Let's explore one of the aforementioned statistical measures, mutual information, in more detail.

## Mutual Information

Mutual information is a measure of the mutual dependence between two variables. It quantifies how much information a feature provides about the target variable.

For each feature in the dataset, mutual information calculates how much knowing the value of that feature reduces uncertainty about the target variable.

A higher value of mutual information indicates that the feature is more informative regarding the target variable. Conversely, a mutual information value close to zero suggests that the feature is less relevant for predicting the target variable.

To compute and identify these values in Scikit-Learn, we will utilize the [SelectKBest()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) and the [mutual_info_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) functions.

`SelectKBest` is able to select the top 'k' features based on a scoring function, which in this case is `mutual_info_classif`. A similar function exists for chi-squared.

In this case, we are interested in the mutual information values for all of our features, so we will not utilize the 'k' parameter.



In [7]:
selector = SelectKBest(mutual_info_classif).fit(X,y)
# By calling .fit(X,y), the mutual information score for each feature in 'X' is calcluated with respect to our target variable PEP ('y').
scores = selector.scores_

In [8]:
# If we call the scores object by itself, we don't know which values correspond to which features in X.
print(scores)

[0.         0.00189369 0.         0.08229865 0.         0.02233835
 0.03679827 0.         0.         0.01662535 0.01089824 0.01010141
 0.        ]


In [9]:
# Technically, each value corresponds to each column in the 'X' dataframe from left to right. We can use Python/Pandas to make our results more interpretable.

# Get the names of all features
feature_names=list(X.columns)

# Print the scores for each feature
for i, score in enumerate(selector.scores_):
    print("Feature %s: %f" % (feature_names[i], score))

Feature age: 0.000000
Feature income: 0.001894
Feature married: 0.000000
Feature children: 0.082299
Feature car: 0.000000
Feature save_act: 0.022338
Feature current_act: 0.036798
Feature mortgage: 0.000000
Feature sex_FEMALE: 0.000000
Feature region_INNER_CITY: 0.016625
Feature region_RURAL: 0.010898
Feature region_SUBURBAN: 0.010101
Feature region_TOWN: 0.000000


What takeaways can we make from these results?

Note: There will be variation in the results each time you run the code using SelectKBest with mutual_info_classif is due to the inherent randomness within the mutual_info_classif function in scikit-learn. You can define a function to set the random state for the mutual_info_classif function, but the results in the next section should not change regardless.

## Applying Mutual Information to the Filter Method

Now that we have the scores for each feature based on mutual information, we can use the filter method to understand the impact of each feature on the model's importance.

To start, we can create a score dictionary to map each feature name to its mutual information score and then sort the dictionary in descending order.

In [10]:
# Get the indices of the selected features
selected_features_indices = selector.get_support(indices=True)

# Create a dictionary that maps feature names to their scores
score_dict = dict(zip(feature_names, scores))

# Sort the dictionary by scores in descending order
sorted_dict = sorted(score_dict.items(), key=lambda x: x[1], reverse=True)
print(sorted_dict)

[('children', 0.08229865136087411), ('current_act', 0.03679827428447746), ('save_act', 0.022338347637050893), ('region_INNER_CITY', 0.01662534536154836), ('region_RURAL', 0.010898236141792417), ('region_SUBURBAN', 0.010101406337609475), ('income', 0.0018936945902052749), ('age', 0.0), ('married', 0.0), ('car', 0.0), ('mortgage', 0.0), ('sex_FEMALE', 0.0), ('region_TOWN', 0.0)]


In order to assess whether each feature we remove either improves or deteriorates model performance, we need a baseline metric to compare against for our model. In this case, we will use the 10-Fold Cross Validation ROC AUC % on the training data.

In [11]:
mycv = cross_val_score(model, X, y, scoring='roc_auc', cv = 10).mean()
print('Decision-Tree 10-Fold Cross Validation ROC AUC:', round(mycv*100,2),'%')

Decision-Tree 10-Fold Cross Validation ROC AUC: 86.63 %


Now that we have our baseline, we can instantiate a for-loop that iteratively removes each feature from 'X', constructs a new model, and reports the 10-Fold Cross Validation ROC AUC % for that model.

In [12]:
# List to hold features that should be removed.
features_to_remove = []

for feature, _ in sorted_dict:

    # Drop one feature at a time
    X_temp = X.drop(columns=[feature])

    # Construct a new model with the removed feature
    model_temp = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=3)

    # Calculate the 10-Fold Cross Validation ROC AUC % on the new model.
    auc_score = cross_val_score(model_temp, X_temp, y, scoring='roc_auc', cv=10).mean()

    print('Decision-Tree 10-Fold Cross Validation ROC AUC after removing', feature,':', round(auc_score*100, 2), '%')

    if auc_score >= mycv:
        features_to_remove.append(feature)

print('\nFeatures to Remove:',features_to_remove)

Decision-Tree 10-Fold Cross Validation ROC AUC after removing children : 63.68 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing current_act : 86.13 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing save_act : 87.02 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing region_INNER_CITY : 86.88 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing region_RURAL : 86.95 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing region_SUBURBAN : 87.24 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing income : 84.79 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing age : 87.6 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing married : 83.29 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing car : 86.7 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing mortgage : 86.83 %
Decision-Tree 10-Fold Cross Validation ROC AUC after removing sex_FEMALE : 87.29 %
Decision-Tree 10-F

Based on the results, it appears there are several features we should remove. Do these results surprise you given that last few variables listed were ranked highly in terms of mutual information? How is this possible?

We can now apply the results from the filter method by removing the specified attributes and see how our model performs in terms of 10-Fold Cross Validation ROC AUC %.

In [13]:
X_filter = X.drop(columns=features_to_remove, axis=1)
model_filter = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, random_state = 3).fit(X_filter, y)

In [14]:
mycv_filter = cross_val_score(model_filter, X_filter, y, scoring='roc_auc',cv = 10).mean()
print('Decision-Tree 10-Fold Cross Validation ROC AUC after Filtering:', round(mycv_filter*100,2),'%')

Decision-Tree 10-Fold Cross Validation ROC AUC after Filtering: 87.16 %


# Wrapper Method

The wrapper method uses a specific machine learning model to evaluate the effectiveness of subsets of features. The selection of features is "wrapped" around this model; and hence the optimal set of features is specific to the model being used.

- For example, the features selected for a decision tree model might not be the best for another model, such as a logistic regression.

Essentially, the wrapper method involves training a model on different combinations of features and evaluating their performance; aiming to find the best subset of features that result in the best model performance. A search algorithm (like forward selection, backward elimination, or recursive feature elimination) iteratively adds or removes features, evaluates the model, and then decides the next set of features to try. Moreover, the wrapper method uses the performance of a specific machine learning model as the criterion for feature selection, as opposed to general statistical measures in the filter method.

**The wrapper method is the most computationally costly of all the feature selection methods covered in this class, as they involve training models multiple times for different subsets of features.** This is important to note as you will notice Google Colab will take significantly longer to run the code, especially in the context of HW4.


Firstly, as with the filter method, we will need a baseline metric to compare against for our model. Again, we can use the 10-Fold Cross Validation ROC AUC % on the training data as our metric of interest. This is our inital model performance before we perform the wrapper method.

In [15]:
mycv = cross_val_score(model, X, y, scoring='roc_auc',cv = 10).mean()
print('Decision-Tree 10-Fold Cross Validation ROC AUC:', round(mycv*100,2),'%')

Decision-Tree 10-Fold Cross Validation ROC AUC: 86.63 %


In Scikit-Learn, one of the wrapper method techniques is the [SequentialFeatureSelector (SFS)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) function, which allows us to specify if we want to perform forward selection or backward elimination.

**Forward selection starts with no features in the model.** In each step, it adds the feature that provides the most significant improvement to the model performance. This process is repeated, adding one feature at a time, until adding new features no longer improves the model performance significantly or a predefined number of features is reached.

- Forward selection is useful when you have a large number of features, as it allows you to start building the model gradually. Additionally, it is less computationally intensive.
- However, it might miss out on important feature interactions that only become apparent when more features are included in the model.

**Backward elimination starts with all the features in the model.** In each step, it removes the least significant feature (the one whose removal causes the least deterioration in the model performance). This process is repeated, removing one feature at a time, until removing more features significantly worsens the model performance or a predefined number of features is reached.

- Since backward elimination starts with all features, it considers all feature interactions from the beginning. It can also be more accurate when the number of features is not too large, as it evaluates the model with all possible combinations.
- However, it can be computationally intensive, especially if the initial number of features is very high.

To use the SFS function, we need to specify:

- The model that will be used to evaluate the importance of different feature subsets. This will be our sample model.
- How many features to select; setting this to `'auto'` will tell the SFS function to automatically determine the ideal number of features.
- The direction methodology (forward or backward selection) to be used. In this case, we will use backward elimination as we don't have a significant amount of features.
- The scoring metric used to assess the various models being trained. We will use the ROC AUC score to be consistent with our baseline metric.
- The number of folds to use in cross-validation for model asssessment (this is an optional but highly recommended parameter). We will use 10-Fold Cross validation to be consistent with our baseline metric.

Once we call `.fit()` on the SFS function, it will employ the backward elimination method to identify the most important features as per the specified scoring method and cross-validation strategy.

In [17]:
sfs = SFS(model, n_features_to_select=None, direction='backward', scoring='roc_auc', cv=10).fit(X,y)

# Used to transform the dataset X by selecting only the features that SFS determined to be most important.
X_selected = sfs.transform(X)

# sfs.get_support() returns a boolean array indicating which features were selected.
# By zipping this array with X.columns and iterating through them, the code creates a list of feature names that corresponds to True in the boolean array (selected),
  # indicating these features were selected by SFS.
selected_feature_names = [name for name, selected in zip(X.columns, sfs.get_support()) if selected]

print("Selected feature names:", selected_feature_names)

Selected feature names: ['income', 'married', 'children', 'save_act', 'current_act', 'mortgage']


Based on the results, we can then evaluate our model performance after feature selection by only using the subset of features selected by the SFS wrapper method.

In [18]:
model.fit(X_selected, y)
selected_features_auc = cross_val_score(model, X_selected, y, cv=10, scoring='roc_auc').mean()
print("ROC AUC Score After Wrapping with SFS:", round(selected_features_auc,2)*100,'%')

ROC AUC Score After Wrapping with SFS: 88.0 %


The SFS wrapper method is not the only methodology we could use. Another method is known as [Recursive Feature Elimination (RFE)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html).

RFE is an iterative process to rank features by recursively removing the least important features. It starts by training the model on all available features and then measuring the importance of each feature (usually based on model-specific metrics like feature weights or coefficients).

While similar to backward elimination, one key diference is that RFE internally evaluates feature importance versus we are able to specify the performance metric to evaluate feature subsets in SFS (mainly the use-of and number-of folds in cross-validation and a scoring metric)

In [19]:
# Here, let's say we are interested in selecting the five most important features. There is no 'auto' setting for RFE.
# RFE will iteratively remove features, retrain the model, and evaluate which features are the most significant until only five features are left.
selector_rank = RFE(model, n_features_to_select=5).fit(X, y)

# Print the ranking of each feature; RFE still ranks all the features in the dataset even though we only specify five.
for i, rank in enumerate(selector_rank.ranking_): # The number of features to select determines how many features will be listed as 'Rank 1'.
    print("Feature %s: Rank %d" % (feature_names[i], rank))

Feature age: Rank 2
Feature income: Rank 1
Feature married: Rank 1
Feature children: Rank 1
Feature car: Rank 9
Feature save_act: Rank 1
Feature current_act: Rank 8
Feature mortgage: Rank 1
Feature sex_FEMALE: Rank 7
Feature region_INNER_CITY: Rank 3
Feature region_RURAL: Rank 6
Feature region_SUBURBAN: Rank 5
Feature region_TOWN: Rank 4


We can then apply the results from the RFE wrapper method in the same manner as before.

In [20]:
# Select only the features that RFE determined to be most important.
X_selected_RFE = selector_rank.transform(X)
selected_feature_names = [name for name, selected in zip(X.columns, selector_rank.get_support()) if selected]
print("Selected feature names:", selected_feature_names)

Selected feature names: ['income', 'married', 'children', 'save_act', 'mortgage']


In [21]:
model.fit(X_selected_RFE, y)
selected_features_auc_RFE = cross_val_score(model, X_selected_RFE, y, cv=10, scoring='roc_auc').mean()
print("ROC AUC Score After Wrapping with RFE:", round(selected_features_auc,2)*100,'%')

ROC AUC Score After Wrapping with RFE: 88.0 %
