Load the dataset and save it as df variable

In [1]:
import sklearn
from sklearn import datasets, model_selection, ensemble

import numpy as np
import pandas as pd

df = pd.read_excel('bank-additional-full.xlsx')

Divide the dataset into training and testing subsets 

In [2]:
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93994.0,-36.4,4857.0,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93994.0,-36.4,4857.0,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93994.0,-36.4,4857.0,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93994.0,-36.4,4857.0,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93994.0,-36.4,4857.0,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94767.0,-50.8,1028.0,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94767.0,-50.8,1028.0,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94767.0,-50.8,1028.0,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94767.0,-50.8,1028.0,4963.6,yes


In [3]:
df['y'] = df['y'].map({'no': 0, 'yes': 1})

### **Feature Engineering:** `Feature Selection`

**Definition**: Feature Selection is the process of selecting a subset of relevant features for use in machine learning model building. 

It is not always the truth that the more data, the better the result will be. Including irrelevant features (the ones that are just unhelpful to the prediction) and redundant features (irrelevant in the presence of others) will only make the learning process overwhelmed and easy to cause overfitting.

#### **Why Feature Selection Matters?**

With feature selection, we can have:

- simplification of models to make them easier to interpret
- shorter training times and lesser computational cost
- lesser cost in data collection
- avoid the curse of dimensionality
- enhanced generalization by reducing overfitting 

We should keep in mind that different feature subsets render optimal performance for different algorithms. So it's not a separate process along with the machine learning model training. Therefore, if we are selecting features for a linear model, it is better to use selection procedures targeted to those models, like importance by regression coefficient or Lasso. And if we are selecting features for trees, it is better to use tree derived importance.

#### **1. Filter Method**

Filter methods select features based on a performance measure regardless of the ML algorithm later employed.

Univariate filters evaluate and rank a single feature according to a certain criteria, while multivariate filters evaluate the entire feature space. Filter methods are:

- selecting variable regardless of the model
- less computationally expensive
- usually give lower prediction performance

As a result, filter methods are suited for a first step quick screen and removal of irrelevant features.

**Note**: 
* One thing to keep in mind when using chi-square test or univariate selection methods, is that in very big datasets, most of the features will show a small p_value, and therefore look like they are highly predictive. This is in fact an effect of the sample size. So care should be taken when selecting features using these procedures. An ultra tiny p_value does not highlight an ultra-important feature, it rather indicates that the dataset contains too many samples. 

* Correlated features do not necessarily affect model performance (trees, etc), but high dimensionality does and too many features hurt model interpretability. So it's always better to reduce correlated features.

---


In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, RobustScaler
from sklearn.impute import SimpleImputer

# Separate features (X) and target variable (y)
X = df.drop(columns=['y'])
y = df['y']

# Identify categorical and numerical columns
categorical_columns = X.select_dtypes(include=['object']).columns
numerical_columns = X.select_dtypes(exclude=['object']).columns

# Create transformers for encoding and scaling
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one_hot_encoder', OneHotEncoder(drop='first'))  # Use drop='first' to avoid the dummy variable trap
])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', RobustScaler())
])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

# Apply transformations to X
X_transformed = preprocessor.fit_transform(X)

# Display the transformed DataFrame (optional)
df_encoded = pd.DataFrame(X_transformed, columns=numerical_columns.tolist() + preprocessor.named_transformers_['cat'].named_steps['one_hot_encoder'].get_feature_names_out(categorical_columns).tolist())

# You can now use X_transformed for further analysis or modeling


---

In [5]:
df_encoded

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success
0,1.200000,0.373272,-0.5,0.0,0.0,0.0000,0.222525,0.857143,0.000272,0.000000,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,1.266667,-0.142857,-0.5,0.0,0.0,0.0000,0.222525,0.857143,0.000272,0.000000,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-0.066667,0.211982,-0.5,0.0,0.0,0.0000,0.222525,0.857143,0.000272,0.000000,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.133333,-0.133641,-0.5,0.0,0.0,0.0000,0.222525,0.857143,0.000272,0.000000,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,1.200000,0.585253,-0.5,0.0,0.0,0.0000,0.222525,0.857143,0.000272,0.000000,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,2.333333,0.709677,-0.5,0.0,0.0,-0.6875,0.924614,-1.428571,-1.040217,-1.762791,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
41184,0.533333,0.935484,-0.5,0.0,0.0,-0.6875,0.924614,-1.428571,-1.040217,-1.762791,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
41185,1.200000,0.041475,0.0,0.0,0.0,-0.6875,0.924614,-1.428571,-1.040217,-1.762791,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
41186,0.400000,1.207373,-0.5,0.0,0.0,-0.6875,0.924614,-1.428571,-1.040217,-1.762791,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [6]:
X = df_encoded

In [7]:
y

0        0
1        0
2        0
3        0
4        0
        ..
41183    1
41184    0
41185    0
41186    1
41187    0
Name: y, Length: 41188, dtype: int64

In [8]:
from sklearn.model_selection import GridSearchCV,StratifiedKFold,train_test_split, cross_val_score

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42,stratify=y)

In [10]:
from sklearn.feature_selection import SelectKBest

**1. F-Statistics**

Compute the ANOVA F-value between the feature and the target labels to determine the relationship's statistical significance. The methods based on F-test estimate the degree of linear dependency between two random variables, which is a measure of the difference in means between multiple groups of data.

In [11]:
selector = SelectKBest(score_func=sklearn.feature_selection.f_classif, k=5)

selector.fit(X_train, y_train)

X_train.columns[selector.get_support()]

Index(['duration', 'pdays', 'emp.var.rate', 'nr.employed', 'poutcome_success'], dtype='object')

**2. Mutual Information (MI)**

Mutual information (MI) between two random variables is a non-negative value, which measures any kind of statistical dependency between the variables, but being nonparametric, they require more samples for accurate estimation.

In [12]:
selector = SelectKBest(score_func=sklearn.feature_selection.mutual_info_classif, k=5)

selector.fit(X_train, y_train)

X_train.columns[selector.get_support()]

Index(['duration', 'cons.price.idx', 'cons.conf.idx', 'euribor3m',
       'nr.employed'],
      dtype='object')

____

#### **2. Wrapper Method**

Wrappers use a search strategy to search through the space of possible feature subsets and evaluate each subset by the quality of the performance on a ML algorithm. Practically any combination of search strategy and algorithm can be used as a wrapper. It is featured as:

- use ML models to score the feature subset
- train a new model on each subset
- very computationally expensive
- usually provide the best performing subset for a give ML algorithm, but probably not for another
- need an arbitrary defined stopping criteria

The most common **search strategy** group is Sequential search, including Forward Selection, Backward Elimination and Exhaustive Search. Randomized search is another popular choice, including Evolutionary computation algorithms such as genetic, and Simulated annealing.

Another key element in wrappers is **stopping criteria**. When to stop the search? In general there're three:

- performance increase
- performance decrease
- predefined number of features is reached

In [13]:
from mlxtend.feature_selection import SequentialFeatureSelector

**1. Forward Selection**

Step forward feature selection starts by evaluating all features individually and selects the one that generates the best performing algorithm, according to a pre-set evaluation criteria. In the second step, it evaluates all possible combinations of the selected feature and a second feature, and selects the pair that produce the best performing algorithm based on the same pre-set criteria.

This selection procedure is called greedy, because it evaluates all possible single, double, triple and so on feature combinations. Therefore, it is quite computationally expensive, and sometimes, if feature space is big, even unfeasible.

In [16]:
from imblearn.over_sampling import RandomOverSampler
oversampler = RandomOverSampler(random_state=42)


In [17]:
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

In [18]:
from xgboost import XGBClassifier
from imblearn.over_sampling import RandomOverSampler

# Instantiate XGBoost classifier
model = XGBClassifier(n_estimators=5)
sampler = RandomOverSampler(random_state=42)

forward = SequentialFeatureSelector(
    estimator=model,
    k_features='best',
    forward=True, 
    verbose=0,
    scoring='precision',
    cv=3
)

forward.fit(X_train_resampled, y_train_resampled)

pd.DataFrame.from_dict(forward.get_metric_dict()).T.head(5)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names,ci_bound,std_dev,std_err
1,"(52,)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(poutcome_success,)",0.0047,0.002089,0.001477
2,"(10, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, poutcome_success)",0.0047,0.002089,0.001477
3,"(10, 11, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, poutcome_s...",0.0047,0.002089,0.001477
4,"(10, 11, 12, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
5,"(10, 11, 12, 13, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477


In [20]:

model_bench_sort = pd.DataFrame.from_dict(forward.get_metric_dict()).T.sort_values(by='avg_score',ascending=False)
model_bench_sort.head(20)

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names,ci_bound,std_dev,std_err
1,"(52,)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(poutcome_success,)",0.0047,0.002089,0.001477
3,"(10, 11, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, poutcome_s...",0.0047,0.002089,0.001477
4,"(10, 11, 12, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
5,"(10, 11, 12, 13, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
6,"(10, 11, 12, 13, 15, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
7,"(10, 11, 12, 13, 15, 20, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
8,"(10, 11, 12, 13, 15, 20, 24, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
9,"(10, 11, 12, 13, 15, 20, 24, 26, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
10,"(10, 11, 12, 13, 15, 20, 24, 26, 32, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, job_entrepreneur, job_housem...",0.0047,0.002089,0.001477
2,"(10, 52)","[0.9373996789727127, 0.9322834645669291, 0.934...",0.934852,"(job_blue-collar, poutcome_success)",0.0047,0.002089,0.001477


In [None]:
forward.k_feature_names_

('nr.employed', 'job_self-employed', 'marital_single', 'poutcome_success')

**2. Backward Elimination**

Step backward feature selection starts by fitting a model using all features. Then it removes one feature. It will remove the one that produces the highest performing algorithm (least statistically significant) for a certain evaluation criteria. In the second step, it will remove a second feature, the one that again produces the best performing algorithm. And it proceeds, removing feature after feature, until a certain criteria is met.

In [None]:
# model = sklearn.ensemble.RandomForestClassifier(n_estimators=5)

# backward = SequentialFeatureSelector(
#     estimator=model,
#     k_features='best', 
#     forward=False, 
#     verbose=0,
#     scoring='precision',
#     cv=3
# )

# backward.fit(X_train, y_train)

# pd.DataFrame.from_dict(backward.get_metric_dict()).T.head(10)

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names,ci_bound,std_dev,std_err
53,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.5948660714285714, 0.5901132852729145, 0.555...",0.580139,"(age, duration, campaign, pdays, previous, emp...",0.039547,0.017574,0.012426
52,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6097560975609756, 0.5865490628445424, 0.575...",0.590698,"(age, duration, campaign, pdays, previous, emp...",0.031896,0.014174,0.010022
51,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.591116173120729, 0.5949506037321625, 0.5823...",0.589457,"(age, duration, campaign, pdays, previous, emp...",0.011914,0.005294,0.003744
50,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6091101694915254, 0.588421052631579, 0.5726...",0.590066,"(age, duration, campaign, pdays, previous, emp...",0.033581,0.014923,0.010552
49,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.5993265993265994, 0.5916114790286976, 0.576...",0.589064,"(age, duration, campaign, pdays, previous, emp...",0.021581,0.00959,0.006781
48,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6016073478760046, 0.5932203389830508, 0.581...",0.592168,"(age, duration, campaign, pdays, previous, emp...",0.018386,0.00817,0.005777
47,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6123595505617978, 0.5989417989417989, 0.557...",0.589456,"(age, duration, campaign, pdays, previous, emp...",0.052991,0.023548,0.016651
46,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.5976608187134503, 0.5962732919254659, 0.576...",0.590299,"(age, duration, campaign, pdays, previous, emp...",0.02126,0.009447,0.00668
45,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14,...","[0.5955307262569832, 0.5993303571428571, 0.578...",0.591215,"(age, duration, campaign, pdays, previous, emp...",0.020085,0.008926,0.006311
44,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14,...","[0.59375, 0.5919477693144722, 0.5878023133543638]",0.591167,"(age, duration, campaign, pdays, previous, emp...",0.005604,0.00249,0.001761


pd.DataFrame.from_dict(backward.get_metric_dict()).T.head(10)

In [None]:
# pd.DataFrame.from_dict(backward.get_metric_dict()).T

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names,ci_bound,std_dev,std_err
53,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.5948660714285714, 0.5901132852729145, 0.555...",0.580139,"(age, duration, campaign, pdays, previous, emp...",0.039547,0.017574,0.012426
52,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6097560975609756, 0.5865490628445424, 0.575...",0.590698,"(age, duration, campaign, pdays, previous, emp...",0.031896,0.014174,0.010022
51,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.591116173120729, 0.5949506037321625, 0.5823...",0.589457,"(age, duration, campaign, pdays, previous, emp...",0.011914,0.005294,0.003744
50,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6091101694915254, 0.588421052631579, 0.5726...",0.590066,"(age, duration, campaign, pdays, previous, emp...",0.033581,0.014923,0.010552
49,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.5993265993265994, 0.5916114790286976, 0.576...",0.589064,"(age, duration, campaign, pdays, previous, emp...",0.021581,0.00959,0.006781
48,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6016073478760046, 0.5932203389830508, 0.581...",0.592168,"(age, duration, campaign, pdays, previous, emp...",0.018386,0.00817,0.005777
47,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.6123595505617978, 0.5989417989417989, 0.557...",0.589456,"(age, duration, campaign, pdays, previous, emp...",0.052991,0.023548,0.016651
46,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[0.5976608187134503, 0.5962732919254659, 0.576...",0.590299,"(age, duration, campaign, pdays, previous, emp...",0.02126,0.009447,0.00668
45,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14,...","[0.5955307262569832, 0.5993303571428571, 0.578...",0.591215,"(age, duration, campaign, pdays, previous, emp...",0.020085,0.008926,0.006311
44,"(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14,...","[0.59375, 0.5919477693144722, 0.5878023133543638]",0.591167,"(age, duration, campaign, pdays, previous, emp...",0.005604,0.00249,0.001761


In [None]:
# backward.k_feature_names_

('mean texture',
 'mean perimeter',
 'mean area',
 'mean concave points',
 'mean symmetry',
 'mean fractal dimension',
 'compactness error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst radius',
 'worst texture',
 'worst area',
 'worst compactness',
 'worst concavity',
 'worst symmetry',
 'worst fractal dimension')

____ 
#### **3. Embedded Method**

Embedded Method combine the advantages of the filter and wrapper methods due to its fair computational cost and reliable performance. Embedded method are algorithm-based, where an algorithm helps extract relevant features. A learning algorithm keep track of relevant features using certain criteria and collects the most contributing features at same time during training phase. Common embedded methods include Lasso and various types of tree-based algorithms. It is featured as:

- perform feature selection as part of the model building process
- consider interactions between features
- less computationally expensive as it only train the model once, compared to Wrappers
- usually provide the best performing subset for a give ML algorithm, but probably not for another

In [None]:
# from sklearn.feature_selection import SelectFromModel

**1. LASSO Regularization L1**

It penalize irrelevant parameters by shrinking their weights or coefficients to zero. Hence, those features are removed from the model, and it not only removes the extraneous features and prevents the model from overfitting. One can learn the complete working of regularization in [here](https://www.enjoyalgorithms.com/blog/regularization-in-machine-learning).

In [None]:
# model = sklearn.linear_model.LogisticRegression(solver='liblinear')

# embed_lasso = SelectFromModel(
#     estimator=model,
#     max_features=5
# )
# embed_lasso.fit(X_train, y_train)

# X_train.columns[(embed_lasso.get_support())]

Index(['mean radius', 'texture error', 'worst radius', 'worst compactness',
       'worst concavity'],
      dtype='object')

**2. Random Forest Feature Importance**

The tree based model approach naturally ranks the features of a dataset by measuring how well the purity is improving. The measure of impurity is either the Gini impurity or the information gain/entropy. When training a tree, it is possible to compute how much each feature decreases the impurity. The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

In [None]:
# model = sklearn.ensemble.RandomForestClassifier()

# embed_rf = SelectFromModel(
#     estimator=model,
#     max_features=5
# )
# embed_rf.fit(X_train, y_train)

# X_train.columns[(embed_rf.get_support())]

Index(['mean concave points', 'worst radius', 'worst perimeter', 'worst area',
       'worst concave points'],
      dtype='object')