# Assignment 2: Feature Selection for Attrition / Burnout Prediction

# Group 14: Rylie Ramos-Marquez, Derek Atabayev, Vishnu Garigipati

From the previous assignment, we know that the mest method is gradient boosting of trees, specifically with 200 boosting rounds.

Why it's the best:

* High F1-score = 0.9319 which far surpasses other models (better by at least 0.1 = 10%), and is also better than simple KNN/Decision trees model

* Shallow tree structure (max depth = 5) which prevents overfitting

* Consistent performance with a tight 95% confidence interval of [0.855171, 0.893952] which is a very small range, meaning the performance is consistent and strong

The final model is saved as a Pickle format for easy retrieval

In [38]:
# imports for the project

import joblib # since joblib.dump was used to save the model
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.calibration import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score

import numpy as np

In [39]:
# retrieve final_model_14.pkl from ../final_model_14.pkl

model = joblib.load('../final_model_14.pkl')

# check if the model has been loaded correctly
print(model)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   MinMaxScaler())]),
                                                  Index(['hrs', 'absences', 'JobInvolvement', 'PerformanceRating',
       'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age',
       'DistanceFromHome', 'Education', 'EmployeeID', 'JobLevel',
       'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object'))])),
                ('classifier',
                 GradientBoostingClassifier(max_d

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Looks like the model has been loaded successfully. We can see that the feature names are correct and familiar, so the preprocessing and classifier are intact. All the gradient boosting parameters are there too.

For the grid search later, we will need our training data. We can pull that now too.

We added a cell to the notebook from assignment 1 to export this data to pickle files

Our ideal split from assignment one was 80/20 training/testing

In [40]:
# save data X and y from pickle

X = joblib.load('./X.pkl')
y = joblib.load('./y.pkl')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100545358)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


In [41]:
correlation_matrix = X.corr(numeric_only=True)
high_corr_pairs = correlation_matrix.where((correlation_matrix > 0.7) & (correlation_matrix < 1))

for col in high_corr_pairs.columns:
    high_corr_indices = high_corr_pairs.index[high_corr_pairs[col].notnull()]
    for idx in high_corr_indices:
        print(f"High correlation: {col} and {idx} -> {correlation_matrix.loc[idx, col]:.2f}")

High correlation: PerformanceRating and PercentSalaryHike -> 0.79
High correlation: PercentSalaryHike and PerformanceRating -> 0.79
High correlation: YearsAtCompany and YearsWithCurrManager -> 0.76
High correlation: YearsWithCurrManager and YearsAtCompany -> 0.76


# Adding feature selection

* SelectKBest and criterion f_classif

In [42]:
# add feature selection with SelectKBest and f_classif to the pipeline

# score_func is the function used to evaluate the importance of each feature
# f_classif is a default value for the score_func parameter, calculates ANOVA F-Value

# Extract components from the existing pipeline
preprocessor = model.named_steps['preprocessor']
classifier = model.named_steps['classifier']

# create pipeline 1: feature selection with SelectKBest and criterion f_classif

pipeline_f_classif = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(f_classif)),
    ('classifier', classifier)
])

pipeline_mutual_info_classif = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(mutual_info_classif)),
    ('classifier', classifier)
])


Performing grid search to tune the number of features to be selected (k parameter)

  * We need to use an array to test different values of k, since we have 21 features, we can check 5, 10, 15, 20, 21 possibilities
  * This will tell us whether the higher side or lower side is best

In [43]:
# performing grid search to find the best parameters for the model, best metric is accuracy

# define the parameters for the grid search

# parameters for choosing a value of k
param_grid = {
    'selector__k': [i for i in range(1,22)]
}

def perform_grid_search(param_grid):
    # create grid search object for pipeline 1
    grid_search_f_classif = GridSearchCV(pipeline_f_classif, param_grid, cv=5, n_jobs=-1, verbose=0)
    # in the previous assignment, we used 5-fold cross validation because the dataset is small, and we want to make sure that the model is not overfitting
    # that approach worked well and so we can use it for HPO of the 'k' parameter too

    # creating grid search object for pipeline 2
    grid_search_mutual_info_classif = GridSearchCV(pipeline_mutual_info_classif, param_grid, cv=5, n_jobs=-1, verbose=0)

    # fit the grid search objects to the data
    grid_search_f_classif.fit(X_train, y_train_encoded)
    grid_search_mutual_info_classif.fit(X_train, y_train_encoded)

    # print the optimal k in both cases
    best_k_f_classif = grid_search_f_classif.best_params_['selector__k']
    print(f'Optimal value of k for f_classif: {best_k_f_classif}')

    best_k_mutual_info_classif = grid_search_mutual_info_classif.best_params_['selector__k']
    print(f'Optimal value of k for mutual_info_classif: {best_k_mutual_info_classif}')

    print("\nAverage accuracy for each k (f_classif):")
    for mean, params in zip(grid_search_f_classif.cv_results_['mean_test_score'], grid_search_f_classif.cv_results_['params']):
        print(f"k = {params['selector__k']}: Average Accuracy = {mean:.4f}")

    # Print average accuracy for each k for mutual_info_classif
    print("\nAverage accuracy for each k (mutual_info_classif):")
    for mean, params in zip(grid_search_mutual_info_classif.cv_results_['mean_test_score'], grid_search_mutual_info_classif.cv_results_['params']):
        print(f"k = {params['selector__k']}: Average Accuracy = {mean:.4f}")
    return grid_search_f_classif, grid_search_mutual_info_classif

# Call the function with the parameter grid
grid_search_f_classif, grid_search_mutual_info_classif = perform_grid_search(param_grid)


Optimal value of k for f_classif: 21
Optimal value of k for mutual_info_classif: 21

Average accuracy for each k (f_classif):
k = 1: Average Accuracy = 0.6202
k = 2: Average Accuracy = 0.6456
k = 3: Average Accuracy = 0.6947
k = 4: Average Accuracy = 0.7553
k = 5: Average Accuracy = 0.7991
k = 6: Average Accuracy = 0.8088
k = 7: Average Accuracy = 0.8298
k = 8: Average Accuracy = 0.8421
k = 9: Average Accuracy = 0.8386
k = 10: Average Accuracy = 0.8325
k = 11: Average Accuracy = 0.8368
k = 12: Average Accuracy = 0.8544
k = 13: Average Accuracy = 0.8386
k = 14: Average Accuracy = 0.8447
k = 15: Average Accuracy = 0.8456
k = 16: Average Accuracy = 0.8526
k = 17: Average Accuracy = 0.8605
k = 18: Average Accuracy = 0.8649
k = 19: Average Accuracy = 0.8684
k = 20: Average Accuracy = 0.8754
k = 21: Average Accuracy = 0.8781

Average accuracy for each k (mutual_info_classif):
k = 1: Average Accuracy = 0.7395
k = 2: Average Accuracy = 0.8281
k = 3: Average Accuracy = 0.8377
k = 4: Average Acc

In [44]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

def perform_multiple_grid_search(param_grid, X_train, y_train_encoded, n_runs=20):
    best_ks_f_classif = []
    best_ks_mutual_info_classif = []
    
    # Store accuracies for each k
    accuracy_f_classif = {k: [] for k in param_grid['selector__k']}
    accuracy_mutual_info_classif = {k: [] for k in param_grid['selector__k']}
    
    # Perform grid search over multiple runs
    for _ in range(n_runs):
        # Create cross-validation strategy
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
        
        # Define grid search objects
        grid_search_f_classif = GridSearchCV(pipeline_f_classif, param_grid, cv=cv, n_jobs=-1, verbose=0)
        grid_search_mutual_info_classif = GridSearchCV(pipeline_mutual_info_classif, param_grid, cv=cv, n_jobs=-1, verbose=0)

        # Fit the grid search objects to the data
        grid_search_f_classif.fit(X_train, y_train_encoded)
        grid_search_mutual_info_classif.fit(X_train, y_train_encoded)

        # Get optimal k for each method
        best_k_f_classif = grid_search_f_classif.best_params_['selector__k']
        best_k_mutual_info_classif = grid_search_mutual_info_classif.best_params_['selector__k']
        
        best_ks_f_classif.append(best_k_f_classif)
        best_ks_mutual_info_classif.append(best_k_mutual_info_classif)

        # Record accuracies for each k
        for mean, params in zip(grid_search_f_classif.cv_results_['mean_test_score'], grid_search_f_classif.cv_results_['params']):
            accuracy_f_classif[params['selector__k']].append(mean)

        for mean, params in zip(grid_search_mutual_info_classif.cv_results_['mean_test_score'], grid_search_mutual_info_classif.cv_results_['params']):
            accuracy_mutual_info_classif[params['selector__k']].append(mean)
    
    # Calculate average optimal k and accuracy
    avg_best_k_f_classif = np.mean(best_ks_f_classif)
    avg_best_k_mutual_info_classif = np.mean(best_ks_mutual_info_classif)

    print(f'Average optimal value of k for f_classif over {n_runs} runs: {avg_best_k_f_classif}')
    print(f'Average optimal value of k for mutual_info_classif over {n_runs} runs: {avg_best_k_mutual_info_classif}')
    
    print("\nAverage accuracy for each k (f_classif):")
    for k, accuracies in accuracy_f_classif.items():
        print(f"k = {k}: Average Accuracy = {np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}")
    
    print("\nAverage accuracy for each k (mutual_info_classif):")
    for k, accuracies in accuracy_mutual_info_classif.items():
        print(f"k = {k}: Average Accuracy = {np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}")
    print(f'\n\n best k_f {best_ks_f_classif}')
    print(f'\n\n best mutual info {best_ks_mutual_info_classif}')
    return avg_best_k_f_classif, avg_best_k_mutual_info_classif, accuracy_f_classif, accuracy_mutual_info_classif

# Example usage:
param_grid = {'selector__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]}
avg_best_k_f_classif, avg_best_k_mutual_info_classif, accuracy_f_classif, accuracy_mutual_info_classif = perform_multiple_grid_search(param_grid, X_train, y_train_encoded, n_runs=30)


Average optimal value of k for f_classif over 30 runs: 18.1
Average optimal value of k for mutual_info_classif over 30 runs: 17.3

Average accuracy for each k (f_classif):
k = 1: Average Accuracy = 0.6047 ± 0.0128
k = 2: Average Accuracy = 0.6594 ± 0.0136
k = 3: Average Accuracy = 0.7223 ± 0.0126
k = 4: Average Accuracy = 0.7791 ± 0.0099
k = 5: Average Accuracy = 0.8052 ± 0.0085
k = 6: Average Accuracy = 0.8312 ± 0.0101
k = 7: Average Accuracy = 0.8487 ± 0.0092
k = 8: Average Accuracy = 0.8558 ± 0.0090
k = 9: Average Accuracy = 0.8635 ± 0.0082
k = 10: Average Accuracy = 0.8634 ± 0.0089
k = 11: Average Accuracy = 0.8624 ± 0.0102
k = 12: Average Accuracy = 0.8638 ± 0.0091
k = 13: Average Accuracy = 0.8632 ± 0.0110
k = 14: Average Accuracy = 0.8651 ± 0.0085
k = 15: Average Accuracy = 0.8668 ± 0.0091
k = 16: Average Accuracy = 0.8664 ± 0.0106
k = 17: Average Accuracy = 0.8692 ± 0.0116
k = 18: Average Accuracy = 0.8687 ± 0.0086
k = 19: Average Accuracy = 0.8724 ± 0.0068
k = 20: Average Accu

In this case, the best k value is 20 in both cases; only one k-value is being discarded for minimal predictive power.

This suggests that overall, the features havevery strong predictive power.

In [52]:
from collections import Counter

# Get the union of keys from both counters

best_ks_f_classif = [18, 11, 21, 17, 17, 20, 19, 19, 21, 14, 19, 20, 16, 19, 19, 19, 21, 20, 21, 15, 15, 20, 18, 21, 19, 17, 15, 17, 16, 19]
best_ks_mutual_info_classif = [19, 12, 18, 15, 19, 18, 18, 20, 19, 20, 19, 16, 18, 20, 19, 16, 13, 15, 11, 19, 20, 21, 10, 18, 20, 19, 19, 19, 12, 17]
counter1 = Counter(best_ks_f_classif)
counter2 = Counter(best_ks_mutual_info_classif)

# Format counts and sort by values (counts) in descending order
formatted_counts1 = ", ".join(f"{key}: {value}" for key, value in sorted(counter1.items(), key=lambda x: x[1], reverse=True))
formatted_counts2 = ", ".join(f"{key}: {value}" for key, value in sorted(counter2.items(), key=lambda x: x[1], reverse=True))

# Print results
print(f"Counts in list1: {formatted_counts1}")
print(f"Counts in list2: {formatted_counts2}")

Counts in list1: 19: 8, 21: 5, 17: 4, 20: 4, 15: 3, 18: 2, 16: 2, 11: 1, 14: 1
Counts in list2: 19: 9, 18: 5, 20: 5, 12: 2, 15: 2, 16: 2, 13: 1, 11: 1, 21: 1, 10: 1, 17: 1


In [46]:
# Evaluate the models obtained with the two pipelines on the testing dataset

# get the best models from the grid search object
best_model_f_classif = grid_search_f_classif.best_estimator_

best_model_mutual_info_classif = grid_search_mutual_info_classif.best_estimator_

# print the accuracy and f1-score of the best models on the testing dataset

y_pred_f_classif = best_model_f_classif.predict(X_test)
y_pred_mutual_info_classif = best_model_mutual_info_classif.predict(X_test)

accuracy_f_classif = accuracy_score(y_test_encoded, y_pred_f_classif)
f1_f_classif = f1_score(y_test_encoded, y_pred_f_classif)

accuracy_mutual_info_classif = accuracy_score(y_test_encoded, y_pred_mutual_info_classif)
f1_mutual_info_classif = f1_score(y_test_encoded, y_pred_mutual_info_classif)

print(f'Accuracy of the best model with f_classif: {accuracy_f_classif}')
print(f'F1-score of the best model with f_classif: {f1_f_classif}')

print(f'Accuracy of the best model with mutual_info_classif: {accuracy_mutual_info_classif}')
print(f'F1-score of the best model with mutual_info_classif: {f1_mutual_info_classif}')




Accuracy of the best model with f_classif: 0.9265734265734266
F1-score of the best model with f_classif: 0.9292929292929293
Accuracy of the best model with mutual_info_classif: 0.9265734265734266
F1-score of the best model with mutual_info_classif: 0.9292929292929293


The best feature selection method is with the criterion: mutual_info_classif

The accuracy and f1-scores in this case are slightly better, but both models are pretty much the same.

In [47]:
# Check which features are actually selected
# We have 21, and k=20, so one has been dropped

all_features = ['hrs', 'absences', 'JobInvolvement', 'PerformanceRating',
       'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age',
       'DistanceFromHome', 'Education', 'EmployeeID', 'JobLevel',
       'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
# get the selected features from the best model with mutual_info_classif
selected_features_mutual_info_classif = best_model_mutual_info_classif.named_steps['selector'].get_support()

# find the feature that has been dropped
dropped_feature = [feature for feature, selected in zip(all_features, selected_features_mutual_info_classif) if not selected] # zip is used to iterate over two lists at the same time
print(f'The dropped feature is: {dropped_feature}')


The dropped feature is: []


All features, besides 'absences' are selected. 20 total features

The results are improved as compared to the previous assignment, F1 Score has increased from 0.9320 to 0.9424

Reason is likely that 'absences' might have been so poorly correlated to attrition that it contributed noise, which means the model can improve its predictions after removing it


In [48]:
# Redefine the pipeline with k = 20

pipeline_mutual_info_classif = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(mutual_info_classif, k=20)), # add this feature selection
    ('classifier', classifier)
])