# Assignment 2: Feature Selection for Attrition / Burnout Prediction

# Group 14: Rylie Ramos-Marquez, Derek Atabayev, Vishnu Garigipati

From the previous assignment, we know that the mest method is gradient boosting of trees, specifically with 200 boosting rounds.

Why it's the best:

* High F1-score = 0.9319 which far surpasses other models (better by at least 0.1 = 10%), and is also better than simple KNN/Decision trees model

* Shallow tree structure (max depth = 5) which prevents overfitting

* Consistent performance with a tight 95% confidence interval of [0.855171, 0.893952] which is a very small range, meaning the performance is consistent and strong

The final model is saved as a Pickle format for easy retrieval

In [1]:
# imports for the project

import joblib # since joblib.dump was used to save the model
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.calibration import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score






In [2]:
# retrieve final_model_14.pkl from ../final_model_14.pkl

model = joblib.load('../final_model_14.pkl')

# check if the model has been loaded correctly
print(model)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   MinMaxScaler())]),
                                                  Index(['hrs', 'absences', 'JobInvolvement', 'PerformanceRating',
       'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age',
       'DistanceFromHome', 'Education', 'EmployeeID', 'JobLevel',
       'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object'))])),
                ('classifier',
                 GradientBoostingClassifier(max_d

Looks like the model has been loaded successfully. We can see that the feature names are correct and familiar, so the preprocessing and classifier are intact. All the gradient boosting parameters are there too.

For the grid search later, we will need our training data. We can pull that now too.

We added a cell to the notebook from assignment 1 to export this data to pickle files

Our ideal split from assignment one was 80/20 training/testing

In [3]:
# save data X and y from pickle

X = joblib.load('./X.pkl')
y = joblib.load('./y.pkl')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100545358)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


# Adding feature selection

* SelectKBest and criterion f_classif

In [4]:
# add feature selection with SelectKBest and f_classif to the pipeline

# score_func is the function used to evaluate the importance of each feature
# f_classif is a default value for the score_func parameter, calculates ANOVA F-Value

# Extract components from the existing pipeline
preprocessor = model.named_steps['preprocessor']
classifier = model.named_steps['classifier']

# create pipeline 1: feature selection with SelectKBest and criterion f_classif

pipeline_f_classif = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(f_classif, k=10)),
    ('classifier', classifier)
])
    

# create pipline 2: feature selection with SelectKBest and criterion mutual_info_classif

pipeline_mutual_info_classif = Pipeline([
    ('model', model),
    ('selector', SelectKBest(mutual_info_classif, k=10)),
    ('classifier', classifier)
])


Performing grid search to tune the number of features to be selected (k parameter)

  * We need to use an array to test different values of k, since we have 21 features, we can check 5, 10, 15, 20, 21 possibilities
  * This will tell us whether the higher side or lower side is best

In [5]:
# performing grid search to find the best parameters for the model, best metric is accuracy

# define the parameters for the grid search

# parameters for choosing a value of k
param_grid = {
    'selector__k': [5,10,15,20,21]
}

def perform_grid_search(param_grid):
    # create grid search object for pipeline 1
    grid_search_f_classif = GridSearchCV(pipeline_f_classif, param_grid, cv=5, n_jobs=-1, verbose=1)
    # in the previous assignment, we used 5-fold cross validation because the dataset is small, and we want to make sure that the model is not overfitting
    # that approach worked well and so we can use it for HPO of the 'k' parameter too

    # creating grid search object for pipeline 2
    grid_search_mutual_info_classif = GridSearchCV(pipeline_mutual_info_classif, param_grid, cv=5, n_jobs=-1, verbose=1)

    # fit the grid search objects to the data
    grid_search_f_classif.fit(X_train, y_train_encoded)
    grid_search_mutual_info_classif.fit(X_train, y_train_encoded)

    # print the optimal k in both cases
    best_k_f_classif = grid_search_f_classif.best_params_['selector__k']
    print(f'Optimal value of k for f_classif: {best_k_f_classif}')

    best_k_mutual_info_classif = grid_search_mutual_info_classif.best_params_['selector__k']
    print(f'Optimal value of k for mutual_info_classif: {best_k_mutual_info_classif}')

    return grid_search_f_classif, grid_search_mutual_info_classif

# Call the function with the parameter grid
grid_search_f_classif, grid_search_mutual_info_classif = perform_grid_search(param_grid)


Fitting 5 folds for each of 5 candidates, totalling 25 fits


Fitting 5 folds for each of 5 candidates, totalling 25 fits


ValueError: 
All the 25 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 472, in fit
    Xt = self._fit(X, y, routed_params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 389, in _fit
    self._validate_steps()
  File "c:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 259, in _validate_steps
    raise TypeError(
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   MinMaxScaler())]),
                                                  Index(['hrs', 'absences', 'JobInvolvement', 'PerformanceRating',
       'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age',
       'DistanceFromHome', 'Education', 'EmployeeID', 'JobLevel',
       'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object'))])),
                ('classifier',
                 GradientBoostingClassifier(max_depth=5, min_samples_split=5,
                                            n_estimators=200,
                                            random_state=100545358,
                                            subsample=0.8))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't


In this case, the best k value is 20 in both cases; only one k-value is being discarded for minimal predictive power.

This suggests that overall, the features havevery strong predictive power.

In [None]:
# Evaluate the models obtained with the two pipelines on the testing dataset

# get the best models from the grid search object
best_model_f_classif = grid_search_f_classif.best_estimator_

best_model_mutual_info_classif = grid_search_mutual_info_classif.best_estimator_

# print the accuracy and f1-score of the best models on the testing dataset

y_pred_f_classif = best_model_f_classif.predict(X_test)
y_pred_mutual_info_classif = best_model_mutual_info_classif.predict(X_test)

accuracy_f_classif = accuracy_score(y_test_encoded, y_pred_f_classif)
f1_f_classif = f1_score(y_test_encoded, y_pred_f_classif)

accuracy_mutual_info_classif = accuracy_score(y_test_encoded, y_pred_mutual_info_classif)
f1_mutual_info_classif = f1_score(y_test_encoded, y_pred_mutual_info_classif)

print(f'Accuracy of the best model with f_classif: {accuracy_f_classif}')
print(f'F1-score of the best model with f_classif: {f1_f_classif}')

print(f'Accuracy of the best model with mutual_info_classif: {accuracy_mutual_info_classif}')
print(f'F1-score of the best model with mutual_info_classif: {f1_mutual_info_classif}')




Accuracy of the best model with f_classif: 0.9195804195804196
F1-score of the best model with f_classif: 0.9225589225589226
Accuracy of the best model with mutual_info_classif: 0.9300699300699301
F1-score of the best model with mutual_info_classif: 0.9319727891156463


The best feature selection method is with the criterion: mutual_info_classif

The accuracy and f1-scores in this case are slightly better, but both models are pretty much the same.

In [103]:
# Check which features are actually selected
# We have 21, and k=20, so one has been dropped

all_features = ['hrs', 'absences', 'JobInvolvement', 'PerformanceRating',
       'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age',
       'DistanceFromHome', 'Education', 'EmployeeID', 'JobLevel',
       'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
# get the selected features from the best model with mutual_info_classif
selected_features_mutual_info_classif = best_model_mutual_info_classif.named_steps['selector'].get_support()

# find the feature that has been dropped
dropped_feature = [feature for feature, selected in zip(all_features, selected_features_mutual_info_classif) if not selected] # zip is used to iterate over two lists at the same time
print(f'The dropped feature is: {dropped_feature}')


The dropped feature is: []


All features, besides 'absences' are selected. 20 total features

The results are improved as compared to the previous assignment, F1 Score has increased from 0.9320 to 0.9424

Reason is likely that 'absences' might have been so poorly correlated to attrition that it contributed noise, which means the model can improve its predictions after removing it


In [104]:
# Redefine the pipeline with k = 20

pipeline_mutual_info_classif = Pipeline([
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(mutual_info_classif, k=20)), # add this feature selection
    ('classifier', classifier)
])