# Classification Model and Evaluation

## Objectives

Answer Business Requirement 2:

- The client is interested in using employee data to predict whether an employee is at risk of leaving the company.

- Fit and evaluate a classification model to predict if an employee will leave (attrition) or stay with the company.


### Inputs

- outputs/datasets/cleaned/TrainSetCleaned.csv
- outputs/datasets/cleaned/TestSetCleaned.csv
- Instructions on data cleaning and feature engineering from the relevant notebooks

### Outputs

- Data cleaning, feature engineering, and modeling pipelines
- Feature importance plot
- Model evaluation metrics for employee attrition prediction



---

# Change working directory

We need to ensure that the working directory is correctly set:

In [None]:
import os

current_dir = os.getcwd()
current_dir

# Set the working directory to the parent directory
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

# Confirm the new current directory
current_dir = os.getcwd()
current_dir

---

## Load Data


Load the cleaned training and test datasets that have been prepared with imputed values for MonthlyIncome, TotalWorkingYears, and any other necessary features, ready for the ML pipeline.

In [None]:
import pandas as pd

# Load the train set
train_set_df = pd.read_csv("outputs/datasets/cleaned/TrainSetCleaned.csv")
train_set_df.head(3)

# Load the test set
test_set_df = pd.read_csv("outputs/datasets/cleaned/TestSetCleaned.csv")
test_set_df.head(3)


---

## Classification ML Pipeline

Pipeline for Data Cleaning and Feature Engineering

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import OrdinalEncoder

def DataCleaningandFeatEngPipeline():
    pipeline = Pipeline([
        ("median_imputation", MeanMedianImputer(imputation_method="median", 
                                                variables=["MonthlyIncome", "TotalWorkingYears"])),
        ("frequent_imputation", CategoricalImputer(imputation_method="frequent", 
                                                   variables=["JobRole", "MaritalStatus"])),
        ("ordinal_encoding", OrdinalEncoder(encoding_method="arbitrary", 
                                            variables=["BusinessTravel", "Department", "EducationField"]))
    ])
    return pipeline


---

## Pipeline for Modeling

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

def ClassificationPipeline(model):
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("model", model)
    ])
    return pipeline


---

## Split Data into Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_set_df.drop("Attrition", axis=1),
    train_set_df["Attrition"],
    test_size=0.2,
    random_state=0
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


---


## Target Distribution and Oversampling

Check target distribution in the training set to evaluate balance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Check target distribution of the train set
sns.set_style("whitegrid")
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()




- The target looks relatively balanced, but in order to try and minimise overfitting,  oversampling will be done.

- In order to do this, we first need to clean and encode the data.



In [None]:
# Construct the data cleaning and feature engineering pipeline
data_cleaning_feat_eng_pipeline = DataCleaningandFeatEngPipeline()

# Apply the pipeline to clean and encode the training and test sets
X_train = data_cleaning_feat_eng_pipeline.fit_transform(X_train, y_train)
X_test = data_cleaning_feat_eng_pipeline.transform(X_test)

# Display the shapes of the transformed datasets
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


 We ensure all features are numeric, as SMOTE requires numeric input.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

if len(non_numeric_columns) > 0:
    encoder = OrdinalEncoder()
    X_train[non_numeric_columns] = encoder.fit_transform(X_train[non_numeric_columns])
    X_test[non_numeric_columns] = encoder.transform(X_test[non_numeric_columns])


If the target is imbalanced, we will consider using SMOTE to balance the classes.

In [None]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training set
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)

# Display the shapes of the oversampled datasets
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


Now we will verify the target distribution in the training set after applying SMOTE to confirm that the class imbalance has been addressed.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Convert y_train back to a pandas Series to use value_counts()
y_train_series = pd.Series(y_train)

# Check target distribution after oversampling
y_train_series.value_counts().plot(kind='bar', title='Train Set Target Distribution After SMOTE')
plt.show()


---

## Hyperparameter Optimization

Load custom hyperparameter optimisation class from CodeInstitute

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score
import numpy as np
import pandas as pd

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = ClassificationPipeline(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches



---

### Definde Models and Parameters


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

In [None]:
models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=0),
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "LogisticRegression": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
}


Using the HyperparameterOptimizationSearch class to search for the best model using default parameters:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=0),
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "LogisticRegression": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
}


We are using default hyperparameters to find best algorithm, scored by recall and as such fulfilling business requirement 2.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

non_numeric_columns = X_train.select_dtypes(include=['object', 'category']).columns

if len(non_numeric_columns) > 0:
    encoder = OrdinalEncoder()
    X_train[non_numeric_columns] = encoder.fit_transform(X_train[non_numeric_columns])
    X_test[non_numeric_columns] = encoder.transform(X_test[non_numeric_columns])



In [None]:
from sklearn.preprocessing import LabelEncoder

# Convert target labels to numerical values
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)


In [None]:
from sklearn.metrics import make_scorer, recall_score

# Run the hyperparameter optimization
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
           scoring=make_scorer(recall_score, pos_label=1),  # Use numerical labels
           n_jobs=-1, cv=5)

# Summarize the results
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary


---

### Refining the Search with Specific Hyperparameters

Based on the results, we will select the top-performing models and refine the hyperparameter search:

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, recall_score

# Define the models to search
models_search = {
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "LogisticRegression": LogisticRegression(random_state=0),
}

params_search = {
    "AdaBoostClassifier": {
        "model__n_estimators": [50, 100, 200, 300],
        "model__learning_rate": [0.001, 0.01, 0.1, 1.0],
        "model__algorithm": ["SAMME", "SAMME.R"],
    },
    "LogisticRegression": {
        "model__penalty": ["l2", "l1", "elasticnet", None],
        "model__C": [10, 2, 1.0, 0.5, 0.1],
        "model__tol": [1e-3, 1e-4, 1e-5],
    },
}

# Run the hyperparameter optimization
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring=make_scorer(recall_score, pos_label=1),  # Use numerical labels
           n_jobs=-1, cv=5)

extensive_grid_search_summary, extensive_grid_search_pipelines = search.score_summary(sort_by='mean_score')
print(extensive_grid_search_summary)


---

### Using extensive Hyperparameter options

In [None]:

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring=make_scorer(recall_score, pos_label=1),
           n_jobs=-1, cv=5)

extensive_grid_search_summary, extensive_grid_search_pipelines = search.score_summary(sort_by='mean_score')
print(extensive_grid_search_summary)


Based on these results we will use AdaBoostClassifier

### Select and Save the Best Model

Identify and save the best model and its parameters:

In [None]:
best_model = extensive_grid_search_summary.iloc[0, 0]
best_parameters = extensive_grid_search_pipelines[best_model].best_params_

print("Best Model:", best_model)
print("Best Parameters:", best_parameters)


We identified the best model as AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Rebuild the AdaBoostClassifier with the best parameters
classification_pipeline = AdaBoostClassifier(
    algorithm='SAMME.R',
    learning_rate=1.0,
    n_estimators=300,
    random_state=0  # Keep random_state for reproducibility
)


We fit the model using the training data:

In [None]:
classification_pipeline.fit(X_train, y_train)

Once the model is fitted, we can save it for later use:

In [None]:
import joblib

# Save the fitted model to a file
joblib.dump(classification_pipeline, "best_classification_pipeline.pkl")


We evaluate the model right away:

In [None]:

y_pred = classification_pipeline.predict(X_test)


from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Stay', 'Leave']))


---

### Feature Importance


- Extracted feature importance from the final model (e.g., AdaBoostClassifier).

- Key features: **Employee Number**, **Monthly Income**, and **Age**.

- Lesser impact: **Education**, **Employee Count**.

- Visual representation helps in understanding the model's focus.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Check if the model has the `feature_importances_` attribute
if hasattr(classification_pipeline, 'feature_importances_'):
    df_feature_importance = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': classification_pipeline.feature_importances_
    }).sort_values(by='Importance', ascending=False)
    
    # List of best features based on importance
    best_features = df_feature_importance["Feature"].to_list()
    
    print(f"* These are the {len(best_features)} most important features in descending order.\n"
          f"* The model was trained using these features: {best_features}")
    
    # Plotting the feature importances
    df_feature_importance.plot(kind="bar", x="Feature", y="Importance")
    plt.title("Feature Importance in Employee Retention Analysis")
    plt.xlabel("Features")
    plt.ylabel("Importance")
    plt.show()
else:
    print("The selected model does not have a feature_importances_ attribute.")


---

### Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X, y, pipeline, label_map):
    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

# Update the label_map to match the classes in the employee retention dataset
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=classification_pipeline,
                label_map=["Stayed", "Left"]
                )



To assess the performance of our classification model on employee retention, we performed evaluations on both the training and test sets. The key metrics include:

- **Confusion Matrix**: Shows the actual vs. predicted values.
- **Classification Report**: Provides precision, recall, and F1-score for the classes "Stayed" and "Left."

### Key Points:
- **Train Set**: Used to evaluate the model’s performance on the data it was trained on.
- **Test Set**: Used to assess how well the model generalizes to unseen data.
- **Metrics**: Recall and precision were measured for "Stayed" and "Left" categories to ensure the model meets business requirements.


---

## Refitting the ML Pipeline

We refit the machine learning pipeline using the most important features identified from the feature importance analysis: EmployeeNumber, MonthlyIncome, Age, DailyRate, and YearsAtCompany. 

### Re-writing the ML Pipelines

To ensure optimal performance:
- **Data Cleaning and Feature Engineering Pipeline**: Focused on necessary preprocessing steps for the selected features.
- **Modeling Pipeline**: Included steps for scaling the features and applying the best model.

This refit aims to achieve a model that is both effective and efficient, focusing only on the most critical features for predicting employee retention.


In [None]:

print(X_train.columns)

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OrdinalEncoder

# Updated Data Cleaning and Feature Engineering Pipeline
def DataCleaningandFeatEngPipeline():
    pipeline = Pipeline([
        ("median_imputation", MeanMedianImputer(imputation_method="median", 
                                                variables=["MonthlyIncome", "Age", "DistanceFromHome"])),
        
        ("ordinal_encoding", OrdinalEncoder(encoding_method="arbitrary", 
                                            variables=["OverTime", "JobSatisfaction", "BusinessTravel"])),
    ])
    return pipeline


---

## Split Train and Test Sets Using Only Most Important Features


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(["Attrition"], axis=1),  # Dropping the target variable
    df["Attrition"],  # Defining the target variable
    test_size=0.2,  # Allocating 20% of data to the test set
    random_state=0  # Ensuring reproducibility
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Filter the data to include only the most important features

In [None]:
X_train = X_train.filter(best_features)  # Filtering train set
X_test = X_test.filter(best_features)  # Filtering test set

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
X_train.head(3)  # Display the first few rows of the training data


---

## Handle Target Imbalance

To address the imbalance in the target variable (Attrition), we'll first clean and engineer the data, then apply SMOTE (Synthetic Minority Over-sampling Technique) to oversample the minority class.

In [None]:
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt

# Apply SMOTE for oversampling the minority class
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)

# Check the distribution after oversampling
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()


---

### Cross-Validation

We start by defining the model (AdaBoostClassifier) and the hyperparameters that will be used in the cross-validation process.

In [None]:
models_search = {
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

We store the Hyperparameter values in a list

In [None]:

# Best hyperparameters for AdaBoostClassifier
params_search = {
    "AdaBoostClassifier": {
        "model__algorithm": ["SAMME.R"],
        "model__learning_rate": [1.0],
        "model__n_estimators": [300],
    },
}


Using the HyperparameterOptimizationSearch function to run cross-validation on the defined model and parameters.

In [None]:
params_search = {
    'RandomForestClassifier': {
        'model__max_depth': [None],
        'model__max_features': [None],
        'model__max_leaf_nodes': [None],
        'model__min_samples_leaf': [50],
        'model__min_samples_split': [2],
        'model__n_estimators': [50]
    },
}


In [None]:
from sklearn.metrics import make_scorer, recall_score

# Execute cross-validation with the correct pos_label
quick_search.fit(X_train, y_train,
                 scoring=make_scorer(recall_score, pos_label='Yes'),  # Use 'Yes' as pos_label
                 n_jobs=-1, cv=5)


Check the results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary


Defining the best classification pipeline

In [None]:
best_params = grid_search_pipelines[best_model].best_params_

best_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", AdaBoostClassifier(**{
        'algorithm': best_params['model__algorithm'],
        'learning_rate': best_params['model__learning_rate'],
        'n_estimators': best_params['model__n_estimators'],
        'random_state': 0  # Ensure reproducibility
    }))
])

best_pipeline.fit(X_train, y_train)

classification_pipeline = best_pipeline


---

### Evaluation of Pipeline on Train and Test Sets

ML Business Case Metrics:

- Recall on "Yes" (Employee Attrition): 75%
- Precision on "No" (No Employee Attrition): 70%

In [None]:
# Evaluate the performance of the classification pipeline
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=classification_pipeline,
                label_map=["No Attrition", "Yes Attrition"]
                )


Using the selected features, the recall for "Yes Attrition" was 97% on the train set and 67% on the test set. The precision for "No Attrition" was 97% on the train set and 91% on the test set.

---

### Push Files to Repo

The following files will be generated and saved:

- Train set
- Test set
- Data cleaning and feature engineering pipeline
- Modeling pipeline
-Feature Importance plot

In [None]:
import joblib
import os

version = "v2"
file_path = f"outputs/ml_pipeline/classification_model/{version}"

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

### Save Train Set

Saving the train set with variables already encoded (and after oversampling.

In [None]:
print(X_train.shape)
X_train.head(3)

X_train.to_csv(f"{file_path}/X_train.csv", index=False)
y_train.to_csv(f"{file_path}/y_train.csv", index=False)


### Save Test Set

Save the test set with variables already encoded.

In [None]:
print(X_test.shape)
X_test.head(3)

X_test.to_csv(f"{file_path}/X_test.csv", index=False)
y_test.to_csv(f"{file_path}/y_test.csv", index=False)


### Save ML Pipelines

Two pipelines will be saved: one for data cleaning and feature engineering, and another for modeling.

- When predicting live data, both pipelines will be required.


- When predicting on the train and test sets, only the modeling pipeline is required as the data has already been processed.

In [None]:
# Save the data cleaning and feature engineering pipeline
joblib.dump(value=data_cleaning_feat_eng_pipeline,
            filename=f"{file_path}/data_cleaning_and_feat_engineering_pipeline.pkl")

# Save the classification pipeline
joblib.dump(value=classification_pipeline,
            filename=f"{file_path}/classification_pipeline.pkl")


### Save Feature Importance Plot

Save the plot of feature importances.

In [None]:
df_feature_importance.plot(kind="bar", x="Feature", y="Importance")
plt.savefig(f"{file_path}/features_importance.png", bbox_inches="tight")
plt.show()
