<a href="https://colab.research.google.com/github/briangeorg/Machine_Learning_Portfolio/blob/main/Classification_Model_Pipeline_and_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**What's Our Project?**

We're working cross-functionally with HR to evaluate employee data; the project is to predict which employees are most likely to leave the company, enabling the organization to implement a retention program.

**Getting Started**

Below we'll ingest our raw data from Kaggle and wrangle our data into a format that's usable across our intended model types.

In [2]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "WA_Fn-UseC_-HR-Employee-Attrition.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "pavansubhasht/ibm-hr-analytics-attrition-dataset",
  file_path
)
# View csv data below for a basic understanding:
print("First 5 records:", df.head())

Using Colab cache for faster access to the 'ibm-hr-analytics-attrition-dataset' dataset.
First 5 records:    Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical    

**Data Prep**

First, we'll do three things:

1. Remove variables we identified as non-meaningful in our Exploratory Data Analysis workbook (these were Employee Count and Standard Hours)

2. Min-max scale our numeric factors

2. Recode (one-hot encoding) our categorical variables

In [4]:
# Remove non-meaningful factors (due to zero variance):

columns_to_remove = ['EmployeeCount', 'StandardHours', 'Over18']
df_clean = df.drop(columns=columns_to_remove)

# Use sklearn's MinMaxScaler function to scale:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

numeric_cols = df_clean.select_dtypes(include=np.number).columns

scaler = MinMaxScaler()

df_clean[numeric_cols] = scaler.fit_transform(df_clean[numeric_cols])

# Use Pandas' get dummies function to one-hot encode our categorical variables:

df_encoded = pd.get_dummies(df_clean, dtype=int)

print(df_encoded.head())

# Finally: we have two factors where we were one-hot encoding a binary string.
# These were out target (Attrition) and another predictor (Overtime); for these,
# we will want to drop-out the 'no' variants of each, as they're redundant and
# should not be included.

columns_to_remove = ['Attrition_No', 'OverTime_No']
df_final = df_encoded.drop(columns=columns_to_remove)


        Age  DailyRate  DistanceFromHome  Education  EmployeeNumber  \
0  0.547619   0.715820          0.000000       0.25        0.000000   
1  0.738095   0.126700          0.250000       0.00        0.000484   
2  0.452381   0.909807          0.035714       0.25        0.001451   
3  0.357143   0.923407          0.071429       0.75        0.001935   
4  0.214286   0.350036          0.035714       0.00        0.002903   

   EnvironmentSatisfaction  HourlyRate  JobInvolvement  JobLevel  \
0                 0.333333    0.914286        0.666667      0.25   
1                 0.666667    0.442857        0.333333      0.25   
2                 1.000000    0.885714        0.333333      0.00   
3                 1.000000    0.371429        0.666667      0.00   
4                 0.000000    0.142857        0.666667      0.00   

   JobSatisfaction  ...  JobRole_Research Director  \
0         1.000000  ...                          0   
1         0.333333  ...                          0   
2 

**Splitting Our Data into Trainin and Test Sets**:

It's important to note here that, normally, we would also want a validation
set to allow us to run hyperparameter tuning before we train our models!

However, in this use case, we find ourselves with a (very) small data set: by relying on k-fold cross validation, we can do our utmost to ensure our models are generalizable to unseen data. But here we will not split-out a validation set, as the tradeoff might be a substantial decrease in our model's predictive power.

In [None]:
# Next, we'll split our data into training and test sets so
# that we're ready to model:







**Model Training**:

Here we'll implement 3 contenders (well, two serious contenders and a benchmark model):

1. A logistic regression model - this is to serve as a benchmark or baseline of comparison to understand our other proposed models; we don't just want to know how well they perform, but how much better are they than a simple, linear decision boundary

2. Light GBM [Gradient Boosted Machine]: this implementation of

In [None]:
from sklearn.model_selection import KFold
import numpy as np
'''
***Important Note: altering the kf values below will affect whether
    your CV results are reproducible (decides how your CV partitions
    are set)***
'''
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Baseline/Benchmark Logistic Regression Model:

from sklearn.linear_model import LogisticRegression

logit_scores = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    logit_scores.append(accuracy_score(y_test, preds))

# Light GBM Model:

import lightgbm as lgb
from sklearn.metrics import accuracy_score

lightgbm_scores = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

    params = {
        'objective': 'binary',
        'metric': 'binary_error',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'verbose': -1
    }

    model = lgb.train(params, train_data, valid_sets=[test_data], early_stopping_rounds=10, verbose_eval=False)
    preds = model.predict(X_test)
    preds_binary = (preds > 0.5).astype(int)
    lightgbm_scores.append(accuracy_score(y_test, preds_binary))

# Deep Neural Network (DNN) using TensorFlow

import tensorflow as tf

dnn_scores = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    _, accuracy = model.evaluate(X_test, y_test, verbose=0)
    dnn_scores.append(accuracy)

# Finally, we'll compare the models' performance, focusing on accuracy:

print("LightGBM Average Accuracy:", np.mean(lightgbm_scores))
print("Logistic Regression Average Accuracy:", np.mean(logit_scores))
print("DNN Average Accuracy:", np.mean(dnn_scores))