## Problem Statement
Develop a predictive model using employee data to classify individuals as likely to stay or leave the company. This classification will assist in making informed decisions about employee retention strategies and workplace improvements.

## Overview
The dataset contains 900 rows and 15 columns, representing various employee metrics. The data aims to reflect realistic scenarios in a corporate setting, encompassing professional and personal employee metrics. Below is a brief overview of the dataset columns:

1. JobSatisfaction: Employee's job satisfaction level.

2. PerformanceRating: Performance rating given by the company.

3. YearsAtCompany: Total number of years the employee has been with the company.

4. WorkLifeBalance: Rating of how well the employee feels they balance work and personal life.

5. DistanceFromHome: Distance from the employee's home to the workplace.

6. MonthlyIncome: The monthly income of the employee.

7. EducationLevel: The highest level of education attained by the employee.

8. Age: The age of the employee.

9. NumCompaniesWorked: The number of companies the employee has worked at before joining the current company.

10. EmployeeRole: The role or position of the employee within the company.

11. AnnualBonus: Annual bonus received by the employee.

12. TrainingHours: Number of hours spent in training programs.

13. Department: Department in which the employee works.

14. AnnualBonus_Squared: Square of the annual bonus (a polynomial feature).

15. AnnualBonus_TrainingHours_Interaction: Interaction term between annual bonus and training hours.


It is clear from the above description that the EmployeeTurnover is the 'Target' column.

Binary outcome variable, with '1' indicating the employee is likely to leave (turnover) and '0' indicating the employee is likely to stay. Understanding factors leading to turnover is crucial for the company to develop effective employee retention strategies and improve overall workplace satisfaction.

Let us begin with importing the necessary libraries. And read the data.

In [1]:
# Necessary library imports for data processing
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report, f1_score

Now load the data

In [2]:
# Load the dataset
df = pd.read_csv('modified_employee_turnover.csv')
df.head()

Unnamed: 0,Job_Satisfaction,Performance_Rating,Years_At_Company,Work_Life_Balance,Distance_From_Home,Monthly_Income,Education_Level,Age,Num_Companies_Worked,Employee_Role,Annual_Bonus,Training_Hours,Department,Annual_Bonus_Squared,Annual_Bonus_Training_Hours_Interaction,Employee_Turnover
0,0.562326,0.141129,0.123989,0.347583,0.330353,0.328853,0.600933,0.31599,0.768736,0.090671,0.324786,0.669193,0.602932,0.105486,0.217344,0
1,0.017041,0.559047,0.511203,0.793908,0.42355,0.55345,0.742009,0.897146,0.380035,0.601633,0.694611,0.043271,0.800761,0.482484,0.030056,0
2,0.774699,0.604371,0.798174,0.2605,0.804034,0.1318,0.775178,0.830947,0.218726,0.972936,0.153476,0.701336,0.705275,0.023555,0.107638,1
3,0.628174,0.385249,0.230104,0.516809,0.272248,0.589249,0.482409,0.090507,0.402746,0.132842,0.305973,0.549688,0.600531,0.09362,0.16819,0
4,0.799183,0.199967,0.839029,0.247927,0.341934,0.076818,0.055356,0.68086,0.923341,0.493017,0.844094,0.793751,0.664679,0.712494,0.67,0


In [3]:
df.shape

(1350, 16)

This shows that the dataset has 16 features and 1350  number of observations. 

## Let's separate the features (X) and the target variable (y)

The target variable 'Employee_Turnover' is what the model aims to predict, and it is separated from the input features to ensure the model is trained correctly.

In [4]:
# Dropping 'Target' column to avoid muticollinearity
# Write your code below
# your code here
df.columns

Index(['Job_Satisfaction', 'Performance_Rating', 'Years_At_Company',
       'Work_Life_Balance', 'Distance_From_Home', 'Monthly_Income',
       'Education_Level', 'Age', 'Num_Companies_Worked', 'Employee_Role',
       'Annual_Bonus', 'Training_Hours', 'Department', 'Annual_Bonus_Squared',
       'Annual_Bonus_Training_Hours_Interaction', 'Employee_Turnover'],
      dtype='object')

In [5]:
X = df[['Job_Satisfaction', 'Performance_Rating', 'Years_At_Company',
       'Work_Life_Balance', 'Distance_From_Home', 'Monthly_Income',
       'Education_Level', 'Age', 'Num_Companies_Worked', 'Employee_Role',
       'Annual_Bonus', 'Training_Hours', 'Department', 'Annual_Bonus_Squared',
       'Annual_Bonus_Training_Hours_Interaction']]
y = df['Employee_Turnover']

In [6]:
X.shape, y.shape

((1350, 15), (1350,))

# Splitting Dataset into Train and Test Sets


This step is a standard procedure in machine learning for preparing data before training a model. It ensures that there is a separate dataset for evaluating the model's performance, which helps in assessing how well the model will perform on unseen data. Split the data in 70:30 ratio. Name the variables as follows- X_train, X_test, y_train, y_test.

And use the random_state as 42.


In [7]:
# Splitting the dataset into training and testing sets
# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply Logistic Regression without Regularization 


After training, it's time to fit our model in 'logistic_no_reg'. First we will be evaluating the base model. Meaning this model will have no regularization. Let the max_iterations be 10000 and n_jobs is -1.

In [8]:
# Logistic regression without regularization
# Write your code below
# your code here
logistic_no_reg = LogisticRegression(penalty="none", max_iter=10000, n_jobs=-1)

In [9]:
logistic_no_reg.fit(X_train,y_train)

In [10]:
# Evaluate the model without regularization for train data
y_train_pred_no_reg = logistic_no_reg.predict(X_train)
f1_no_reg_train = f1_score(y_train, y_train_pred_no_reg)
print("F1 Score without Regularization:", f1_no_reg_train)

F1 Score without Regularization: 0.8667366211962225


In [11]:
# Evaluate the model without regularization for train data
y_test_pred_no_reg = logistic_no_reg.predict(X_test)
f1_no_reg_test = f1_score(y_test, y_test_pred_no_reg)
print("F1 Score without Regularization:", f1_no_reg_test)

F1 Score without Regularization: 0.8541666666666667


In [12]:
assert f1_no_reg_train > 0.86, "F1 score for train data does not match the expected value"
assert f1_no_reg_test > 0.84, "F1 score for test data does not match the expected value"


Lets move forward and apply L1 regularization for automatic feature selection

# Apply Logistic Regression with L1 Regularization 

It's time to fit our model in 'logistic_l1_cv'.

In [13]:
# Definig a range for Cs
Cs = np.linspace(0.001,10,20)

In [14]:
# Logistic regression with L1 regularization using cross-validation to find the best C
# Write your code below
# your code here
logistic_l1_cv = LogisticRegressionCV(Cs=Cs, penalty='l1', solver='liblinear', random_state=42)

In [15]:
logistic_l1_cv.fit(X_train,y_train.values.ravel())

### Lets now examine the Regularization Strengths

In [16]:
logistic_l1_cv.Cs_

array([1.00000000e-03, 5.27263158e-01, 1.05352632e+00, 1.57978947e+00,
       2.10605263e+00, 2.63231579e+00, 3.15857895e+00, 3.68484211e+00,
       4.21110526e+00, 4.73736842e+00, 5.26363158e+00, 5.78989474e+00,
       6.31615789e+00, 6.84242105e+00, 7.36868421e+00, 7.89494737e+00,
       8.42121053e+00, 8.94747368e+00, 9.47373684e+00, 1.00000000e+01])

In [17]:
best_C = logistic_l1_cv.C_
print(f"The best Cs value is: {best_C}")

The best Cs value is: [1.57978947]


## We can see that the peak comes at 1.57

Next apply Logistic regression with L1 regularization in 'logistic_l1_cv' using cross-validation(cv=5), and peak value 0.35 for C value.
Penalty would remain L1, solver would be liblinear and max_iter=10000

In [18]:
# Logistic regression with L1 regularization using cross-validation
# Write your code below
# your code here
logistic_l1_cv = LogisticRegressionCV(Cs=Cs,cv=5, penalty='l1', solver='liblinear', random_state=42)

In [19]:
logistic_l1_cv.fit(X_train,y_train.values.ravel())

In [20]:
# Evaluate the model with L1 regularization on train data
y_train_pred_l1 = logistic_l1_cv.predict(X_train)
f1_l1_train = f1_score(y_train, y_train_pred_l1)
print("F1 Score with L1 Regularization on train data :", f1_l1_train)

F1 Score with L1 Regularization on train data : 0.8718487394957983


In [21]:
# Evaluate the model with L1 regularization on train data
y_test_pred_l1 = logistic_l1_cv.predict(X_test)
f1_l1_test = f1_score(y_test, y_test_pred_l1)
print("F1 Score with L1 Regularization on test data :", f1_l1_test)

F1 Score with L1 Regularization on test data : 0.8511749347258485


In [22]:
assert f1_l1_train > 0.87, "F1 score for train data does not match the expected value"
assert f1_l1_test > 0.85, "F1 score for test data does not match the expected value"

An F1 score of around 0.871 for the training set and 0.840 for the test set with L1 regularization suggests that your logistic regression model performs well.

The model not only fits the training data well but also generalizes adequately to new, unseen data. The use of L1 regularization seems to have helped in maintaining this balance, potentially by eliminating irrelevant features and reducing overfitting.

# Apply Logistic Regression with L2 Regularization on train data

Next apply Logistic regression with L2 regularization in 'logistic_l2_cv' using cross-validation(cv=5), default C value.

In [23]:
# Definig a range for Cs
Cs = np.linspace(0.001,10,20)

In [24]:
# Logistic regression with L2 regularization using cross-validation to find the best C
# Write your code below
logistic_l2_cv = LogisticRegressionCV(Cs=Cs, 
                                      penalty='l2',
                                      solver= 'liblinear',
                                      cv=5, 
                                      max_iter=10000,
                                      n_jobs=-1)
logistic_l2_cv.fit(X_train, y_train)

In [25]:
logistic_l2_cv.Cs_

array([1.00000000e-03, 5.27263158e-01, 1.05352632e+00, 1.57978947e+00,
       2.10605263e+00, 2.63231579e+00, 3.15857895e+00, 3.68484211e+00,
       4.21110526e+00, 4.73736842e+00, 5.26363158e+00, 5.78989474e+00,
       6.31615789e+00, 6.84242105e+00, 7.36868421e+00, 7.89494737e+00,
       8.42121053e+00, 8.94747368e+00, 9.47373684e+00, 1.00000000e+01])

In [26]:
logistic_l2_cv.C_

array([3.15857895])

## Peak at 3.15

Apply Logistic regression with L2 regularization in 'logistic_l2_cv' using cross-validation with peak value from the plot. 

In [27]:
# Logistic regression with L2 regularization using cross-validation to find the best C
#Write your code below
# your code here
logistic_l2_cv = LogisticRegressionCV(Cs=Cs, penalty='l2', solver= 'liblinear', cv=5, max_iter=10000, n_jobs=-1)

In [28]:
logistic_l2_cv.fit(X_train,y_train)

In [29]:
# Evaluate the model with L2 regularization on old train data
y_train_pred_l2 = logistic_l2_cv.predict(X_train)
f1_l2_train = f1_score(y_train, y_train_pred_l2)
print("F1 Score with L2 Regularization:", f1_l2_train)

F1 Score with L2 Regularization: 0.8654848800834203


In [30]:
# Evaluate the model with L2 regularization on old test data
y_test_pred_l2 = logistic_l2_cv.predict(X_test)
f1_l2_test = f1_score(y_test, y_test_pred_l2)
print("F1 Score with L2 Regularization:", f1_l2_test)

F1 Score with L2 Regularization: 0.8475452196382429


In [31]:
assert f1_l2_train > 0.86, "F1 score for train data does not match the expected value"
assert f1_l2_test > 0.84, "F1 score for test data does not match the expected value"

An F1 score of around 0.86 for one dataset and 0.847 for the other dataset in the context of L2 regularization indicates that your logistic regression model performs well on both the dataset it was trained on and on new, unseen data. The scores suggest that the model is accurately predicting the target variable, maintaining a balance between precision and recall, and the L2 regularization is likely helping to enhance the model's ability to generalize.

## ElasticNet Regularization

Apply Logistic regression with ElasticNet regularization in 'logistic_en_cv' using cross-validation

In [32]:
# Definig a range for Cs
Cs = np.linspace(0.001,10,20)

In [33]:
# Logistic regression with ElasticNet regularization using cross-validation
# Write your code below
# your code here
logistic_en_cv = LogisticRegressionCV(Cs=Cs, penalty='elasticnet', solver='saga', cv=5, max_iter=10000, n_jobs=-1, 
                                      l1_ratios= [0.0001, 0.001, 0.01, 0.05, 0.1, 0.4, 0.5, 0.7, 1])

In [34]:
logistic_en_cv.fit(X_train, y_train)

In [35]:
# Evaluate the model

y_train_pred_elastic = logistic_en_cv.predict(X_train)
f1_elastic_train = f1_score(y_train, y_train_pred_elastic)
print("F1 Score with Elastic Net Regularization on Train Set:", f1_elastic_train)

F1 Score with Elastic Net Regularization on Train Set: 0.8724973656480506


In [36]:
y_test_pred_elastic = logistic_en_cv.predict(X_test)
f1_elastic_test = f1_score(y_test, y_test_pred_elastic)
print("F1 Score with Elastic Net Regularization on Test Set:", f1_elastic_test)

F1 Score with Elastic Net Regularization on Test Set: 0.8443271767810026


In [37]:
assert f1_elastic_train > 0.87, "F1 score for train data does not match the expected value"
assert f1_elastic_test > 0.84, "F1 score for test data does not match the expected value"

## What this mean for the Business

In a business context, choosing the right model depends on the specific needs and constraints of the business problem. For instance:

1. If feature interpretability is important (i.e., understanding which features are driving the predictions), L1 regularization might be preferred despite its slightly higher tendency to overfit.

2. If the business needs a more generalized model that is robust to various types of data, the model with L2 regularization or the one without any regularization might be more appropriate.

3. If there's a need for a balance between feature selection and model complexity, Elastic Net could be the way to go.

In all cases, the relatively high F1 scores indicate that logistic regression is a competent approach for the classification task at hand, capable of providing reliable insights for informed decision-making in the business. Regular monitoring and validation on new data are recommended to ensure continued model accuracy and relevance.