<a href="https://colab.research.google.com/github/YuvarajCU/Employee-Attrition-Prediction/blob/main/Employee_Attrition_Prediciton_Using_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Employee Attrition Prediction:**
In the current era, companies are rapidly growing and seeking highly experienced professionals to meet their demands. Such experienced individuals are considered valuable assets to the company, and losing them can be costly. Companies may try to retain these employees by offering them better compensation, or they may choose to hire new employees altogether. Accurately predicting employee turnover can save companies significant amounts of money and time. Furthermore, it can help management control project pipelines more effectively, enabling them to manage their workforce in a flexible manner.

# What is Attrition?

When an employee moves out of the company either voluntarily or involuntarily, it is known as attrition. The attrition rate is calculated as the percent of employees who have left the organization by the average number of employees.



```
# Attrition Rate = ((Number of employees who left during a given time period) / (Average total number of employees during the same time period)) x 100
```

The ideal employee attrition rate should be below 10%, while a rate exceeding 20% is concerning for any company. High attrition rates may be due to various reasons, such as poor management, lack of recognition, toxic work environment, and limited career growth opportunities. As we delve deeper into the data, we may discover additional factors contributing to attrition. In this blog, we aim to build an attrition prediction model that can forecast employees' likelihood to leave the organization in the future. Additionally, we will provide insights and feedback to the HR and talent acquisition departments, which can assist in reducing attrition rates in some areas. Let's begin by analyzing the data.



#Data Analysis

Our objective is to investigate the interesting trends that result in employee turnover. Once the analysis is complete, we aim to build a machine learning model that predicts the likelihood of employees leaving the company.

In [None]:
import pandas as pd
pd.set_option("display.max_columns", 100)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn import datasets

import statsmodels.api as sm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

In [None]:
# To upload the data from google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path='/content/drive/MyDrive/Hero Vired/Employee Atrrition Prediction/Dataset/general_data.csv'
df=pd.read_csv(path)
df1=pd.read_csv('/content/drive/MyDrive/Hero Vired/Employee Atrrition Prediction/Dataset/employee_survey_data.csv')
df2=pd.read_csv('/content/drive/MyDrive/Hero Vired/Employee Atrrition Prediction/Dataset/manager_survey_data.csv')
df = pd.merge(df, df1, on='EmployeeID', how='left')
df = pd.merge(df, df2, on='EmployeeID', how='left')

##Data Health Review

In [None]:
df.sample(5)

In [None]:
df.info()

In [None]:
(df.isnull().mean()*100).sort_values(ascending = False)

In [None]:
df = df.fillna(df.select_dtypes(include='number').mean())

In [None]:
(df.isnull().mean()*100).sort_values(ascending = False)

In [None]:
df['NumCompaniesWorked'] = df['NumCompaniesWorked'].astype(int)
df['TotalWorkingYears'] = df['TotalWorkingYears'].astype(int)
df['EnvironmentSatisfaction'] = df['EnvironmentSatisfaction'].astype(int)
df['JobSatisfaction'] = df['JobSatisfaction'].astype(int)
df['WorkLifeBalance'] = df['WorkLifeBalance'].astype(int)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.drop(columns=['EmployeeCount','EmployeeID','Over18','StandardHours'],inplace=True)

In [None]:
col = df.columns
for i in col:
  print(i, 'percent :' , (len(df[i].unique())/len(df[i])) * 100, df[i].nunique())

In [None]:
df.sample(5)

Here we have made some checks on the data and made certain changes to our data set:
1. We have found that there are few variables like the ratings which have some missing values and assuming those are supposed to be 0 instead, the missing values are replaced with zeros.
2. Few variables in their float type are int type for the ease of analysis.
3. We have dropped the Variables with either only one class('EmployeeCount','Over18','StandardHours) or being extremely high cardinal('EmployeeID').
4. The 'Attrition' feature will be our dependent feature, and the rest of the features are independent.

Let's visualize the histograms:

##Histographs

In [None]:
df.hist(figsize=(15,15))
plt.tight_layout()
plt.show()

*  Most distributions are right-skewed (Monthly Income, Total Working Years, Year at Company, Distance From Home, etc.).
*   They are also tail heavy (Tails are not exponentially bounded).  
*   The age feature is a little right-skewed, and most of the employees have ages between 25–40 years.

*   These dataset are natural and most these are meant to be skewed and this may not affect the prediction










## Bivariant Analysis

### personal data vs Attrition

In [None]:
sns.kdeplot(df.loc[df['Attrition']=='No','Age'],label='Active Employee')
sns.kdeplot(df.loc[df['Attrition']=='Yes','Age'],label='Ex-Employee')
plt.legend()
plt.show()

sns.countplot(x='Gender', hue='Attrition', data=df)
plt.show()

plt.figure(figsize=(10, 8))
sns.countplot(x='DistanceFromHome', hue='Attrition', data=df)
plt.show()

sns.countplot(x='MaritalStatus', hue='Attrition',data=df)
plt.show()

### Survey data vs Attrition

In [None]:
# count plots Survey data vs Attrition

sns.countplot(x='EnvironmentSatisfaction', hue='Attrition', data=df)
plt.show()

sns.countplot(x='JobSatisfaction', hue='Attrition', data=df)
plt.show()

sns.countplot(x='WorkLifeBalance', hue='Attrition', data=df)
plt.show()

df['OverallEmployeeRating']=df[['EnvironmentSatisfaction','JobSatisfaction','WorkLifeBalance']].mean(axis=1).round()
sns.countplot(x='OverallEmployeeRating', hue='Attrition', data=df)
plt.show()

sns.countplot(x='JobInvolvement', hue='Attrition', data=df)
plt.show()

sns.countplot(x='PerformanceRating', hue='Attrition', data=df)
plt.show()

df['OverallManagerRating']=df[['JobInvolvement','PerformanceRating']].mean(axis=1).round()
sns.countplot(x='OverallManagerRating', hue='Attrition', data=df)
plt.show()



### Career info vs Attrition

In [None]:
sns.countplot(x='Department', hue='Attrition',data=df)
plt.show()

sns.countplot(x='Education', hue='Attrition', data=df)
plt.show()

plt.figure(figsize=(10, 8))
sns.countplot(x='EducationField', hue='Attrition',data=df)
plt.show()

sns.countplot(x='JobLevel', hue='Attrition',data=df)
plt.show()

ax=sns.countplot(x='JobRole', hue='Attrition',data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()

sns.countplot(x="BusinessTravel", hue="Attrition", data=df)
plt.show()

sns.violinplot(x='MonthlyIncome',y='Attrition',data=df)
plt.show()

sns.countplot(x='NumCompaniesWorked', hue='Attrition',data=df)
plt.show()

sns.countplot(x='PercentSalaryHike', hue='Attrition', data=df)
plt.show()

plt.figure(figsize=(10, 8))
sns.countplot(x='TotalWorkingYears', hue='Attrition', data=df)
plt.show()

sns.countplot(x='TrainingTimesLastYear', hue='Attrition', data=df)
plt.show()

plt.figure(figsize=(10, 8))
sns.countplot(x='YearsAtCompany', hue='Attrition', data=df)
plt.show()

sns.countplot(x='YearsSinceLastPromotion', hue='Attrition', data=df)
plt.show()

sns.countplot(x='YearsWithCurrManager', hue='Attrition', data=df)
plt.show()

sns.countplot(x='StockOptionLevel', hue='Attrition', data=df)
plt.show()

##Trivariant Analysis

### Analysis on Job Role and Monthly Income

In [None]:
plt.figure(figsize=(10, 8))
ax=sns.boxplot(y=df["MonthlyIncome"],x=df['JobRole'],hue=df["Attrition"])
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)

plt.grid(True, alpha=1)
plt.tight_layout()
plt.show()

In [None]:
sns.catplot(x="Gender", hue="Attrition", col="MaritalStatus",
            data=df, kind="count", height=4, aspect=.7)

## Summary on the Data Analysis:

1. Ex-employees have an average age of 33.6 years, while the current employees have 37.5 years.

2. A younger employee is more likely to leave a company, and the education and marital status parameters are potential support.  
3. Employees with Marital status- Single tend to have left the company compared to other two classes.
3. MonthlyIncome doesnt seem to make any impact on the attrition rate, which can also negatively contribute for employees in bigger role.
3. Lower stock option levels presented comparitively higher attritions
4. Surveys aren't giving a clear picture, yet in terms of percentage, low ratings are being directly proportional to the attrition rate percentage of a class, but we cant rely on this analysis.
5. Delay in Promotion saturates the interest of the employees
6. Number of Employees in a particular Domain and job role has been directly proportional to attrition, probably due to the competitive population.

# Data Cleaning

Since the number of unique values in all categorical variables is less than 10, we will manually map them for sake of not producing anymore features and burder the model.

##Attrition

In [None]:
map_dict = {'Yes': 1, 'No': 0}
df['Attrition_New'] = df['Attrition'].map(map_dict)
df['Attrition_New'].unique()

##Business Travel

In [None]:
BusinessTravel_dict = df["BusinessTravel"].value_counts()
print(BusinessTravel_dict)

In [None]:
BusinessTravel_dict_new = {
    'Travel_Rarely':     0,
    'Travel_Frequently': 1,
    'Non-Travel':        2,
}
print(BusinessTravel_dict_new)

In [None]:
def BusinessTravel(x):
    if str(x) in BusinessTravel_dict_new.keys():
        return BusinessTravel_dict_new[str(x)]
df['New BusinessTravel'] = df["BusinessTravel"].apply(BusinessTravel)
df.sample(5)

##Department

In [None]:
Department_dict = df["Department"].value_counts()
print(Department_dict)

In [None]:
Department_dict_new = {
    'Research & Development': 0,
    'Sales':                  1,
    'Human Resources':        2,
}
print(Department_dict_new)

In [None]:
def Department(x):
    if str(x) in Department_dict_new.keys():
        return Department_dict_new[str(x)]
df['New Department'] = df["Department"].apply(Department)
df.sample(5)

##Education Field

In [None]:
EducationField_dict = df["EducationField"].value_counts()
print(EducationField_dict)

In [None]:
EducationField_dict_new = {
    'Life Sciences':    0,
    'Medical':          1,
    'Marketing':        2,
    'Technical Degree': 3,
    'Other' :           4,
    'Human Resources':  5

}
print(EducationField_dict_new)

In [None]:
def EducationField(x):
    if str(x) in EducationField_dict_new.keys():
        return EducationField_dict_new[str(x)]
df['New EducationField'] = df["EducationField"].apply(EducationField)
df.sample(5)

##Gender

In [None]:
map_dict = {'Male': 0, 'Female': 1}
df['Gender_new'] = df['Gender'].map(map_dict)
df['Gender_new'].unique()

##Job Role

In [None]:
JobRole_dict = df["JobRole"].value_counts()
print(JobRole_dict)

In [None]:
JobRole_dict_new = {
    'Sales Executive':            0,
    'Research Scientist':         1,
    'Laboratory Technician':      2,
    'Manufacturing Director':     3,
    'Healthcare Representative' : 4,
    'Sales Representative':       5,
    'Research Director':          6,
    'Human Resources':            7,
    'Manager':                    8

}
print(JobRole_dict_new)

In [None]:
def JobRole(x):
    if str(x) in JobRole_dict_new.keys():
        return JobRole_dict_new[str(x)]
df['New JobRole'] = df["JobRole"].apply(JobRole)
df.sample(5)

##Marital Status

In [None]:
MaritalStatus_dict = df["MaritalStatus"].value_counts()
print(MaritalStatus_dict)

In [None]:
MaritalStatus_dict_new = {
    'Married':  0,
    'Single':   1,
    'Divorced': 2
}
print(MaritalStatus_dict_new)

In [None]:
def MaritalStatus(x):
    if str(x) in MaritalStatus_dict_new.keys():
        return MaritalStatus_dict_new[str(x)]
df['New MaritalStatus'] = df["MaritalStatus"].apply(MaritalStatus)
df.sample(5)

In [None]:
df = df.drop(df.select_dtypes('object'), axis=1)

#Model Building

In [None]:
df.sample(5)

## Correlation and OLS

In [None]:
corr = df.corr()
print(corr)

plt.figure(figsize=(20, 10))
colormap = sns.color_palette("YlGnBu")
sns.heatmap(df.corr(), annot=True, cmap=colormap).set_title('Correlation Heatmap', fontdict={'fontsize':14})
plt.show()

In [None]:
corr_attrition = df.corr()['Attrition_New']
print(corr_attrition.sort_values(ascending=True))

On comparison, the target variable (Attrition) has a negative correlation with Total Working Years, Age, Years With Current Manager, Overall Employee rating, Years at Company, Environment Satisfaction, Job Satisfaction, and Work Life balance. On the other hand, there are comparitively fewer variables that have a positive correlation with attrition, such as Department, Number of companies worked before, Percentage of salary hikes, and a few others.

However, this doesn't mean that these features are not significant. Even a little distinction might help classify the attrition, so removing any of the attributes from the analysis is not recommended.

In [None]:
X = df.drop(['Attrition_New'], axis=1)
y = df['Attrition_New']
X = sm.add_constant(X)

model = sm.Logit(y, X)
result = model.fit()
print(result.summary())

p_values = result.pvalues
print(p_values.sort_values(ascending=False))

##Model 1: Normal Logistic Regression

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

X = df.drop(columns=['Attrition_New'],axis=1)
y = df['Attrition_New']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

logreg = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print(precision_score(y_test, y_pred, average='macro', zero_division=1))
print(recall_score(y_test, y_pred, average='macro', zero_division=1))
print(f1_score(y_test, y_pred, average='macro', zero_division=1))
print(classification_report(y_test, y_pred, zero_division=1))


## Model 2: w/o Monthly Income

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

X = df.drop(columns=['Attrition_New','MonthlyIncome'],axis=1)
y = df['Attrition_New']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

logreg = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print(precision_score(y_test, y_pred, average='macro', zero_division=1))
print(recall_score(y_test, y_pred, average='macro', zero_division=1))
print(f1_score(y_test, y_pred, average='macro', zero_division=1))
print(classification_report(y_test, y_pred, zero_division=1))

## Model 3 : Using RFE

In [None]:
from sklearn.feature_selection import RFE

X = df.drop(columns=['Attrition_New'],axis=1)
y = df['Attrition_New']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lr = LogisticRegression(max_iter=1000, random_state=42)

rfe = RFE(lr, n_features_to_select=20)
rfe.fit(X_train, y_train)

X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

lr.fit(X_train_selected, y_train)
y_pred = lr.predict(X_test_selected)

print(precision_score(y_test, y_pred, average='macro', zero_division=1))
print(recall_score(y_test, y_pred, average='macro', zero_division=1))
print(f1_score(y_test, y_pred, average='macro', zero_division=1))
print(classification_report(y_test, y_pred, zero_division=1))


## Models Using RFE and 3 differet Sampling techniques

In [None]:
# Define the feature and target variables
X = df.drop(columns=['Attrition_New'],axis=1)
y = df['Attrition_New']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the logistic regression model
lr_model = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')

rfe = RFE(lr, n_features_to_select=10)
rfe.fit(X_train, y_train)

X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

# Define the sampling methods
over_sampler = RandomOverSampler(sampling_strategy='minority')
under_sampler = RandomUnderSampler(sampling_strategy='majority')
smote = SMOTE(random_state=42)

# Apply the sampling methods to the training data
X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)
X_train_under, y_train_under = under_sampler.fit_resample(X_train, y_train)
X_train_smote, y_train_smote = smote.fit_resample(X, y)

# Train and test the logistic regression model with each sampling method
for X_train_resampled, y_train_resampled in [(X_train, y_train), (X_train_over, y_train_over),
                                             (X_train_under, y_train_under), (X_train_smote, y_train_smote)]:
    lr_model.fit(X_train_resampled, y_train_resampled)
    y_pred = lr_model.predict(X_test)
    print(precision_score(y_test, y_pred, average='macro', zero_division=1))
    print(recall_score(y_test, y_pred, average='macro', zero_division=1))
    print(f1_score(y_test, y_pred, average='macro', zero_division=1))

    print('Classification report for resampled data:')
    print(classification_report(y_test, y_pred, zero_division=1))

    importances = pd.DataFrame({'feature': X_train.columns,'importance': np.abs(lr_model.coef_[0])})
    importances = importances.sort_values('importance',ascending=False).set_index('feature')
    print(importances)

## Model Using SMOTE and Grid Search

In [None]:
# Load data
X = df.drop(columns=['Attrition_New'], axis=1)
y = df['Attrition_New']

# Resample using SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X, y)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

# Data Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Define hyperparameters for tuning
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'solver': ['lbfgs', 'liblinear', 'sag']}

# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(LogisticRegression(), params, scoring='roc_auc', cv=10)
grid_search.fit(X_train, y_train)

# Best Performing Parameter
print('=' * 20)
print("best params: " + str(grid_search.best_estimator_))
print("best params: " + str(grid_search.best_params_))
print('best score:', grid_search.best_score_)
print('=' * 20)

# Fit logistic regression model using best hyperparameters
lr = LogisticRegression(C=grid_search.best_params_['C'], solver=grid_search.best_params_['solver'])
lr.fit(X_train, y_train)

# Make predictions on test set
y_pred = lr.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))


## Model 1 using Grid Search

In [None]:
# Removing some observations whose class is in majority
# This is an important step to balance the dataset
df = df[(df['Attrition_New'] != 0) | (np.random.rand(len(df)) < .33)]

X = df.drop(columns=['Attrition_New'], axis=1)
y = df['Attrition_New']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42, stratify=y)

# Data Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Defining parameters for hyper-parameter tuning
params = {'solver': ['newton-cg', 'liblinear'],
          'penalty': ['l2'],
          'C': np.logspace(-4.5, 4.5, 50),
          'class_weight': ['balanced'],
          'max_iter': [1000, 5000, 10000],
          'tol': [0.0001, 0.001, 0.01, 0.1],
          'fit_intercept': [True, False],
          'intercept_scaling': [1, 2, 3]}

# Initializing Grid Search with Logistic Regression and keeping roc_auc as the performance metrics!
grid_search = GridSearchCV(estimator=LogisticRegression(),
                           param_grid=params,
                           cv=10,
                           n_jobs=-1,
                           verbose=0,
                           scoring="roc_auc",
                           return_train_score=True)

# Training
grid_search.fit(X_train, y_train)

# Best Performing Parameter
print('=' * 20)
print("best params: " + str(grid_search.best_estimator_))
print("best params: " + str(grid_search.best_params_))
print('best score:', grid_search.best_score_)
print('=' * 20)

## Model 2 Using Grid Search

In [None]:
# Removing some observations whose class is in majority
# This is an important step to balance the dataset
df = df[(df['Attrition_New'] != 0) | (np.random.rand(len(df)) < .33)]

X = df.drop(columns=['Attrition_New'], axis=1)
y = df['Attrition_New']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42, stratify=y)

# Data Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Defining parameters for hyper-parameter tuning
params = {'solver': ['newton-cg','lbfgs'],
          'penalty': ['l2'],
          'C': np.logspace(-4.5, 4.5, 50),
          'class_weight': ['balanced'],
          'max_iter': [1000, 5000, 10000],
          'tol': [0.0001, 0.001, 0.01, 0.1],
          'fit_intercept': [True, False],
          'intercept_scaling': [1, 2, 3]}

# Initializing Grid Search with Logistic Regression and keeping roc_auc as the performance metrics!
grid_search = GridSearchCV(estimator=LogisticRegression(),
                           param_grid=params,
                           cv=10,
                           n_jobs=-1,
                           verbose=0,
                           scoring="roc_auc",
                           return_train_score=True)

# Training
grid_search.fit(X_train, y_train)

# Best Performing Parameter
print('=' * 20)
print("best params: " + str(grid_search.best_estimator_))
print("best params: " + str(grid_search.best_params_))
print('best score:', grid_search.best_score_)
print('=' * 20)

## Model 3 Using Grid Search

In [None]:
# Removing some observations whose class is in majority
# This is an important step to balance the dataset
df = df[(df['Attrition_New'] != 0) | (np.random.rand(len(df)) < .33)]

X = df.drop(columns=['Attrition_New'], axis=1)
y = df['Attrition_New']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42, stratify=y)

# Data Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Defining parameters for hyper-parameter tuning
params = {'solver': ['newton-cg', 'liblinear'],
          'penalty': ['l2'],
          'C': np.logspace(-4.5, 4.5, 50),
          'class_weight': ['balanced'],
          'max_iter': [1000, 5000, 10000],
          'tol': [0.0001, 0.001, 0.01, 0.1],
          'fit_intercept': [True, False],
          'intercept_scaling': [1, 2, 3]}

# Initializing Grid Search with Logistic Regression and keeping roc_auc as the performance metrics!
grid_search = GridSearchCV(estimator=LogisticRegression(),
                           param_grid=params,
                           cv=10,
                           n_jobs=-1,
                           verbose=0,
                           scoring="roc_auc",
                           return_train_score=True)

# Training
grid_search.fit(X_train, y_train)

# Best Performing Parameter
print('=' * 20)
print("best params: " + str(grid_search.best_estimator_))
print("best params: " + str(grid_search.best_params_))
print('best score:', grid_search.best_score_)
print('=' * 20)

In [None]:
# Removing some observations whose class is in majority
# This is an important step to balance the dataset
df = df[(df['Attrition_New'] != 0) | (np.random.rand(len(df)) < .33)]

X = df.drop(columns=['Attrition_New'], axis=1)
y = df['Attrition_New']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42, stratify=y)

# Data Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Defining parameters for hyper-parameter tuning
params = {'solver': ['newton-cg'],
          'penalty': ['l2'],
          'C': np.logspace(-4.5, 4.5, 50),
          'class_weight': ['balanced'],
          'max_iter': [1000, 5000, 10000],
          'tol': [0.0001, 0.001, 0.01, 0.1],
          'fit_intercept': [True, False],
          'intercept_scaling': [1, 2, 3]}

# Initializing Grid Search with Logistic Regression and keeping roc_auc as the performance metrics!
grid_search = GridSearchCV(estimator=LogisticRegression(),
                           param_grid=params,
                           cv=10,
                           n_jobs=-1,
                           verbose=0,
                           scoring="roc_auc",
                           return_train_score=True)

# Training
grid_search.fit(X_train, y_train)

# Best Performing Parameter
print('=' * 20)
print("best params: " + str(grid_search.best_estimator_))
print("best params: " + str(grid_search.best_params_))
print('best score:', grid_search.best_score_)
print('=' * 20)

In [None]:
!pip install plot-metric

In [None]:
# Let's evaluate the performance of the model over the testing dataset:
from plot_metric.functions import BinaryClassification as BC
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
bc = BC(y_test, y_pred,labels=[0,1])

# Plotting AUC_ROC Curve
plt.figure(figsize=(8, 6))
bc.plot_roc_curve()
plt.show()

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay

print("The accuracy is {:.2f}".format(accuracy_score(y_test, y_pred)))
print("The balanced accuracy is {:.2f}".format(balanced_accuracy_score(y_test, y_pred)))
print("The recall is {:.2f}".format(recall_score(y_test, y_pred)))
print("The precision is {:.2f}".format(precision_score(y_test, y_pred)))
print("The F1 Score is {:.2f}".format(f1_score(y_test, y_pred)))
print("The AUC ROC Score is {:.2f}".format(roc_auc_score(y_test, y_pred)))

cm = confusion_matrix(y_test, best_model.predict(X_test))
classes = ['Not_Attrition', 'Attrition']
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)

fig, ax = plt.subplots(figsize=(7, 7))
plt.title("Confusion Matrix")
disp = disp.plot(ax=ax)
plt.grid(None)
plt.show()

The evaluation metrics of Accuracy, Balanced Accuracy, Recall, and AUC_ROC seem promising for our model. However, as our data is imbalanced, accuracy alone cannot be relied upon, and it may give misleading results. In this case, Recall and AUC_ROC metrics indicate a good fit for our data, but Precision is relatively low. The confusion matrix shows the presence of false-positive cases, which affects the precision of the model. However, false negatives are low, which improves the recall metric.

In conclusion, we successfully used the logistic regression algorithm to model attrition in the dataset. Nevertheless, there is still room for improvement. To improve the model, we tried various sampling methods and evaluated their corresponding results to find the optimum sampling technique for this problem. Yet our final model using Grid Search achieved decent Recall and AUC ROC scores, suggesting a good fit over the data.

# Summary

1. Data Preprocessing:



* Data cleaning and handling missing values
* Dropping irrelevant features
* Encoding categorical features using LabelEncoder
* Train-Test Split: Splitting the data into training and testing sets with a ratio of 80:20

2. Sampling:

Handling class imbalance by oversampling the minority class using SMOTE
Feature Engineering:

3. Scaling the features using StandardScaler
Feature selection using Recursive Feature Elimination (RFE)
Building the binary logistic regression model using the selected features
Model Training:

4. Fitting the binary logistic regression model on the training data
Performance comparison between Train and Test:

5. Evaluating the performance of the model on both the training and testing sets
Cross-Validation:

6. Applying 10-fold cross-validation to get a more reliable estimate of the model's performance

7. Fine-tuning the model by tuning the hyperparameters using GridSearchCV
Final Model:

8. Building the final binary logistic regression model with the best hyperparameters



