# MACHINE LEARNING PROJECT 

## TITLE: Attrition Unmasked - Why Employees Leave

![Employee Attrition](https://github.com/eshagarwal/ML_Gisma_Project/blob/main/employee-attrition.jpg?raw=true)

#### An Interesting Quote I Found:  
*"Managers tend to blame their turnover problems on everything under the sun, while ignoring the crux of the matter: people don't leave jobs; they leave managers."*  
— *Travis BradBerry*  


## What is Attrition and What Determines It?  
**Attrition:** It is basically the turnover rate of employees inside an organization.  

### This can happen for many reasons:  
- Employees looking for better opportunities.  
- A negative working environment.  
- Bad management.  
- Sickness of an employee (or even death).  
- Excessive working hours.  

## Structure of the Project  
This project will be structured in the following way:  

- **Questions:** Questions will be asked prior to the visualization to ensure that the visualizations shown in this project are insightful.  
- **Summary:** After each section, a summary will be provided to understand what we learned from the visualizations.  
- **Recommendations:** Suggestions will be made to the organization to help reduce the **attrition rate**.  


### **Table of Contents**  

#### I. **General Information**  
- [Summary of our Data](#summary-of-our-data)  
- [Distribution of our Labels](#distribution-of-our-labels)  

#### II. **Gender Analysis**  
- [Age Distribution by Gender](#age-distribution-by-gender)  
- [Job Satisfaction Distribution by Gender](#job-satisfaction-distribution-by-gender)  
- [Monthly Income by Gender](#monthly-income-by-gender)  
- [Presence by Department](#presence-by-department)  

#### III. **Analysis by Education**  
- [Understanding Attrition by Education](#understanding-attrition-by-education)  

#### IV. **The Impact of Income Towards Attrition**  
- [Average Income by Department](#average-income-by-department)  
- [Determining Satisfaction by Income](#determining-satisfaction-by-income)  
- [Income and the Levels of Attrition](#income-and-the-levels-of-attrition)  
- [Level of Attrition by Overtime](#level-of-attrition-by-overtime)  

#### V. **Working Environment**  
- [Average Environment Satisfaction](#average-environment-satisfaction)  

#### VI. **Other Factors**  
- [Other Factors that Could Influence Attrition](#other-factors-that-could-influence-attrition)  

#### VII. **Feature Engineering**  
- [Mapping Categorical Values to Numerical Values for Correlation Matrix](#mapping_categorical_to_numerical)  
- [Dropping all of the `object` d-type for Correlation Matrix](#dropping_objects)  
- [Plotting the Correlation Matrix](#correlation_matrix)
- [Checking the fields correlated to attrition](#checking_fields_correlated_to_attrition)

#### VIII. **Data Preprocessing**
- [Defining Features and Target Variable for Model Training](#defining-features-and-target-variable-for-model-training)
- [Splitting the Data into Training and Testing Sets](#splitting-the-data-into-training-and-testing-sets)
- [Balancing the Training Data using SMOTE](#balancing-the-training-data-using-smote)
- [Calculating Class Weights](#calculating-class-weights)

#### IX. **Analysis and Models** 
- [Defining the Models for Training](#defining-the-models-for-training)
- [Training and Evaluating the Models](#training-and-evaluating-the-models)
- [Model Evaluation Results](#model-evaluation-results)
- [Confusion Matrices](#confusion-matrices)

#### X. **Fine-Tuning**
- [Fine-tuning Random Forest Model with Cross-Validation](#fine-tuning-random-forest-model-with-cross-validation)
- [Evaluating the Performance of the Fine-Tuned Random Forest Model](#evaluating-the-performance-of-model)
- [Visualizing Feature Importances of the Fine-Tuned Random Forest Model](#visualizing-feature-importances)


#### XI. **Conclusion**  
- [Top Reasons Why Employees Leave the Organization](#top-reasons-why-employees-leave-the-organization)  



### Importing libraries


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import sklearn

In [2]:
import plotly.io as pio
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
pio.renderers.keys()
# pio.renderers.default = 'notebook'

dict_keys(['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode', 'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab', 'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg', 'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe', 'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png'])

In [3]:
df = pd.read_csv("./Data/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Summary of our Data  
<a id="summary-of-our-data"></a>  

Before we get into the deep visualizations, we want to make sure how our data looks like.    


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [5]:
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [6]:
df.columns
print(df.columns)

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')


In [7]:
df.dtypes

Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears   

In [8]:
df.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

In [9]:
df.isna().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

### Summary  

- **Dataset Structure:** 1470 observations (rows), 35 features (variables).  
- **Missing Data:** Luckily , there is no missing data! This will make it easier to work with the dataset.  
- **Data Type:** We only have two data types in this dataset: strings and integers.  
- **Label:** *Attrition* is the label in our dataset, and we would like to find out why employees are leaving the organization!  
- **Imbalanced Dataset:** 1237 (84% of cases) employees did not leave the organization, while 237 (16% of cases) did leave. This makes our dataset **imbalanced**, since more people stay in the organization than leave.

### Distribution of our Labels  
<a id="distribution-of-our-labels"></a>  

This is an important aspect that will be further discussed below and that is dealing with **imbalanced datasets**. **83.9%** of employees did not quit the organization, while **16.1%** did leave.  

Knowing that we are dealing with an imbalanced dataset will help us determine the best approach to implement our predictive model.  


In [10]:
attrition_counts = df["Attrition"].value_counts()
yes_percent = attrition_counts["Yes"] / (
    attrition_counts["Yes"] + attrition_counts["No"]
)
no_percent = attrition_counts["No"] / (attrition_counts["Yes"] + attrition_counts["No"])

fig = go.Figure(
    data=[
        go.Pie(
            labels=["Yes", "No"],
            values=[yes_percent, no_percent],
        ),
    ]
)
fig.show()

## Gender Analysis  

We will try to see if there are any discrepancies between males and females in the organization. Also, we will look at other basic information such as age, level of job satisfaction, and average salary by gender.  

### Questions to ask Ourselves:  
- What is the age distribution between males and females? Are there any significant **discrepancies**?  
- What is the average job satisfaction? Is any type of gender more dissatisfied than the other?  
- What is the average salary by gender? What are the number of employees by gender in each department?  


#### **Age Distribution by Gender**  
<a id="age-distribution-by-gender"></a>

In [11]:
average_age_by_gender = df.groupby("Gender")["Age"].mean()

print("\nAverage age by Gender:")
print("================================")
print(average_age_by_gender)


Average age by Gender:
Gender
Female    37.329932
Male      36.653061
Name: Age, dtype: float64


In [12]:
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=2,
    cols=2,
    subplot_titles=("Female Employees", "Male Employees", "Overall Employees"),
)

# Female employees
female_df = df[df["Gender"] == "Female"]
female_hist = px.histogram(female_df, x="Age")
female_hist.update_traces(showlegend=True)
mean_age_female = female_df["Age"].mean()

# Male employees
male_df = df[df["Gender"] == "Male"]
male_hist = px.histogram(male_df, x="Age")
male_hist.update_traces(showlegend=True)
mean_age_male = male_df["Age"].mean()

# Overall employees
overall_hist = px.histogram(df, x="Age")
overall_hist.update_traces(showlegend=True)
mean_age_overall = df["Age"].mean()

for trace in female_hist.data:
    fig.add_trace(trace, row=1, col=1)
    fig.update_xaxes(title_text="Age", row=1, col=1)
    fig.update_yaxes(title_text="Count", row=1, col=1)
    fig.add_vline(
        x=mean_age_female,
        line_width=2,
        line_dash="dash",
        line_color="red",
        annotation_text="Mean Age",
        annotation_position="top right",
        row=1,
        col=1,
    )

for trace in male_hist.data:
    fig.add_trace(trace, row=1, col=2)
    fig.update_xaxes(title_text="Age", row=1, col=2)
    fig.update_yaxes(title_text="Count", row=1, col=2)
    fig.add_vline(
        x=mean_age_male,
        line_width=2,
        line_dash="dash",
        line_color="red",
        annotation_text="Mean Age",
        annotation_position="top right",
        row=1,
        col=2,
    )

for trace in overall_hist.data:
    fig.add_trace(trace, row=2, col=1)
    fig.update_xaxes(title_text="Age", row=2, col=1)
    fig.update_yaxes(title_text="Count", row=2, col=1)
    fig.add_vline(
        x=mean_age_overall,
        line_width=2,
        line_dash="dash",
        line_color="red",
        annotation_text="Mean Age",
        annotation_position="top right",
        row=2,
        col=1,
    )

fig.update_layout(title_text="Age Distribution of Employees", height=800)

fig.show()

#### **Distribution of Job Satisfaction**  
<a id="job-satisfaction-distribution-by-gender"></a>

In [13]:
job_satisfaction_by_gender = df.groupby("Gender")["JobSatisfaction"].mean()

print("\nMean Job Satisfaction by Gender:")
print("================================")
print(job_satisfaction_by_gender)


Mean Job Satisfaction by Gender:
Gender
Female    2.683673
Male      2.758503
Name: JobSatisfaction, dtype: float64


In [14]:
grouped_df = (
    df.groupby(["Gender", "JobSatisfaction"])["JobSatisfaction"]
    .count()
    .reset_index(name="Count")
)

fig = px.bar(
    grouped_df,
    x="JobSatisfaction",
    y="Count",
    color="Gender",
    barmode="group",
    title="Job Satisfaction by Gender",
)

fig.show()

#### **Monthly Income by Gender**  
<a id="monthly-income-by-gender"></a>

In [15]:
average_salary_by_gender = df.groupby("Gender")["MonthlyIncome"].mean()

print("\nAverage Salary by Gender:")
print("================================")
print(average_salary_by_gender)


Average Salary by Gender:
Gender
Female    6686.566327
Male      6380.507937
Name: MonthlyIncome, dtype: float64


In [16]:
fig = px.strip(
    df,
    x="Gender",
    y="MonthlyIncome",
    title="Average Salary by Gender",
    hover_data=["MonthlyIncome", "JobSatisfaction"],
    color="Gender",
)
fig.show()

#### **Presence by Department**  
<a id="presence-by-department"></a>

In [17]:
grouped_df = (
    df.groupby(["Gender", "Department"])["Department"]
    .count()
    .reset_index(name="Count")
)

print("\nEmployee Count by Gender and Department:")
print("=======================================")
print(grouped_df.to_string(index=False))

print("\nTotal Employees per Department:")
print("============================")
dept_totals = grouped_df.groupby("Department")["Count"].sum()
print(dept_totals)


Employee Count by Gender and Department:
Gender             Department  Count
Female        Human Resources     20
Female Research & Development    379
Female                  Sales    189
  Male        Human Resources     43
  Male Research & Development    582
  Male                  Sales    257

Total Employees per Department:
Department
Human Resources            63
Research & Development    961
Sales                     446
Name: Count, dtype: int64


In [18]:
department_counts = df["Department"].value_counts().reset_index()
department_counts.columns = ["Department", "Count"]

fig = px.bar_polar(
    department_counts,
    r="Count",
    theta="Department",
    color="Department",
    title="Total Employees per Department",
)

fig.show()

### Summary:  
- **Age by Gender:** The average age of females is 37.33, and for males, it is 36.65, and both distributions are **similar**.  
- **Job Satisfaction by Gender:** Females had a lower satisfaction level as opposed to males.  
- **Salaries:** The average salaries for both genders are practically the same, with **males** having an average of 6380.51 and **females** 6686.57.  
- **Departments:** There are a higher number of males in the three departments; however, females are more predominant in the **Research and Development** department.  

## Analysis by Education and Attrition by Level of Education
<a id="understanding-attrition-by-education"></a>

In [19]:
df["Education"].value_counts()

Education
3    572
4    398
2    282
1    170
5     48
Name: count, dtype: int64

In [20]:
df["EducationLevel"] = df["Education"].map(
    {1: "School", 2: "College", 3: "Bachelor", 4: "Master", 5: "PhD"}
)

education_counts = df["EducationLevel"].value_counts().reset_index()
education_counts.columns = ["EducationLevel", "Count"]

attrition_counts = (
    df.groupby(["EducationLevel", "Attrition"]).size().reset_index(name="Count")
)

# Creating subplots
fig = make_subplots(
    rows=2,
    cols=1,
    subplot_titles=(
        "Education Level Distribution - Treemap",
        "Attrition Count by Education Level - Bar Chart",
    ),
    specs=[[{"type": "treemap"}], [{"type": "xy"}]],
)

# Treemap
fig_treemap = px.treemap(education_counts, path=["EducationLevel"], values="Count")
fig_treemap.update_traces(textinfo="label+value")
for trace in fig_treemap.data:
    fig.add_trace(trace, row=1, col=1)

# Bar Chart
fig_bar = px.bar(
    attrition_counts,
    x="EducationLevel",
    y="Count",
    color="Attrition",
    barmode="group",
    color_discrete_map={"Yes": "red", "No": "blue"},
    text="Attrition",
)
fig_bar.update_traces(textposition="outside")
for trace in fig_bar.data:
    fig.add_trace(trace, row=2, col=1)

fig.update_layout(height=1000, title_text="Education Level and Attrition Analysis")

fig.show()

### Summary: 

**Attrition by Level of Education:** 
The level of education plays a significant role in employee attrition. Typically, employees with higher education levels may have greater career mobility and are more likely to seek opportunities that align with their qualifications. In this dataset, employees with a bachelor's degree have the highest turnover rate, which aligns with the fact that millennials (who are typically more educated) have the highest attrition rates overall.  

- **Bachelor’s Degree:** Employees with a bachelor's degree show the highest attrition, possibly due to the higher career expectations and job opportunities available for individuals with this level of education.  
- **Master’s Degree:** Employees with a master's degree tend to have slightly lower attrition, indicating they may be more likely to stay in positions that align with their higher qualifications, or may have greater job security.  
- **PhD or Doctorate:** Employees with a PhD or doctorate tend to have the lowest turnover rates, possibly because these individuals often occupy specialized roles and may have fewer career options or may be more committed to their long-term career path.  

This suggests that organizations could focus on retaining employees with a bachelor's degree by offering more growth opportunities or adjusting compensation to meet their expectations.  


## The Impact of Income towards Attrition  

I wonder how much importance each employee gives to the income they earn in the organization. Here, we will find out if it is true that money is really everything!  

### Questions to Ask Ourselves  
- What is the average monthly income by **department**? Are there any significant differences between individuals who quit and didn't quit?  
- Are there significant changes in the **level of income by Job Satisfaction**? Are individuals with a **lower satisfaction** getting much less income than the ones who are more satisfied?  
- Do employees who **quit the organization** have a much lower income than people who **didn't quit the organization**?  
- Do employees with a higher performance rating earn more than those with a lower performance rating? Is the difference significant by Attrition status?  


### Average Income by Department and Attrition Status
<a id="average-income-by-department"></a>


In [21]:
# Determine the average monthly income by department
average_income_by_department = (
    df.groupby("Department")["MonthlyIncome"].mean().reset_index(name="MeanIncome")
)

# Determine the average monthly income by department and attrition status
average_income_by_department_attrition = (
    df.groupby(["Department", "Attrition"])["MonthlyIncome"]
    .mean()
    .reset_index(name="MeanIncome")
)

fig = make_subplots(
    rows=2,
    cols=1,
    subplot_titles=(
        "Average Monthly Income by Department",
        "Average Monthly Income by Department and Attrition Status",
    ),
)

fig1 = px.bar(
    average_income_by_department,
    x="Department",
    y="MeanIncome",
    title="Average Monthly Income by Department",
    color="Department",
)

fig2 = px.bar(
    average_income_by_department_attrition,
    x="Department",
    y="MeanIncome",
    color="Attrition",
    barmode="group",
    title="Average Monthly Income by Department and Attrition Status",
    color_discrete_map={"Yes": "red", "No": "blue"},
    text="Attrition",
)
fig_bar.update_traces(textposition="outside")

for trace in fig1.data:
    fig.add_trace(trace, row=1, col=1)

for trace in fig2.data:
    fig.add_trace(trace, row=2, col=1)

fig.update_layout(height=800, title_text="Average Monthly Income Analysis")

fig.show()

### Determining Satisfaction by Income
<a id="determining-satisfaction-by-income"></a>


In [22]:
fig = px.box(
    df,
    x="Attrition",
    y="MonthlyIncome",
    color="JobSatisfaction",
    title="Distribution of Monthly Income by Job Satisfaction and Attrition",
)
fig.show()

### Income and its Impact on Attrition
<a id="income-and-the-levels-of-attrition"></a>


In [23]:
# Group by Attrition and calculate average MonthlyIncome
df_grouped = df.groupby("Attrition", as_index=False)["MonthlyIncome"].mean()

fig = px.bar(
    df_grouped,
    x="Attrition",
    y="MonthlyIncome",
    title="Income and its Impact on Attrition",
    labels={"MonthlyIncome": "Average Monthly Income"},
    color="Attrition",
)
fig.show()

### Level of Attrition by Overtime Status
<a id="level-of-attrition-by-overtime"></a>


In [24]:
# Count of employees by OverTime and Attrition
df_grouped = df.groupby(["OverTime", "Attrition"], as_index=False).size()

fig = px.bar(
    df_grouped,
    x="OverTime",
    y="size",
    color="Attrition",
    title="Attrition Count by Overtime Status",
    labels={"size": "Number of Employees", "OverTime": "Overtime Status"},
    barmode="group",
    text=df_grouped["size"],
)

fig.show()

### Summary:  
- **Income by Departments:** Wow! We can see huge differences in each department by **attrition status**.  
- **Income by Job Satisfaction:** Hmm. It seems the lower the job satisfaction, the **wider the gap** by attrition status in the levels of income.  
- **Attrition Sample Population:** I would say that most of this sample population has had a **salary increase** of less than 15% and a **monthly income** of less than 7,000.  
- **Exhaustion at Work:** Over 54% of workers who left the organization worked **overtime**! Will this be a reason why employees are leaving?   

## Working Environment  
<a id="average-environment-satisfaction"></a>

In this section, we will explore the working environment of the organization.

### Question to ask Ourselves   
- **Working Environment by Job Role:** What's the working environment by job role?  

In [25]:
df.WorkLifeBalance.value_counts()

WorkLifeBalance
3    893
2    344
4    153
1     80
Name: count, dtype: int64

In [26]:
df["WorkLifeBalanceLevel"] = df.WorkLifeBalance.map(
    {1: "Bad", 2: "Good", 3: "Better", 4: "Best"}
)
df["WorkLifeBalanceLevel"].value_counts()

WorkLifeBalanceLevel
Better    893
Good      344
Best      153
Bad        80
Name: count, dtype: int64

In [27]:
# Group by WorkLifeBalance and Attrition
df_grouped = (
    df.groupby(["WorkLifeBalanceLevel", "WorkLifeBalance", "Attrition"], as_index=False)
    .size()
    .sort_values(by="WorkLifeBalance")
)

fig = px.bar(
    df_grouped,
    x="WorkLifeBalanceLevel",
    y="size",
    color="Attrition",
    title="Is there a Work Life Balance Environment?",
    labels={
        "size": "Number of Employees",
        "WorkLifeBalanceLevel": "Work-Life Balance Rating",
    },
    barmode="group",
    text=df_grouped["size"],
)

fig.show()

### Summary:  
- Employees with a **"Better"** work-life balance have the lowest attrition (127 employees left).
- Employees with a **"Bad"** work-life balance have the highest attrition (25 employees left).
- Employees in the **"Good"** and **"Best"** categories show moderate attrition (58 and 27 employees left, respectively).
- The **"Best"** work-life balance group has a slightly higher attrition than the **"Better"** group (27 vs. 127).
- A better work-life balance generally correlates with fewer employees leaving the organization, while a poor balance leads to more employees quitting.



## Other Factors that could Influence Attrition
<a id="other-factors-that-could-influence-attrition"></a>
In this section we will analyze other external factors that could have a possible influence on individuals leaving the organization.
One of the factors include:
- Home Distance from Work

### Question to Ask Ourselves:
- **Distance from Work:** Is distance from work a huge factor in terms of quitting the organization?


In [28]:
# Group by DistanceFromHome and Attrition, and count employees
df_grouped = df.groupby(["DistanceFromHome", "Attrition"], as_index=False).size()

fig = px.bar(
    df_grouped,
    x="DistanceFromHome",
    y="size",
    color="Attrition",
    title="Attrition Rate by Distance from Home",
    labels={"size": "Number of Employees"},
)

fig.show()

In [29]:
fig = px.histogram(df[df['DistanceFromHome'] < 5], x='JobLevel', color='Attrition',
                   title='Attrition Rate by Job Level for Nearby Employees',
                   labels={'JobLevel': 'Job Level'},
                   barmode='group')

fig.show()

### Summary:

- **Attrition Rate by Distance from Home:**
  - Employees living closer to the workplace (DistanceFromHome ≤ 5) have relatively higher attrition rates.
  - As the distance from home increases, the number of employees leaving the organization decreases.

- **Attrition Rate by Job Level for Nearby Employees:**
  - Junior employees (JobLevel 1) with short commutes have the highest attrition rates, with 51 employees quitting.
  - Employees at higher job levels (JobLevel 2 to 5) with short commutes show lower attrition, suggesting that career growth may be a bigger factor for junior employees.

Junior employees with shorter commutes tend to leave the organization more frequently, likely due to seeking better career opportunities.


## Feature Engineering
<a id="feature-engineering"></a>

In [30]:
df.dtypes

Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears   

In [31]:
df.Department.value_counts()

Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64

### Mapping Categorical Values to Numerical Values for Correlation Matrix
<a id="mapping_categorical_to_numerical"></a>

In [32]:
df['DepartmentValue'] = df['Department'].map({'Sales': 1, 'Research & Development': 2, 'Human Resources': 3})

In [33]:
df['GenderValue'] = df.Gender.map({'Male': True, 'Female': False})

In [34]:
df.select_dtypes(include=["object"]).columns

Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'Over18', 'OverTime', 'EducationLevel',
       'WorkLifeBalanceLevel'],
      dtype='object')

In [35]:
df.MaritalStatus.value_counts()

MaritalStatus
Married     673
Single      470
Divorced    327
Name: count, dtype: int64

In [36]:
df.select_dtypes(exclude=["object"]).columns

Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager', 'DepartmentValue',
       'GenderValue'],
      dtype='object')

In [37]:
df.BusinessTravel.value_counts()

BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64

In [38]:
df['BusinessTravelValue'] = df.BusinessTravel.map({'Non-Travel': 0, 'Travel_Rarely': 1, 'Travel_Frequently': 2})

In [39]:
df['AttritionValue'] = df['Attrition'].map({'Yes': True, 'No': False})

In [40]:
df['OverTimeValue'] = df['OverTime'].map({'Yes': True, 'No': False})

In [41]:
df.EducationLevel.value_counts()

EducationLevel
Bachelor    572
Master      398
College     282
School      170
PhD          48
Name: count, dtype: int64

In [42]:
df.StandardHours.value_counts()

StandardHours
80    1470
Name: count, dtype: int64

### Dropping all of the `object` d-type for Correlation Matrix
<a id="dropping_objects"></a>

In [43]:
correlation = df.drop(
    [
        "EmployeeCount",
        "EmployeeNumber",
        "Over18",
        "HourlyRate",
        "MaritalStatus",
        "Attrition",
        "EducationField",
        "Department",
        "Gender",
        'OverTime',
        'EducationLevel',
        'WorkLifeBalanceLevel',
        "JobRole",
        "BusinessTravel",
        'StandardHours'
    ], axis='columns'
).corr()
correlation

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,...,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,DepartmentValue,GenderValue,BusinessTravelValue,AttritionValue,OverTimeValue
Age,1.0,0.010661,-0.001686,0.208034,0.010146,0.02982,0.509604,-0.004892,0.497855,0.028051,...,-0.02149,0.311309,0.212901,0.216513,0.202089,0.031882,-0.036311,-0.011807,-0.159205,0.028062
DailyRate,0.010661,1.0,-0.004985,-0.016806,0.018355,0.046135,0.002966,0.030571,0.007707,-0.032182,...,-0.037848,-0.034055,0.009932,-0.033229,-0.026363,-0.007109,-0.011716,-0.015539,-0.056652,0.009135
DistanceFromHome,-0.001686,-0.004985,1.0,0.021042,-0.016075,0.008783,0.005303,-0.003669,-0.017014,0.027473,...,-0.026556,0.009508,0.018845,0.010029,0.014406,-0.017225,-0.001851,-0.009696,0.077924,0.025514
Education,0.208034,-0.016806,0.021042,1.0,-0.027128,0.042438,0.101589,-0.011296,0.094961,-0.026084,...,0.009819,0.069114,0.060236,0.054254,0.069065,-0.007996,-0.016547,-0.00867,-0.031373,-0.020322
EnvironmentSatisfaction,0.010146,0.018355,-0.016075,-0.027128,1.0,-0.008278,0.001212,-0.006784,-0.006259,0.0376,...,0.027627,0.001458,0.018007,0.016194,-0.004999,0.019395,0.000508,-0.01131,-0.103369,0.070132
JobInvolvement,0.02982,0.046135,0.008783,0.042438,-0.008278,1.0,-0.01263,-0.021476,-0.015271,-0.016322,...,-0.014617,-0.021355,0.008717,-0.024184,0.025976,0.024586,0.01796,0.0293,-0.130016,-0.003507
JobLevel,0.509604,0.002966,0.005303,0.101589,0.001212,-0.01263,1.0,-0.001944,0.9503,0.039563,...,0.037818,0.534739,0.389447,0.353885,0.375281,-0.101963,-0.039403,-0.011696,-0.169105,0.000544
JobSatisfaction,-0.004892,0.030571,-0.003669,-0.011296,-0.006784,-0.021476,-0.001944,1.0,-0.007157,0.000644,...,-0.019459,-0.003803,-0.002305,-0.018214,-0.027656,-0.021001,0.033252,0.008666,-0.103481,0.024539
MonthlyIncome,0.497855,0.007707,-0.017014,0.094961,-0.006259,-0.015271,0.9503,-0.007157,1.0,0.034814,...,0.030683,0.514285,0.363818,0.344978,0.344079,-0.05313,-0.031858,-0.01345,-0.15984,0.006089
MonthlyRate,0.028051,-0.032182,0.027473,-0.026084,0.0376,-0.016322,0.039563,0.000644,0.034814,1.0,...,0.007963,-0.023655,-0.012815,0.001567,-0.036746,-0.023642,-0.041482,-0.00844,0.01517,0.021431


### Plotting the Correlation Matrix
<a id="correlation_matrix"></a>

In [44]:
fig = px.imshow(
    correlation,
    color_continuous_scale="Viridis",
    title="Correlation Heatmap",
)
fig.show()

### Summary:
- Employees with more **total working years** tend to have a **higher monthly income**.
- A **larger salary increase percentage** is typically associated with a **higher performance rating**.
- Employees who have been with their **current manager for a longer period** generally have **more years since their last promotion**.
- **Older employees** generally earn a **higher monthly income**.


### Checking the fields correlated to attrition
<a id="checking_fields_correlated_to_attrition"></a>

#### These are the features we will use to predict the Attrition value

In [45]:
correlation['AttritionValue'].sort_values(ascending=False).drop('AttritionValue')

OverTimeValue               0.246118
BusinessTravelValue         0.127006
DistanceFromHome            0.077924
NumCompaniesWorked          0.043494
GenderValue                 0.029453
MonthlyRate                 0.015170
PerformanceRating           0.002889
PercentSalaryHike          -0.013478
Education                  -0.031373
YearsSinceLastPromotion    -0.033019
RelationshipSatisfaction   -0.045872
DailyRate                  -0.056652
TrainingTimesLastYear      -0.059478
WorkLifeBalance            -0.063939
DepartmentValue            -0.063991
EnvironmentSatisfaction    -0.103369
JobSatisfaction            -0.103481
JobInvolvement             -0.130016
YearsAtCompany             -0.134392
StockOptionLevel           -0.137145
YearsWithCurrManager       -0.156199
Age                        -0.159205
MonthlyIncome              -0.159840
YearsInCurrentRole         -0.160545
JobLevel                   -0.169105
TotalWorkingYears          -0.171063
Name: AttritionValue, dtype: float64

## Data Preprocessing

### Importing the Libraries

In [46]:
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, classification_report
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

### Defining Features and Target Variable for Model Training
<a id="defining-features-and-target-variable-for-model-training"></a>

In [47]:
X = df[
    [
        "OverTimeValue",
        "BusinessTravelValue",
        "DistanceFromHome",
        "NumCompaniesWorked",
        "GenderValue",
        "MonthlyRate",
        "PerformanceRating",
        "PercentSalaryHike",
        "Education",
        "YearsSinceLastPromotion",
        "RelationshipSatisfaction",
        "DailyRate",
        "TrainingTimesLastYear",
        "WorkLifeBalance",
        "DepartmentValue",
        "EnvironmentSatisfaction",
        "JobSatisfaction",
        "JobInvolvement",
        "YearsAtCompany",
        "StockOptionLevel",
        "YearsWithCurrManager",
        "Age",
        "MonthlyIncome",
        "YearsInCurrentRole",
        "JobLevel",
        "TotalWorkingYears"
    ]
]
y = df["Attrition"]
X.shape, y.shape

((1470, 26), (1470,))

### Splitting the Data into Training and Testing Sets
<a id="splitting-the-data-into-training-and-testing-sets"></a>


In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=500)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1176, 26), (294, 26), (1176,), (294,))

### Balancing the Training Data using SMOTE
<a id="balancing-the-training-data-using-smote"></a>

Applied **SMOTE (Synthetic Minority Over-sampling Technique)** because the dataset is imbalanced, with fewer instances of employees who left the company (Attrition = Yes). SMOTE generates synthetic samples for the minority class (Attrition = Yes), helping to balance the dataset and improve the model's ability to predict both classes accurately.

In [49]:
smote = SMOTE(random_state=69)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
X_train_balanced.shape, y_train_balanced.shape

((1974, 26), (1974,))

In [50]:
print("After SMOTE:")
print(pd.Series(y_train_balanced).value_counts())

After SMOTE:
Attrition
Yes    987
No     987
Name: count, dtype: int64


### Preprocessing the Data and Calculating Class Weights
<a id="calculating-class-weights"></a>


In [51]:
# calculate class weights
class_weights = dict(zip(
    [0, 1],
    compute_class_weight('balanced', classes=np.unique(y), y=y)
))

# create pipeline for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), X.select_dtypes(exclude=["object"]).columns),
        ("cat", OneHotEncoder(drop='first'), X.select_dtypes(include=["object"]).columns),
    ]
).set_output(transform='pandas')

# preprocess the data
y_train = LabelEncoder().fit_transform(y_train)
y_test = LabelEncoder().fit_transform(y_test)

### Defining the Models for Training
<a id="defining-the-models-for-training"></a>

In [52]:
models = {
    "Logistic Regression": LogisticRegression(class_weight=class_weights),
    "Random Forest": RandomForestClassifier(class_weight=class_weights),
    "Gradient Boosting": GradientBoostingClassifier(random_state=500),
    "SVM": SVC(class_weight=class_weights),
    "Decision Tree": DecisionTreeClassifier(class_weight=class_weights),
    "KNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
}

### Training and Evaluating the Models
<a id="defining-the-models-for-training"></a>

In [53]:
results = {}
confusion_matrix_results = {}
for name, model in models.items():
    pipeline = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("sampler", SMOTE(random_state=69)),
        ("classifier", model),
    ]).set_output(transform='pandas')
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    results[name] = {
        "Recall": recall_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred),
    }
    confusion_matrix_results[name] = confusion_matrix(y_test, y_pred)
    print(f"Model: {name}")
    print(classification_report(y_test, y_pred))
    print("=====================================")

Model: Logistic Regression
              precision    recall  f1-score   support

           0       0.93      0.49      0.64       246
           1       0.24      0.81      0.37        48

    accuracy                           0.54       294
   macro avg       0.58      0.65      0.50       294
weighted avg       0.82      0.54      0.60       294

Model: Random Forest
              precision    recall  f1-score   support

           0       0.89      0.96      0.92       246
           1       0.62      0.38      0.47        48

    accuracy                           0.86       294
   macro avg       0.75      0.67      0.69       294
weighted avg       0.84      0.86      0.85       294

Model: Gradient Boosting
              precision    recall  f1-score   support

           0       0.89      0.95      0.92       246
           1       0.61      0.40      0.48        48

    accuracy                           0.86       294
   macro avg       0.75      0.67      0.70       294
w

### Model Evaluation Results
<a id="model-evaluation-results"></a>

In [54]:
for model, result in results.items():
    print(f"Model: {model}")
    for metric, value in result.items():
        print(f"{metric}: {value}")
    print("\n")


Model: Logistic Regression
Recall: 0.8125
Precision: 0.23636363636363636
F1 Score: 0.36619718309859156


Model: Random Forest
Recall: 0.375
Precision: 0.6206896551724138
F1 Score: 0.4675324675324675


Model: Gradient Boosting
Recall: 0.3958333333333333
Precision: 0.6129032258064516
F1 Score: 0.4810126582278481


Model: SVM
Recall: 0.6041666666666666
Precision: 0.37662337662337664
F1 Score: 0.464


Model: Decision Tree
Recall: 0.3958333333333333
Precision: 0.3114754098360656
F1 Score: 0.3486238532110092


Model: KNN
Recall: 0.7083333333333334
Precision: 0.3008849557522124
F1 Score: 0.422360248447205


Model: Naive Bayes
Recall: 0.7083333333333334
Precision: 0.23448275862068965
F1 Score: 0.35233160621761656




In [55]:
fig = go.Figure()

for model, result in results.items():
    fig.add_trace(
        go.Bar(
            name=model,
            x=list(result.keys()),
            y=list(result.values()),
        )
    )

fig.update_layout(
    title="Model Comparison",
    xaxis_title="Metrics",
    yaxis_title="Score",
    barmode="group",
)

fig.show()

### Model Evaluation Summary

- **Logistic Regression** showed a strong ability to identify positives but struggled with false positives, leading to a relatively low precision. This indicates it can detect most positive cases but also misclassifies a significant number of negative cases as positives.

- **Random Forest** balanced recall and precision well. It was able to identify positives fairly effectively, though its ability to capture all relevant positives could be improved. The overall performance was solid with a moderate balance between false positives and false negatives.

- **Gradient Boosting** demonstrated similar characteristics to Random Forest, offering a reasonable balance between recall and precision. While it did a decent job in identifying positives, there is still room for improvement in handling false negatives.

- **SVM** had a higher recall compared to some other models, which means it was effective at identifying positive instances. However, its precision could be enhanced to reduce false positives and improve overall performance.

- **Decision Tree** performed moderately with a decent recall, but its precision was lower. This suggests that while it captured some positives, it misclassified a significant portion of negatives as positives.

- **KNN** excelled at identifying positives, resulting in a high recall. However, it struggled with precision, capturing a large number of false positives. This resulted in a less favorable overall balance between precision and recall.

- **Naive Bayes** also showed high recall, similar to KNN, but had a low precision, meaning it often misclassified negatives as positives. The model showed a moderate performance overall with some areas needing improvement.

### Conclusion:
The models varied in their performance, with some prioritizing recall and others focusing more on precision. Models like **KNN** and **Naive Bayes** were able to capture a large number of positives but also misclassified a lot of negatives. **Random Forest** and **Gradient Boosting** offered a better balance between recall and precision, though there is still potential to improve performance by reducing false positives and increasing recall.


### The confusion matrices for each model are summarized below:
<a id="confusion-matrices"></a>

In [56]:
for model, cm in confusion_matrix_results.items():
    print(f"Confusion Matrix for {model}:")
    print(cm)
    print("\n")


Confusion Matrix for Logistic Regression:
[[120 126]
 [  9  39]]


Confusion Matrix for Random Forest:
[[235  11]
 [ 30  18]]


Confusion Matrix for Gradient Boosting:
[[234  12]
 [ 29  19]]


Confusion Matrix for SVM:
[[198  48]
 [ 19  29]]


Confusion Matrix for Decision Tree:
[[204  42]
 [ 29  19]]


Confusion Matrix for KNN:
[[167  79]
 [ 14  34]]


Confusion Matrix for Naive Bayes:
[[135 111]
 [ 14  34]]




In [57]:
fig = make_subplots(rows=2, cols=4, subplot_titles=list(confusion_matrix_results.keys()))

for i, (model, cm) in enumerate(confusion_matrix_results.items()):
    fig.add_trace(
        go.Heatmap(
            z=cm,
            x=["No", "Yes"],
            y=["No", "Yes"],
            colorscale="Viridis",
            showscale=False,
        ),
        row=(i // 4) + 1,
        col=(i % 4) + 1,
    )

fig.update_layout(title="Confusion Matrix for all Models", height=800)

### Confusion Matrix Summary

- **Logistic Regression** has a relatively high number of false positives, suggesting that while it is good at identifying positive instances, it also misclassifies a significant number of negative instances as positive.
  
- **Random Forest** shows a more balanced performance, with fewer false positives than Logistic Regression but still struggles with a moderate number of false negatives. This indicates that while it can correctly identify many positives, it misses a portion of them.

- **Gradient Boosting** performs similarly to Random Forest, with a comparable number of false positives and negatives. It shows a reasonable balance in its classification, identifying most of the positives while also misclassifying some negatives.

- **SVM** has a relatively higher number of false negatives compared to the other models, which suggests that it misses some positive instances. However, it manages to identify a significant number of positive cases.

- **Decision Tree** shows a decent balance, but like the other models, it struggles with false positives and negatives, meaning it sometimes misidentifies both positive and negative instances.

- **KNN** performs similarly to Decision Tree, with a higher number of false positives and a good number of true positives. Its false negative rate is higher than the false positive rate, which might be worth further investigation for improvement.

- **Naive Bayes** exhibits a pattern similar to KNN, with a higher number of false positives and fewer false negatives. This suggests that it can identify positive instances well but has trouble avoiding false positives.

### Conclusion:
Overall, the models show varied performance with respect to false positives and false negatives. Models like **Random Forest** and **Gradient Boosting** provide a relatively better balance in classification, while models like **KNN** and **Naive Bayes** capture many false positives, which may need further refinement.


#### Since the Random Forest model provided the better balanced results, showing a good ability to correctly classify negatives while still capturing some positives, I am proceeding with fine-tuning the **RandomForestClassifier** to further improve its performance and enhance its prediction capabilities.


### Fine-tuning Random Forest Model with Cross-Validation
<a id="fine-tuning-random-forest-model-with-cross-validation"></a>

In [58]:
rf_model = RandomForestClassifier(
    class_weight='balanced',
    n_estimators=200,
    max_depth=10,
    random_state=42
)

# cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='f1')

### Evaluating the Performance of the Fine-Tuned Random Forest Model
<a id="evaluating-the-performance-of-model"></a>


In [59]:
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

print(f"Recall: new {recall_score(y_test, y_pred)}, old: {results['Random Forest']['Recall']}")
print(f"Precision: new: {precision_score(y_test, y_pred)}, old: {results['Random Forest']['Precision']}")
print(f"F1 Score: new: {f1_score(y_test, y_pred)}, old: {results['Random Forest']['F1 Score']}")
print(f"Cross-Validation F1 Score: {scores.mean()}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Recall: new 0.20833333333333334, old: 0.375
Precision: new: 0.6666666666666666, old: 0.6206896551724138
F1 Score: new: 0.31746031746031744, old: 0.4675324675324675
Cross-Validation F1 Score: 0.24756150994270304
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.98      0.92       246
           1       0.67      0.21      0.32        48

    accuracy                           0.85       294
   macro avg       0.77      0.59      0.62       294
weighted avg       0.83      0.85      0.82       294

Confusion Matrix:
[[241   5]
 [ 38  10]]


# Analysis of Updated Model Results

### Random Forest (Best Overall)
- **Best F1 Score** and **Best Precision**
- Shows a **good balance** between precision and recall.
- **High performance** in correctly classifying the negative class, but struggles a bit with the positive class.

### Gradient Boosting (Second Best)
- **Similar performance** to Random Forest with a slight variation in precision and recall.
- Provides a **balanced approach**, similar to Random Forest.

### High Recall but Low Precision:
- **Logistic Regression**:
  - **High recall** but **low precision**, meaning it identifies many positive cases but incorrectly classifies many negatives as positives.

### Balanced Performance:
- **SVM**:
  - Shows a **balanced** F1 score with a **more even distribution** between precision and recall.

### Poor Performance:
- **Naive Bayes** and **KNN**:
  - Both show **high recall** but at the cost of **low precision**, resulting in many false positives.

## **Best Model Choice**: 
   - **Random Forest** or **Gradient Boosting**
   - Both provide a **better balance** between precision and recall, making them the best choices for this imbalanced dataset.


## Visualizing Feature Importances of the Fine-Tuned Random Forest Model
<a id="visualizing-feature-importances"></a>


In [60]:
# Create DataFrame with feature importances
importances = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

fig = px.bar(
    importances,
    x='feature',
    y='importance',
    title='Feature Importances',
    color='feature',
)

fig.update_layout(
    xaxis_title='Feature',
    yaxis_title='Importance',
)

fig.show()

## Conclusion
### Top Reasons Why Employees Leave the Organization
<a id="top-reasons-why-employees-leave-the-organization"></a>

- **No Overtime**: This was a surprise, employees who don't have overtime are most likely to leave the organization. This could be that employees would like to have a higher amount of income or employees could feel that they are underused.
  
- **Monthly Income**: As expected, income is a huge factor in why employees leave the organization in search of a better salary.
  
- **Age**: This could also be expected, since people who are aiming to retire will leave the organization.

Knowing the most likely reasons why employees leave the organization can help the organization take action and reduce the level of attrition inside the organization.
