<a href="https://colab.research.google.com/github/abdel2ty/IntenseAI_Notebooks_v1/blob/main/employee_retention_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Employee Retention Analysis**

This name is correct because the dataset contains information about employees, and the goal is to **predict whether an employee will leave (`left = 1`) or stay (`left = 0`)**.

---

## **Dataset Overview**

The dataset contains information about employees' performance, working hours, promotions, salaries, and whether they eventually left the company.

### **Feature-by-Feature Explanation**

| **Column Name**             | **Description**                                       | **Data Type**            | **Example**             | **Business Meaning**                                                |
| --------------------------- | ----------------------------------------------------- | ------------------------ | ----------------------- | ------------------------------------------------------------------- |
| **satisfaction\_level**     | Employee's job satisfaction level                     | Float (0 to 1)           | 0.38                    | Lower values mean dissatisfaction, higher values mean satisfaction. |
| **last\_evaluation**        | Last performance evaluation score                     | Float (0 to 1)           | 0.53                    | Measures how well the employee performed in the last review.        |
| **number\_project**         | Number of projects the employee worked on             | Integer                  | 2                       | Higher values → more workload & responsibility.                     |
| **average\_montly\_hours**  | Average monthly working hours                         | Integer                  | 157                     | Helps identify overworked or underworked employees.                 |
| **time\_spend\_company**    | Number of years spent in the company                  | Integer                  | 3                       | Measures employee tenure.                                           |
| **Work\_accident**          | Whether the employee had a workplace accident         | Binary (0 = No, 1 = Yes) | 0                       | Used to see if accidents affect attrition.                          |
| **left**                    | **Target Column** → Did the employee leave?           | Binary (0 = No, 1 = Yes) | 1                       | **This is what we predict.**                                        |
| **promotion\_last\_5years** | Whether the employee got promoted in the last 5 years | Binary                   | 0                       | Promotions affect employee motivation.                              |
| **Departments**             | Department name                                       | Categorical              | “sales”                 | Shows which team/department the employee belongs to.                |
| **salary**                  | Salary level                                          | Categorical              | “low”, “medium”, “high” | Used to analyze the effect of salary on retention.                  |

---

## **Target Variable**

* **Column:** `left`
* **Values:**

  * `0` → Employee stayed in the company
  * `1` → Employee left the company
* This is what we want to **predict** using machine learning.

---

## **Business Goal**

The goal of this dataset is to:

* Understand the **key factors** that make employees leave.
* Analyze employee data to identify and quantify **the factors that influence whether employees leave or stay**, so **HR can target interventions** that **improve employee satisfaction** and **reduce** turnover.
* Help HR improve **employee satisfaction** and **retention**.


---

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv("Employee_Retention_Dataset.csv")
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary
0,0.79,0.54,5,212,4,0,No,0,marketing,medium
1,0.17,0.60,5,144,6,0,No,0,technical,medium
2,0.91,0.77,4,167,3,1,No,0,IT,medium
3,0.65,1.00,4,195,3,0,No,0,IT,low
4,0.87,0.76,4,218,2,0,No,0,sales,low
...,...,...,...,...,...,...,...,...,...,...
14994,0.73,0.81,4,245,2,0,No,0,sales,low
14995,0.67,0.59,3,205,5,0,No,0,support,medium
14996,0.98,0.98,4,170,10,0,No,0,IT,medium
14997,0.73,0.99,6,206,5,0,Yes,0,sales,medium


In [None]:
# General information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  object 
 7   promotion_last_5years  14999 non-null  int64  
 8   Departments            14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(5), object(3)
memory usage: 1.1+ MB


In [None]:
# Dataset Shape
print("\nDataset Shape:", df.shape)


Dataset Shape: (14999, 10)


In [None]:
# Check Missing Values
df.isnull().sum()

satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
Departments              0
salary                   0
dtype: int64

In [None]:
df["left"].value_counts()

left
No     11428
Yes     3571
Name: count, dtype: int64

In [None]:
# Calculate Percentage Distribution
df["left"].value_counts(normalize=True) * 100

left
No     76.191746
Yes    23.808254
Name: proportion, dtype: float64

In [None]:
# analyze satisfactsatisfaction_levelion_level
df[""].value_counts(normalize=True) * 100

satisfaction_level
0.10    2.386826
0.11    2.233482
0.74    1.713448
0.77    1.680112
0.84    1.646776
          ...   
0.25    0.226682
0.28    0.206680
0.27    0.200013
0.12    0.200013
0.26    0.200013
Name: proportion, Length: 92, dtype: float64

In [None]:
# calculate the average of the satisfaction levels
df["satisfaction_level"].mean()

np.float64(0.6128335222348156)

In [None]:
df["satisfy"] = np.where(df["satisfaction_level"] > 0.61 , "satisfied", "not satisfied")
df["satisfy"]

0            satisfied
1        not satisfied
2            satisfied
3            satisfied
4            satisfied
             ...      
14994        satisfied
14995        satisfied
14996        satisfied
14997        satisfied
14998        satisfied
Name: satisfy, Length: 14999, dtype: object

In [None]:
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary,satisfy
0,0.79,0.54,5,212,4,0,No,0,marketing,medium,satisfied
1,0.17,0.60,5,144,6,0,No,0,technical,medium,not satisfied
2,0.91,0.77,4,167,3,1,No,0,IT,medium,satisfied
3,0.65,1.00,4,195,3,0,No,0,IT,low,satisfied
4,0.87,0.76,4,218,2,0,No,0,sales,low,satisfied
...,...,...,...,...,...,...,...,...,...,...,...
14994,0.73,0.81,4,245,2,0,No,0,sales,low,satisfied
14995,0.67,0.59,3,205,5,0,No,0,support,medium,satisfied
14996,0.98,0.98,4,170,10,0,No,0,IT,medium,satisfied
14997,0.73,0.99,6,206,5,0,Yes,0,sales,medium,satisfied


In [None]:
df.groupby(["left", "satisfy"]).size().unstack()

satisfy,not satisfied,satisfied
left,Unnamed: 1_level_1,Unnamed: 2_level_1
No,4331,7097
Yes,2606,965


In [None]:
# Separate "Yes" and "No" cases
df_left_yes = df[df["left"] == "Yes"]
df_left_no = df[df["left"] == "No"]

In [None]:
# shape of statisfied
df_left_yes.shape

(3571, 11)

In [None]:
df_left_yes.head(3)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary,satisfy
9,0.41,0.48,2,155,3,0,Yes,0,marketing,low,not satisfied
12,0.74,0.97,4,228,5,0,Yes,0,hr,low,satisfied
13,0.4,0.5,2,129,3,0,Yes,0,IT,medium,not satisfied


In [None]:
# shape of not statisfied
df_left_no.shape

(11428, 11)

In [None]:
df_left_no.head(3)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary,satisfy
0,0.79,0.54,5,212,4,0,No,0,marketing,medium,satisfied
1,0.17,0.6,5,144,6,0,No,0,technical,medium,not satisfied
2,0.91,0.77,4,167,3,1,No,0,IT,medium,satisfied


In [None]:
# Randomly sample 3571 rows from "No" cases
df_no_sampled = df_left_no.sample(n=3571, random_state=17)  # random_state for reproducibility


In [None]:
df_no_sampled.shape

(3571, 11)

In [None]:
# Combine "Yes" cases with the reduced "No" cases
df_balanced = pd.concat([df_left_yes, df_no_sampled])

In [None]:
df_balanced["left"].head(3)

9     Yes
12    Yes
13    Yes
Name: left, dtype: object

In [None]:
df_balanced["left"].tail(3)

7516     No
9134     No
10781    No
Name: left, dtype: object

In [None]:
df_balanced = df_balanced.sample(frac=1, random_state=17).reset_index(drop=True)

In [None]:
df_balanced["left"].head(15)

0      No
1      No
2     Yes
3      No
4      No
5     Yes
6     Yes
7      No
8     Yes
9     Yes
10     No
11     No
12     No
13    Yes
14     No
Name: left, dtype: object

In [None]:
# Check new distribution
print(df_balanced["left"].value_counts())
print("New shape:", df_balanced.shape)

left
No     3571
Yes    3571
Name: count, dtype: int64
New shape: (7142, 11)


In [None]:
df_balanced["left"].value_counts(normalize=True) * 100

left
No     50.0
Yes    50.0
Name: proportion, dtype: float64

In [None]:
# calculate the mean of the monthly hours
monthly_hours = df["average_montly_hours"].mean()
monthly_hours

np.float64(201.0503366891126)

In [None]:
df_balanced["environment"] = np.where(df_balanced["average_montly_hours"] > monthly_hours, "bad", "good")
df_balanced[["environment", "average_montly_hours"]]

Unnamed: 0,environment,average_montly_hours
0,good,152
1,good,99
2,good,130
3,bad,244
4,good,126
...,...,...
7137,bad,265
7138,bad,232
7139,bad,263
7140,good,154


In [None]:
def categorize_leaving(statue):
    if statue == "good":
        return "will stay"
    elif statue == "bad":
        return "will leave"

df_balanced["monthly_hours_and_environment"] = df_balanced["environment"].apply(categorize_leaving)

In [None]:
df_balanced

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary,satisfy,environment,monthly_hours_and_environment
0,0.47,0.46,2,152,2,0,No,0,IT,medium,not satisfied,good,will stay
1,0.61,0.39,3,99,2,0,No,0,support,low,not satisfied,good,will stay
2,0.43,0.51,2,130,3,0,Yes,0,IT,low,not satisfied,good,will stay
3,0.88,0.72,5,244,2,0,No,0,technical,low,satisfied,bad,will leave
4,0.79,0.83,3,126,10,1,No,0,support,low,satisfied,good,will stay
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7137,0.91,0.84,5,265,5,0,Yes,0,technical,medium,satisfied,bad,will leave
7138,0.90,1.00,5,232,5,0,Yes,0,technical,medium,satisfied,bad,will leave
7139,0.90,0.93,4,263,3,1,No,0,support,medium,satisfied,bad,will leave
7140,0.37,0.50,2,154,3,0,Yes,0,hr,medium,not satisfied,good,will stay


In [None]:
df_balanced["monthly_hours_and_environment"].value_counts()

monthly_hours_and_environment
will leave    3634
will stay     3508
Name: count, dtype: int64

In [None]:
# Calculate Percentage Distribution
df_balanced["monthly_hours_and_environment"].value_counts(normalize=True) * 100

monthly_hours_and_environment
will leave    50.882106
will stay     49.117894
Name: proportion, dtype: float64

In [None]:
df_balanced.groupby(["left", "monthly_hours_and_environment"]).size().unstack()

monthly_hours_and_environment,will leave,will stay
left,Unnamed: 1_level_1,Unnamed: 2_level_1
No,1719,1852
Yes,1915,1656


In [None]:
df_balanced["Work_accident"].value_counts()

Work_accident
0    6365
1     777
Name: count, dtype: int64

In [None]:
# Calculate Percentage Distribution
df_balanced["Work_accident"].value_counts(normalize=True) * 100

Work_accident
0    89.120694
1    10.879306
Name: proportion, dtype: float64

In [None]:
df_balanced.groupby(["left", "Work_accident"]).size().unstack()

Work_accident,0,1
left,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2963,608
Yes,3402,169


In [None]:
df_balanced["time_spend_company"].value_counts()

time_spend_company
3     3091
4     1403
2     1076
5     1020
6      370
10      68
7       63
8       51
Name: count, dtype: int64

In [None]:
df_balanced["experience"] = pd.qcut(df_balanced["time_spend_company"],
                             q=3,
                             labels=["junior","mid_level","senior"])

In [None]:
df_balanced

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary,satisfy,environment,monthly_hours_and_environment,experience
0,0.47,0.46,2,152,2,0,No,0,IT,medium,not satisfied,good,will stay,junior
1,0.61,0.39,3,99,2,0,No,0,support,low,not satisfied,good,will stay,junior
2,0.43,0.51,2,130,3,0,Yes,0,IT,low,not satisfied,good,will stay,junior
3,0.88,0.72,5,244,2,0,No,0,technical,low,satisfied,bad,will leave,junior
4,0.79,0.83,3,126,10,1,No,0,support,low,satisfied,good,will stay,senior
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7137,0.91,0.84,5,265,5,0,Yes,0,technical,medium,satisfied,bad,will leave,senior
7138,0.90,1.00,5,232,5,0,Yes,0,technical,medium,satisfied,bad,will leave,senior
7139,0.90,0.93,4,263,3,1,No,0,support,medium,satisfied,bad,will leave,junior
7140,0.37,0.50,2,154,3,0,Yes,0,hr,medium,not satisfied,good,will stay,junior


In [None]:
df_balanced["experience"].value_counts()

experience
junior       4167
senior       1572
mid_level    1403
Name: count, dtype: int64

In [None]:
# Calculate Percentage Distribution
df_balanced["experience"].value_counts(normalize=True) * 100

experience
junior       58.345001
senior       22.010641
mid_level    19.644357
Name: proportion, dtype: float64

In [None]:
df_balanced.groupby(["salary", "experience"]).size().unstack()

  df_balanced.groupby(["salary", "experience"]).size().unstack()


experience,junior,mid_level,senior
salary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high,296,65,91
low,2208,773,832
medium,1663,565,649


In [None]:
df_balanced.groupby(["left", "experience"]).size().unstack()

  df_balanced.groupby(["left", "experience"]).size().unstack()


experience,junior,mid_level,senior
left,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,2528,513,530
Yes,1639,890,1042


In [None]:
# General information
df_balanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7142 entries, 0 to 7141
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   satisfaction_level             7142 non-null   float64 
 1   last_evaluation                7142 non-null   float64 
 2   number_project                 7142 non-null   int64   
 3   average_montly_hours           7142 non-null   int64   
 4   time_spend_company             7142 non-null   int64   
 5   Work_accident                  7142 non-null   int64   
 6   left                           7142 non-null   object  
 7   promotion_last_5years          7142 non-null   int64   
 8   Departments                    7142 non-null   object  
 9   salary                         7142 non-null   object  
 10  satisfy                        7142 non-null   object  
 11  environment                    7142 non-null   object  
 12  monthly_hours_and_environment  714

In [None]:
# Check Missing Values
df_balanced.isnull().sum()

satisfaction_level               0
last_evaluation                  0
number_project                   0
average_montly_hours             0
time_spend_company               0
Work_accident                    0
left                             0
promotion_last_5years            0
Departments                      0
salary                           0
satisfy                          0
environment                      0
monthly_hours_and_environment    0
experience                       0
dtype: int64

# Insights

##### Main Dataset
- No missing values in the dataset.  
- Employees who **left**: **3571**, while those who **stayed**: **11428**.  
- Percentage of employees who **left**: **23.8%**, while who **stayed**: **76.2%**.  
---
- The average satisfaction level: **0.61** → **61%**.  
---
- Employees who left and were **not satisfied**: **2606** out of **6937** → **37.5%**.  
- Employees who left and were **satisfied**: **965** out of **8062** → **11.9%**.  
- Employees who stayed and were **not satisfied**: **4331** out of **6937** → **62.4%**.  
- Employees who stayed and were **satisfied**: **7097** out of **8062** → **88%**.  

---

##### After Down-Sampling
- Employees who **left**: **3571** out of **7142** → **50%**.  
- Employees who **stayed**: **3571** out of **7142** → **50%**.  
---
- Average monthly working hours: **201**.  
- If monthly working hours `< 201` → environment is considered **Good**, else → **Bad**.  
---
- Based on **Monthly Hours / Environment** prediction:  
  - Predicted to leave: **3634** out of **7142** → **50.88%**.  
  - Predicted to stay: **3508** out of **7142** → **49.11%**.  
---
- Validation:  
  - Actually left from predicted-to-leave: **1915** out of **3634** → **52.7%**.  
  - Actually stayed from predicted-to-stay: **1852** out of **3508** → **52.8%**.  

---

##### Work Accident
- Work Accident **True**: **777** out of **7142** → **10.87%**.  
- Work Accident **False**: **6365** out of **7142** → **89.12%**.  
---
- Work Accident **True** and left: **169** out of **777** → **21.75%**.  
- Work Accident **True** and stayed: **608** out of **777** → **78.24%**.  
- Work Accident **False** and left: **3402** out of **6365** → **53.4%**.  
- Work Accident **False** and stayed: **2963** out of **6365** → **46.5%**.  

---

##### Experience Levels
- From **7142** employees:  
  - **Juniors**: **4167** → **58.3%**.  
  - **Mid-Levels**: **1403** → **19.6%**.  
  - **Seniors**: **1572** → **22%**.  

---

##### Salary Distribution
- **Juniors (4167 total)**:  
  - Low Salary: **2208** → **53%**.  
  - Medium Salary: **1663** → **39.9%**.  
  - High Salary: **296** → **7.1%**.  

- **Mid-Levels (1403 total)**:  
  - Low Salary: **773** → **55.1%**.  
  - Medium Salary: **565** → **40.3%**.  
  - High Salary: **65** → **4.6%**.  

- **Seniors (1572 total)**:  
  - Low Salary: **832** → **52.9%**.  
  - Medium Salary: **649** → **41.3%**.  
  - High Salary: **91** → **5.8%**.  

---

##### Attrition by Experience Level
- **Juniors** who left: **1639** out of **4167** → **39.3%**.  
- **Mid-Levels** who left: **890** out of **1403** → **63.4%**.  
- **Seniors** who left: **1042** out of **1572** → **66.3%**.  