<a href="https://colab.research.google.com/github/f-flavia/HR-IBM/blob/main/HR_Analytics_and_Prediction_of_Employee_Attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

## Employee Retention Strategies: Analyzing Key Drivers of Employee Attrition

Employee retention strategies are crucial for the long-term success and well-being of any organization. High employee attrition can have a negative impact on productivity, morale, and ultimately, the bottom line. In this case study, we will delve into the key drivers of employee attrition by analyzing IBM's HR Analytics dataset. This dataset contains valuable information about nearly 1,500 current and former employees, including their job satisfaction, work-life balance, tenure, experience, salary, and demographic data. By exploring this dataset, we aim to gain insights that can help organizations develop effective employee retention strategies.

Before delving into the analysis, let's first examine some summary statistics and provide an overview of the dataset. The dataset offers a comprehensive view of employees' experiences and characteristics, enabling us to identify patterns and potential factors contributing to attrition.

The IBM HR Analytics dataset provides a wealth of information that can be used to understand the underlying causes of attrition. Here are some key features of the dataset:

Job Satisfaction: This variable measures employees' satisfaction with their current roles and responsibilities. Higher job satisfaction is often associated with better retention rates.

Work-Life Balance: Work-life balance is a critical factor in employee satisfaction and well-being. This variable helps us assess whether employees perceive a healthy balance between their personal and professional lives.

Tenure: Tenure refers to the length of time an employee has been with the company. Longer tenures are generally indicative of higher employee loyalty and lower attrition rates.

Experience: The dataset includes information about employees' overall experience in years. Experience plays a role in job satisfaction and may influence attrition rates.

Salary: Compensation is a crucial aspect of employee satisfaction and retention. Analyzing salary data can help identify if there is a correlation between pay and attrition.

Demographic Data: The dataset includes demographic information such as age, gender, and education level. Exploring these factors can provide insights into whether certain demographics are more prone to attrition.

By examining these variables in the dataset, we can uncover trends and patterns that shed light on the drivers of employee attrition. This knowledge can then be leveraged to design effective strategies to improve retention rates and create a more engaged and satisfied workforce.

In the subsequent sections of this case study, we will conduct a detailed analysis of the dataset, exploring relationships between various factors and employee attrition. By identifying significant drivers of attrition, organizations can proactively address these issues and implement targeted retention strategies.

Employee retention is a continuous process that requires ongoing attention and effort. By understanding the factors that contribute to attrition, organizations can create a supportive and engaging work environment, leading to higher employee satisfaction and retention. In the following analysis, we will uncover valuable insights that can help organizations thrive in today's competitive job market.

#### Attrition

Employee attrition refers to a company’s strategic decision not to replace an employee who leaves voluntarily.

####Turnover

Turnover can be more accurately described as the rate at which an organization replaces departing employees over any given time period whether they are voluntary or involuntary departures. Unlike attrition, a high employee turnover rate can indicate problems within an organization.

#### Churn

Employee churn refers to the combined numbers of an organization’s attrition rate and turnover rate. Churn rates can be comprised of both voluntary and involuntary departures.

In [41]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

In [42]:
hr = pd.read_csv('hr_analy_pred.csv')

print("There are {:,} rows and {} columns in the data.".format(hr.shape[0], hr.shape[1]))
print("There are {} missing values in the data.".format(hr.isnull().sum().sum()))

There are 1,470 rows and 35 columns in the data.
There are 0 missing values in the data.


In [43]:
hr.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Summary statistics of numeric variables

In [44]:
hr.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


## Summary statistics of categorical variables

In [45]:
# Categorical Variables

cat_cols=hr.select_dtypes(include=object).columns.tolist() # tolist() is used to convert a series to a list.
cat_df=pd.DataFrame(hr[cat_cols].melt(var_name='column', value_name='value') # melt() unpivots a wide format DataFrame to long format, optionally leaving the identifier variables set.
                    .value_counts()).rename(columns={0: 'count'}).sort_values(by=['column', 'count'])

display(hr.select_dtypes(include=object).describe())
display(cat_df)

Unnamed: 0,Attrition,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
count,1470,1470,1470,1470,1470,1470,1470,1470,1470
unique,2,3,3,6,2,9,3,1,2
top,No,Travel_Rarely,Research & Development,Life Sciences,Male,Sales Executive,Married,Y,No
freq,1233,1043,961,606,882,326,673,1470,1054


Unnamed: 0_level_0,Unnamed: 1_level_0,count
column,value,Unnamed: 2_level_1
Attrition,Yes,237
Attrition,No,1233
BusinessTravel,Non-Travel,150
BusinessTravel,Travel_Frequently,277
BusinessTravel,Travel_Rarely,1043
Department,Human Resources,63
Department,Sales,446
Department,Research & Development,961
EducationField,Human Resources,27
EducationField,Other,82


# Exploratory Data Analysis

In [46]:
# Quantity of attrition

display(hr['Attrition'].value_counts())

No     1233
Yes     237
Name: Attrition, dtype: int64

In [47]:
# Plotting Employee Attrition

plot_df=hr['Attrition'].value_counts(normalize=True) # normalize=True - is used to turn the counts into proportions of the total
plot_df=plot_df.mul(100).rename('Percent').reset_index().sort_values('Percent')
plot_df.rename(columns={'index':'Attrition'}, inplace=True)
plot_df['Attrition']=['Former Employees' if i == 'Yes' else 'Current Employees' for i in plot_df['Attrition']]
x=plot_df['Attrition']
y=plot_df['Percent']
fig = px.bar(plot_df, x=x, y=y, text=y,opacity=.8, title= 'Employee Attrition', width=800, height=550)
fig.update_traces(texttemplate='%{text:.2s}%', textposition='outside',
                  marker_line=dict(width=1, color='#1F0202'), marker_color=['lightskyblue','gold'])
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='gray')
fig.update_layout(yaxis_ticksuffix = '%')
fig.show()

Among 1470 employees 16% of them left the company.

In [48]:
# Quantity of male and female

display(hr['Gender'].value_counts())

Male      882
Female    588
Name: Gender, dtype: int64

In [49]:
# Plotting percent of Employess by Gender

plot_df=hr['Gender'].value_counts(normalize=True)
plot_df=plot_df.mul(100).rename('Percent').reset_index().sort_values('Percent')
plot_df.rename(columns={'index':'Gender'}, inplace=True)
fig = px.bar(plot_df, x='Gender', y='Percent', opacity=.8, title= 'Employee by Gender', text=plot_df['Percent'], width=800, height=550)
fig.update_traces(texttemplate='%{text:.0s}%', textposition='outside',
                  marker_line=dict(width=1, color='#1F0202'), marker_color=['#C02B34','#CDBBA7'])
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='gray')
fig.update_layout(xaxis_title='Gender', yaxis_ticksuffix = '%')
fig.show()


There are 20% more male employees than female.

### Features distribution

In [10]:
# Plotting Attrition by Gender

plot_df = hr.groupby(['Gender'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('Gender')
fig = px.bar(plot_df, x="Gender", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Gender', bargap=.09,font_color='#28221D',
                  xaxis_title='Gender',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

The highest amount of attrition is with men 17% left the company, for women, 15%.

In [50]:
# Transforming numerical data "Age" into categorical data "AgeGroup"

bins=[0, 20, 30, 40, 50, 60]
labels=['18 to 20', '20 to 30', '30 to 40', '40 to 50', '50 to 60']
hr['AgeGroup'] = pd.cut(hr['Age'], bins, labels=labels)
hr['AgeGroup']=hr['AgeGroup'].astype('str')

In [12]:
# Plotting Attrition by Employess Age

plot_df = hr.groupby(['AgeGroup'])['Attrition'].value_counts()
plot_df = plot_df.rename('Quantity').reset_index()
fig = px.bar(plot_df, x="AgeGroup", y="Quantity", color="Attrition", barmode="group", text_auto=True,
             category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=400)
fig.update_traces(textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='#1F0202')
fig.update_layout(title_text='Attrition by Employess Age',height=750,font_color='#28221D',
                  paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0', xaxis_title='Employess Age')
fig.show()

Among all 28 employees aged 18 to 20, 16 left the company. Almost a quarter of employees aged 20 to 30 left too.

In [13]:
# Plotting Attrition by Employess Age and Gender

plot_df = hr.groupby(['AgeGroup','Gender'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index()
fig = px.bar(plot_df, x='AgeGroup', y='Percent', color='Attrition',
             facet_row='Gender', text='Percent', opacity=0.75, barmode='group',
             category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#C02B34','No': '#CDBBA7'}, width=800, height=400)
fig.update_traces(texttemplate='%{text:.2s}%', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#1F0202',ticksuffix = '%')
fig.update_layout(title_text='Attrition Rates by Employess Age and Gender',height=750,font_color='#28221D',
                  xaxis_title='Employess Age', paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0',
                  xaxis = dict(tickmode = 'array'))
fig.show()

The highest amount of attrition for both male and female is with employees aged 18 to 30.

In [14]:
# Plotting Attrition by Number Companies Worked

plot_df = hr.groupby(['Attrition'])['NumCompaniesWorked'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('NumCompaniesWorked')
fig = px.bar(plot_df, x="NumCompaniesWorked", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=450)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Number Companies Worked', bargap=.09,font_color='#28221D',
                  xaxis_title='Number Companies Worked',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()


Among the employees who left, over half of them worked for 0 or 1 company.


In [15]:
# Plotting Attrition by Total Working Years

plot_df = hr.groupby(['TotalWorkingYears'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('TotalWorkingYears')
fig = px.bar(plot_df, x="TotalWorkingYears", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=1000, height=400)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Total Working Years', bargap=.09,font_color='#28221D',
                  xaxis_title='Total Working Years',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

The highest amount of attrition is with employees that worked for 1 year or less, almost half of them left the company.

In [16]:
# Plotting Attrition by Training Times Last Year

plot_df = hr.groupby(['TrainingTimesLastYear'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('TrainingTimesLastYear')
fig = px.bar(plot_df, x="TrainingTimesLastYear", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=430)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Training Times Last Year', bargap=.09,font_color='#28221D',
                  xaxis_title='Training Times Last Year',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Almost a third of employees who had 0 training times last year left the company.

In [17]:
# Plotting Attrition by Years At Company

plot_df = hr.groupby(['Attrition'])['YearsAtCompany'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('YearsAtCompany')
fig = px.bar(plot_df, x="YearsAtCompany", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=1000, height=400)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Years at Company', bargap=.09,font_color='#28221D',
                  xaxis_title='Years at Company',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Over a third of employees that worked for 0 or 1 year at the company, left. The attrition is very low after 11 years at the company.

In [18]:
# Plotting Attrition by Years In Current Role

plot_df = hr.groupby(['YearsInCurrentRole'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('YearsInCurrentRole')
fig = px.bar(plot_df, x="YearsInCurrentRole", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=1000, height=400)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Years in Current Role', bargap=.09,font_color='#28221D',
                  xaxis_title='Years in Current Role',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

30% of employees that worked 0 years in the current role, left, also nearly 20% of employees that worked 1 or 2 years in the current role, left.

In [19]:
# Plotting Attrition by Years Since Last Promotion

plot_df = hr.groupby(['Attrition'])['YearsSinceLastPromotion'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('YearsSinceLastPromotion')
fig = px.bar(plot_df, x="YearsSinceLastPromotion", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=400)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Years Since Last Promotion', bargap=.09,font_color='#28221D',
                  xaxis_title='Years Since Last Promotion',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among the employees who left the company 67% had their last promotion 0 or 1 year ago.

In [20]:
#Plotting Attrition by Years With Current Manager

plot_df = hr.groupby(['YearsWithCurrManager'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('YearsWithCurrManager')
fig = px.bar(plot_df, x="YearsWithCurrManager", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=1000, height=400)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Years With Current Manager', bargap=.09,font_color='#28221D',
                  xaxis_title='Years With Current Manager',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Over 30% of employees that worked 0 years with the current manager left the company.

In [21]:
#Plotting Attrition by Over Time

plot_df = hr.groupby(['Attrition'])['OverTime'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('OverTime')
fig = px.bar(plot_df, x="OverTime", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Over Time', bargap=.09,font_color='#28221D',
                  xaxis_title='Over Time',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among employees who left, 54% worked over time.

In [22]:
#Plotting Attrition by Marital Status

plot_df = hr.groupby(['MaritalStatus'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('MaritalStatus')
fig = px.bar(plot_df, x="MaritalStatus", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Marital Status', bargap=.09,font_color='#28221D',
                  xaxis_title='Marital Status',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

The amount of attrition is higher with single employees, over a quarter of them left the company.

In [23]:
#Plotting Attrition by Education Field

plot_df = hr.groupby(['EducationField'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('EducationField')
fig = px.bar(plot_df, x="EducationField", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=450)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Education Field', bargap=.09,font_color='#28221D',
                  xaxis_title='Education Field',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

About 25% of employees in HR, marketing and  technical degree left the company.

In [24]:
#Plotting Attrition by Education Level

plot_df = hr.groupby(['Attrition'])['Education'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('Education')
fig = px.bar(plot_df, x="Education", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=450)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Education Level', bargap=.09,font_color='#28221D',
                  xaxis_title='Education Level',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among employees who left, 42% had Education level 3.

In [25]:
#Plotting Attrition by Environment Satisfaction

plot_df = hr.groupby(['Attrition'])['EnvironmentSatisfaction'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('EnvironmentSatisfaction')
plot_df.EnvironmentSatisfaction=pd.Categorical(plot_df.EnvironmentSatisfaction).rename_categories(
    {1:'Poor', 2:'Neutral', 3:'Good', 4:'Excellent'})
fig = px.bar(plot_df, x="EnvironmentSatisfaction", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Environment Satisfaction', bargap=.09,font_color='#28221D',
                  xaxis_title='Environment Satisfaction',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among employees who left, the majority were satisfied in their jobs with over 51% rating their job satisfaction as good or excellent, while 30.3
% were the least satisfied in their job.

In [26]:
#Plotting Attrition by Job Involvement

plot_df = hr.groupby(['Attrition'])['JobInvolvement'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('JobInvolvement')
plot_df.JobInvolvement=pd.Categorical(plot_df.JobInvolvement).rename_categories(
    {1:'Poor', 2:'Neutral', 3:'Good', 4:'Excellent'})
fig = px.bar(plot_df, x="JobInvolvement", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Job Involvement', bargap=.09,font_color='#28221D',
                  xaxis_title='Job Involvement',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among employees who left, the majority were satisfied in their job involvement with 52.7% rating their job involvement as good and 5.4% as excellent, while 11.8% employees were the least satisfied in their job, rating it as poor.

In [27]:
#Plotting Attrition by Performance Rating

plot_df = hr.groupby(['Attrition'])['PerformanceRating'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('PerformanceRating')
fig = px.bar(plot_df, x="PerformanceRating", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Performance Rating', bargap=.09,font_color='#28221D',
                  xaxis_title='Performance Rating',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()


Among the employees who left 84.3% had their performance rating at 4.

In [28]:
#Plotting Attrition by Relationship Satisfaction

plot_df = hr.groupby(['Attrition'])['RelationshipSatisfaction'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('RelationshipSatisfaction')
plot_df.RelationshipSatisfaction=pd.Categorical(plot_df.RelationshipSatisfaction).rename_categories(
    {1:'Poor', 2:'Neutral', 3:'Good', 4:'Excellent'})
fig = px.bar(plot_df, x="RelationshipSatisfaction", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Relationship Satisfaction', bargap=.09,font_color='#28221D',
                  xaxis_title='Relationship Satisfaction',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among employees who left, the majority were satisfied in their job involvement with 30% rating their job involvement as good and 27% as excellent, while 24% employees were the least satisfied in their job, rating it as poor.

In [29]:
#Plotting Attrition by Job Satisfaction

plot_df = hr.groupby(['Attrition'])['JobSatisfaction'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('JobSatisfaction')
plot_df.JobSatisfaction=pd.Categorical(plot_df.JobSatisfaction).rename_categories(
    {1:'Poor', 2:'Neutral', 3:'Good', 4:'Excellent'})
fig = px.bar(plot_df, x="JobSatisfaction", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Job Satisfaction', bargap=.09,font_color='#28221D',
                  xaxis_title='Job Satisfaction',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among employees who left, the majority were satisfied in their job with 31% rating their job satisfaction as good and 22% as excellent, while 28% were the least satisfied in their job.

In [30]:
#Plotting Attrition by Job Roles

plot_df = hr.groupby(['JobRole'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('JobRole')
fig = px.bar(plot_df, x="JobRole", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=500)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Job Role', bargap=.09,font_color='#28221D',
                  xaxis_title='Job Role',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

40% of Sales Representative left the company and nearly a quarter of HR and Laboratory Technician too, while only 3% of Research Directors left.

In [31]:
# Plottting Attrition by Business Travel

plot_df = hr.groupby(['BusinessTravel'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index().sort_values('BusinessTravel')
fig = px.bar(plot_df, x="BusinessTravel", y="Percent", color="Attrition", barmode="group", text_auto=True, category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#214D5C','No': '#ACBCC2'}, width=800, height=450)
fig.update_traces(textposition='outside', marker_line=dict(width=1, color='#28221D'),texttemplate = "%{y:.0f}%")
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#28221D', ticksuffix='%')
fig.update_layout(title_text='Attrition Rates by Business Travel', bargap=.09,font_color='#28221D',
                  xaxis_title='Business Travel',paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Among all employees, 25% that travel frequently left the company.

In [32]:
# Plotting Median Salaries by Department and Attrition Status

plot_df = hr.groupby(['Department', 'Attrition', 'Gender'])['MonthlyIncome'].median()
plot_df = plot_df.mul(12).rename('Salary').reset_index().sort_values('Salary', ascending=False).sort_values('Gender') #mult by 12 total in a year
fig = px.bar(plot_df, x='Department', y='Salary', color='Gender', text='Salary',
             barmode='group', opacity=0.75, color_discrete_map={'Female': '#214D5C','Male': '#ACBCC2'},
             facet_col='Attrition', category_orders={'Attrition': ['Yes', 'No']}, width=800, height=400)
fig.update_traces(texttemplate='$%{text:,.0f}', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='#28221D')
fig.update_layout(title_text='Median Salaries by Department and Attrition Status', font_color='#28221D',
                  yaxis=dict(title='Salary',tickprefix='$',range=(0,79900)),width=950,height=500,
                  paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

In comparison to current employees, former employees had lower median salaries across all three departments. In Human Resources, women tend to have higher median salaries than men.

In [33]:
# Plotting Attrition by Work Life Balance and Gender

plot_df = hr.groupby(['WorkLifeBalance','Gender'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index()
fig = px.bar(plot_df, x='WorkLifeBalance', y='Percent', color='Attrition',
             facet_row='Gender', text='Percent', opacity=0.75, barmode='group',
             category_orders={'Attrition': ['Yes', 'No']},
             color_discrete_map={'Yes': '#C02B34','No': '#CDBBA7'}, width=800, height=400)
fig.update_traces(texttemplate='%{text:.2s}%', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(title="",zeroline=True, zerolinewidth=1, zerolinecolor='#1F0202',ticksuffix = '%')
fig.update_layout(title_text='Attrition Rates by Work Life Balance and Gender',height=750,font_color='#28221D',
                  xaxis_title='Work Life Balance', paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0',
                  xaxis = dict(tickmode = 'array', tickvals = [1, 2, 3, 4],
                               ticktext = ['Poor', 'Neutral', 'Good', 'Excellent']))
fig.show()


Among women with the highest rated work life balance, 1 out of every 4 left the company. For men, the highest proportion occurred in those with the lowest work life balance.

In [34]:
# Plotting Attrition by Gender and Department

plot_df = hr.groupby(['Gender','Department'])['Attrition'].value_counts(normalize=True)
plot_df = plot_df.mul(100).rename('Percent').reset_index()
fig = px.bar(plot_df, x="Department", y="Percent", color="Attrition", barmode="group",
            text='Percent', opacity=.75, facet_col="Gender", category_orders={'Attrition': ['Yes', 'No']},
            color_discrete_map={'Yes': '#C02B34','No': '#CDBBA7'}, width=800, height=400)
fig.update_traces(texttemplate='%{text:.0s}%', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'),  width=.4)
fig.update_layout(title_text='Attrition Rates by Department and Gender', yaxis_ticksuffix = '%',
                  paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0',font_color='#28221D',
                  height=500, xaxis=dict(tickangle=30))
fig.update_xaxes(showticklabels=True,tickangle=30,col=2)
fig.update_yaxes(title = "", zeroline=True, zerolinewidth=1, zerolinecolor='#28221D')
fig.show()

Women in Human Resources experienced the highest amount of turnover, with nearly 1 out of every 3 women in HR leaving the company. For men, the highest turnover occurred in the Sales department.

In [35]:
# Plotting Attrition by Department, Gender and Salary mean.

plot_df = hr.groupby(['Department', 'Gender'])['MonthlyIncome'].mean()
plot_df = plot_df.mul(12).rename('Salary').reset_index().sort_values('Salary', ascending=False)
fig = px.bar(plot_df, x='Department', y='Salary', color='Gender', text='Salary',
             barmode='group', opacity=0.75, color_discrete_map={'Female': '#214D5C','Male': '#ACBCC2'}, width=800, height=450)
fig.update_traces(texttemplate='$%{text:,.0f}', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='#28221D')
fig.update_layout(title_text='Average Salaries by Department & Gender', font_color='#28221D',
                  yaxis=dict(title='Salary',tickprefix='$'), paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')

fig.show()

Across each department, women on average have higher salaries than men.

In [36]:
# Plotting Monthly Income by Job Role.

plot_df = hr.groupby('JobRole')['MonthlyIncome'].mean()
plot_df = plot_df.mul(12).rename('Salary').reset_index().sort_values('Salary', ascending=False)
fig = px.bar(plot_df, x='JobRole', y='Salary', text='Salary', opacity=0.7, width=800)
fig.update_traces(texttemplate='$%{text:,.0f}', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'), marker_color='#3A5F53')
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='#28221D')
fig.update_layout(title_text='Average Salaries by Job Role', font_color='#28221D',
                  yaxis=dict(title='Salary',tickprefix='$'), height=500,
                  xaxis_title='', paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Managers and Research Directors have the highest average salaries, while Laboratory Technicians and Sales Representatives have the lowest.

In [37]:
# Plotting Monthly income by Attrition Status.

plot_df=hr.sort_values(by="Attrition")
fig=px.histogram(plot_df, x='MonthlyIncome', color='Attrition',
                 opacity=0.8, barmode='overlay', marginal='box',
                 color_discrete_map={'Yes': '#C02B34','No': '#CDBBA7'}, width=800, height=550)
fig.update_layout(title_text='Distribution of Monthly Income by Attrition Status',
                  xaxis_title='Monthly Income, $', yaxis_title='Count',font_color='#28221D',
                  paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0', legend_traceorder='reversed')
fig.show()

The distribution of monthly income for both current and former employees is positively skewed and lower overall among staff who left. Former employees had a median monthly income more than $2,000 less than current employees.

In [38]:
fig=go.Figure()
colors=['#214D5C','#91ABB4']
for i, j in enumerate(hr['Gender'].unique()):
    df_plot=hr[hr['Gender']==j]
    fig.add_trace(go.Box(x=df_plot['WorkLifeBalance'], y=df_plot['MonthlyIncome'],
                         notched=True, line=dict(color=colors[i]),name=j))
fig.update_layout(title='Distribution of Monthly Income by Work Life Balance', width=800, height=400,
                  xaxis_title='Work Life Balance', boxmode='group', font_color='#28221D',
                  xaxis = dict(tickmode = 'array', tickvals = [1, 2, 3, 4],
                               ticktext = ['Poor', 'Fair', 'Good', 'Excellent']),
                  paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Women with the lowest-rated work life balance have the highest median salary out of all of the groups at $5,400/month.

In [39]:
plot_df = hr.copy()
plot_df['JobLevel'] = pd.Categorical(
    plot_df['JobLevel']).rename_categories(
    ['Entry level', 'Mid level', 'Senior', 'Lead', 'Executive'])
col=['#73AF8E', '#4F909B', '#707BAD', '#A89DB7','#C99193']
fig = px.scatter(plot_df, x='TotalWorkingYears', y='MonthlyIncome',
                 color='JobLevel', size='MonthlyIncome',
                 color_discrete_sequence=col,
                 category_orders={'JobLevel': ['Entry level', 'Mid level', 'Senior', 'Lead', 'Executive']}, width=800, height=400)
fig.update_layout(legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
                  title='Monthly income increases with total number of years worked and job level <br>',
                  xaxis_title='Total Working Years', yaxis=dict(title='Income',tickprefix='$'),
                  legend_title='', font_color='#28221D',
                  margin=dict(l=40, r=30, b=80, t=120),paper_bgcolor='#F4F2F0', plot_bgcolor='#F4F2F0')
fig.show()

Based on the scatterplot above, monthly income is positively correlated with total number of years worked and there is strong association between an employee's earnings and their job level.

In [40]:
cat_cols=[]
for i in hr.columns:
    if hr[i].nunique() <= 5 or hr[i].dtype == object:
        cat_cols.append(i)
df=hr.copy()
df.drop(df[cat_cols], axis=1, inplace=True)
df.drop('EmployeeNumber', axis=1, inplace=True)
corr=df.corr().round(2)
x=corr.index.tolist()
y=corr.columns.tolist()
z=corr.to_numpy()
fig = ff.create_annotated_heatmap(z=z, x=x, y=y, annotation_text=z, name='',
                                  hovertemplate="Correlation between %{x} and %{y}= %{z}",
                                  colorscale='GnBu')
fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Correlation Matrix of Employee Attrition",
                  font_color='#28221D',margin=dict(t=180),height=600)
fig.show()

Confirming our findings in the scatterplot above, MonthlyIncome has a strong positive correlation to TotalWorkingYears of 0.77. Additionally, YearsAtCompany has a strong positive association with YearsWithCurrManager (correlation = 0.77), as well as with YearsInCurrentRole (correlation = 0.76). There are no variables with a correlation above 0.8, indicating a potential collinearity issue.

# Conclusion

This analysis explores employee attrition and salary factors in a dataset of 1,470 employees. The study reveals that 16% of the workforce left the company, with slightly higher attrition rates among male employees. Younger employees, particularly those aged 18-30, showed higher attrition rates. Factors such as limited prior work experience, short tenure, lack of training, and absence of recent promotions or long-term managers were associated with higher attrition rates. Work-related factors, including overtime work, single marital status, and department affiliation, influenced attrition rates as well. Job satisfaction varied among former employees, with a significant number leaving despite high performance ratings. Attrition rates varied across departments, with Sales Representatives experiencing the highest turnover. Travel frequency also played a role in attrition. Salary discrepancies existed between current and former employees, with women in the Human Resources department earning higher median salaries than men. These findings emphasize the importance of understanding these factors to develop strategies that improve employee retention, work-life balance, and address salary discrepancies, ultimately fostering a satisfied and engaged workforce.