# Importing, Understanding, and Cleaning the Data

In [1]:
import pandas as pd 
import plotly.express as px
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
df = pd.read_excel("../assets/HR/HR-Data.xlsx", sheet_name="Data")
df.head(10)

Unnamed: 0,Attrition,Business Travel,CF_age band,CF_attrition label,Department,Education Field,emp no,Employee Number,Gender,Job Role,...,Performance Rating,Relationship Satisfaction,Standard Hours,Stock Option Level,Total Working Years,Work Life Balance,Years At Company,Years In Current Role,Years Since Last Promotion,Years With Curr Manager
0,Yes,Travel_Rarely,35 - 44,Ex-Employees,Sales,Life Sciences,STAFF-1,1,Female,Sales Executive,...,3,1,80,0,8,1,6,4,0,5
1,No,Travel_Frequently,45 - 54,Current Employees,R&D,Life Sciences,STAFF-2,2,Male,Research Scientist,...,4,4,80,1,10,3,10,7,1,7
2,Yes,Travel_Rarely,35 - 44,Ex-Employees,R&D,Other,STAFF-4,4,Male,Laboratory Technician,...,3,2,80,0,7,3,0,0,0,0
3,No,Travel_Frequently,25 - 34,Current Employees,R&D,Life Sciences,STAFF-5,5,Female,Research Scientist,...,3,3,80,0,8,3,8,7,3,0
4,No,Travel_Rarely,25 - 34,Current Employees,R&D,Medical,STAFF-7,7,Male,Laboratory Technician,...,3,4,80,1,6,3,2,2,2,2
5,No,Travel_Frequently,25 - 34,Current Employees,R&D,Life Sciences,STAFF-8,8,Male,Laboratory Technician,...,3,3,80,0,8,2,7,7,3,6
6,No,Travel_Rarely,Over 55,Current Employees,R&D,Medical,STAFF-10,10,Female,Laboratory Technician,...,4,1,80,3,12,2,1,0,0,0
7,No,Travel_Rarely,25 - 34,Current Employees,R&D,Life Sciences,STAFF-11,11,Male,Laboratory Technician,...,4,2,80,1,1,3,1,0,0,0
8,No,Travel_Frequently,35 - 44,Current Employees,R&D,Life Sciences,STAFF-12,12,Male,Manufacturing Director,...,4,2,80,0,10,3,9,7,1,8
9,No,Travel_Rarely,35 - 44,Current Employees,R&D,Medical,STAFF-13,13,Male,Healthcare Representative,...,3,2,80,2,17,2,7,7,7,7


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 44 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Attrition                   1470 non-null   object 
 1   Business Travel             1470 non-null   object 
 2   CF_age band                 1470 non-null   object 
 3   CF_attrition label          1470 non-null   object 
 4   Department                  1470 non-null   object 
 5   Education Field             1470 non-null   object 
 6   emp no                      1470 non-null   object 
 7   Employee Number             1470 non-null   int64  
 8   Gender                      1470 non-null   object 
 9   Job Role                    1470 non-null   object 
 10  Marital Status              1470 non-null   object 
 11  Over Time                   1470 non-null   object 
 12  Over18                      1470 non-null   object 
 13  Training Times Last Year    1470 

In [4]:
df['CF_age band'].value_counts()

CF_age band
25 - 34     554
35 - 44     505
45 - 54     245
Under 25     97
Over 55      69
Name: count, dtype: int64

In [5]:
cols_to_drop = [
    'emp no', 'Employee Number', 'Employee Count', 'Over18',
    'Standard Hours', 'CF_attrition count', 'CF_attrition counts',
    'CF_attrition rate', 'CF_current Employee', 'CF_attrition label',
    '-2', '0'
]

df_cleaned = df.drop(columns=cols_to_drop, errors='ignore')

print(f"Columns before: {df.shape[1]}")
print(f"Columns after: {df_cleaned.shape[1]}")
df_cleaned.head()


Columns before: 44
Columns after: 32


Unnamed: 0,Attrition,Business Travel,CF_age band,Department,Education Field,Gender,Job Role,Marital Status,Over Time,Training Times Last Year,...,Percent Salary Hike,Performance Rating,Relationship Satisfaction,Stock Option Level,Total Working Years,Work Life Balance,Years At Company,Years In Current Role,Years Since Last Promotion,Years With Curr Manager
0,Yes,Travel_Rarely,35 - 44,Sales,Life Sciences,Female,Sales Executive,Single,Yes,0,...,11,3,1,0,8,1,6,4,0,5
1,No,Travel_Frequently,45 - 54,R&D,Life Sciences,Male,Research Scientist,Married,No,3,...,23,4,4,1,10,3,10,7,1,7
2,Yes,Travel_Rarely,35 - 44,R&D,Other,Male,Laboratory Technician,Single,Yes,3,...,15,3,2,0,7,3,0,0,0,0
3,No,Travel_Frequently,25 - 34,R&D,Life Sciences,Female,Research Scientist,Married,Yes,3,...,11,3,3,0,8,3,8,7,3,0
4,No,Travel_Rarely,25 - 34,R&D,Medical,Male,Laboratory Technician,Married,No,3,...,12,3,4,1,6,3,2,2,2,2


## Dropped Columns and Reasons

| Column | Reason |
|:--------|:--------|
| **emp no** | Just an employee ID, doesn’t affect analysis. |
| **Employee Number** | Same as above, only used for identification. |
| **Employee Count** | Always the same value for all rows. |
| **Over18** | Everyone is over 18, no useful information. |
| **Standard Hours** | Same value for everyone, not helpful. |
| **CF_attrition count** | Calculated number, not real employee data. |
| **CF_attrition counts** | Many missing values and not needed. |
| **CF_attrition rate** | Calculated field, not an employee feature. |
| **CF_current Employee** | Says the same thing as the `Attrition` column. |
| **CF_attrition label** | Duplicate of `Attrition`, just with different words. |
| **-2** | Useless column with no meaning. |
| **0** | Same as above, unclear and not useful. |


After removing these columns, the data becomes simpler and easier to work with.


_____________

# Data Exploration and Analysis

In [6]:
df_cleaned['Attrition'] = df_cleaned['Attrition'].map({'Yes': 1, 'No': 0})

df_cleaned['Attrition'].value_counts()

Attrition
0    1233
1     237
Name: count, dtype: int64

In [7]:
df = df_cleaned

### Converting Attrition to Numeric Values

The `Attrition` column originally contains text values (`Yes` or `No`).  
To make it easier to use in analysis and visualizations, we replace:
- `Yes` → `1` (employee left the company)  
- `No` → `0` (employee stayed)

This allows us to calculate averages, correlations, and create clearer plots based on attrition rates.


In [8]:
print("Percentage of Attrition by Gender:")

gender_attrition = df.groupby('Gender')['Attrition'].mean().mul(100).reset_index()

display(gender_attrition)



fig = px.bar(
    gender_attrition,
    x='Gender',
    y='Attrition',
    color='Gender',
    text=gender_attrition['Attrition'].round(1).astype(str) + '%',
    title='Attrition Percentage by Gender'
)
fig.update_traces(textposition='outside')
fig.update_layout(yaxis_title='Attrition (%)', xaxis_title='Gender', showlegend=False)
fig.show()


Percentage of Attrition by Gender:


Unnamed: 0,Gender,Attrition
0,Female,14.795918
1,Male,17.006803


## Insight Gender :

- **Gender:** Male employees have a slightly higher attrition rate (≈17%) compared to females (≈14.8%).  
  → This suggests men are leaving the company a bit more often than women.  


_________________

In [9]:
print("\nPercentage of Attrition by age Group (CF_age band):")

age_attrition = df.groupby('CF_age band')['Attrition'].mean().mul(100).reset_index()

display(age_attrition)



fig = px.bar(
    age_attrition,
    x='CF_age band',
    y='Attrition',
    color='CF_age band',
    text=age_attrition['Attrition'].round(1).astype(str) + '%',
    title='Attrition Percentage by Age Group'
)
fig.update_traces(textposition='outside')
fig.update_layout(yaxis_title='Attrition (%)', xaxis_title='Age Group', showlegend=False)
fig.show()



Percentage of Attrition by age Group (CF_age band):


Unnamed: 0,CF_age band,Attrition
0,25 - 34,20.216606
1,35 - 44,10.09901
2,45 - 54,10.204082
3,Over 55,15.942029
4,Under 25,39.175258


## Insight Age Group:

  - The **highest attrition** is among employees **under 25 (≈39%)**, followed by those aged **25–34 (≈20%)**.

  - Attrition **drops significantly after age 35**, indicating that younger employees are more likely to leave early in their careers.  


____________

In [10]:
print("\nPercentage of Attrition by Department:")

dept_attrition = df.groupby('Department')['Attrition'].mean().mul(100).reset_index()

display(dept_attrition)

fig = px.bar(
    dept_attrition,
    x='Attrition',
    y='Department',
    color='Department',
    orientation='h',
    text=dept_attrition['Attrition'].round(1).astype(str) + '%',
    title='Attrition Percentage by Department'
)
fig.update_traces(textposition='outside')
fig.update_layout(xaxis_title='Attrition (%)', yaxis_title='Department', showlegend=False)
fig.show()



Percentage of Attrition by Department:


Unnamed: 0,Department,Attrition
0,HR,19.047619
1,R&D,13.83975
2,Sales,20.627803


## Insights Department:

  - **Sales (≈20.6%)** and **HR (≈19%)** show higher attrition than **R&D (≈13.8%)**.  
  ### → This could point to more stress, fewer growth opportunities, or job-hopping tendencies in those departments.  


______

In [11]:
fig = px.histogram(
    df,
    x='Department',
    y='Attrition',
    color='CF_age band',
    barmode='group',
    histfunc='avg',
    title='Attrition Percentage by Department and Age Group'
)


fig.update_traces(texttemplate='%{y:.1%}', textposition='outside')
fig.update_layout(yaxis_title='Attrition (%)', yaxis_tickformat='.0%', legend_title='Age Group')

fig.show()


## Insights Department × Age Group:

- Across all departments, **younger employees (Under 25)** have the **highest attrition rates**, especially in **HR (≈66.7%)** and **Sales (≈51.9%)**.  
- In contrast, employees aged **35–44** and **45–54** show much lower attrition levels across all departments (mostly below 22%).  
- **R&D** maintains the lowest overall attrition across age groups, with all rates below 20%.  
### → This suggests that younger staff particularly in HR and Sales are more likely to leave early, possibly due to entry-level roles, limited experience, or seeking faster career growth.


___

In [12]:
print("Percentage of Attrition by Business Travel")

trvl_attrition = df.groupby('Business Travel')['Attrition'].mean().mul(100).reset_index()

display(trvl_attrition)

fig= px.bar(
    trvl_attrition,
    x='Business Travel',
    y='Attrition',
    color= 'Business Travel',
    orientation='v',
    text= trvl_attrition['Attrition'].round(1).astype(str)+'%',
    title= 'Percentage of Attrition by Business Travel'
)
fig.update_traces(textposition='outside')
fig.update_layout(xaxis_title= 'Attrition(%)', yaxis_title= 'Business Travels')
fig.show()

Percentage of Attrition by Business Travel


Unnamed: 0,Business Travel,Attrition
0,Non-Travel,8.0
1,Travel_Frequently,24.909747
2,Travel_Rarely,14.956855


## Insights Business Travel:

- Employees who **travel frequently (≈24.9%)** have the **highest attrition rate**, followed by those who **travel rarely (≈15%)**.  
- **Non-traveling employees (≈8%)** are the least likely to leave the company.  
### → This suggests that frequent business travel may lead to higher stress or work-life imbalance, increasing the likelihood of employees leaving their jobs.


__________________

In [13]:
df["Education Field"].value_counts()

Education Field
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: count, dtype: int64

In [14]:
print("Percentage of Attrition by Education Field")

edu_attrition = df.groupby('Education Field')['Attrition'].mean().mul(100).reset_index()

display(edu_attrition)

fig= px.bar(
    edu_attrition,
    x='Education Field',
    y='Attrition',
    color='Education Field',
    text= edu_attrition['Attrition'].round(1).astype(str)+'%',
    title= 'Percentage of Attrition by Education Field'
)
fig.update_traces(textposition='outside')
fig.update_layout(xaxis_title= 'Attrition(%)', yaxis_title= 'Education Field')
fig.show()

Percentage of Attrition by Education Field


Unnamed: 0,Education Field,Attrition
0,Human Resources,25.925926
1,Life Sciences,14.686469
2,Marketing,22.012579
3,Medical,13.577586
4,Other,13.414634
5,Technical Degree,24.242424


## Insights Education Field:

- Employees with backgrounds in **Human Resources (≈25.9%)**, **Technical Degrees (≈24.2%)**, and **Marketing (≈22%)** show the **highest attrition rates**.  
- Those from **Life Sciences (≈14.7%)**, **Medical (≈13.6%)**, and **Other fields (≈13.4%)** have significantly **lower attrition**.  
### → This suggests that employees in HR, technical, and marketing fields may face higher workload pressure or find more external opportunities, leading to higher turnover.


______________________

In [15]:
df['Job Satisfaction']

0       4
1       2
2       3
3       3
4       2
       ..
1465    1
1466    4
1467    4
1468    2
1469    4
Name: Job Satisfaction, Length: 1470, dtype: int64

In [16]:
print("Percentage of Attrition by Job Satisfaction")

job_attrition = df.groupby('Job Satisfaction')['Attrition'].mean().mul(100).reset_index()

display(job_attrition)


fig = px.bar(
    job_attrition,
    x='Job Satisfaction',
    y='Attrition',
    color='Job Satisfaction',
    text=job_attrition['Attrition'].round(1).astype(str) + '%',
    title='Percentage of Attrition by Job Satisfaction'
)


fig.update_traces(textposition='outside')
fig.update_layout(
    xaxis_title='Job Satisfaction Level',
    yaxis_title='Attrition (%)',
    showlegend=False
)

fig.show()


Percentage of Attrition by Job Satisfaction


Unnamed: 0,Job Satisfaction,Attrition
0,1,23.214286
1,2,14.438503
2,3,14.385151
3,4,14.545455


## Insights Job Satisfaction:

- Employees with the **lowest job satisfaction (Level 1)** have the **highest attrition rate (≈23.2%)**.  
- For satisfaction levels **2, 3, and 4**, attrition remains almost the same (around **14–15%**).  
### → This shows that **very low job satisfaction is a strong driver of employee turnover**, while moderate to high satisfaction helps reduce attrition but not with a great difference.


____________________

In [17]:
fig = px.histogram(
    df[df['Attrition'] == 1],
    x='Total Working Years',
    nbins=30,
    title='Attrition Distribution by Total Working Years',

)

fig.show()


## Insights Total Working Years:

- Employees with **less than 10 total working years** show the **highest attrition**, especially in the early career stage (0–5 years).  
- Attrition steadily **declines after around 10 years** of total experience and becomes very low beyond 15 years.  
### → This indicates that **newer or less experienced employees are more likely to leave**, possibly seeking career growth, better pay, or new opportunities, while more experienced employees tend to stay longer in the company.


_____________________

In [18]:
num_columns = df.select_dtypes(exclude = 'object')
corr = num_columns.corr()['Attrition'].drop('Attrition').sort_values(ascending=False)

fig = px.bar(
    x=corr.index,
    y=corr.values,
    title="Correlation with Attrition",
    labels={'x': 'Features', 'y': 'Correlation'},
    color=corr.values,
    color_continuous_scale='RdYlGn'
)

fig.show()

## Insights Correlation with Attrition:

The correlations are relatively weak overall (ranging between ±0.17), which is typical in HR data, but they still reveal important patterns about what influences employee attrition.

#### Positive Correlations (Higher Values → Higher Attrition)
- **Distance From Home (≈ +0.078)** → Employees living farther from the office are slightly more likely to leave, likely due to long commutes and less work-life balance.  
- **Num Companies Worked (≈ +0.043)** → Employees who have changed jobs more often tend to leave again — a pattern common among job-hoppers.  
- **Monthly Rate (≈ +0.015)** and **Performance Rating (≈ +0.002)** → Very weak positive relations; these likely have minimal real impact on attrition.

#### Negative Correlations (Higher Values → Lower Attrition)
- **Total Working Years (≈ -0.171)**, **Job Level (≈ -0.169)**, **Years In Current Role (≈ -0.166)**, **Monthly Income (≈ -0.160)**, **Age (≈ -0.156)**, and **Years With Current Manager (≈ -0.152)**  
  → These are the **strongest negative relationships**.  
  Longer tenure, higher seniority, higher income, and older age all indicate more stability and lower turnover — showing that **experience and career progression reduce attrition**.  
- **Stock Option Level (≈ -0.130)** and **Years At Company (≈ -0.134)** → Employees with stock incentives and longer company tenure are less likely to leave, suggesting that **financial and emotional commitment** helps retention.  
- **Job Involvement (≈ -0.108)** and **Environment Satisfaction (≈ -0.089)** → Lower engagement and poor work environment correspond to higher turnover, underlining the importance of workplace satisfaction.  
- **Work Life Balance (≈ -0.060)**, **Job Satisfaction (≈ -0.066)**, **Training Times Last Year (≈ -0.059)** → These weaker but meaningful negative correlations show that a better balance, satisfaction, and development opportunities help retain employees.  
- **Daily Rate (≈ -0.056)**, **Relationship Satisfaction (≈ -0.054)**, and **Years Since Last Promotion (≈ -0.033)** → Minimal influence, but still point toward employee growth and positive relationships reducing attrition.


_________________________

# Summary & Overall Conclusion

### - Employee attrition is mainly influenced by **youth, short tenure, lower satisfaction, and frequent travel**, while **experience, career level, income, age, and engagement** are the strongest retention factors.  

### - In essence, **younger and less experienced employees are more likely to leave**, often due to limited growth or work-life balance challenges.  
Meanwhile, **experienced, well-compensated, and satisfied employees** demonstrate higher loyalty and stability.  

### - To reduce turnover, organizations should focus on **early-career development**, **clear career progression paths**, **competitive compensation**, and a healthy **work-life balance** that fosters long-term employee engagement.


________

In [19]:
df.to_csv('HR_cleaned.csv')