## **IBM HR Employee Attrition Analysis**
We are using IBM HR Analytics Employee Attrition dataset to find if there is a relationship between JobSatisfaction and Attrition (leaving the company)?

In [None]:
# Importing essential libraries
import pandas as pd
from scipy import stats

In [None]:
# Loading the dataset
df = pd.read_csv('HR-Employee-Attrition.csv')

In [26]:
df.shape

(1470, 35)

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [28]:
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [None]:
# Creating Contingency Table
table = pd.crosstab(df['Attrition'],df['JobSatisfaction'])
table

JobSatisfaction,1,2,3,4
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,223,234,369,407
Yes,66,46,73,52


### **Statistical Testing**

**Hypothesis :** There is a relationship between JobSatisfaction and Attrition

**Null Hypothesis (H0) :** There is no significant association between the variables

**Alternate Hypothesis (H1) :** There is significant association between the variables

In [23]:
# Performing Chi-Square Test for Independence

chi2, p_value, dof, expected = stats.chi2_contingency(table)

print(f'Chi2 : {chi2}')
print(f'P-Value : {p_value}')
print('')

if p_value < 0.05:
    print('Reject Null Hypothesis : There is significant association between the variables')
else:
    print('Fail to Reject Null Hypothesis : There is no significant association between the variables')

Chi2 : 17.505077010348
P-Value : 0.0005563004510387556

Reject Null Hypothesis : There is significant association between the variables


- Highly Significant Link: The extremely low p-value ($0.000556$) provides strong evidence that job satisfaction and employee attrition are not independent. The relationship is statistically significant, meaning satisfaction levels are a reliable predictor of whether an employee will stay or leave.

- Low Satisfaction Drives Turnover: Employees reporting the lowest satisfaction (Level 1) had an observed attrition count of 66, which is significantly higher than the expected count of ~46. This indicates that "Low" satisfaction is a primary driver of employees exiting the company.High Satisfaction as a 

- Retention Tool: Conversely, employees with the highest satisfaction (Level 4) showed much lower attrition than expected (52 actual vs. ~74 expected). This confirms that maximizing job satisfaction is an effective strategy for reducing turnover.