### Performing EDA

#### Problem Statement

The HR Director of a company has observed a recent rise in employee attrition and wants to understand the underlying trends and patterns. A staff survey has been conducted, and the resulting data has been shared for analysis.

The objective is to determine the overall attrition rate (percentage of employees who have left the organization) and to analyze whether factors such as age, years at the company, and income influence an employee’s likelihood of leaving.



#### Import the data

In [95]:
import pandas as pd

In [97]:
hr_data = pd.read_csv('2. Dataset/Hr Analytics.csv')
hr_data.head(3)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0


#### Getting basic info of the data

In [100]:
### Get the shape of the data
hr_data.shape

(1470, 35)

In [102]:
### Checking for nulls
sum(hr_data.isnull().sum())>0

False

This indicates that no row has null/missing data

#### Column description

In [106]:
### Column names
hr_data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [108]:
# Count numerical columns
num_cols = hr_data.select_dtypes(include=['int64', 'float64']).shape[1]

# Count categorical columns
cat_cols = hr_data.select_dtypes(include=['object', 'category']).shape[1]

print(f"Numerical columns: {num_cols}")
print(f"Categorical columns: {cat_cols}")

Numerical columns: 26
Categorical columns: 9


In [110]:
info_table = pd.DataFrame({
    'Column Name': hr_data.columns,
    'Pandas Data Type': hr_data.dtypes.values
})

# Map pandas dtypes to categorical / numerical
info_table['Column Type'] = info_table['Pandas Data Type'].apply(
    lambda x: 'Numerical' if x in ['int64', 'float64'] else 'Categorical'
)

info_table

Unnamed: 0,Column Name,Pandas Data Type,Column Type
0,Age,int64,Numerical
1,Attrition,object,Categorical
2,BusinessTravel,object,Categorical
3,DailyRate,int64,Numerical
4,Department,object,Categorical
5,DistanceFromHome,int64,Numerical
6,Education,int64,Numerical
7,EducationField,object,Categorical
8,EmployeeCount,int64,Numerical
9,EmployeeNumber,int64,Numerical


In [112]:
info_table['Description'] = [
    'Age of employee',
    'Did the employee exit the company',
    'Does the employee travel for business',
    'Employee wages per day',
    'Department employee works in',
    'Distance between home and workplace',
    'Education level of employee',
    'Field of education',
    'Employee headcount indicator',
    'Unique employee identification number',
    'Satisfaction with work environment',
    'Gender of employee',
    'Hourly pay rate',
    'Level of job involvement',
    'Job seniority level',
    'Employee job role',
    'Satisfaction with job',
    'Marital status of employee',
    'Monthly income of employee',
    'Monthly pay rate',
    'Number of companies worked at previously',
    'Whether employee is over 18 years old',
    'Whether employee works overtime',
    'Percentage increase in salary',
    'Performance rating of employee',
    'Satisfaction with personal relationships at work',
    'Standard working hours',
    'Stock option level assigned',
    'Total years of work experience',
    'Training sessions attended last year',
    'Work-life balance rating',
    'Years spent at the company',
    'Years in current role',
    'Years since last promotion',
    'Years working with current manager'
]


#### Data Dictionary

In [115]:
info_table

Unnamed: 0,Column Name,Pandas Data Type,Column Type,Description
0,Age,int64,Numerical,Age of employee
1,Attrition,object,Categorical,Did the employee exit the company
2,BusinessTravel,object,Categorical,Does the employee travel for business
3,DailyRate,int64,Numerical,Employee wages per day
4,Department,object,Categorical,Department employee works in
5,DistanceFromHome,int64,Numerical,Distance between home and workplace
6,Education,int64,Numerical,Education level of employee
7,EducationField,object,Categorical,Field of education
8,EmployeeCount,int64,Numerical,Employee headcount indicator
9,EmployeeNumber,int64,Numerical,Unique employee identification number


#### Column Statistics

In [39]:
hr_data.describe().T[['mean', 'min', 'max']].round()

Unnamed: 0,mean,min,max
Age,37.0,18.0,60.0
DailyRate,802.0,102.0,1499.0
DistanceFromHome,9.0,1.0,29.0
Education,3.0,1.0,5.0
EmployeeCount,1.0,1.0,1.0
EmployeeNumber,1025.0,1.0,2068.0
EnvironmentSatisfaction,3.0,1.0,4.0
HourlyRate,66.0,30.0,100.0
JobInvolvement,3.0,1.0,4.0
JobLevel,2.0,1.0,5.0


#### Next Steps

After performing inital EDA, we can conclude that the data is in a good shape and can now be imported to Tableau for performing analysis and addressing solutions to our Problem Statement