# About the Dataset

- My dataset is gotten from Kaggle
- The Employee Attrition Prediction Dataset contains data for 1,000 employees, designed for predictive modeling and analysis of employee attrition. 
    It includes a variety of demographic, job-related, and performance metrics to help understand the factors contributing to employee turnover.
- Download a sample of it at  **[Employee Attrition Prediction Dataset](https://www.kaggle.com/datasets/ziya07/employee-attrition-prediction-dataset)**


## Key Features:

- Employee_ID: Unique identifier for each employee.
- Age: Age of the employee.
- Gender: Gender of the employee.
- Marital_Status: Marital status of the employee (Single, Married, Divorced).
- Department: Department the employee works in (e.g., HR, IT, Sales, Marketing).
- Job_Role: Specific role within the department (e.g., Manager, Analyst).
- Job_Level: Level in the organizational hierarchy.
- Monthly_Income: Monthly salary of the employee.
- Hourly_Rate: Rate per hour for hourly employees.
- Years_at_Company: Number of years the employee has been with the company.
- Years_in_Current_Role: Number of years the employee has been in their current role.
- Years_Since_Last_Promotion: Time since the employee’s last promotion.
- Work_Life_Balance: Rating of work-life balance.
- Job_Satisfaction: Rating of job satisfaction (1-5 scale).
- Performance_Rating: Performance rating (1-5 scale).
- Training_Hours_Last_Year: Number of training hours completed in the past year.
- Overtime: Whether the employee works overtime (Yes/No).
- Project_Count: Number of projects managed by the employee.
- Average_Hours_Worked_Per_Week: Average working hours per week.
- Absenteeism: Number of days the employee was absent in the past year.
- Work_Environment_Satisfaction: Rating of work environment satisfaction.
- Relationship_with_Manager: Rating of the relationship with the manager.
- Job_Involvement: Rating of job involvement.
- Distance_From_Home: Distance from home to the workplace (in kilometers).
- Number_of_Companies_Worked: Total number of companies the employee has worked for.
- Attrition: The target column (Yes/No) indicating whether the employee left the company.

### Importing Libraries

In [3]:
import pandas as pd
import numpy as np

### Read Data using pandas

In [5]:
employee_data = pd.read_csv("employee_attrition_dataset.csv")

employee_data

Unnamed: 0,Employee_ID,Age,Gender,Marital_Status,Department,Job_Role,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,...,Overtime,Project_Count,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Relationship_with_Manager,Job_Involvement,Distance_From_Home,Number_of_Companies_Worked,Attrition
0,1,58,Female,Married,IT,Manager,1,15488,28,15,...,No,6,54,17,4,4,4,20,3,No
1,2,48,Female,Married,Sales,Assistant,5,13079,28,6,...,Yes,2,45,1,4,1,2,25,2,No
2,3,34,Male,Married,Marketing,Assistant,1,13744,24,24,...,Yes,6,34,2,3,4,4,45,3,No
3,4,27,Female,Divorced,Marketing,Manager,1,6809,26,10,...,No,9,48,18,2,3,1,35,3,No
4,5,40,Male,Divorced,Marketing,Executive,1,10206,52,29,...,No,3,33,0,4,1,3,44,3,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,27,Female,Divorced,HR,Analyst,2,4172,76,24,...,No,4,46,10,3,1,4,24,4,No
996,997,47,Male,Single,IT,Manager,4,11007,71,19,...,Yes,7,36,16,3,2,4,39,3,Yes
997,998,50,Female,Divorced,IT,Executive,1,4641,43,25,...,Yes,1,46,9,2,3,3,33,2,No
998,999,28,Female,Married,HR,Executive,4,19855,92,13,...,No,4,52,17,4,1,4,41,4,No


## 1. Data Exploration

**Viewing the first and the last 5 observations**

In [8]:
# viewing the dataset

employee_data.head()

Unnamed: 0,Employee_ID,Age,Gender,Marital_Status,Department,Job_Role,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,...,Overtime,Project_Count,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Relationship_with_Manager,Job_Involvement,Distance_From_Home,Number_of_Companies_Worked,Attrition
0,1,58,Female,Married,IT,Manager,1,15488,28,15,...,No,6,54,17,4,4,4,20,3,No
1,2,48,Female,Married,Sales,Assistant,5,13079,28,6,...,Yes,2,45,1,4,1,2,25,2,No
2,3,34,Male,Married,Marketing,Assistant,1,13744,24,24,...,Yes,6,34,2,3,4,4,45,3,No
3,4,27,Female,Divorced,Marketing,Manager,1,6809,26,10,...,No,9,48,18,2,3,1,35,3,No
4,5,40,Male,Divorced,Marketing,Executive,1,10206,52,29,...,No,3,33,0,4,1,3,44,3,No


In [9]:
employee_data.tail()

Unnamed: 0,Employee_ID,Age,Gender,Marital_Status,Department,Job_Role,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,...,Overtime,Project_Count,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Relationship_with_Manager,Job_Involvement,Distance_From_Home,Number_of_Companies_Worked,Attrition
995,996,27,Female,Divorced,HR,Analyst,2,4172,76,24,...,No,4,46,10,3,1,4,24,4,No
996,997,47,Male,Single,IT,Manager,4,11007,71,19,...,Yes,7,36,16,3,2,4,39,3,Yes
997,998,50,Female,Divorced,IT,Executive,1,4641,43,25,...,Yes,1,46,9,2,3,3,33,2,No
998,999,28,Female,Married,HR,Executive,4,19855,92,13,...,No,4,52,17,4,1,4,41,4,No
999,1000,48,Female,Divorced,IT,Analyst,2,11738,39,1,...,Yes,2,59,5,1,4,3,43,2,No


**Checking the data types of each column**

In [11]:
employee_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Employee_ID                    1000 non-null   int64 
 1   Age                            1000 non-null   int64 
 2   Gender                         1000 non-null   object
 3   Marital_Status                 1000 non-null   object
 4   Department                     1000 non-null   object
 5   Job_Role                       1000 non-null   object
 6   Job_Level                      1000 non-null   int64 
 7   Monthly_Income                 1000 non-null   int64 
 8   Hourly_Rate                    1000 non-null   int64 
 9   Years_at_Company               1000 non-null   int64 
 10  Years_in_Current_Role          1000 non-null   int64 
 11  Years_Since_Last_Promotion     1000 non-null   int64 
 12  Work_Life_Balance              1000 non-null   int64 
 13  Job_

**Getting Summary Statistics**

In [13]:
employee_data.describe()

Unnamed: 0,Employee_ID,Age,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,Years_in_Current_Role,Years_Since_Last_Promotion,Work_Life_Balance,Job_Satisfaction,Performance_Rating,Training_Hours_Last_Year,Project_Count,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Relationship_with_Manager,Job_Involvement,Distance_From_Home,Number_of_Companies_Worked
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,39.991,3.055,11499.899,57.837,14.922,7.539,4.408,2.495,3.151,2.527,50.043,4.877,44.553,9.524,2.494,2.519,2.503,24.507,2.484
std,288.819436,11.780055,1.399977,4920.529231,24.702037,8.350548,4.001061,2.99508,1.105077,1.426967,1.13073,28.204657,2.546833,8.704192,5.973534,1.110494,1.106736,1.099636,14.138099,1.111296
min,1.0,20.0,1.0,3001.0,15.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,30.0,0.0,1.0,1.0,1.0,1.0,1.0
25%,250.75,30.0,2.0,7395.75,36.0,8.0,4.0,2.0,2.0,2.0,2.0,26.0,3.0,37.0,4.0,2.0,2.0,2.0,12.0,1.0
50%,500.5,41.0,3.0,11256.0,58.0,15.0,8.0,4.0,3.0,3.0,3.0,50.0,5.0,45.0,9.0,2.0,3.0,3.0,24.0,2.0
75%,750.25,50.25,4.0,15855.0,80.0,22.0,11.0,7.0,3.0,4.0,4.0,75.25,7.0,52.0,15.0,4.0,4.0,3.0,37.0,3.0
max,1000.0,59.0,5.0,19999.0,99.0,29.0,14.0,9.0,4.0,5.0,4.0,99.0,9.0,59.0,19.0,4.0,4.0,4.0,49.0,4.0


In [14]:
# Summary statistics for few selected columns:

employee_data[["Age", "Monthly_Income", "Number_of_Companies_Worked", "Distance_From_Home"]].describe()

Unnamed: 0,Age,Monthly_Income,Number_of_Companies_Worked,Distance_From_Home
count,1000.0,1000.0,1000.0,1000.0
mean,39.991,11499.899,2.484,24.507
std,11.780055,4920.529231,1.111296,14.138099
min,20.0,3001.0,1.0,1.0
25%,30.0,7395.75,1.0,12.0
50%,41.0,11256.0,2.0,24.0
75%,50.25,15855.0,3.0,37.0
max,59.0,19999.0,4.0,49.0


## 2. Calculating Basic Statistics

- Calculating mean, median, mode and standard deviation  for these columns: **Age, Monthly_Income, Numbers_of_Companies_Worked
and Distance_from_home**

In [17]:
employee_data[["Age", "Monthly_Income", "Number_of_Companies_Worked", "Distance_From_Home"]].describe()

Unnamed: 0,Age,Monthly_Income,Number_of_Companies_Worked,Distance_From_Home
count,1000.0,1000.0,1000.0,1000.0
mean,39.991,11499.899,2.484,24.507
std,11.780055,4920.529231,1.111296,14.138099
min,20.0,3001.0,1.0,1.0
25%,30.0,7395.75,1.0,12.0
50%,41.0,11256.0,2.0,24.0
75%,50.25,15855.0,3.0,37.0
max,59.0,19999.0,4.0,49.0


- Calculating the correlation between different features:

##### The correlation between Monthly_income, Average_Hours_Worked_Per_Week, Absenteeism, Work_Environment_Satisfaction, Distance_From_Home and  Job_Satisfaction columns:

In [20]:
employee_data[['Monthly_Income', 'Average_Hours_Worked_Per_Week', 'Absenteeism', 
               'Work_Environment_Satisfaction', 'Distance_From_Home', 'Job_Satisfaction']].corr()

Unnamed: 0,Monthly_Income,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Distance_From_Home,Job_Satisfaction
Monthly_Income,1.0,-0.045374,-0.009032,-0.010991,-0.050925,0.011808
Average_Hours_Worked_Per_Week,-0.045374,1.0,0.018371,-0.032951,0.004113,-0.023332
Absenteeism,-0.009032,0.018371,1.0,-0.046153,0.030145,0.059172
Work_Environment_Satisfaction,-0.010991,-0.032951,-0.046153,1.0,0.007749,-0.030696
Distance_From_Home,-0.050925,0.004113,0.030145,0.007749,1.0,0.022995
Job_Satisfaction,0.011808,-0.023332,0.059172,-0.030696,0.022995,1.0
