# **EDA_Employee Data to Predict Attrition Trends**
# **By Amit Kharche**
**Follow me** on [Linkedin](https://www.linkedin.com/in/amit-kharche) and [Medium](https://medium.com/@amitkharche14) for more insights on **Data Science** and **AI**

<center><img width=20% src="https://th.bing.com/th/id/R.6810463b19119cfef654450c8c3d242f?rik=HNwkiOJCEeq7Ew&riu=http%3a%2f%2feastcoast-trading.com%2fwp-content%2fuploads%2frevslider%2fgrid_slider_7%2fpeople.jpg&ehk=4x5Oztx4%2fNCK5XoKfdTph0iPQEU1cRK2EQGxOFhhTZ4%3d&risl=&pid=ImgRaw&r=0"></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
**4.** [**Data Acquisition & Description**](#Section4)<br>
**5.** [**Data Pre-processing**](#Section5)<br>
**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Summarization**](#Section7)<br>

---
<a name = Section1></a>

# **1. Introduction**

Employee retention is a critical concern for any organization, especially in the fast-paced and competitive environment of the software industry. High attrition rates not only disrupt team dynamics but also increase recruitment and training costs. Traditionally, the HR department at this software company has relied on **exit interviews** to understand why employees choose to leave. While these interviews offer valuable insights, they are inherently **reactive**—conducted only after the decision to exit has been made.

Recognizing these limitations, the HR team is initiating a **proactive, data-driven approach** to enhance employee retention. The goal is to analyze historical data to uncover trends and signals indicative of potential attrition.

This EDA focuses on **permanent employees**—the company's long-term talent. The dataset includes demographics, job roles, performance, and employment status.

### Objectives:
- Assess data structure and quality.
- Identify attrition-related patterns.
- Generate insights to inform HR strategies.

This foundation may later support predictive modeling.


---
<a name = Section2></a>

# **2. Problem Statement**

The HR department of a software company currently depends on **exit interviews** to determine why employees leave. However, this **reactive method** has several drawbacks:

- Insights vary in quality based on who conducts the interview.
- Data from interviews is hard to consolidate and analyze at scale.
- Crucially, feedback arrives too late—after the employee has left.

To overcome these limitations, HR seeks a **proactive, data-driven strategy** by examining historical employee data. The focus is on uncovering **patterns and drivers** of attrition among **permanent employees**.

This project centers on conducting **Exploratory Data Analysis (EDA)** using a dataset that includes attributes of both current and former employees.

### Objectives:
- Discover features correlated with attrition.
- Analyze demographic and organizational factors influencing retention.
- Derive actionable insights to guide HR interventions.

Ultimately, this EDA aims to support a forward-looking HR strategy and enable early engagement with at-risk employees.


---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [7]:
!pip install -q datascience         # Required by pandas-profiling
!pip install -q pandas-profiling    # Generates data profiling reports
!pip install -q folium==0.5.0       # For interactive map visualizations
!pip install -q sweetviz            # For visual exploratory data analysis
!pip install xverse
!pip install ydata-profiling

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
!pip install -q --upgrade datascience
!pip install -q --upgrade pandas-profiling

<a name = Section32></a>
### **3.3 Importing Libraries**

In [10]:
import numpy as np
from numpy import isnan
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import datetime

In [12]:
import folium                                          # importing folium for
import pandas as pd                                    # Importing for panel data analysis
from pandas_profiling import ProfileReport             # Importing Pandas Profiling (To generate Univariate Analysis)
pd.set_option('display.max_columns', None)             # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)            # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)         # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x) # To suppress scientific notation over exponential values
pd.set_option('mode.chained_assignment',None)           # To Supress pandas warning
pd.options.display.max_columns = 50                     # To display every column of the dataset in head()
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                      # Importing package numpys (For Numerical Python)
np.set_printoptions(precision = 4)  # To display value  only upto 4 decimal place
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                          # Importing pyplot interface of matplotlib
plt.style.use('seaborn-whitegrid')                       # to apply seaborn whitegrid, classic style to the plots
%matplotlib inline
import seaborn as sns                                    # Importing seaborn library for interactive visualization
sns.set(style='whitegrid',font_scale=1.3, color_codes= True)   # To adjust seaborn setting for plot
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import plotly.graph_objs as go                            # For interactive graphs
#------------------------------------------------------------------------------------------------------------------------------
import warnings                                           # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                         # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---


- The first dataset consists of the information about department_data.
- This dataset contains information about each department. The schema of the dataset is as follows:


| ID | Feature Name | Description of the feature |
| :-- | :--| :--|
|01| **dept_id**   | Unique Department Code|
|02| **dept_name**      | Name of the Department|
|03| **dept_head**        |Name of the Head of the Department|



- The Second dataset consists of the information about employee_details_data.
- This dataset consists of Employee ID, their Age, Gender and Marital Status. The schema of this dataset is as follows:



| ID | Feature Name | Description of the feature |
| :-- | :--| :--|
|01| **employee_id**      | Unique ID Number for each employee|
|02| **age**      | Age of the employee|
|03| **gender**        |Gender of the employee|
|04| **marital_status**        |Marital Status of the employee|

- The third dataset consists of the information about employee_data.
- This dataset consists of each employee’s Administrative Information, Workload Information, Mutual Evaluation Information and Status




| ID | Feature Name | Description of the feature |
| :-- | :--| :--|
|01| **status**      | Current employment status (Employed / Left)|
|02| **department**      | Department to which the employees belong(ed) to|
|03| **salary**        | Salary level with respect to rest of their department|
|04| **tenure**        | Number of years at the company|
|05| **recently_promoted**      | Was the employee promoted in the last 3 years?|
|06| **employee_id**      | Unique ID Number for each employee|
|07| **n_projects**        | Number of projects employee has worked on|
|08| **avg_monthly_hrs**        | Average number of hours worked per month|
|09| **satisfaction**      |Score for employee’s satisfaction with the company (higher is better)|
|10| **last_evaluation**      |Score for most recent evaluation of employee (higher is better)|
|11| **filed_complaint**        |Has the employee filed a formal complaint in the last 3 years?|



In [None]:
data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/winequality.csv')
print('Data Shape:', data.shape)
data.head()

### **Data Description**

- To get some quick description out of the data you can use describe method defined in pandas library.

### **Data Information**

---
<a name = Section5></a>
# **5. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

---
<a name = Section6></a>
# **6. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

---
<a name = Section7></a>
# **7. Data Post-Profiling**
---

- This section is emphasised on getting a report about the data after the data manipulation.

- You may end up observing some new changes, so keep it under check and make right observations.

---
<a name = Section8></a>
# **8. Exploratory Data Analysis**
---

- This section is emphasised on asking the right questions and perform analysis using the data.

- Note that there is no limit how deep you can go, but make sure not to get distracted from right track.

---
<a name = Section9></a>
# **9. Summarization**
---

<a name = Section91></a>
### **9.1 Conclusion**

- In this part you need to provide a conclusion about your overall analysis.

- Write down some short points that you have observed so far.

<a name = Section92></a>
### **9.2 Actionable Insights**

- This is a very crucial part where you will present your actionable insights.
- You need to give suggestions about what could be applied and what not.
- Make sure that these suggestions are short and to the point, ultimately it's a catalyst to your business.