# HR Analytics: Understanding Employee Attrition
Analyzing trends and patterns in employee attrition using HR analytics data from Kaggle. The data is on Atlas Lab employees.

## Objectives
- Understand the distribution of employee attributes (age, department, role, etc.)
- Explore patterns in employee attrition
- Analyze job and performance satisfaction metrics
- Identify factors correlated with attrition

## Import Libraries

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

## Load and Merge Data

In [None]:
# Load Data
df_employee = pd.read_csv('Employee.csv')
df_performance = pd.read_csv('PerformanceRating.csv')

# Merge Data
df_combined = pd.merge(df_employee, df_performance, on='EmployeeID')
print(df_combined.shape)
df_combined.head()

## Data Cleaning
- dropping unnecessary columns 
- fixing column types
- mapping binary and ordinal values

In [None]:
# Display columns
df_combined.info()
df_combined.describe(include='all')

# Missing values
print(df_combined.isnull().sum())

# Map binary values
df_combined['Attrition'] = df_combined['Attrition'].map({'Yes':1, 'No': 0})
df_combined['OverTime'] = df_combined['OverTime'].map({'Yes': 1, 'No':0})


# Convert ordinal values
from pandas.api.types import CategoricalDtype
levels = CategoricalDtype(
    categories=[1,2,3,4,5],
    ordered=True
)

cols_to_convert = [
    'EnvironmentSatisfaction',
    'JobSatisfaction',
    'RelationshipSatisfaction',
    'WorkLifeBalance',
    'SelfRating',
    'ManagerRating',
    'Education',
]
for col in cols_to_convert:
    df_combined[col] = df_combined[col].astype(levels)

# convert remaining categorical columns
categorical_cols = ['BusinessTravel', 'Department', 'State', 'Ethnicity', 'EducationField', 'JobRole', 
                    'MaritalStatus', 'StockOptionLevel'
                    ]
for col in categorical_cols:
    df_combined[col] = df_combined[col].astype('category')

# one-hot encode categorical variables
df_encoded = pd.get_dummies(df_combined, drop_first=True)

# Date columns
df_combined['HireDate'] = pd.to_datetime(df_combined['HireDate'])
df_combined['ReviewDate'] = pd.to_datetime(df_combined['ReviewDate'])

# Drop unused columns
drop_columns = ['FirstName', 'LastName', 'PerformanceID']

df_combined = df_combined.drop(drop_columns, axis=1)

# Save cleaned data
df_combined.to_csv('cleaned_employee_data.csv', index=False)


## 1. Exploratory Data Analysis - Overview of Dataset

In [None]:
# Set visual style
sns.set(style='whitegrid', palette='dark')
plt.rcParams['figure.figsize'] = (10, 6)

# Load the dataset
df = df_combined.copy()

print(df.shape)
df.info()
df.describe()

Make functions to run statistic summaries for each column easier.

In [None]:
# Create function to print out basic data stats 
# for qualitative variables
def qual_summary(df, column):
    """
    Prints the count and percentage of each category in the specified column.
    Parameters: 
    df (DataFrame): The DataFrame containing the data.
    column (str): The name of the column to summarize.
    """
    total = len(df)
    counts = df[column].value_counts()
    percentages = counts / total * 100

    print(f"\nColumn: {column}")
    for category, count in counts.items():
        print(f"{category}: {count} ({percentages[category]:.2f}%)")

# Same concept but for qualitative variables
def quant_summary(df, column):
    """
    Prints summary stats for a numeric column:
    (mean, median, mode, min, max, and range)

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    column (str): The name of the numeric column to summarize.
    """
    series = df[column].dropna() # for missing values
    mean = series.mean()
    median = series.median()
    mode = series.mode().values[0]
    min_val = series.min()
    max_val = series.max()
    range_val = max_val - min_val
    print(f"\nSummary of '{column}:")
    print(f"Mean: {mean:.2f}")
    print(f"Median: {median}")
    print(f"Mode: {mode}")
    print(f"Min: {min_val}")
    print(f"Max: {max_val}")
    print(f"Range: {range_val}")

## 2. Univariate Analysis
- Count plots: Attrition, Gender, Department
- Histograms: Age

In [None]:
# Attrition Count
sns.countplot(x='Attrition', data=df)
plt.title('Employee Attrition Count')
plt.show()
qual_summary(df, 'Attrition')

4448 employees stayed with the company, while 2261 left, out of a total of 6709. This shows that the retention rate (66.3%) of the company is higher-than-average, while the turnover rate is lower (33.7%).

In [None]:
# Distribution of Age
sns.histplot(df['Age'], kde=True)
plt.title('Age Distribution')
plt.show()

### Key Observations:
- The age distribution of employees peaks between ages 25 and 30, with the highest amount of employees around age 25. 
- After age 30, the number of employees gradually decreases, resulting in a more even spread across older age groups. 
- The oldest employees are 51, while the youngest are 18. 
- This suggests that the company likely hires younger adults, many of whom may be early in their careers, rather than older, more experienced employees.

In [None]:
# Gender Count
sns.countplot(x='Gender', data=df)
plt.title('Employee Gender Count')
plt.show()

### Key Observations: 
- 46% of the employees are female, while 44.24% are male.
- 8.7% identify as non-binary.
- Only 0.91% of employees chose not to disclose their gender.
- The majority of employees identify as either female or male.
- Female representation is slightly higher than male, suggesting a near gender balance with a slight female majority.

In [None]:
# Department Count
sns.countplot(x='Department', data=df)
plt.title('Department Count')
plt.show()
qual_summary(df, 'Department')

### Key Observations:
- 63.45% of employees work in Technology.
- 32.03% of employees work in Sales. 
- 4.52% of employees work in Human Resources.
- The workforce is heavily concentrated in Technology roles, suggesting that this is a tech-focused company - which aligns with the name Atlas Lab. 