# HR Analytics: Understanding Employee Attrition
Analyzing trends and patterns in employee attrition using HR analytics data from Kaggle. The data is on Atlas Lab employees.

## Objectives
- Understand the distribution of employee attributes (age, department, role, etc.)
- Explore patterns in employee attrition
- Analyze job and performance satisfaction metrics
- Identify factors correlated with attrition

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis, normaltest, probplot

## Load and Merge Data

In [None]:
# Load Data
df_employee = pd.read_csv('Employee.csv')
df_performance = pd.read_csv('PerformanceRating.csv')

# Merge Data
df_combined = pd.merge(df_employee, df_performance, on='EmployeeID')
print(df_combined.shape)
df_combined.head()

## Data Cleaning
- dropping unnecessary columns 
- fixing column types
- mapping binary and ordinal values

In [None]:
# Missing values
print(df_combined.isnull().sum())

# Map binary values
df_combined['Attrition'] = df_combined['Attrition'].map({'Yes':1, 'No': 0})
df_combined['OverTime'] = df_combined['OverTime'].map({'Yes': 1, 'No':0})


# Convert ordinal values
from pandas.api.types import CategoricalDtype
levels = CategoricalDtype(
    categories=[1,2,3,4,5],
    ordered=True
)

cols_to_convert = [
    'EnvironmentSatisfaction',
    'JobSatisfaction',
    'RelationshipSatisfaction',
    'WorkLifeBalance',
    'SelfRating',
    'ManagerRating',
    'Education',
]
for col in cols_to_convert:
    df_combined[col] = df_combined[col].astype(levels)

# convert remaining categorical columns
categorical_cols = ['BusinessTravel', 'Department', 'State', 'Ethnicity', 'EducationField', 'JobRole', 
                    'MaritalStatus', 'StockOptionLevel', 'TrainingOpportunitiesWithinYear',
                    'TrainingOpportunitiesTaken'
                    ]
for col in categorical_cols:
    df_combined[col] = df_combined[col].astype('category')

# Date columns
df_combined['HireDate'] = pd.to_datetime(df_combined['HireDate'])
df_combined['ReviewDate'] = pd.to_datetime(df_combined['ReviewDate'])

# Drop unused columns
drop_columns = ['FirstName', 'LastName', 'PerformanceID']

df_combined = df_combined.drop(drop_columns, axis=1)

# Save cleaned data
df_combined.to_csv('cleaned_employee_data.csv', index=False)


## 1. Exploratory Data Analysis - Overview of Dataset

In [None]:
# Set visual style
sns.set(style='whitegrid', palette='dark')
plt.rcParams['figure.figsize'] = (10, 6)

# Load the dataset
df = df_combined.copy()

print(df.shape)
df.info()
df.describe()

## Function 1: ql_stats
- a function to handle qualitative columns
- runs summary (counts & percentages) for the column parameter
- displays how many unique responses a category has, the most occurring response, and how many employees chose that response.
- plots countplot for visualization.

In [None]:
def ql_stats(df, col):
    """
    Prints the summary and percentage of each category 
    in the specified column w/count plots.

    Parameters: 
    df (DataFrame): The DataFrame containing the data.
    col (str): The name of the column to summarize.
    """
    print(f"\n--- Categorical Summary: {col} ---")
    counts = df[col].value_counts(dropna=False)
    percentages = df[col].value_counts(normalize=True, dropna=False) * 100

    summary = pd.DataFrame({
        'Count': counts,
        'Percentages': percentages.round(2)
    })
    print(summary)
    print(f"Unique categories: {df[col].nunique(dropna=False)}")
    print(f"Most frequent: {df[col].mode()[0]}")

    sns.countplot(x=col, data=df)
    plt.title(f'Distribution of {col}')
    plt.show()

## Function 2: qn_stats
- a function to handle quantitative data 
- displays a general summary for the column parameter
- skewness: tells us the shape and spread of the data from the center
- kurtosis: tells us how much data is far from the center
- plots histogram + kde
- runs normality test to see if the data is normally distributive
- plots QQ-plot to visualize how much the data follows the normal distribution

In [None]:
def qn_stats(df, col):
    """
    Prints out summary for quantitative columns and creates 
    histogram + KDE plots.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    col (str): The name of the numeric column to summarize.
    """
    print(f"\n--- Numerical Summary: {col} ---")
    desc = df[col].describe()
    print(desc)

    print(f"Mode: {df[col].mode()[0]}")
    print(f"Skewness: {skew(df[col].dropna()):.2f}")
    print(f"Kurtosis: {kurtosis(df[col].dropna()):.2f}")
    
    # Histogram + KDE
    sns.histplot(df[col], kde=True, bins=20)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

    # Normality Test
    stat, p=normaltest(df[col])
    print(f"\nD'Agostino and Pearson Test:")
    print(f" Statistic = {stat:.4f}, p-value = {p:.4f}")
    if p < 0.05:
        print("Data is not normally distributed.")
    else:
        print("Data is normally distributed.")

    # QQ Plot
    plt.figure(figsize=(6,6))
    probplot(df[col], dist='norm', plot=plt)
    plt.title(f'QQ-Plot of {col}')
    plt.xlabel("Theoretical Quantiles")
    plt.ylabel('Sample Quantiles')
    plt.grid(True)
    plt.show()

## Function 3: dt_stats
- properly handles datetime data type variables
- summarizes the values under each datetime variable.
- visualizes monthly and yearly time series plots.

In [None]:
def dt_stats(df, col):
    """
    Summarizes datetime columns and plots time series trends.
    
    Parameters:
    df (DataFrame): The DataFrame containing the data.
    col (str): The name of the numeric column to summarize.
    """
    print(f"\n--- Datetime Summary: {col} ---")
    print(f"Min date: {df[col].min()}")
    print(f"Max date: {df[col].max()}")
    print(f"Unique dates: {df[col].nunique(dropna=False)}")

    # Monthly counts
    dt_monthly = df.set_index(col).resample('M').size()
    plt.figure(figsize=(14,7))
    dt_monthly.plot(marker='o')
    plt.title(f"Monthly count of {col}")
    plt.xlabel("Month")
    plt.ylabel("Count")
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # Yearly counts
    dt_yearly = df.set_index(col).resample('Y').size()
    plt.figure(figsize=(14, 7))
    dt_yearly.plot(marker='o')
    plt.title(f"Yearly Count of {col}")
    plt.xlabel("Year")
    plt.ylabel("Count")
    plt.grid(True)
    plt.tight_layout()
    plt.show()

## Using a for loop to go through all the columns

In [None]:
# Drop Employee ID from the columns
df = df.drop(columns=['EmployeeID'])

# Temporarily turns Attrition and OverTime variables as categories
df['Attrition'] = df['Attrition'].astype('category')
df['OverTime'] = df['OverTime'].astype('category')

# Loop through all columns and run stats, tests, and plots
for col in df.columns:
    if pd.api.types.is_datetime64_any_dtype(df[col]):
        dt_stats(df, col)
    elif pd.api.types.is_numeric_dtype(df[col]):
        qn_stats(df, col)
    else:
        ql_stats(df, col)

## Key Observations:
### Gender
--- Categorical Summary: Gender ---
                   Count  Percentages
Female              3091        46.07
Male                2968        44.24
Non-Binary           589         8.78
Prefer Not To Say     61         0.91
Unique categories: 4
Most frequent: Female
- 46.07% of employees identify as female
- 44.24% of employees identify as male
- 8.78% of employees identify as non-binary
- 0.91% of employees did not want to state their identity
- There are more employees who identify as female than any other gender.
- Not much to say here, I might skip over this one.

### Age
--- Numerical Summary: Age ---
count    6709.000000
mean       30.776718
std         7.928774
min        18.000000
25%        25.000000
50%        28.000000
75%        37.000000
max        51.000000
- The average age of employees is 31.
- The standard deviation of employee's age is 7.93.
- The youngest employee is 18 years old.
- 25% of employees are under age 25.
- 50% of employees are under age 28. 
- 75% of employees are under age 37.
- The oldest employee is 51 years old.

Mode: 25
Skewness: 0.67
Kurtosis: -0.72
- The most common age of employees is age 25.
- Since the skewness (0.67) is greater than 0, the ages of employees are skewed to the right.
- Since the kurtosis (-0.72) is less than 3, the ages of employees are platykurtic. The distribution is flat, and has lighter tails.
- The "Distribution of Age" plot shows most of the values on the low end, which suggests that employees are more younger. 
- The histogram plot also shows a peak at 25, showing that most employees are 25 years old in this company. 
- The plot also shows that after age 30, the spread of employees becomes more even. 
- The plot also shows that the distribution of the ages of employees is not normal.

D'Agostino and Pearson Test:
 Statistic = 793.6930, p-value = 0.0000
Data is not normally distributed.
- Since the p-value is very small, we can say that the distribution of age is not normal.
- The QQ plot shows that the values deviate from the normal distribution.

### BusinessTravel







