# HR Dashboard - Data Cleaning
## Leave & Attendance Dataset

This notebook performs data cleaning, wrangling, and feature engineering on the Leave_Attendance dataset.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Load Raw Data

In [2]:
# Load the Leave & Attendance dataset
df = pd.read_csv('../data/raw/Leave_Attendance.csv')
print(f"Original dataset shape: {df.shape}")
df.head(10)

Original dataset shape: (10000, 6)


Unnamed: 0,LeaveID,EmployeeID,LeaveType,StartDate,EndDate,Days
0,LEAVE-1,PNR-1001,Maternity/Paternity,2021-08-05,2021-08-07,3.0
1,LEAVE-2,PNR-1001,Vacation,2022-07-20,2022-07-22,3.0
2,LEAVE-3,PNR-1001,Personal Leave,2018-09-22,2018-10-01,10.0
3,LEAVE-3,PNR-1001,Personal Leave,2018-09-22,2018-10-01,10.0
4,LEAVE-5,PNR-1001,Sick Leave,2022-03-16,2022-03-16,1.0
5,LEAVE-6,PNR-1002,Sick Leave,2023-12-02,2023-12-06,5.0
6,LEAVE-7,PNR-1002,Personal Leave,2018-01-30,2018-02-03,5.0
7,LEAVE-8,PNR-1002,Maternity/Paternity,2023-04-19,2023-04-20,2.0
8,LEAVE-9,PNR-1002,Personal Leave,2022-04-21,2022-04-22,2.0
9,LEAVE-10,PNR-1002,Sick Leave,2020-03-07,2020-03-07,1.0


### Initial Data Exploration

In [3]:
# Display basic information
print("Dataset Info:")
df.info()
print("\n" + "="*50)
print("\nBasic Statistics:")
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   LeaveID     9659 non-null   object 
 1   EmployeeID  9632 non-null   object 
 2   LeaveType   9627 non-null   object 
 3   StartDate   9637 non-null   object 
 4   EndDate     9601 non-null   object 
 5   Days        9718 non-null   float64
dtypes: float64(1), object(5)
memory usage: 468.9+ KB


Basic Statistics:


Unnamed: 0,Days
count,9718.0
mean,4.177403
std,3.182814
min,1.0
25%,2.0
50%,3.0
75%,5.0
max,10.0


In [4]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({'Missing_Count': missing, 'Percentage': missing_percent})
print(missing_df)

Missing Values:
            Missing_Count  Percentage
LeaveID               341        3.41
EmployeeID            368        3.68
LeaveType             373        3.73
StartDate             363        3.63
EndDate               399        3.99
Days                  282        2.82


In [5]:
# Check for duplicates
print(f"Total duplicate rows: {df.duplicated().sum()}")

Total duplicate rows: 300


---
## Data Cleaning Steps

### Step 1: Delete Redundant Columns

In [6]:
# Check for redundant columns
print("All columns are relevant for analysis.")
print(f"Columns: {list(df.columns)}")

All columns are relevant for analysis.
Columns: ['LeaveID', 'EmployeeID', 'LeaveType', 'StartDate', 'EndDate', 'Days']


### Step 2: Drop / Rename the Columns

In [7]:
# Rename columns for consistency (snake_case)
df.columns = df.columns.str.lower().str.replace(' ', '_')
print("Renamed columns:")
print(list(df.columns))

Renamed columns:
['leaveid', 'employeeid', 'leavetype', 'startdate', 'enddate', 'days']


### Step 3: Remove Duplicates

In [8]:
# Remove duplicate rows
print(f"Rows before removing duplicates: {len(df)}")
df = df.drop_duplicates()
print(f"Rows after removing duplicates: {len(df)}")

Rows before removing duplicates: 10000
Rows after removing duplicates: 9700


### Step 4: Remove the NaN Values from the Dataset

In [9]:
# Check missing values before cleaning
print("Missing values before cleaning:")
print(df.isnull().sum())
print(f"\nTotal rows: {len(df)}")

Missing values before cleaning:
leaveid       341
employeeid    368
leavetype     373
startdate     363
enddate       399
days          282
dtype: int64

Total rows: 9700


In [10]:
# Remove rows where critical columns are missing
df = df.dropna(subset=['employeeid', 'startdate', 'enddate'])
print(f"Rows after removing missing critical fields: {len(df)}")

Rows after removing missing critical fields: 8608


In [11]:
# Fill missing LeaveType with 'Unknown'
df['leavetype'] = df['leavetype'].fillna('Unknown')
print("\nMissing values after cleaning:")
print(df.isnull().sum())


Missing values after cleaning:
leaveid       297
employeeid      0
leavetype       0
startdate       0
enddate         0
days          260
dtype: int64


### Step 5: Clean Individual Columns

#### 5.1 Clean LeaveID Column

In [12]:
# Check LeaveID
print("Sample LeaveID values:")
print(df['leaveid'].head(20))
print(f"\nMissing LeaveID: {df['leaveid'].isnull().sum()}")

Sample LeaveID values:
0        LEAVE-1
1        LEAVE-2
2        LEAVE-3
4        LEAVE-5
5        LEAVE-6
6        LEAVE-7
7        LEAVE-8
8        LEAVE-9
9       LEAVE-10
10      LEAVE-11
11     LEAVE-12 
12      LEAVE-13
13      LEAVE-14
14      LEAVE-15
15      LEAVE-16
16      LEAVE-17
17      LEAVE-18
19      LEAVE-20
20      LEAVE-21
22           NaN
Name: leaveid, dtype: object

Missing LeaveID: 297


In [13]:
# Generate missing LeaveIDs
existing_ids = df['leaveid'].dropna()
max_id = existing_ids.str.extract(r'(\d+)')[0].astype(float).max()
missing_count = df['leaveid'].isnull().sum()
if missing_count > 0:
    new_ids = [f"LEAVE-{int(max_id) + i + 1}" for i in range(missing_count)]
    df.loc[df['leaveid'].isnull(), 'leaveid'] = new_ids
print(f"Generated {missing_count} new LeaveIDs")

Generated 297 new LeaveIDs


#### 5.2 Clean EmployeeID Column

In [14]:
# Clean EmployeeID
df['employeeid'] = df['employeeid'].str.strip()
print("EmployeeID cleaned")
print(f"Unique employees: {df['employeeid'].nunique()}")

EmployeeID cleaned
Unique employees: 2000


#### 5.3 Clean LeaveType Column

In [15]:
# Check LeaveType values
print("LeaveType value counts (before):")
print(df['leavetype'].value_counts())

LeaveType value counts (before):
leavetype
Vacation                 2027
Personal Leave           1978
Sick Leave               1940
Maternity/Paternity      1897
Unknown                   332
 Maternity/Paternity      113
 Personal Leave           110
 Vacation                 107
 Sick Leave               104
Name: count, dtype: int64


In [16]:
# Clean LeaveType
df['leavetype'] = df['leavetype'].str.strip().str.title()
print("\nLeaveType value counts (after):")
print(df['leavetype'].value_counts())


LeaveType value counts (after):
leavetype
Vacation               2134
Personal Leave         2088
Sick Leave             2044
Maternity/Paternity    2010
Unknown                 332
Name: count, dtype: int64


#### 5.4 Clean Date Columns

In [17]:
# Convert StartDate to datetime
df['startdate'] = df['startdate'].str.strip()
df['startdate'] = pd.to_datetime(df['startdate'], errors='coerce')
print(f"StartDate converted to datetime")
print(f"Date range: {df['startdate'].min()} to {df['startdate'].max()}")

StartDate converted to datetime
Date range: 2018-01-01 00:00:00 to 2025-10-28 00:00:00


In [18]:
# Convert EndDate to datetime
df['enddate'] = df['enddate'].str.strip()
df['enddate'] = pd.to_datetime(df['enddate'], errors='coerce')
print(f"EndDate converted to datetime")
print(f"Date range: {df['enddate'].min()} to {df['enddate'].max()}")

EndDate converted to datetime
Date range: 2018-01-01 00:00:00 to 2025-11-06 00:00:00


In [19]:
# Remove invalid dates
rows_before = len(df)
df = df.dropna(subset=['startdate', 'enddate'])
print(f"Removed {rows_before - len(df)} rows with invalid dates")

Removed 0 rows with invalid dates


In [20]:
# Extract date features
df['leave_year'] = df['startdate'].dt.year
df['leave_month'] = df['startdate'].dt.month
df['leave_quarter'] = df['startdate'].dt.quarter
df['leave_day_of_week'] = df['startdate'].dt.day_name()
df['leave_month_name'] = df['startdate'].dt.strftime('%B')
print("Extracted date features")

Extracted date features


#### 5.5 Clean Days Column

In [21]:
# Check Days column
print("Days statistics:")
print(df['days'].describe())
print(f"\nMissing Days: {df['days'].isnull().sum()}")

Days statistics:
count    8348.000000
mean        4.163512
std         3.191087
min         1.000000
25%         2.000000
50%         3.000000
75%         5.000000
max        10.000000
Name: days, dtype: float64

Missing Days: 260


In [22]:
# Calculate days if missing
df['calculated_days'] = (df['enddate'] - df['startdate']).dt.days + 1
df['days'] = df['days'].fillna(df['calculated_days'])
print(f"\nMissing Days after calculation: {df['days'].isnull().sum()}")


Missing Days after calculation: 0


In [23]:
# Validate days (should be positive and reasonable)
print(f"\nNegative or zero days: {(df['days'] <= 0).sum()}")
print(f"Days > 365: {(df['days'] > 365).sum()}")
# Remove invalid
df = df[(df['days'] > 0) & (df['days'] <= 365)]
print(f"Rows after removing invalid days: {len(df)}")


Negative or zero days: 0
Days > 365: 0
Rows after removing invalid days: 8608


### Step 6: Check for Some More Transformations

#### 6.1 Create Leave Duration Categories

In [24]:
# Create leave duration categories
df['leave_duration_category'] = pd.cut(
    df['days'],
    bins=[0, 1, 3, 7, 14, 365],
    labels=['Single Day', 'Short (2-3 days)', 'Medium (4-7 days)', 'Long (8-14 days)', 'Extended (15+ days)']
)
print("Created 'leave_duration_category' feature")
print("\nLeave duration distribution:")
print(df['leave_duration_category'].value_counts().sort_index())

Created 'leave_duration_category' feature

Leave duration distribution:
leave_duration_category
Single Day             1783
Short (2-3 days)       3445
Medium (4-7 days)      1665
Long (8-14 days)       1715
Extended (15+ days)       0
Name: count, dtype: int64


#### 6.2 Create Leave Type Binary Flags

In [25]:
# Create binary flags for leave types
df['is_sick_leave'] = (df['leavetype'] == 'Sick Leave').astype(int)
df['is_vacation'] = (df['leavetype'] == 'Vacation').astype(int)
df['is_personal'] = (df['leavetype'] == 'Personal Leave').astype(int)
df['is_maternity_paternity'] = (df['leavetype'] == 'Maternity/Paternity').astype(int)
print("Created leave type binary flags")

Created leave type binary flags


#### 6.3 Create Employee-Level Aggregations

In [26]:
# Calculate total leave days per employee
df['total_leave_days'] = df.groupby('employeeid')['days'].transform('sum')
df['leave_count'] = df.groupby('employeeid')['employeeid'].transform('count')
df['avg_leave_duration'] = df.groupby('employeeid')['days'].transform('mean').round(1)
print("Created employee-level leave metrics")

Created employee-level leave metrics


In [27]:
# Count leave types per employee
df['sick_leave_count'] = df.groupby('employeeid')['is_sick_leave'].transform('sum')
df['vacation_count'] = df.groupby('employeeid')['is_vacation'].transform('sum')
df['personal_leave_count'] = df.groupby('employeeid')['is_personal'].transform('sum')
df['maternity_paternity_count'] = df.groupby('employeeid')['is_maternity_paternity'].transform('sum')
print("Created leave type counts per employee")

Created leave type counts per employee


#### 6.4 Create Time-Based Features

In [28]:
# Calculate time between leaves
df = df.sort_values(['employeeid', 'startdate'])
df['prev_leave_end'] = df.groupby('employeeid')['enddate'].shift(1)
df['days_since_last_leave'] = (df['startdate'] - df['prev_leave_end']).dt.days
print("Created 'days_since_last_leave' feature")

Created 'days_since_last_leave' feature


In [29]:
# Create leave frequency indicator
df['leave_sequence'] = df.groupby('employeeid').cumcount() + 1
print("Created 'leave_sequence' feature")

Created 'leave_sequence' feature


#### 6.5 Create Seasonal Features

In [30]:
# Create season feature
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['leave_month'].apply(get_season)
print("Created 'season' feature")
print("\nLeave distribution by season:")
print(df['season'].value_counts())

Created 'season' feature

Leave distribution by season:
season
Spring    2257
Summer    2165
Fall      2126
Winter    2060
Name: count, dtype: int64


#### 6.6 Create Weekend/Weekday Flags

In [31]:
# Check if leave starts on weekend
df['starts_on_weekend'] = df['startdate'].dt.dayofweek.isin([5, 6]).astype(int)
df['starts_on_monday'] = (df['startdate'].dt.dayofweek == 0).astype(int)
df['starts_on_friday'] = (df['startdate'].dt.dayofweek == 4).astype(int)
print("Created weekend/weekday flags")

Created weekend/weekday flags


#### 6.7 Create Leave Pattern Features

In [32]:
# Identify frequent leave takers (>10 days per year)
annual_leave = df.groupby(['employeeid', 'leave_year'])['days'].sum().reset_index()
annual_leave['high_leave_usage'] = (annual_leave['days'] > 10).astype(int)
print("Annual leave summary:")
print(annual_leave.groupby('leave_year')['days'].describe())

Annual leave summary:
            count      mean       std  min  25%  50%   75%   max
leave_year                                                      
2018        890.0  5.031461  4.086126  1.0  2.0  3.0   7.0  25.0
2019        846.0  5.343972  4.202691  1.0  2.0  4.0  10.0  20.0
2020        878.0  5.448747  4.352642  1.0  2.0  4.0  10.0  30.0
2021        879.0  5.281001  4.146738  1.0  2.0  4.0  10.0  23.0
2022        882.0  5.462585  4.221353  1.0  2.0  5.0  10.0  25.0
2023        865.0  5.497110  4.078263  1.0  2.0  5.0  10.0  22.0
2024        875.0  4.924571  4.054410  1.0  2.0  3.0   8.0  25.0
2025        725.0  4.946207  3.830833  1.0  2.0  3.0   8.0  20.0


#### 6.8 Sort Data

In [33]:
# Sort by EmployeeID and StartDate
df = df.sort_values(['employeeid', 'startdate']).reset_index(drop=True)
print("Data sorted by EmployeeID and StartDate")
df.head(10)

Data sorted by EmployeeID and StartDate


Unnamed: 0,leaveid,employeeid,leavetype,startdate,enddate,days,leave_year,leave_month,leave_quarter,leave_day_of_week,...,vacation_count,personal_leave_count,maternity_paternity_count,prev_leave_end,days_since_last_leave,leave_sequence,season,starts_on_weekend,starts_on_monday,starts_on_friday
0,LEAVE-3,PNR-1001,Personal Leave,2018-09-22,2018-10-01,10.0,2018,9,3,Saturday,...,1,1,1,NaT,,1,Fall,1,0,0
1,LEAVE-1,PNR-1001,Maternity/Paternity,2021-08-05,2021-08-07,3.0,2021,8,3,Thursday,...,1,1,1,2018-10-01,1039.0,2,Summer,0,0,0
2,LEAVE-5,PNR-1001,Sick Leave,2022-03-16,2022-03-16,1.0,2022,3,1,Wednesday,...,1,1,1,2021-08-07,221.0,3,Spring,0,0,0
3,LEAVE-2,PNR-1001,Vacation,2022-07-20,2022-07-22,3.0,2022,7,3,Wednesday,...,1,1,1,2022-03-16,126.0,4,Summer,0,0,0
4,LEAVE-7,PNR-1002,Personal Leave,2018-01-30,2018-02-03,5.0,2018,1,1,Tuesday,...,0,2,1,NaT,,1,Winter,0,0,0
5,LEAVE-10,PNR-1002,Sick Leave,2020-03-07,2020-03-07,1.0,2020,3,1,Saturday,...,0,2,1,2018-02-03,763.0,2,Spring,1,0,0
6,LEAVE-9,PNR-1002,Personal Leave,2022-04-21,2022-04-22,2.0,2022,4,2,Thursday,...,0,2,1,2020-03-07,775.0,3,Spring,0,0,0
7,LEAVE-8,PNR-1002,Maternity/Paternity,2023-04-19,2023-04-20,2.0,2023,4,2,Wednesday,...,0,2,1,2022-04-22,362.0,4,Spring,0,0,0
8,LEAVE-6,PNR-1002,Sick Leave,2023-12-02,2023-12-06,5.0,2023,12,4,Saturday,...,0,2,1,2023-04-20,226.0,5,Winter,1,0,0
9,LEAVE-14,PNR-1003,Maternity/Paternity,2018-07-09,2018-07-10,2.0,2018,7,3,Monday,...,1,0,3,NaT,,1,Summer,0,1,0


#### 6.9 Final Data Quality Check

In [34]:
# Final summary
print("=" * 60)
print("FINAL DATA QUALITY SUMMARY")
print("=" * 60)
print(f"\nFinal shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nDuplicates: {df.duplicated().sum()}")

FINAL DATA QUALITY SUMMARY

Final shape: (8608, 31)

Columns: ['leaveid', 'employeeid', 'leavetype', 'startdate', 'enddate', 'days', 'leave_year', 'leave_month', 'leave_quarter', 'leave_day_of_week', 'leave_month_name', 'calculated_days', 'leave_duration_category', 'is_sick_leave', 'is_vacation', 'is_personal', 'is_maternity_paternity', 'total_leave_days', 'leave_count', 'avg_leave_duration', 'sick_leave_count', 'vacation_count', 'personal_leave_count', 'maternity_paternity_count', 'prev_leave_end', 'days_since_last_leave', 'leave_sequence', 'season', 'starts_on_weekend', 'starts_on_monday', 'starts_on_friday']

Missing values:
leaveid                         0
employeeid                      0
leavetype                       0
startdate                       0
enddate                         0
days                            0
leave_year                      0
leave_month                     0
leave_quarter                   0
leave_day_of_week               0
leave_month_name        

### Save Cleaned Data

In [35]:
# Save to processed folder
output_path = '../data/processed/Leave_Attendance_cleaned.csv'
df.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")
print(f"Total records: {len(df)}")

Cleaned data saved to: ../data/processed/Leave_Attendance_cleaned.csv
Total records: 8608


---
## Summary

### Data Cleaning & Feature Engineering Completed!

**Steps Performed:**
1. Checked for redundant columns
2. Renamed columns to snake_case
3. Removed duplicates
4. Handled missing values
5. Cleaned all columns:
   - LeaveID (generated missing)
   - EmployeeID, LeaveType
   - StartDate, EndDate (datetime)
   - Days (calculated and validated)
6. **Feature Engineering:**
   - **Duration categories**: Single day to Extended (15+ days)
   - **Leave type flags**: Binary indicators for each type
   - **Employee aggregations**: Total days, counts, averages
   - **Time-based**: Days since last leave, leave sequence
   - **Seasonal**: Season classification
   - **Pattern flags**: Weekend starts, frequent usage

**Feature Summary:**
- Original columns: 6
- Final columns: 35+
- New features: 29+