# HR Dashboard - Data Cleaning
## Recruitment Funnel Dataset

This notebook performs data cleaning, wrangling, and feature engineering on the Recruitment_Funnel dataset.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Load Raw Data

In [2]:
# Load the Recruitment Funnel dataset
df = pd.read_csv('../data/raw/Recruitment_Funnel.csv')
print(f"Original dataset shape: {df.shape}")
df.head(10)

Original dataset shape: (24008, 7)


Unnamed: 0,CandidateID,PositionID,JobTitle,ApplicationDate,Source,Stage,DateatStage
0,CAND-1,POS-1,,2025-08-18,Indeed,Rejected,2025-08-18
1,CAND-2,POS-1,Regional Sales,2025-10-21,Employee Referral,Rejected,
2,CAND-3,POS-1,,2025-10-25,LinkedIn,Applied,2025-10-25
3,CAND-4,POS-1,Regional Sales,2025-10-25,Employee Referral,Screened,2025-10-29
4,CAND-5,POS-1,Regional Sales,2025-10-25,Job Fair,Interview 1,2025-11-04
5,CAND-6,POS-1,Regional Sales,,Company Website,Interview 2,
6,CAND-6,POS-1,Regional Sales,2025-10-25,Company Website,Interview 2,2025-11-06
7,CAND-8,POS-1,Regional Sales,2025-10-25,LinkedIn,Hired,2025-11-13
8,CAND-9,POS-1,Regional Sales,2025-09-25,,Rejected,2025-09-25
9,CAND-10,POS-1,Regional Sales,2025-10-17,Indeed,,2025-10-17


### Initial Data Exploration

In [3]:
# Display basic information
print("Dataset Info:")
df.info()
print("\n" + "="*50)
print("\nBasic Statistics:")
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24008 entries, 0 to 24007
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   CandidateID      21408 non-null  object
 1   PositionID       21385 non-null  object
 2   JobTitle         21401 non-null  object
 3   ApplicationDate  21315 non-null  object
 4   Source           21359 non-null  object
 5   Stage            21370 non-null  object
 6   DateatStage      21413 non-null  object
dtypes: object(7)
memory usage: 1.3+ MB


Basic Statistics:


Unnamed: 0,CandidateID,PositionID,JobTitle,ApplicationDate,Source,Stage,DateatStage
count,21408,21385,21401,21315,21359,21370,21413
unique,19148,1127,14,237,10,14,285
top,CAND-16663,POS-118,QC,2025-09-28,LinkedIn,Rejected,2025-10-17
freq,2,59,3747,350,3824,5286,333


In [4]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({'Missing_Count': missing, 'Percentage': missing_percent})
print(missing_df)

Missing Values:
                 Missing_Count  Percentage
CandidateID               2600   10.829723
PositionID                2623   10.925525
JobTitle                  2607   10.858880
ApplicationDate           2693   11.217094
Source                    2649   11.033822
Stage                     2638   10.988004
DateatStage               2595   10.808897


In [5]:
# Check for duplicates
print(f"Total duplicate rows: {df.duplicated().sum()}")

Total duplicate rows: 376


---
## Data Cleaning Steps

### Step 1: Delete Redundant Columns

In [6]:
# Check for redundant columns
print("All columns are relevant for analysis.")
print(f"Columns: {list(df.columns)}")

All columns are relevant for analysis.
Columns: ['CandidateID', 'PositionID', 'JobTitle', 'ApplicationDate', 'Source', 'Stage', 'DateatStage']


### Step 2: Drop / Rename the Columns

In [7]:
# Rename columns for consistency (snake_case)
df.columns = df.columns.str.lower().str.replace(' ', '_')
print("Renamed columns:")
print(list(df.columns))

Renamed columns:
['candidateid', 'positionid', 'jobtitle', 'applicationdate', 'source', 'stage', 'dateatstage']


### Step 3: Remove Duplicates

In [8]:
# Remove duplicate rows
print(f"Rows before removing duplicates: {len(df)}")
df = df.drop_duplicates()
print(f"Rows after removing duplicates: {len(df)}")

Rows before removing duplicates: 24008
Rows after removing duplicates: 23632


### Step 4: Remove the NaN Values from the Dataset

In [9]:
# Check missing values before cleaning
print("Missing values before cleaning:")
print(df.isnull().sum())
print(f"\nTotal rows: {len(df)}")

Missing values before cleaning:
candidateid        2600
positionid         2623
jobtitle           2607
applicationdate    2693
source             2649
stage              2638
dateatstage        2595
dtype: int64

Total rows: 23632


In [10]:
# Remove rows where critical columns are missing
df = df.dropna(subset=['candidateid', 'positionid', 'applicationdate'])
print(f"Rows after removing missing critical fields: {len(df)}")

Rows after removing missing critical fields: 16668


In [11]:
# Fill missing values
df['jobtitle'] = df['jobtitle'].fillna('Unknown')
df['source'] = df['source'].fillna('Unknown')
df['stage'] = df['stage'].fillna('Unknown')
print("\nMissing values after cleaning:")
print(df.isnull().sum())


Missing values after cleaning:
candidateid           0
positionid            0
jobtitle              0
applicationdate       0
source                0
stage                 0
dateatstage        1686
dtype: int64


### Step 5: Clean Individual Columns

#### 5.1 Clean CandidateID Column

In [12]:
# Clean CandidateID
df['candidateid'] = df['candidateid'].str.strip()
print("CandidateID cleaned")
print(f"Unique candidates: {df['candidateid'].nunique()}")

CandidateID cleaned
Unique candidates: 15003


#### 5.2 Clean PositionID Column

In [13]:
# Clean PositionID
df['positionid'] = df['positionid'].str.strip()
print("PositionID cleaned")
print(f"Unique positions: {df['positionid'].nunique()}")

PositionID cleaned
Unique positions: 568


#### 5.3 Clean JobTitle Column

In [14]:
# Clean JobTitle
df['jobtitle'] = df['jobtitle'].str.strip().str.title()
print("JobTitle cleaned")
print(f"\nTop 10 job titles:")
print(df['jobtitle'].value_counts().head(10))

JobTitle cleaned

Top 10 job titles:
jobtitle
Qc                3015
Operations        2355
It                2023
Marketing         1995
Regional Sales    1907
Research          1845
Account           1813
Unknown           1715
Name: count, dtype: int64


#### 5.4 Clean Date Columns

In [15]:
# Convert ApplicationDate to datetime
df['applicationdate'] = df['applicationdate'].str.strip()
df['applicationdate'] = pd.to_datetime(df['applicationdate'], errors='coerce')
print(f"ApplicationDate converted to datetime")
print(f"Date range: {df['applicationdate'].min()} to {df['applicationdate'].max()}")

ApplicationDate converted to datetime
Date range: 2025-07-01 00:00:00 to 2025-10-29 00:00:00


In [16]:
# Convert DateatStage to datetime
df['dateatstage'] = df['dateatstage'].str.strip()
df['dateatstage'] = pd.to_datetime(df['dateatstage'], errors='coerce')
print(f"DateatStage converted to datetime")
print(f"Date range: {df['dateatstage'].min()} to {df['dateatstage'].max()}")

DateatStage converted to datetime
Date range: 2025-07-01 00:00:00 to 2025-11-25 00:00:00


In [17]:
# Remove invalid dates
rows_before = len(df)
df = df.dropna(subset=['applicationdate'])
print(f"Removed {rows_before - len(df)} rows with invalid application dates")

Removed 0 rows with invalid application dates


In [18]:
# Extract date features
df['application_year'] = df['applicationdate'].dt.year
df['application_month'] = df['applicationdate'].dt.month
df['application_quarter'] = df['applicationdate'].dt.quarter
df['application_month_name'] = df['applicationdate'].dt.strftime('%B')
df['application_day_of_week'] = df['applicationdate'].dt.day_name()
print("Extracted application date features")

Extracted application date features


#### 5.5 Clean Source Column

In [19]:
# Check Source values
print("Source value counts (before):")
print(df['source'].value_counts())

Source value counts (before):
source
LinkedIn               2669
Employee Referral      2659
Job Fair               2636
Indeed                 2624
Company Website        2539
Unknown                1748
 Company Website        379
 Employee Referral      375
 Indeed                 364
 Job Fair               352
 LinkedIn               323
Name: count, dtype: int64


In [20]:
# Clean Source
df['source'] = df['source'].str.strip().str.title()
print("\nSource value counts (after):")
print(df['source'].value_counts())


Source value counts (after):
source
Employee Referral    3034
Linkedin             2992
Indeed               2988
Job Fair             2988
Company Website      2918
Unknown              1748
Name: count, dtype: int64


#### 5.6 Clean Stage Column

In [21]:
# Check Stage values
print("Stage value counts (before):")
print(df['stage'].value_counts())

Stage value counts (before):
stage
Rejected         3633
Applied          3123
Screened         2222
Unknown          1764
Interview 1      1520
Interview 2      1087
Hired             783
Offer             756
 Rejected         470
 Applied          457
 Screened         307
 Interview 1      206
 Interview 2      126
 Hired            109
 Offer            105
Name: count, dtype: int64


In [22]:
# Clean Stage
df['stage'] = df['stage'].str.strip().str.title()
print("\nStage value counts (after):")
print(df['stage'].value_counts())


Stage value counts (after):
stage
Rejected       4103
Applied        3580
Screened       2529
Unknown        1764
Interview 1    1726
Interview 2    1213
Hired           892
Offer           861
Name: count, dtype: int64


### Step 6: Check for Some More Transformations

#### 6.1 Create Stage Hierarchy Features

In [23]:
# Define stage order
stage_order = {
    'Applied': 1,
    'Screened': 2,
    'Interview 1': 3,
    'Interview 2': 4,
    'Offer': 5,
    'Hired': 6,
    'Rejected': 7,
    'Unknown': 0
}
df['stage_number'] = df['stage'].map(stage_order)
print("Created 'stage_number' feature")

Created 'stage_number' feature


#### 6.2 Create Stage Binary Flags

In [24]:
# Create binary flags for stages
df['is_applied'] = (df['stage'] == 'Applied').astype(int)
df['is_screened'] = (df['stage'] == 'Screened').astype(int)
df['is_interview1'] = (df['stage'] == 'Interview 1').astype(int)
df['is_interview2'] = (df['stage'] == 'Interview 2').astype(int)
df['is_offer'] = (df['stage'] == 'Offer').astype(int)
df['is_hired'] = (df['stage'] == 'Hired').astype(int)
df['is_rejected'] = (df['stage'] == 'Rejected').astype(int)
print("Created stage binary flags")

Created stage binary flags


#### 6.3 Create Source Binary Flags

In [25]:
# Create binary flags for sources
df['is_linkedin'] = (df['source'] == 'Linkedin').astype(int)
df['is_indeed'] = (df['source'] == 'Indeed').astype(int)
df['is_referral'] = (df['source'] == 'Employee Referral').astype(int)
df['is_company_website'] = (df['source'] == 'Company Website').astype(int)
df['is_job_fair'] = (df['source'] == 'Job Fair').astype(int)
print("Created source binary flags")

Created source binary flags


#### 6.4 Create Time-Based Features

In [26]:
# Calculate time in recruitment process
df['days_in_process'] = (df['dateatstage'] - df['applicationdate']).dt.days
df['days_in_process'] = df['days_in_process'].fillna(0)
print("Created 'days_in_process' feature")
print(f"\nDays in process statistics:")
print(df['days_in_process'].describe())

Created 'days_in_process' feature

Days in process statistics:
count    16668.000000
mean         6.157187
std          7.129098
min          0.000000
25%          0.000000
50%          4.000000
75%         10.000000
max         33.000000
Name: days_in_process, dtype: float64


#### 6.5 Create Position-Level Aggregations

In [27]:
# Calculate position metrics
df['applications_per_position'] = df.groupby('positionid')['candidateid'].transform('count')
df['hires_per_position'] = df.groupby('positionid')['is_hired'].transform('sum')
df['rejections_per_position'] = df.groupby('positionid')['is_rejected'].transform('sum')
print("Created position-level metrics")

Created position-level metrics


#### 6.6 Create Conversion Metrics

In [28]:
# Calculate conversion rates at position level
position_stats = df.groupby('positionid').agg({
    'candidateid': 'count',
    'is_screened': 'sum',
    'is_interview1': 'sum',
    'is_interview2': 'sum',
    'is_offer': 'sum',
    'is_hired': 'sum'
}).rename(columns={'candidateid': 'total_applications'})

position_stats['screening_rate'] = (position_stats['is_screened'] / position_stats['total_applications'] * 100).round(2)
position_stats['interview1_rate'] = (position_stats['is_interview1'] / position_stats['total_applications'] * 100).round(2)
position_stats['hire_rate'] = (position_stats['is_hired'] / position_stats['total_applications'] * 100).round(2)

print("Position-level conversion rates:")
print(position_stats.head(10))

Position-level conversion rates:
            total_applications  is_screened  is_interview1  is_interview2  \
positionid                                                                  
POS-1                       21            2              1              1   
POS-10                      35            8              5              2   
POS-100                     32            2              4              1   
POS-101                     25            2              5              2   
POS-102                     18            0              0              0   
POS-103                     44            8              2              6   
POS-104                     49            6              5              4   
POS-105                     26            2              3              0   
POS-106                     28            4              3              2   
POS-107                     29            8              2              2   

            is_offer  is_hired  screening_

#### 6.7 Create Source Effectiveness Metrics

In [29]:
# Calculate source effectiveness
source_stats = df.groupby('source').agg({
    'candidateid': 'count',
    'is_hired': 'sum',
    'is_rejected': 'sum'
}).rename(columns={'candidateid': 'total_candidates'})

source_stats['hire_rate'] = (source_stats['is_hired'] / source_stats['total_candidates'] * 100).round(2)
source_stats['rejection_rate'] = (source_stats['is_rejected'] / source_stats['total_candidates'] * 100).round(2)

print("\nSource effectiveness:")
print(source_stats.sort_values('hire_rate', ascending=False))


Source effectiveness:
                   total_candidates  is_hired  is_rejected  hire_rate  \
source                                                                  
Company Website                2918       171          729       5.86   
Employee Referral              3034       161          781       5.31   
Linkedin                       2992       158          691       5.28   
Unknown                        1748        92          415       5.26   
Indeed                         2988       156          735       5.22   
Job Fair                       2988       154          752       5.15   

                   rejection_rate  
source                             
Company Website             24.98  
Employee Referral           25.74  
Linkedin                    23.09  
Unknown                     23.74  
Indeed                      24.60  
Job Fair                    25.17  


#### 6.8 Create Candidate Journey Features

In [30]:
# Count stages per candidate
candidate_stages = df.groupby('candidateid').agg({
    'stage': 'count',
    'stage_number': 'max'
}).rename(columns={'stage': 'stage_count', 'stage_number': 'furthest_stage'})

df = df.merge(candidate_stages, on='candidateid', how='left')
print("Created candidate journey features")

Created candidate journey features


#### 6.9 Create Seasonal Features

In [31]:
# Create season feature
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['application_season'] = df['application_month'].apply(get_season)
print("Created 'application_season' feature")
print("\nApplications by season:")
print(df['application_season'].value_counts())

Created 'application_season' feature

Applications by season:
application_season
Fall      13063
Summer     3605
Name: count, dtype: int64


#### 6.10 Create Success Indicators

In [32]:
# Create success category
def categorize_outcome(row):
    if row['is_hired'] == 1:
        return 'Hired'
    elif row['is_rejected'] == 1:
        return 'Rejected'
    elif row['is_offer'] == 1:
        return 'Offer Stage'
    elif row['stage_number'] >= 3:
        return 'In Progress - Advanced'
    elif row['stage_number'] >= 1:
        return 'In Progress - Early'
    else:
        return 'Unknown'

df['recruitment_outcome'] = df.apply(categorize_outcome, axis=1)
print("Created 'recruitment_outcome' feature")
print("\nRecruitment outcome distribution:")
print(df['recruitment_outcome'].value_counts())

Created 'recruitment_outcome' feature

Recruitment outcome distribution:
recruitment_outcome
In Progress - Early       6109
Rejected                  4103
In Progress - Advanced    2939
Unknown                   1764
Hired                      892
Offer Stage                861
Name: count, dtype: int64


#### 6.11 Sort Data

In [33]:
# Sort by PositionID and ApplicationDate
df = df.sort_values(['positionid', 'applicationdate']).reset_index(drop=True)
print("Data sorted by PositionID and ApplicationDate")
df.head(10)

Data sorted by PositionID and ApplicationDate


Unnamed: 0,candidateid,positionid,jobtitle,applicationdate,source,stage,dateatstage,application_year,application_month,application_quarter,...,is_company_website,is_job_fair,days_in_process,applications_per_position,hires_per_position,rejections_per_position,stage_count,furthest_stage,application_season,recruitment_outcome
0,CAND-1,POS-1,Unknown,2025-08-18,Indeed,Rejected,2025-08-18,2025,8,3,...,0,0,0.0,21,1,8,1,7,Summer,Rejected
1,CAND-15,POS-1,Regional Sales,2025-09-07,Employee Referral,Unknown,2025-09-07,2025,9,3,...,0,0,0.0,21,1,8,1,0,Fall,Unknown
2,CAND-13,POS-1,Unknown,2025-09-14,Job Fair,Rejected,2025-09-14,2025,9,3,...,0,1,0.0,21,1,8,1,7,Fall,Rejected
3,CAND-16,POS-1,Regional Sales,2025-09-15,Employee Referral,Applied,2025-09-15,2025,9,3,...,0,0,0.0,21,1,8,2,1,Fall,In Progress - Early
4,CAND-16,POS-1,Regional Sales,2025-09-15,Employee Referral,Applied,2025-09-15,2025,9,3,...,0,0,0.0,21,1,8,2,1,Fall,In Progress - Early
5,CAND-14,POS-1,Regional Sales,2025-09-20,Indeed,Rejected,2025-09-20,2025,9,3,...,0,0,0.0,21,1,8,1,7,Fall,Rejected
6,CAND-9,POS-1,Regional Sales,2025-09-25,Unknown,Rejected,2025-09-25,2025,9,3,...,0,0,0.0,21,1,8,1,7,Fall,Rejected
7,CAND-22,POS-1,Regional Sales,2025-10-02,Indeed,Unknown,2025-10-02,2025,10,4,...,0,0,0.0,21,1,8,2,1,Fall,Unknown
8,CAND-22,POS-1,Regional Sales,2025-10-02,Indeed,Applied,2025-10-02,2025,10,4,...,0,0,0.0,21,1,8,2,1,Fall,In Progress - Early
9,CAND-19,POS-1,Regional Sales,2025-10-03,Company Website,Screened,2025-10-08,2025,10,4,...,1,0,5.0,21,1,8,1,2,Fall,In Progress - Early


#### 6.12 Final Data Quality Check

In [34]:
# Final summary
print("=" * 60)
print("FINAL DATA QUALITY SUMMARY")
print("=" * 60)
print(f"\nFinal shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nDuplicates: {df.duplicated().sum()}")

FINAL DATA QUALITY SUMMARY

Final shape: (16668, 33)

Columns: ['candidateid', 'positionid', 'jobtitle', 'applicationdate', 'source', 'stage', 'dateatstage', 'application_year', 'application_month', 'application_quarter', 'application_month_name', 'application_day_of_week', 'stage_number', 'is_applied', 'is_screened', 'is_interview1', 'is_interview2', 'is_offer', 'is_hired', 'is_rejected', 'is_linkedin', 'is_indeed', 'is_referral', 'is_company_website', 'is_job_fair', 'days_in_process', 'applications_per_position', 'hires_per_position', 'rejections_per_position', 'stage_count', 'furthest_stage', 'application_season', 'recruitment_outcome']

Missing values:
candidateid                     0
positionid                      0
jobtitle                        0
applicationdate                 0
source                          0
stage                           0
dateatstage                  1686
application_year                0
application_month               0
application_quarter          

### Save Cleaned Data

In [35]:
# Save to processed folder
output_path = '../data/processed/Recruitment_Funnel_cleaned.csv'
df.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")
print(f"Total records: {len(df)}")

Cleaned data saved to: ../data/processed/Recruitment_Funnel_cleaned.csv
Total records: 16668


---
## Summary

### Data Cleaning & Feature Engineering Completed!

**Steps Performed:**
1. Checked for redundant columns
2. Renamed columns to snake_case
3. Removed duplicates
4. Handled missing values
5. Cleaned all columns:
   - CandidateID, PositionID, JobTitle
   - ApplicationDate, DateatStage (datetime)
   - Source, Stage (standardized)
6. **Feature Engineering:**
   - **Stage hierarchy**: Stage numbers and ordering
   - **Binary flags**: Stage and source indicators
   - **Time-based**: Days in recruitment process
   - **Position metrics**: Applications, hires, rejections per position
   - **Conversion rates**: Screening, interview, hire rates
   - **Source effectiveness**: Hire and rejection rates by source
   - **Candidate journey**: Stage count, furthest stage reached
   - **Seasonal**: Application season classification
   - **Outcome categories**: Hired, Rejected, In Progress, etc.

**Feature Summary:**
- Original columns: 7
- Final columns: 40+
- New features: 33+