# HR Dashboard - Data Cleaning
## Employee Master Dataset

This notebook performs data cleaning and wrangling on the Employee_Master dataset.

### Import Libraries

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Load Raw Data

In [44]:
# Load the Employee Master dataset
df = pd.read_csv('../data/raw/Employee_Master.csv')
print(f"Original dataset shape: {df.shape}")
df.head(10)

Original dataset shape: (2000, 12)


Unnamed: 0,EmployeeID,FullName,Department,JobTitle,Email,Gender,HireDate,TerminationDate,Salary,ManagerID,PerformanceRating,Status
0,PNR-1001,Aarav Taylor,IT Support,Help Desk Specialist,employee.1001@pnrao.com,,2021-05-08,,79378.0,,,Active
1,PNR-1002,Emma Jones,Quality Control,,employee.1002@pnrao.com,Female,2019-11-13,,69570.0,PNR-1714,4.0,Active
2,PNR-1002,Emma Jones,Quality Control,QC Inspector,employee.1002@pnrao.com,Female,2019-11-13,,69570.0,PNR-1714,4.0,Active
3,PNR-1004,Rohan Jones,Human Resources,HR Coordinator,employee.1004@pnrao.com,Female,2020-02-11,,61572.0,,4.0,
4,PNR-1004,Rohan Jones,Human Resources,HR Coordinator,employee.1004@pnrao.com,Female,2020-02-11,,61572.0,PNR-1925,4.0,Active
5,PNR-1006,,Logistics,Warehouse Associate,employee.1006@pnrao.com,,2025-02-09,,60796.0,PNR-1690,2.0,Active
6,PNR-1007,John Smith,Marketing,Marketing Coordinator,employee.1007@pnrao.com,Female,2023-08-21,,76107.0,PNR-2219,,Active
7,PNR-1008,Chris Jones,Research & Development,Lab Technician,employee.1008@pnrao.com,,2023-07-27,,79117.0,PNR-1441,3.0,Active
8,PNR-1009,,Human Resources,HR Generalist,employee.1009@pnrao.com,Non-binary,2022-03-17,,59818.0,PNR-1847,5.0,Active
9,PNR-1010,Anika Sharma,Marketing,Marketing Manager,employee.1010@pnrao.com,Non-binary,,2028-05-06,133925.0,,3.0,Terminated


### Initial Data Exploration

In [45]:
# Display basic information
print("Dataset Info:")
df.info()
print("\n" + "="*50)
print("\nBasic Statistics:")
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         1778 non-null   object 
 1   FullName           1759 non-null   object 
 2   Department         1768 non-null   object 
 3   JobTitle           1793 non-null   object 
 4   Email              1779 non-null   object 
 5   Gender             1789 non-null   object 
 6   HireDate           1800 non-null   object 
 7   TerminationDate    454 non-null    object 
 8   Salary             1820 non-null   float64
 9   ManagerID          1407 non-null   object 
 10  PerformanceRating  1844 non-null   float64
 11  Status             1767 non-null   object 
dtypes: float64(2), object(10)
memory usage: 187.6+ KB


Basic Statistics:


Unnamed: 0,Salary,PerformanceRating
count,1820.0,1844.0
mean,91273.081319,3.523861
std,36164.134219,1.119053
min,55060.0,2.0
25%,65727.5,3.0
50%,76243.0,4.0
75%,120660.25,5.0
max,179521.0,5.0


In [46]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({'Missing_Count': missing, 'Percentage': missing_percent})
print(missing_df)

Missing Values:
                   Missing_Count  Percentage
EmployeeID                   222       11.10
FullName                     241       12.05
Department                   232       11.60
JobTitle                     207       10.35
Email                        221       11.05
Gender                       211       10.55
HireDate                     200       10.00
TerminationDate             1546       77.30
Salary                       180        9.00
ManagerID                    593       29.65
PerformanceRating            156        7.80
Status                       233       11.65


In [47]:
# Check for duplicates
print(f"Total duplicate rows: {df.duplicated().sum()}")
print(f"Duplicate EmployeeID values: {df['EmployeeID'].duplicated().sum()}")

Total duplicate rows: 9
Duplicate EmployeeID values: 416


---
## Data Cleaning Steps

### Step 1: Delete Redundant Columns

In [48]:
# Check for redundant columns
# For Employee_Master, all columns appear to be relevant
print("All columns are relevant for analysis.")
print(f"Columns: {list(df.columns)}")

All columns are relevant for analysis.
Columns: ['EmployeeID', 'FullName', 'Department', 'JobTitle', 'Email', 'Gender', 'HireDate', 'TerminationDate', 'Salary', 'ManagerID', 'PerformanceRating', 'Status']


### Step 2: Drop / Rename the Columns

In [49]:
# Rename columns for consistency (snake_case)
df.columns = df.columns.str.lower().str.replace(' ', '_')
print("Renamed columns:")
print(list(df.columns))

Renamed columns:
['employeeid', 'fullname', 'department', 'jobtitle', 'email', 'gender', 'hiredate', 'terminationdate', 'salary', 'managerid', 'performancerating', 'status']


### Step 3: Remove Duplicates

In [50]:
# Remove duplicate rows
print(f"Rows before removing duplicates: {len(df)}")
df = df.drop_duplicates()
print(f"Rows after removing duplicates: {len(df)}")
print(f"Duplicates removed: {2000 - len(df)}")

Rows before removing duplicates: 2000
Rows after removing duplicates: 1991
Duplicates removed: 9


### Step 4: Remove the NaN Values from the Dataset

In [51]:
# Check missing values before cleaning
print("Missing values before cleaning:")
print(df.isnull().sum())
print(f"\nTotal rows: {len(df)}")

Missing values before cleaning:
employeeid            222
fullname              241
department            232
jobtitle              207
email                 221
gender                211
hiredate              200
terminationdate      1537
salary                180
managerid             590
performancerating     156
status                233
dtype: int64

Total rows: 1991


In [52]:
# Remove rows where critical columns are missing
# Critical columns: EmployeeID (primary key)
df = df.dropna(subset=['employeeid'])
print(f"Rows after removing missing EmployeeID: {len(df)}")

Rows after removing missing EmployeeID: 1769


In [53]:
# Handle missing values in other columns
# TerminationDate: Missing values are valid (active employees)
# ManagerID: Missing values are valid (top-level executives)
# PerformanceRating: Missing values are valid (new employees)

# Remove rows missing essential employee information
essential_cols = ['fullname', 'department', 'jobtitle', 'hiredate', 'status']
df = df.dropna(subset=essential_cols)
print(f"Rows after removing missing essential information: {len(df)}")

Rows after removing missing essential information: 1019


In [54]:
# Fill missing Gender with 'Unknown'
df['gender'] = df['gender'].fillna('Unknown')
print("\nMissing values after initial cleaning:")
print(df.isnull().sum())


Missing values after initial cleaning:
employeeid             0
fullname               0
department             0
jobtitle               0
email                 84
gender                 0
hiredate               0
terminationdate      791
salary                79
managerid            288
performancerating     65
status                 0
dtype: int64


### Step 5: Clean Individual Columns

#### 5.1 Clean EmployeeID Column

In [55]:
# Check EmployeeID format
print("Sample EmployeeID values:")
print(df['employeeid'].head(20))
print(f"\nUnique employees: {df['employeeid'].nunique()}")

Sample EmployeeID values:
0       PNR-1001
2       PNR-1002
4       PNR-1004
6      PNR-1007 
7       PNR-1008
12      PNR-1013
16      PNR-1017
17      PNR-1017
18      PNR-1018
19      PNR-1020
20      PNR-1021
25      PNR-1025
26      PNR-1026
28      PNR-1028
32      PNR-1033
33      PNR-1033
34      PNR-1035
36      PNR-1036
39      PNR-1040
42      PNR-1043
Name: employeeid, dtype: object

Unique employees: 938


In [56]:
# Remove whitespace and check for duplicates
df['employeeid'] = df['employeeid'].str.strip()
duplicate_count = df['employeeid'].duplicated().sum()
print(f"Duplicate EmployeeIDs: {duplicate_count}")
if duplicate_count > 0:
    df = df.drop_duplicates(subset=['employeeid'], keep='last')
    print(f"Rows after removing duplicates: {len(df)}")

Duplicate EmployeeIDs: 100
Rows after removing duplicates: 919


#### 5.2 Clean FullName Column

In [57]:
# Clean FullName
df['fullname'] = df['fullname'].str.strip().str.title()
print("FullName cleaned")

FullName cleaned


#### 5.3 Clean Department Column

In [58]:
# Check Department values
print("Department value counts (before):")
print(df['department'].value_counts())

Department value counts (before):
department
Logistics                   104
Quality Control             101
Production                   98
Human Resources              94
Marketing                    91
Research & Development       88
IT Support                   87
Sales                        86
Finance                      81
 Sales                       18
 Production                  14
 Human Resources             12
 IT Support                  11
 Research & Development      10
 Quality Control              7
 Logistics                    7
 Finance                      6
 Marketing                    4
Name: count, dtype: int64


In [59]:
# Clean Department
df['department'] = df['department'].str.strip().str.title()
print("\nDepartment value counts (after):")
print(df['department'].value_counts())


Department value counts (after):
department
Production                112
Logistics                 111
Quality Control           108
Human Resources           106
Sales                     104
It Support                 98
Research & Development     98
Marketing                  95
Finance                    87
Name: count, dtype: int64


#### 5.4 Clean JobTitle Column

In [60]:
# Clean JobTitle
df['jobtitle'] = df['jobtitle'].str.strip().str.title()
print("JobTitle cleaned")

JobTitle cleaned


#### 5.5 Clean Email Column

In [61]:
# Clean Email
df['email'] = df['email'].str.strip().str.lower()

# Generate missing emails
def generate_email(row):
    if pd.isnull(row['email']):
        name_parts = row['fullname'].lower().split()
        if len(name_parts) >= 2:
            return f"{name_parts[0]}.{name_parts[-1]}@company.com"
        return f"{row['employeeid'].lower()}@company.com"
    return row['email']

df['email'] = df.apply(generate_email, axis=1)
print(f"Missing emails: {df['email'].isnull().sum()}")

Missing emails: 0


#### 5.6 Clean Gender Column

In [62]:
# Clean Gender
df['gender'] = df['gender'].str.strip().str.title()
print("Gender value counts:")
print(df['gender'].value_counts())

Gender value counts:
gender
Female        279
Male          279
Non-Binary    277
Unknown        84
Name: count, dtype: int64


#### 5.7 Clean HireDate Column

In [63]:
# Convert HireDate to datetime
df['hiredate'] = df['hiredate'].str.strip()
df['hiredate'] = pd.to_datetime(df['hiredate'], errors='coerce')
print(f"Date range: {df['hiredate'].min()} to {df['hiredate'].max()}")
print(f"Invalid dates: {df['hiredate'].isnull().sum()}")

Date range: 2018-01-02 00:00:00 to 2025-10-27 00:00:00
Invalid dates: 0


In [64]:
# Remove invalid hire dates
rows_before = len(df)
df = df.dropna(subset=['hiredate'])
print(f"Removed {rows_before - len(df)} rows with invalid dates")

Removed 0 rows with invalid dates


In [65]:
# Extract date features
df['hire_year'] = df['hiredate'].dt.year
df['hire_month'] = df['hiredate'].dt.month
df['hire_quarter'] = df['hiredate'].dt.quarter
print("Extracted hire date features")

Extracted hire date features


#### 5.8 Clean TerminationDate Column

In [66]:
# Convert TerminationDate (missing values are valid)
df['terminationdate'] = df['terminationdate'].str.strip()
df['terminationdate'] = pd.to_datetime(df['terminationdate'], errors='coerce')
print(f"Terminated employees: {df['terminationdate'].notnull().sum()}")

Terminated employees: 122


In [67]:
# Extract termination features
df['termination_year'] = df['terminationdate'].dt.year
df['termination_month'] = df['terminationdate'].dt.month
df['termination_quarter'] = df['terminationdate'].dt.quarter
print("Extracted termination date features")

Extracted termination date features


#### 5.9 Clean Salary Column

In [68]:
# Check Salary
print("Salary statistics:")
print(df['salary'].describe())
print(f"\nNegative/zero salaries: {((df['salary'] <= 0) & df['salary'].notnull()).sum()}")

Salary statistics:
count       850.000000
mean      91374.548235
std       36012.977513
min       55110.000000
25%       65668.000000
50%       76384.000000
75%      120616.250000
max      179381.000000
Name: salary, dtype: float64

Negative/zero salaries: 0


In [69]:
# Remove invalid salaries
df = df[(df['salary'] > 0) | (df['salary'].isnull())]
print(f"Rows after removing invalid salaries: {len(df)}")

Rows after removing invalid salaries: 919


In [70]:
# Fill missing salaries with median by department and job
if df['salary'].isnull().sum() > 0:
    df['salary'] = df.groupby(['department', 'jobtitle'])['salary'].transform(
        lambda x: x.fillna(x.median())
    )
    df['salary'] = df['salary'].fillna(df['salary'].median())
print(f"Missing salaries: {df['salary'].isnull().sum()}")

Missing salaries: 0


#### 5.10 Clean ManagerID Column

In [71]:
# Clean ManagerID (missing is valid for executives)
df['managerid'] = df['managerid'].str.strip()
print(f"Missing ManagerID: {df['managerid'].isnull().sum()} (valid for executives)")

Missing ManagerID: 261 (valid for executives)


#### 5.11 Clean PerformanceRating Column

In [72]:
# Check PerformanceRating
print("PerformanceRating statistics:")
print(df['performancerating'].describe())
print(f"\nMissing: {df['performancerating'].isnull().sum()} (valid for new employees)")

PerformanceRating statistics:
count    864.000000
mean       3.506944
std        1.126916
min        2.000000
25%        2.000000
50%        3.500000
75%        5.000000
max        5.000000
Name: performancerating, dtype: float64

Missing: 55 (valid for new employees)


In [73]:
# Remove invalid ratings (outside 1-5)
invalid = ((df['performancerating'] < 1) | (df['performancerating'] > 5)) & df['performancerating'].notnull()
print(f"Invalid ratings: {invalid.sum()}")
if invalid.sum() > 0:
    df = df[~invalid]
    print(f"Rows after removing invalid ratings: {len(df)}")

Invalid ratings: 0


#### 5.12 Clean Status Column

In [74]:
# Check Status
print("Status (before):")
print(df['status'].value_counts(dropna=False))

Status (before):
status
Active          710
Terminated      123
 Active          74
 Terminated      12
Name: count, dtype: int64


In [75]:
# Clean Status
df['status'] = df['status'].str.strip().str.title()
print("\nStatus (after):")
print(df['status'].value_counts())


Status (after):
status
Active        784
Terminated    135
Name: count, dtype: int64


### Step 6: Check for Some More Transformations

#### 6.1 Create Calculated Columns

In [76]:
# Calculate tenure
from datetime import datetime
current_date = pd.Timestamp(datetime.now())

df['tenure_days'] = df.apply(
    lambda row: (row['terminationdate'] - row['hiredate']).days 
    if pd.notnull(row['terminationdate']) 
    else (current_date - row['hiredate']).days,
    axis=1
)
df['tenure_years'] = (df['tenure_days'] / 365.25).round(2)
print("Tenure statistics:")
print(df['tenure_years'].describe())

Tenure statistics:
count    919.000000
mean       3.811991
std        2.227046
min        0.010000
25%        1.875000
50%        3.740000
75%        5.695000
max        7.810000
Name: tenure_years, dtype: float64


In [77]:
# Create tenure bands
df['tenure_band'] = pd.cut(
    df['tenure_years'],
    bins=[0, 1, 3, 5, 10, 100],
    labels=['0-1 years', '1-3 years', '3-5 years', '5-10 years', '10+ years']
)
print("\nTenure band distribution:")
print(df['tenure_band'].value_counts().sort_index())


Tenure band distribution:
tenure_band
0-1 years     123
1-3 years     250
3-5 years     237
5-10 years    309
10+ years       0
Name: count, dtype: int64


In [78]:
# Create performance rating categories
df['performance_category'] = pd.cut(
    df['performancerating'],
    bins=[0, 2, 3, 4, 5],
    labels=['Below Average', 'Average', 'Good', 'Excellent'],
    include_lowest=True
)
print("Created 'performance_category' feature")
print("\nPerformance category distribution:")
print(df['performance_category'].value_counts().sort_index())

Created 'performance_category' feature

Performance category distribution:
performance_category
Below Average    217
Average          215
Good             209
Excellent        223
Name: count, dtype: int64


In [79]:
# Calculate department size (headcount per department)
df['department_size'] = df.groupby('department')['employeeid'].transform('count')
print("Created 'department_size' feature")
print("\nDepartment sizes:")
print(df.groupby('department')['department_size'].first().sort_values(ascending=False))

Created 'department_size' feature

Department sizes:
department
Production                112
Logistics                 111
Quality Control           108
Human Resources           106
Sales                     104
It Support                 98
Research & Development     98
Marketing                  95
Finance                    87
Name: department_size, dtype: int64


In [80]:
# Calculate manager span of control (direct reports per manager)
manager_span = df[df['managerid'].notnull()].groupby('managerid').size()
df['manager_span_of_control'] = df['managerid'].map(manager_span)
df['manager_span_of_control'] = df['manager_span_of_control'].fillna(0).astype(int)

print("Created 'manager_span_of_control' feature")
print(f"\nManagers with largest teams:")
print(manager_span.sort_values(ascending=False).head(10))

Created 'manager_span_of_control' feature

Managers with largest teams:
managerid
            18
PNR-1134     5
PNR-1942     5
PNR-2421     5
PNR-1519     5
PNR-2765     4
PNR-1912     4
PNR-1410     4
PNR-2437     4
PNR-1339     4
dtype: int64


In [81]:
# Create binary flags for analysis
df['is_active'] = (df['status'] == 'Active').astype(int)
df['is_terminated'] = (df['status'] == 'Terminated').astype(int)
df['has_manager'] = df['managerid'].notnull().astype(int)
df['has_performance_rating'] = df['performancerating'].notnull().astype(int)

print("Created binary flag features:")
print("- is_active, is_terminated")
print("- has_manager, has_performance_rating")
print(f"\nActive employees: {df['is_active'].sum()}")
print(f"Terminated employees: {df['is_terminated'].sum()}")

Created binary flag features:
- is_active, is_terminated
- has_manager, has_performance_rating

Active employees: 784
Terminated employees: 135


In [82]:
# Calculate average salary by department and job title
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean').round(2)
df['job_avg_salary'] = df.groupby('jobtitle')['salary'].transform('mean').round(2)

# Calculate salary position relative to peers
df['salary_vs_dept_avg'] = ((df['salary'] - df['dept_avg_salary']) / df['dept_avg_salary'] * 100).round(2)
df['salary_vs_job_avg'] = ((df['salary'] - df['job_avg_salary']) / df['job_avg_salary'] * 100).round(2)

print("Created salary comparison features:")
print("- dept_avg_salary, job_avg_salary")
print("- salary_vs_dept_avg, salary_vs_job_avg")
print(f"\nSalary vs department average statistics:")
print(df['salary_vs_dept_avg'].describe())

Created salary comparison features:
- dept_avg_salary, job_avg_salary
- salary_vs_dept_avg, salary_vs_job_avg

Salary vs department average statistics:
count    919.000000
mean      -0.000033
std       30.522451
min      -54.550000
25%      -23.595000
50%       -3.780000
75%       19.305000
max       91.940000
Name: salary_vs_dept_avg, dtype: float64


In [83]:
# Create seniority level based on tenure
df['seniority_level'] = pd.cut(
    df['tenure_years'],
    bins=[0, 2, 5, 10, 100],
    labels=['Junior', 'Mid-Level', 'Senior', 'Executive'],
    include_lowest=True
)
print("Created 'seniority_level' feature")
print("\nSeniority level distribution:")
print(df['seniority_level'].value_counts().sort_index())

Created 'seniority_level' feature

Seniority level distribution:
seniority_level
Junior       241
Mid-Level    369
Senior       309
Executive      0
Name: count, dtype: int64


In [84]:
# Create attrition risk indicators (for terminated employees)
if df['is_terminated'].sum() > 0:
    df['tenure_at_termination'] = df.apply(
        lambda row: (row['terminationdate'] - row['hiredate']).days / 365.25 
        if pd.notnull(row['terminationdate']) else None,
        axis=1
    )
    
    # Flag early attrition (left within first year)
    df['early_attrition'] = ((df['tenure_at_termination'] < 1) & (df['is_terminated'] == 1)).astype(int)
    
    print("Created attrition features:")
    print("- tenure_at_termination")
    print("- early_attrition (left within 1 year)")
    print(f"\nEarly attrition count: {df['early_attrition'].sum()}")
else:
    print("No terminated employees in dataset")

Created attrition features:
- tenure_at_termination
- early_attrition (left within 1 year)

Early attrition count: 11


In [85]:
# Create salary bands
df['salary_band'] = pd.cut(
    df['salary'],
    bins=[0, 60000, 80000, 100000, 150000, 300000],
    labels=['<60K', '60K-80K', '80K-100K', '100K-150K', '150K+']
)
print("\nSalary band distribution:")
print(df['salary_band'].value_counts().sort_index())


Salary band distribution:
salary_band
<60K          94
60K-80K      465
80K-100K     110
100K-150K    132
150K+        118
Name: count, dtype: int64


#### 6.2 Data Validation

In [86]:
# Validate status vs termination date
terminated_no_date = df[(df['status'] == 'Terminated') & (df['terminationdate'].isnull())]
active_with_date = df[(df['status'] == 'Active') & (df['terminationdate'].notnull())]
print(f"Terminated without date: {len(terminated_no_date)}")
print(f"Active with date: {len(active_with_date)}")

Terminated without date: 13
Active with date: 0


In [87]:
# Fix inconsistencies
df.loc[(df['status'] == 'Terminated') & (df['terminationdate'].isnull()), 'status'] = 'Active'
df.loc[(df['status'] == 'Active') & (df['terminationdate'].notnull()), 'status'] = 'Terminated'
print("\nFixed inconsistencies")
print("\nFinal Status:")
print(df['status'].value_counts())


Fixed inconsistencies

Final Status:
status
Active        797
Terminated    122
Name: count, dtype: int64


#### 6.3 Sort Data

In [88]:
# Sort by EmployeeID
df = df.sort_values('employeeid').reset_index(drop=True)
print("Data sorted by EmployeeID")
df.head(10)

Data sorted by EmployeeID


Unnamed: 0,employeeid,fullname,department,jobtitle,email,gender,hiredate,terminationdate,salary,managerid,...,has_manager,has_performance_rating,dept_avg_salary,job_avg_salary,salary_vs_dept_avg,salary_vs_job_avg,seniority_level,tenure_at_termination,early_attrition,salary_band
0,PNR-1001,Aarav Taylor,It Support,Help Desk Specialist,employee.1001@pnrao.com,Unknown,2021-05-08,NaT,79378.0,,...,0,0,96490.2,70990.91,-17.73,11.81,Mid-Level,,0,60K-80K
1,PNR-1002,Emma Jones,Quality Control,Qc Inspector,employee.1002@pnrao.com,Female,2019-11-13,NaT,69570.0,PNR-1714,...,1,1,109570.66,70667.88,-36.51,-1.55,Senior,,0,60K-80K
2,PNR-1004,Rohan Jones,Human Resources,Hr Coordinator,employee.1004@pnrao.com,Female,2020-02-11,NaT,61572.0,PNR-1925,...,1,1,68801.09,68340.05,-10.51,-9.9,Senior,,0,60K-80K
3,PNR-1007,John Smith,Marketing,Marketing Coordinator,employee.1007@pnrao.com,Female,2023-08-21,NaT,76107.0,PNR-2219,...,1,0,93037.96,72400.67,-18.2,5.12,Mid-Level,,0,60K-80K
4,PNR-1008,Chris Jones,Research & Development,Lab Technician,employee.1008@pnrao.com,Unknown,2023-07-27,NaT,79117.0,PNR-1441,...,1,1,85614.79,69766.0,-7.59,13.4,Mid-Level,,0,60K-80K
5,PNR-1013,Chris Singh,Sales,Account Manager,employee.1013@pnrao.com,Male,2020-11-24,NaT,176073.0,,...,0,1,121525.81,151582.0,44.89,16.16,Mid-Level,,0,150K+
6,PNR-1017,Saanvi Sharma,Finance,Accountant,employee.1017@pnrao.com,Female,2021-02-26,NaT,83620.0,PNR-1533,...,1,1,69535.21,69549.67,20.26,20.23,Mid-Level,,0,80K-100K
7,PNR-1018,Olivia Jones,Finance,Controller,employee.1018@pnrao.com,Male,2023-08-14,NaT,78927.0,PNR-2098,...,1,1,69535.21,68783.8,13.51,14.75,Mid-Level,,0,60K-80K
8,PNR-1020,Aditya Kumar,Production,Line Supervisor,employee.1020@pnrao.com,Unknown,2022-02-25,NaT,68866.0,,...,0,1,102837.83,69824.22,-33.03,-1.37,Mid-Level,,0,60K-80K
9,PNR-1021,Olivia Williams,It Support,It Technician,employee.1021@pnrao.com,Female,2022-11-30,NaT,81467.0,PNR-1947,...,1,1,96490.2,71783.83,-15.57,13.49,Mid-Level,,0,80K-100K


#### 6.4 Final Data Quality Check

In [89]:
# Final summary
print("=" * 60)
print("FINAL DATA QUALITY SUMMARY")
print("=" * 60)
print(f"\nFinal shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMissing values:")
print(df.isnull().sum())
print(f"\nDuplicates: {df.duplicated().sum()}")
print(f"\nData types:")
print(df.dtypes)

FINAL DATA QUALITY SUMMARY

Final shape: (919, 36)

Columns: ['employeeid', 'fullname', 'department', 'jobtitle', 'email', 'gender', 'hiredate', 'terminationdate', 'salary', 'managerid', 'performancerating', 'status', 'hire_year', 'hire_month', 'hire_quarter', 'termination_year', 'termination_month', 'termination_quarter', 'tenure_days', 'tenure_years', 'tenure_band', 'performance_category', 'department_size', 'manager_span_of_control', 'is_active', 'is_terminated', 'has_manager', 'has_performance_rating', 'dept_avg_salary', 'job_avg_salary', 'salary_vs_dept_avg', 'salary_vs_job_avg', 'seniority_level', 'tenure_at_termination', 'early_attrition', 'salary_band']

Missing values:
employeeid                   0
fullname                     0
department                   0
jobtitle                     0
email                        0
gender                       0
hiredate                     0
terminationdate            797
salary                       0
managerid                  261
per

In [90]:
# Display cleaned data
print("\nCleaned data sample:")
df.head(10)


Cleaned data sample:


Unnamed: 0,employeeid,fullname,department,jobtitle,email,gender,hiredate,terminationdate,salary,managerid,...,has_manager,has_performance_rating,dept_avg_salary,job_avg_salary,salary_vs_dept_avg,salary_vs_job_avg,seniority_level,tenure_at_termination,early_attrition,salary_band
0,PNR-1001,Aarav Taylor,It Support,Help Desk Specialist,employee.1001@pnrao.com,Unknown,2021-05-08,NaT,79378.0,,...,0,0,96490.2,70990.91,-17.73,11.81,Mid-Level,,0,60K-80K
1,PNR-1002,Emma Jones,Quality Control,Qc Inspector,employee.1002@pnrao.com,Female,2019-11-13,NaT,69570.0,PNR-1714,...,1,1,109570.66,70667.88,-36.51,-1.55,Senior,,0,60K-80K
2,PNR-1004,Rohan Jones,Human Resources,Hr Coordinator,employee.1004@pnrao.com,Female,2020-02-11,NaT,61572.0,PNR-1925,...,1,1,68801.09,68340.05,-10.51,-9.9,Senior,,0,60K-80K
3,PNR-1007,John Smith,Marketing,Marketing Coordinator,employee.1007@pnrao.com,Female,2023-08-21,NaT,76107.0,PNR-2219,...,1,0,93037.96,72400.67,-18.2,5.12,Mid-Level,,0,60K-80K
4,PNR-1008,Chris Jones,Research & Development,Lab Technician,employee.1008@pnrao.com,Unknown,2023-07-27,NaT,79117.0,PNR-1441,...,1,1,85614.79,69766.0,-7.59,13.4,Mid-Level,,0,60K-80K
5,PNR-1013,Chris Singh,Sales,Account Manager,employee.1013@pnrao.com,Male,2020-11-24,NaT,176073.0,,...,0,1,121525.81,151582.0,44.89,16.16,Mid-Level,,0,150K+
6,PNR-1017,Saanvi Sharma,Finance,Accountant,employee.1017@pnrao.com,Female,2021-02-26,NaT,83620.0,PNR-1533,...,1,1,69535.21,69549.67,20.26,20.23,Mid-Level,,0,80K-100K
7,PNR-1018,Olivia Jones,Finance,Controller,employee.1018@pnrao.com,Male,2023-08-14,NaT,78927.0,PNR-2098,...,1,1,69535.21,68783.8,13.51,14.75,Mid-Level,,0,60K-80K
8,PNR-1020,Aditya Kumar,Production,Line Supervisor,employee.1020@pnrao.com,Unknown,2022-02-25,NaT,68866.0,,...,0,1,102837.83,69824.22,-33.03,-1.37,Mid-Level,,0,60K-80K
9,PNR-1021,Olivia Williams,It Support,It Technician,employee.1021@pnrao.com,Female,2022-11-30,NaT,81467.0,PNR-1947,...,1,1,96490.2,71783.83,-15.57,13.49,Mid-Level,,0,80K-100K


### Save Cleaned Data

In [91]:
# Save to processed folder
output_path = '../data/processed/Employee_Master_cleaned.csv'
df.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")
print(f"Total records: {len(df)}")

Cleaned data saved to: ../data/processed/Employee_Master_cleaned.csv
Total records: 919


---
## Summary

### Data Cleaning & Feature Engineering Completed Successfully!

**Steps Performed:**
1. ✅ Checked for redundant columns (none found)
2. ✅ Renamed columns to snake_case format
3. ✅ Removed duplicates (including duplicate EmployeeIDs)
4. ✅ Handled missing values appropriately
5. ✅ Cleaned all individual columns:
   - EmployeeID, FullName, Department, JobTitle
   - Email (generated missing emails)
   - Gender, HireDate, TerminationDate
   - Salary (imputed missing values), ManagerID
   - PerformanceRating, Status
6. ✅ **Created transformations & feature engineering:**
   - **Date features**: hire/termination year, month, quarter
   - **Tenure calculations**: days, years, bands (0-1, 1-3, 3-5, 5-10, 10+ years)
   - **Salary features**: bands (<60K to 150K+), department/job averages, salary vs peers
   - **Performance categories**: Below Average, Average, Good, Excellent
   - **Department metrics**: department size (headcount)
   - **Manager metrics**: span of control (direct reports)
   - **Binary flags**: is_active, is_terminated, has_manager, has_performance_rating
   - **Seniority levels**: Junior, Mid-Level, Senior, Executive
   - **Attrition features**: tenure at termination, early attrition flag
   - **Data validation**: fixed status/termination date inconsistencies

**Feature Summary:**
- Original columns: 12
- Final columns: 35+
- New features created: 23+

**Next Steps:**
- Clean remaining datasets (Headcount, Leave_Attendance, Recruitment_Funnel)
- Merge with Compensation_History for comprehensive employee analysis
- Perform exploratory data analysis
- Create Power BI dashboard