# Data Cleaning

**Objective:** Clean and prepare the IBM HR data for analysis

**Tasks:**
1. Handle any data quality issues
2. Convert data types where needed
3. Remove unnecessary columns
4. Create clean dataset for EDA

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported!")

Libraries imported!


In [3]:
# Load data
print("Loading data from previous step...")

df = pd.read_csv('../data/external/enriched_data.csv')

print(f"Data loaded!")
print(f"Shape: {df.shape}")
print(f"{df.shape[0]} rows × {df.shape[1]} columns")

Loading data from previous step...
Data loaded!
Shape: (1470, 35)
1470 rows × 35 columns


In [3]:
# Check current data types
print("\nCurrent Data Types:")
print("="*80)
print(df.dtypes)


Current Data Types:
Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64

In [4]:
# Identify columns that should be categorical
print("\nConverting to proper data types...")

# Categorical columns
categorical_columns = [
    'Attrition', 'BusinessTravel', 'Department', 'EducationField',
    'Gender', 'JobRole', 'MaritalStatus', 'OverTime'
]

for col in categorical_columns:
    if col in df.columns:
        df[col] = df[col].astype('category')
        print(f"✓ Converted {col} to category")

print("\nData type conversion complete!")


Converting to proper data types...
✓ Converted Attrition to category
✓ Converted BusinessTravel to category
✓ Converted Department to category
✓ Converted EducationField to category
✓ Converted Gender to category
✓ Converted JobRole to category
✓ Converted MaritalStatus to category
✓ Converted OverTime to category

Data type conversion complete!


In [5]:
# Check for columns that don't vary (constant values)
print("\nChecking for constant columns (no variation)...")

constant_cols = []
for col in df.columns:
    if df[col].nunique() == 1:
        constant_cols.append(col)
        print(f"{col} has only 1 unique value: {df[col].unique()[0]}")

if constant_cols:
    print(f"\nFound {len(constant_cols)} constant column(s).")
    print("These columns don't help prediction and should be removed.")
else:
    print("No constant columns found.")


Checking for constant columns (no variation)...
EmployeeCount has only 1 unique value: 1
Over18 has only 1 unique value: Y
StandardHours has only 1 unique value: 80

Found 3 constant column(s).
These columns don't help prediction and should be removed.


In [6]:
# Remove constant columns
print("\nRemoving constant columns...")

if constant_cols:
    df = df.drop(columns=constant_cols)
    print(f"Removed {len(constant_cols)} columns: {constant_cols}")
else:
    print("No columns to remove.")

print(f"\nNew shape: {df.shape}")


Removing constant columns...
Removed 3 columns: ['EmployeeCount', 'Over18', 'StandardHours']

New shape: (1470, 32)


In [7]:
# Check for highly correlated features (numeric only)
print("\nChecking numeric features...")

numeric_cols = df.select_dtypes(include=[np.number]).columns
print(f"Found {len(numeric_cols)} numeric features")


Checking numeric features...
Found 24 numeric features


In [8]:
# Final data quality check
print("\nFINAL DATA QUALITY REPORT")
print("="*80)

print(f"Total Rows: {len(df)}")
print(f"Total Columns: {len(df.columns)}")
print(f"Missing Values: {df.isnull().sum().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")

print("\nData Types Summary:")
print(df.dtypes.value_counts())


FINAL DATA QUALITY REPORT
Total Rows: 1470
Total Columns: 32
Missing Values: 0
Duplicates: 0

Data Types Summary:
int64       24
category     2
category     1
category     1
category     1
category     1
category     1
category     1
Name: count, dtype: int64


In [9]:
# Save cleaned data
print("\nSaving cleaned dataset...")

df.to_csv('../data/processed/cleaned_data.csv', index=False)

print("Cleaned data saved to: data/processed/cleaned_data.csv")
print(f"Final shape: {df.shape}")
print("\nData cleaning complete! Ready for EDA.")


Saving cleaned dataset...
Cleaned data saved to: data/processed/cleaned_data.csv
Final shape: (1470, 32)

Data cleaning complete! Ready for EDA.


---

## Data Cleaning Summary:

**Removed Columns:**
- EmployeeCount (constant)
- Over18 (constant)
- StandardHours (constant)

**Data Type Conversions:**
- Converted 8 text columns to categorical type
- Preserved all numeric columns

**Data Quality:**
- No missing values
- No duplicates
- Clean dataset ready for analysis

**Output:** `data/processed/cleaned_data.csv`

---

**Next Step:** Proceed to `03_eda.ipynb` for Exploratory Data Analysis