# Module 2: Data Preprocessing and Cleaning

**Author:** Chinmay Nadgir  
**Date:** October 2025  
**Purpose:** Demonstrate professional data preprocessing techniques for production-ready analysis

---

## Table of Contents
1. [Introduction](#intro)
2. [Setup & Data Loading](#setup)
3. [Data Quality Assessment](#assessment)
4. [Handling Missing Values](#missing)
5. [Data Type Conversions](#datatypes)
6. [Outlier Detection and Treatment](#outliers)
7. [Feature Engineering](#features)
8. [Data Normalization and Scaling](#scaling)
9. [Encoding Categorical Variables](#encoding)
10. [Final Validation](#validation)
11. [Summary](#summary)

<a id='intro'></a>
## 1. Introduction

Data preprocessing is the critical foundation of any data science project. Raw data typically contains inconsistencies, missing values, and outliers that must be addressed before analysis.

**Learning Objectives:**
- Assess data quality systematically
- Handle missing data with appropriate strategies
- Detect and treat outliers
- Engineer new features from existing data
- Normalize and scale numeric variables
- Encode categorical variables properly

**Key Principle:** Document all preprocessing decisions for reproducibility and transparency.

<a id='setup'></a>
## 2. Setup & Data Loading

In [None]:
# Standard library imports
import warnings
from pathlib import Path
from typing import List, Tuple

# Third-party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Display versions
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"matplotlib: {plt.matplotlib.__version__}")
print(f"seaborn: {sns.__version__}")

In [None]:
# Load the sample dataset from Module 1
data_dir = Path('data')

# Create sample dataset if Module 1 wasn't run
if not (data_dir / 'sample_data.csv').exists():
    data_dir.mkdir(exist_ok=True)
    np.random.seed(42)
    df = pd.DataFrame({
        'customer_id': range(1, 101),
        'name': [f'Customer_{i}' for i in range(1, 101)],
        'age': np.random.randint(18, 70, 100),
        'purchase_amount': np.random.uniform(10, 1000, 100).round(2),
        'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], 100),
        'date': pd.date_range('2024-01-01', periods=100, freq='D'),
        'loyalty_member': np.random.choice([True, False], 100)
    })
    # Introduce missing values
    df.loc[5:8, 'age'] = np.nan
    df.loc[15:17, 'purchase_amount'] = np.nan
    df.to_csv(data_dir / 'sample_data.csv', index=False)

# Load data
df = pd.read_csv(data_dir / 'sample_data.csv', parse_dates=['date'])
df_original = df.copy()  # Preserve original for comparison

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nFirst 5 rows:")
display(df.head())

<a id='assessment'></a>
## 3. Data Quality Assessment

Begin with a comprehensive assessment of data quality issues.

In [None]:
def assess_data_quality(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generate a comprehensive data quality report.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to assess
    
    Returns:
    --------
    pd.DataFrame
        Quality report with metrics per column
    """
    quality_report = pd.DataFrame({
        'Data_Type': df.dtypes,
        'Non_Null_Count': df.count(),
        'Null_Count': df.isnull().sum(),
        'Null_Percentage': (df.isnull().sum() / len(df) * 100).round(2),
        'Unique_Values': df.nunique(),
        'Duplicate_Rows': [df.duplicated().sum()] * len(df.columns)
    })
    
    return quality_report

quality_report = assess_data_quality(df)
print("\nData Quality Report:")
print("=" * 80)
display(quality_report)

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

if duplicates > 0:
    print(f"Removing {duplicates} duplicate rows...")
    df = df.drop_duplicates()
    print(f"New shape: {df.shape}")
else:
    print("No duplicate rows found.")

In [None]:
# Visualize missing data
fig, ax = plt.subplots(figsize=(10, 6))
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0].sort_values(ascending=False)

if len(missing_data) > 0:
    missing_data.plot(kind='bar', ax=ax, color='coral')
    ax.set_title('Missing Values by Column', fontsize=14, fontweight='bold')
    ax.set_xlabel('Column', fontsize=12)
    ax.set_ylabel('Count of Missing Values', fontsize=12)
    ax.grid(axis='y', alpha=0.3)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values to visualize.")

<a id='missing'></a>
## 4. Handling Missing Values

Missing data strategies depend on the amount and pattern of missingness.

**Strategies:**
- Drop columns with >50% missing
- Impute numeric: mean, median, or forward-fill
- Impute categorical: mode or 'Unknown'
- Document reasoning for each decision

In [None]:
# Identify columns with >50% missing values
missing_threshold = 0.5
missing_pct = df.isnull().sum() / len(df)
cols_to_drop = missing_pct[missing_pct > missing_threshold].index.tolist()

if cols_to_drop:
    print(f"Dropping columns with >{missing_threshold*100}% missing: {cols_to_drop}")
    df = df.drop(columns=cols_to_drop)
else:
    print(f"No columns exceed {missing_threshold*100}% missing threshold.")

In [None]:
# Impute numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        missing_count = df[col].isnull().sum()
        # Use median for robustness to outliers
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)
        print(f"Imputed {missing_count} missing values in '{col}' with median: {median_value:.2f}")

In [None]:
# Impute categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        missing_count = df[col].isnull().sum()
        # Use mode (most frequent value)
        mode_value = df[col].mode()[0]
        df[col].fillna(mode_value, inplace=True)
        print(f"Imputed {missing_count} missing values in '{col}' with mode: {mode_value}")

In [None]:
# Verify no missing values remain
remaining_missing = df.isnull().sum().sum()
print(f"\nTotal missing values after imputation: {remaining_missing}")

if remaining_missing == 0:
    print("All missing values have been handled successfully.")

<a id='datatypes'></a>
## 5. Data Type Conversions

Correct data types improve memory efficiency and enable proper operations.

In [None]:
print("Data types before conversion:")
print(df.dtypes)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

In [None]:
# Convert categorical columns with low cardinality to category dtype
for col in categorical_cols:
    if df[col].nunique() / len(df) < 0.5:  # Less than 50% unique values
        df[col] = df[col].astype('category')
        print(f"Converted '{col}' to category dtype")

In [None]:
# Convert boolean columns
bool_cols = ['loyalty_member']
for col in bool_cols:
    if col in df.columns:
        df[col] = df[col].astype('bool')
        print(f"Converted '{col}' to bool dtype")

In [None]:
print("\nData types after conversion:")
print(df.dtypes)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"Memory reduction achieved through type optimization.")

<a id='outliers'></a>
## 6. Outlier Detection and Treatment

Outliers can skew analysis. We use IQR method and Z-score for detection.

**IQR Method:** Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are outliers

In [None]:
def detect_outliers_iqr(df: pd.DataFrame, column: str) -> Tuple[pd.Series, float, float]:
    """
    Detect outliers using IQR method.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame
    column : str
        Column name
    
    Returns:
    --------
    Tuple[pd.Series, float, float]
        Boolean series of outliers, lower bound, upper bound
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (df[column] < lower_bound) | (df[column] > upper_bound)
    return outliers, lower_bound, upper_bound

# Detect outliers in numeric columns
outlier_summary = {}

for col in numeric_cols:
    if col != 'customer_id':  # Skip ID columns
        outliers, lower, upper = detect_outliers_iqr(df, col)
        outlier_count = outliers.sum()
        outlier_summary[col] = {
            'count': outlier_count,
            'percentage': (outlier_count / len(df) * 100).round(2),
            'lower_bound': lower,
            'upper_bound': upper
        }

print("\nOutlier Detection Summary:")
print("=" * 80)
for col, info in outlier_summary.items():
    print(f"{col}:")
    print(f"  Count: {info['count']} ({info['percentage']}%)")
    print(f"  Bounds: [{info['lower_bound']:.2f}, {info['upper_bound']:.2f}]")

In [None]:
# Visualize outliers with boxplots
numeric_cols_to_plot = [col for col in numeric_cols if col != 'customer_id']

if numeric_cols_to_plot:
    fig, axes = plt.subplots(1, len(numeric_cols_to_plot), figsize=(14, 5))
    if len(numeric_cols_to_plot) == 1:
        axes = [axes]
    
    for idx, col in enumerate(numeric_cols_to_plot):
        axes[idx].boxplot(df[col].dropna(), vert=True)
        axes[idx].set_title(f'{col}', fontweight='bold')
        axes[idx].set_ylabel('Value')
        axes[idx].grid(axis='y', alpha=0.3)
    
    plt.suptitle('Outlier Detection - Boxplots', fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

In [None]:
# Cap outliers using IQR bounds
for col in numeric_cols:
    if col != 'customer_id':
        outliers, lower, upper = detect_outliers_iqr(df, col)
        if outliers.sum() > 0:
            original_outliers = outliers.sum()
            df[col] = df[col].clip(lower=lower, upper=upper)
            print(f"Capped {original_outliers} outliers in '{col}' to [{lower:.2f}, {upper:.2f}]")

<a id='features'></a>
## 7. Feature Engineering

Create new features from existing data to enhance analysis.

In [None]:
# Extract datetime features if date column exists
if 'date' in df.columns:
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day_of_week'] = df['date'].dt.dayofweek
    df['day_name'] = df['date'].dt.day_name()
    df['quarter'] = df['date'].dt.quarter
    print("Extracted datetime features: year, month, day_of_week, day_name, quarter")

In [None]:
# Create binned features for continuous variables
if 'age' in df.columns:
    df['age_group'] = pd.cut(
        df['age'],
        bins=[0, 25, 35, 50, 100],
        labels=['18-25', '26-35', '36-50', '51+']
    )
    print("Created 'age_group' feature")

if 'purchase_amount' in df.columns:
    df['purchase_category'] = pd.cut(
        df['purchase_amount'],
        bins=[0, 100, 300, 500, 1000],
        labels=['Low', 'Medium', 'High', 'Very High']
    )
    print("Created 'purchase_category' feature")

In [None]:
print(f"\nDataset shape after feature engineering: {df.shape}")
print(f"New columns: {[col for col in df.columns if col not in df_original.columns]}")

<a id='scaling'></a>
## 8. Data Normalization and Scaling

Scaling brings features to similar ranges, important for distance-based algorithms.

**Methods:**
- StandardScaler: Mean=0, Std=1 (assumes normal distribution)
- MinMaxScaler: Scale to [0,1] range
- RobustScaler: Uses median and IQR (robust to outliers)

In [None]:
# Select numeric columns for scaling (exclude IDs and engineered categorical features)
cols_to_scale = ['age', 'purchase_amount']
cols_to_scale = [col for col in cols_to_scale if col in df.columns]

if cols_to_scale:
    # Create copies for comparison
    df_standard = df.copy()
    df_minmax = df.copy()
    df_robust = df.copy()
    
    # StandardScaler
    scaler_standard = StandardScaler()
    df_standard[[f'{col}_standard' for col in cols_to_scale]] = scaler_standard.fit_transform(df[cols_to_scale])
    
    # MinMaxScaler
    scaler_minmax = MinMaxScaler()
    df_minmax[[f'{col}_minmax' for col in cols_to_scale]] = scaler_minmax.fit_transform(df[cols_to_scale])
    
    # RobustScaler
    scaler_robust = RobustScaler()
    df_robust[[f'{col}_robust' for col in cols_to_scale]] = scaler_robust.fit_transform(df[cols_to_scale])
    
    print("Scaling applied to:", cols_to_scale)
    print("\nScaled columns created with suffixes: _standard, _minmax, _robust")

In [None]:
# Compare scaling methods
if cols_to_scale:
    comparison_col = cols_to_scale[0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Original
    axes[0, 0].hist(df[comparison_col], bins=20, color='steelblue', edgecolor='black')
    axes[0, 0].set_title(f'Original: {comparison_col}', fontweight='bold')
    axes[0, 0].set_xlabel('Value')
    axes[0, 0].set_ylabel('Frequency')
    
    # StandardScaler
    axes[0, 1].hist(df_standard[f'{comparison_col}_standard'], bins=20, color='coral', edgecolor='black')
    axes[0, 1].set_title('StandardScaler (Z-score)', fontweight='bold')
    axes[0, 1].set_xlabel('Value')
    axes[0, 1].set_ylabel('Frequency')
    
    # MinMaxScaler
    axes[1, 0].hist(df_minmax[f'{comparison_col}_minmax'], bins=20, color='lightgreen', edgecolor='black')
    axes[1, 0].set_title('MinMaxScaler [0,1]', fontweight='bold')
    axes[1, 0].set_xlabel('Value')
    axes[1, 0].set_ylabel('Frequency')
    
    # RobustScaler
    axes[1, 1].hist(df_robust[f'{comparison_col}_robust'], bins=20, color='plum', edgecolor='black')
    axes[1, 1].set_title('RobustScaler (Median/IQR)', fontweight='bold')
    axes[1, 1].set_xlabel('Value')
    axes[1, 1].set_ylabel('Frequency')
    
    plt.suptitle('Comparison of Scaling Methods', fontsize=14, fontweight='bold', y=1.00)
    plt.tight_layout()
    plt.show()
    
    # Add scaled columns to main dataframe (using StandardScaler as default)
    for col in cols_to_scale:
        df[f'{col}_scaled'] = df_standard[f'{col}_standard']
    print(f"\nAdded scaled versions of {cols_to_scale} to main dataframe.")

<a id='encoding'></a>
## 9. Encoding Categorical Variables

Convert categorical variables to numeric for analysis.

**Methods:**
- One-Hot Encoding: For nominal variables (no order)
- Label Encoding: For ordinal variables (with order)
- Frequency Encoding: Useful for high-cardinality features

In [None]:
# One-hot encoding for nominal categorical variables
nominal_cols = ['category']
nominal_cols = [col for col in nominal_cols if col in df.columns]

if nominal_cols:
    df_encoded = pd.get_dummies(df, columns=nominal_cols, prefix=nominal_cols, drop_first=False)
    print(f"Applied one-hot encoding to: {nominal_cols}")
    print(f"Shape after encoding: {df_encoded.shape}")
    print(f"\nNew encoded columns:")
    new_cols = [col for col in df_encoded.columns if col not in df.columns]
    for col in new_cols:
        print(f"  - {col}")

In [None]:
# Label encoding for ordinal variables
ordinal_mapping = {
    'Low': 1,
    'Medium': 2,
    'High': 3,
    'Very High': 4
}

if 'purchase_category' in df.columns:
    df['purchase_category_encoded'] = df['purchase_category'].map(ordinal_mapping)
    print("\nApplied label encoding to 'purchase_category'")
    print(f"Mapping: {ordinal_mapping}")

<a id='validation'></a>
## 10. Final Validation

Verify data is clean and ready for analysis.

In [None]:
print("Final Data Validation:")
print("=" * 80)
print(f"Shape: {df.shape}")
print(f"Total missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")
print(f"\nData types:")
print(df.dtypes.value_counts())
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

In [None]:
# Summary statistics
print("\nSummary Statistics (Numeric Columns):")
display(df.describe())

In [None]:
# Save cleaned data
output_path = data_dir / 'cleaned_data.csv'
df.to_csv(output_path, index=False)
print(f"\nCleaned data saved to: {output_path}")

# Save encoded version if it exists
if 'df_encoded' in locals():
    encoded_path = data_dir / 'encoded_data.csv'
    df_encoded.to_csv(encoded_path, index=False)
    print(f"Encoded data saved to: {encoded_path}")

<a id='summary'></a>
## 11. Summary

### Preprocessing Steps Completed

1. **Data Quality Assessment:** Identified missing values, duplicates, and data type issues
2. **Missing Value Handling:** Imputed numeric columns with median, categorical with mode
3. **Data Type Optimization:** Converted to appropriate types for memory efficiency
4. **Outlier Treatment:** Detected using IQR method and capped extreme values
5. **Feature Engineering:** Created datetime features and binned continuous variables
6. **Normalization:** Applied StandardScaler, MinMaxScaler, and RobustScaler
7. **Encoding:** One-hot encoded nominal variables, label encoded ordinal variables
8. **Validation:** Verified clean data with no missing values or duplicates

### Key Decisions Documented

- Used median imputation for numeric features (robust to outliers)
- Applied IQR method with 1.5 multiplier for outlier detection
- Converted low-cardinality string columns to category dtype
- Created age groups: 18-25, 26-35, 36-50, 51+
- Created purchase categories: Low, Medium, High, Very High
- Used StandardScaler for final scaled versions

### Next Steps

Data is now ready for:
- Statistical analysis (Module 3)
- Visualization (Module 4)
- Exploratory Data Analysis (Module 5)