# Section 3 ‚Äî Data Cleaning and Preprocessing

### Goal
Real-world datasets are rarely clean. This section focuses on **detecting, handling, and transforming messy data** using Pandas. You will learn to:
- Detect and handle missing or inconsistent data.
- Standardize column names and datatypes.
- Remove duplicates and outliers.
- Apply transformations and mappings.
- Prepare data for downstream analysis or machine learning.

## Subtopics
1. Identifying and handling missing values (`isna`, `fillna`, `dropna`).
2. Removing duplicates and inconsistent rows.
3. Data type conversions (`astype`, `to_datetime`).
4. Feature creation and conditional column transformations.
5. Outlier detection and trimming.
6. End-to-end cleaning workflow example.

In [None]:
import pandas as pd
import numpy as np
np.random.seed(42)

# Simulate a messy e-commerce dataset
data = {
    'CustomerID': [201, 202, 203, 204, 205, 205, None],
    'Gender': ['Male', 'female', 'FEMALE', 'Male', np.nan, 'male', 'unknown'],
    'Age': [25, 34, np.nan, 29, 999, 28, 30],
    'PurchaseAmount': [250.0, np.nan, 100.5, 400.0, 500.0, 400.0, np.nan],
    'JoinDate': ['2023-01-01', '2023/01/15', '01-02-2023', '2023-01-22', 'not_a_date', '2023-01-22', '2023-02-10']
}
df = pd.DataFrame(data)
df

## 1Ô∏è‚É£ Detecting Missing Data

Pandas provides vectorized utilities for detecting missing data. Use:
- `df.isna()` ‚Üí boolean mask
- `df.isna().sum()` ‚Üí missing count per column
- `df.dropna()` or `df.fillna()` for imputation/removal

In [None]:
print('Missing values per column:')
print(df.isna().sum(), '\n')

# Drop rows with missing CustomerID
df = df.dropna(subset=['CustomerID'])
df.reset_index(drop=True, inplace=True)
df.head()

## 2Ô∏è‚É£ Handling Duplicates

Duplicates often arise due to multiple data merges or logging events twice. Use:
- `df.duplicated()` to detect.
- `df.drop_duplicates()` to remove.

In [None]:
print('Duplicate rows:')
print(df[df.duplicated()])

df = df.drop_duplicates()
df

## 3Ô∏è‚É£ Standardizing Columns and Data Types

Column names often contain inconsistent cases or spaces. Use `.str.lower()` and `.str.replace()` for standardization. 
We also convert date-like and numeric fields properly.

In [None]:
# Normalize column names
df.columns = df.columns.str.strip().str.lower()

# Convert JoinDate column safely to datetime
df['joindate'] = pd.to_datetime(df['joindate'], errors='coerce')

# Convert Age to numeric and fix unrealistic values (>120 treated as missing)
df.loc[df['age'] > 120, 'age'] = np.nan
df['age'] = df['age'].astype('float')

df.info()
df

## 4Ô∏è‚É£ Handling Categorical and Text Data

Standardize text categories using `.str.lower()`, `.replace()`, and `.map()`. 
For example, unify `Gender` values like `'male'`, `'Male'`, `'FEMALE'` ‚Üí `'Male'` / `'Female'`.

In [None]:
df['gender'] = df['gender'].str.lower().replace({
    'female': 'Female',
    'male': 'Male',
    'unknown': np.nan
})
df

## 5Ô∏è‚É£ Filling Missing Values and Creating Derived Columns

- Numeric columns ‚Üí fill with mean, median, or domain logic.
- Categorical columns ‚Üí fill with mode or forward-fill.
- Derived columns ‚Üí computed from others (e.g., spending category).

In [None]:
df['age'] = df['age'].fillna(df['age'].median())
df['purchaseamount'] = df['purchaseamount'].fillna(df['purchaseamount'].median())

# Create spending category
df['spend_level'] = pd.cut(df['purchaseamount'], bins=[0, 200, 400, 600], labels=['Low', 'Medium', 'High'])
df

## 6Ô∏è‚É£ Outlier Detection (Simple Example)

Detect outliers using IQR or z-score thresholds. Here we apply IQR method on `PurchaseAmount`.

In [None]:
Q1 = df['purchaseamount'].quantile(0.25)
Q3 = df['purchaseamount'].quantile(0.75)
IQR = Q3 - Q1

outlier_mask = (df['purchaseamount'] < (Q1 - 1.5 * IQR)) | (df['purchaseamount'] > (Q3 + 1.5 * IQR))
df_outliers = df[outlier_mask]

print('Detected outliers:')
df_outliers

### Clean final dataset after trimming outliers
Often, trimming top/bottom 1% can improve model stability.

In [None]:
df_clean = df[~outlier_mask].copy()
df_clean.reset_index(drop=True, inplace=True)
df_clean

## ‚úÖ Real-World Problem 1 ‚Äî Cleaning Survey Data

**Scenario:** You receive a CSV with inconsistent gender labels, missing ages, and invalid entries like 0 or 999. 
You must clean and standardize it for demographic analysis.

**Steps to demonstrate:**
1. Identify invalid/missing entries.
2. Standardize case and category names.
3. Impute missing ages with median.
4. Remove unrealistic values.

In [None]:
survey = pd.DataFrame({
    'RespondentID': [1,2,3,4,5,6],
    'Gender': ['male', 'Female', 'FEMALE', 'Unknown', 'male', 'None'],
    'Age': [22, 0, 30, 27, 999, np.nan]
})

# Step 1 & 2
survey['Gender'] = survey['Gender'].str.lower().replace({'female': 'Female', 'male': 'Male', 'unknown': np.nan, 'none': np.nan})

# Step 3 & 4
survey.loc[(survey['Age'] <= 0) | (survey['Age'] > 120), 'Age'] = np.nan
survey['Age'] = survey['Age'].fillna(survey['Age'].median())

survey

## ‚úÖ Real-World Problem 2 ‚Äî Transaction Data Preprocessing for ML

**Goal:** Prepare transaction data for machine learning model input.

1. Impute missing numerical features.
2. Encode categories.
3. Scale numerical columns (simple min-max normalization).

In [None]:
transactions = pd.DataFrame({
    'CustomerID': [101, 102, 103, 104, 105],
    'Gender': ['Male', 'Female', np.nan, 'Male', 'Female'],
    'SpendingScore': [45, np.nan, 70, 30, 100]
})

# Fill missing categorical and numeric values
transactions['Gender'] = transactions['Gender'].fillna(transactions['Gender'].mode()[0])
transactions['SpendingScore'] = transactions['SpendingScore'].fillna(transactions['SpendingScore'].median())

# Encode categories
transactions['Gender'] = transactions['Gender'].map({'Male': 0, 'Female': 1})

# Normalize SpendingScore
transactions['SpendingScore_scaled'] = (transactions['SpendingScore'] - transactions['SpendingScore'].min()) / (transactions['SpendingScore'].max() - transactions['SpendingScore'].min())

transactions

## ‚öôÔ∏è Under the Hood: Pandas Cleaning Internals
- `fillna()` and `dropna()` are vectorized; they modify DataFrame blocks directly for efficiency.
- String operations (`.str`) use underlying NumPy ufuncs or optimized Cythonized loops.
- `to_datetime(errors='coerce')` silently converts invalid dates to `NaT` without breaking pipelines.
- `astype()` triggers type inference cascades for mixed dtypes.
- Operations like `cut` and `qcut` internally rely on `np.searchsorted()` for binning efficiency.

## üí° Best Practices / Common Pitfalls
- Always inspect missing data with `df.isna().sum()` before cleaning.
- Never use inplace fills (`inplace=True`) in production ‚Äî they break chainability.
- When imputing, use domain knowledge (e.g., median per group).
- Keep a log of data-cleaning rules for reproducibility.
- Normalize text case early to avoid duplicates in grouping or joining.

## üß© Challenge Exercise

You‚Äôre given a dataset of customers with columns: `ID`, `Gender`, `Age`, `Income`, and `SignupDate`. The dataset contains missing, invalid, and inconsistent entries.

**Tasks:**
1. Clean missing and invalid `Age` and `Income` values.
2. Normalize gender values (`male`, `M`, `female`, `F`, etc.).
3. Convert `SignupDate` into proper datetime format.
4. Create a new column `IncomeLevel` using quantile binning.
5. Drop duplicates and output final clean dataset.

_Hint:_ Use `pd.qcut()` for quantile-based income bins and `replace()` for category mapping.

# --- End of Section 3 ‚Äî Continue to Section 4 ---