# Module 01 — Mathematical & Programming Foundations
## 1-03: Pandas for Tabular Data

**Objective:** Master Pandas for exploratory data analysis — the essential first step before any ML modeling.

**Prerequisites:** 1-01 (Python, NumPy & Tensor Speed)

---
## Part 0 — Setup & Prerequisites

This notebook covers the complete Pandas EDA workflow: summary statistics, missing-value handling,
categorical encoding, and merge/join operations. We build every imputation and encoding strategy
from scratch before comparing against library implementations.

**Prerequisites:** 1-01 (Python, NumPy & Tensor Speed)

In [None]:
# ── Imports ──────────────────────────────────────────────────────────────────
import sys
import warnings
warnings.filterwarnings("ignore")

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

print(f"Python: {sys.version.split()[0]}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")

In [None]:
# ── Reproducibility ─────────────────────────────────────────────────────────
SEED = 1103
random.seed(SEED)
np.random.seed(SEED)

### Data Loading & EDA

We load two classic sklearn datasets:
- **Iris** — small classification dataset (150 samples, 4 features, 3 classes).
- **California Housing** — larger regression dataset (~20k samples, 8 features).

Both are used throughout this notebook to demonstrate Pandas operations on real tabular data.

In [None]:
# ── Load Iris ────────────────────────────────────────────────────────────────
iris_bunch = load_iris()
iris_df = pd.DataFrame(iris_bunch.data, columns=iris_bunch.feature_names)
iris_df["species"] = pd.Categorical.from_codes(iris_bunch.target, iris_bunch.target_names)

print("=== Iris Dataset ===")
print(f"Shape: {iris_df.shape}")
print(f"\nDtypes:\n{iris_df.dtypes}")
print(f"\nFirst 5 rows:")
iris_df.head()

In [None]:
# ── Load California Housing ─────────────────────────────────────────────────
housing_bunch = fetch_california_housing()
housing_df = pd.DataFrame(housing_bunch.data, columns=housing_bunch.feature_names)
housing_df["MedHouseVal"] = housing_bunch.target

print("=== California Housing Dataset ===")
print(f"Shape: {housing_df.shape}")
print(f"\nDtypes:\n{housing_df.dtypes}")
print(f"\nFirst 5 rows:")
housing_df.head()

In [None]:
# ── Basic Statistics ─────────────────────────────────────────────────────────
print("=== Iris — Summary Statistics ===")
iris_df.describe()

In [None]:
print("=== California Housing — Summary Statistics ===")
housing_df.describe()

---
## Part 1 — Pandas Operations from Scratch

We explore the core Pandas operations that form the backbone of any EDA workflow:
DataFrames/Series basics, summary statistics, missing-value handling, categorical encoding,
and merge/join operations.

### 1.1 DataFrames and Series Basics

A **Series** is a one-dimensional labeled array. A **DataFrame** is a two-dimensional
labeled table — essentially a dictionary of Series sharing the same index. DataFrames can be
created from dictionaries, NumPy arrays, or lists of dictionaries.

In [None]:
# ── Creating DataFrames from different sources ────────────────────────────────

# From a dictionary
dict_df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "score": [88.5, 92.3, 79.1, 95.0],
})
print("DataFrame from dict:")
print(dict_df)

# From a NumPy array
array_data = np.random.randn(4, 3)
array_df = pd.DataFrame(array_data, columns=["feature_a", "feature_b", "feature_c"])
print("\nDataFrame from NumPy array:")
print(array_df)

# From a list of dicts (CSV-like row-oriented data)
rows = [
    {"city": "NYC", "population": 8336817, "area_sq_mi": 302.6},
    {"city": "LA", "population": 3979576, "area_sq_mi": 468.7},
    {"city": "Chicago", "population": 2693976, "area_sq_mi": 227.3},
]
cities_df = pd.DataFrame(rows)
print("\nDataFrame from list of dicts:")
print(cities_df)

#### Series Operations

A Series supports element-wise arithmetic, boolean masking, and integer/label-based indexing.
All operations are vectorized — no explicit loops needed.

In [None]:
# ── Series: indexing, slicing, operations ─────────────────────────────────────
ages = dict_df["age"]
print(f"Type: {type(ages)}")
print(f"First element (ages[0]):  {ages[0]}")
print(f"Slice (ages[1:3]):\n{ages[1:3]}")
print(f"\nVectorized ops — ages + 10:\n{ages + 10}")
print(f"\nBoolean mask — ages > 28:\n{ages[ages > 28]}")

#### DataFrame Selection: `.loc` vs `.iloc`

Pandas provides two main indexers:
- **`.loc`** — label-based: selects by row/column *names*. Inclusive on both ends.
- **`.iloc`** — integer-position-based: selects by *position*. Exclusive on the end (like Python slicing).

In [None]:
# ── DataFrame selection: columns, rows, .loc vs .iloc ────────────────────────

# Select a single column (returns Series)
print("Single column (iris_df['species']):\n", iris_df["species"].head(), "\n")

# Select multiple columns (returns DataFrame)
print("Multiple columns:")
print(iris_df[["sepal length (cm)", "species"]].head(), "\n")

# Filter rows with boolean condition
setosa = iris_df[iris_df["species"] == "setosa"]
print(f"Filtered rows (setosa only): {len(setosa)} rows\n")

# .loc — label-based indexing
print(".loc[0:2, 'sepal length (cm)':'petal length (cm)']:")
print(iris_df.loc[0:2, "sepal length (cm)":"petal length (cm)"], "\n")

# .iloc — integer-position-based indexing
print(".iloc[0:3, 0:2]:")
print(iris_df.iloc[0:3, 0:2])

### 1.2 EDA Workflow

A solid EDA workflow answers three questions:
1. **What does each feature look like?** (`describe`, `value_counts`)
2. **How do features relate to each other?** (`corr`, `groupby`)
3. **Are there issues to fix?** (missing values, outliers, class imbalance)

In [None]:
# ── .describe() — summary statistics ─────────────────────────────────────────
print("=== Iris — Numeric Summary ===")
print(iris_df.describe().round(2))

print("\n=== Iris — Categorical Summary ===")
print(iris_df.describe(include=["category"]))

In [None]:
# ── .value_counts() — categorical distributions ──────────────────────────────
print("Species distribution:")
print(iris_df["species"].value_counts())
print(f"\nBalanced: {iris_df['species'].value_counts().nunique() == 1}")

#### GroupBy Aggregations

`groupby()` splits a DataFrame by unique values of a column, applies an aggregation
function to each group, and combines the results. This is the Pandas equivalent of
SQL's `GROUP BY` clause.

In [None]:
# ── .groupby() — per-group aggregations on Iris ──────────────────────────────
iris_grouped = iris_df.groupby("species").agg(["mean", "std", "count"])
print("=== Per-Species Statistics ===")
iris_grouped

#### Correlation Analysis

The Pearson correlation coefficient measures linear relationship between two variables.
Values close to $+1$ or $-1$ indicate strong linear correlation; values near $0$ suggest
no linear relationship.

In [None]:
# ── .corr() — correlation matrix for Iris ────────────────────────────────────
iris_numeric = iris_df.select_dtypes(include=[np.number])
corr_matrix = iris_numeric.corr()

fig, ax = plt.subplots(figsize=(8, 5))
cax = ax.matshow(corr_matrix, cmap="coolwarm", vmin=-1, vmax=1)
fig.colorbar(cax)
ax.set_xticks(range(len(corr_matrix.columns)))
ax.set_yticks(range(len(corr_matrix.columns)))
ax.set_xticklabels(corr_matrix.columns, rotation=45, ha="left")
ax.set_yticklabels(corr_matrix.columns)
ax.set_title("Iris Feature Correlation Matrix", pad=20)
plt.tight_layout()
plt.show()

#### California Housing — Extended EDA

California Housing is a larger dataset with more features, making it better suited
for demonstrating skewness, kurtosis, and multi-feature correlation analysis.

In [None]:
# ── California Housing — feature distributions ───────────────────────────────
housing_numeric = housing_df.select_dtypes(include=[np.number])
housing_stats = housing_numeric.describe().T
housing_stats["skewness"] = housing_numeric.skew()
housing_stats["kurtosis"] = housing_numeric.kurtosis()
print("=== California Housing — Extended Statistics ===")
housing_stats.round(3)

In [None]:
# ── California Housing — correlation heatmap ─────────────────────────────────
housing_corr = housing_numeric.corr()

fig, ax = plt.subplots(figsize=(8, 6))
cax = ax.matshow(housing_corr, cmap="coolwarm", vmin=-1, vmax=1)
fig.colorbar(cax)
ax.set_xticks(range(len(housing_corr.columns)))
ax.set_yticks(range(len(housing_corr.columns)))
ax.set_xticklabels(housing_corr.columns, rotation=45, ha="left", fontsize=8)
ax.set_yticklabels(housing_corr.columns, fontsize=8)
ax.set_title("California Housing Correlation Matrix", pad=20)
plt.tight_layout()
plt.show()

### 1.3 Handling Missing Values

Real-world datasets almost always have missing entries. The strategy for imputation
depends on the data distribution and the downstream model:

- **Mean imputation:** Best for roughly symmetric continuous features.
- **Median imputation:** Robust to outliers and skewed distributions.
- **Mode imputation:** Standard choice for categorical features.
- **Forward-fill:** Appropriate for time-ordered data where the last known value carries forward.

We first inject synthetic missing values into copies of our datasets, then implement
each strategy from scratch.

In [None]:
# ── Inject synthetic missing values ──────────────────────────────────────────
MISSING_FRACTION = 0.10  # 10% missing values


def inject_missing_values(
    df: pd.DataFrame,
    fraction: float,
    columns: list[str] | None = None,
) -> pd.DataFrame:
    """Inject NaN values randomly into specified columns of a DataFrame.

    Args:
        df: Input DataFrame (not modified in place).
        fraction: Fraction of values to set to NaN per column (0.0 to 1.0).
        columns: List of column names to inject NaNs into. If None, uses
            all numeric columns.

    Returns:
        A new DataFrame with NaN values injected.
    """
    df_missing = df.copy()
    if columns is None:
        columns = df_missing.select_dtypes(include=[np.number]).columns.tolist()
    num_rows = len(df_missing)
    num_to_remove = int(num_rows * fraction)
    for col in columns:
        missing_indices = np.random.choice(num_rows, size=num_to_remove, replace=False)
        df_missing.loc[missing_indices, col] = np.nan
    return df_missing


iris_missing = inject_missing_values(iris_df, MISSING_FRACTION,
                                     columns=["sepal length (cm)", "petal width (cm)"])
housing_missing = inject_missing_values(housing_df, MISSING_FRACTION,
                                        columns=["MedInc", "AveRooms", "AveOccup"])

print("=== Iris — Missing Values ===")
print(iris_missing.isnull().sum())
print(f"\n=== Housing — Missing Values ===")
print(housing_missing.isnull().sum())

#### Strategy 1: Mean Imputation

Replace each missing value with the arithmetic mean of the non-missing values in that
column. This preserves the overall mean but reduces variance (all imputed values
cluster at the center).

In [None]:
# ── Strategy 1: Mean Imputation (from scratch) ───────────────────────────────


def impute_mean(series: pd.Series) -> pd.Series:
    """Impute missing values with the column mean.

    Args:
        series: A pandas Series with potential NaN values.

    Returns:
        A new Series with NaN values replaced by the column mean.
    """
    col_mean = series.dropna().mean()
    return series.fillna(col_mean)


# Apply to iris
iris_mean_imputed = iris_missing.copy()
iris_mean_imputed["sepal length (cm)"] = impute_mean(iris_missing["sepal length (cm)"])
iris_mean_imputed["petal width (cm)"] = impute_mean(iris_missing["petal width (cm)"])

print("Mean-imputed Iris — remaining NaNs:")
print(iris_mean_imputed.isnull().sum())

#### Strategy 2: Median Imputation

Replace missing values with the median. Unlike the mean, the median is robust to
outliers — a handful of extreme values will not distort the imputed value.

In [None]:
# ── Strategy 2: Median Imputation (from scratch) ─────────────────────────────


def impute_median(series: pd.Series) -> pd.Series:
    """Impute missing values with the column median.

    Args:
        series: A pandas Series with potential NaN values.

    Returns:
        A new Series with NaN values replaced by the column median.
    """
    col_median = series.dropna().median()
    return series.fillna(col_median)


iris_median_imputed = iris_missing.copy()
iris_median_imputed["sepal length (cm)"] = impute_median(iris_missing["sepal length (cm)"])
iris_median_imputed["petal width (cm)"] = impute_median(iris_missing["petal width (cm)"])

print("Median-imputed Iris — remaining NaNs:")
print(iris_median_imputed.isnull().sum())

#### Strategy 3: Mode Imputation

Replace missing values with the most frequent value. This is the standard choice for
categorical features where mean and median are not defined.

In [None]:
# ── Strategy 3: Mode Imputation (from scratch) ───────────────────────────────


def impute_mode(series: pd.Series) -> pd.Series:
    """Impute missing values with the column mode (most frequent value).

    Suitable for categorical or discrete features.

    Args:
        series: A pandas Series with potential NaN values.

    Returns:
        A new Series with NaN values replaced by the most frequent value.
    """
    value_counts = series.dropna().value_counts()
    mode_value = value_counts.index[0]
    return series.fillna(mode_value)


# Demonstrate mode imputation on a categorical-like column
# Inject NaN into species (simulate missing labels)
iris_cat_missing = iris_df.copy()
cat_missing_idx = np.random.choice(len(iris_cat_missing), size=15, replace=False)
iris_cat_missing.loc[cat_missing_idx, "species"] = np.nan

print(f"Missing species before mode imputation: {iris_cat_missing['species'].isnull().sum()}")
iris_cat_missing["species"] = impute_mode(iris_cat_missing["species"])
print(f"Missing species after mode imputation:  {iris_cat_missing['species'].isnull().sum()}")

#### Strategy 4: Forward-Fill

Carry the last observed value forward. This is most useful for time-series or
sequentially-ordered data where the previous observation is a reasonable proxy for the
missing one. If leading values are NaN, we backfill them from the first non-NaN value.

In [None]:
# ── Strategy 4: Forward-Fill (from scratch) ──────────────────────────────────


def impute_forward_fill(series: pd.Series) -> pd.Series:
    """Impute missing values using forward-fill (last observation carried forward).

    If the first value(s) are NaN, they remain NaN since there is no prior
    observation to carry forward. A subsequent backfill handles leading NaNs.

    Args:
        series: A pandas Series with potential NaN values.

    Returns:
        A new Series with NaN values replaced by the most recent non-NaN value.
    """
    result = series.copy()
    values = result.values.copy()
    for idx in range(1, len(values)):
        if pd.isna(values[idx]):
            values[idx] = values[idx - 1]
    # Handle leading NaNs with backfill
    for idx in range(len(values) - 2, -1, -1):
        if pd.isna(values[idx]):
            values[idx] = values[idx + 1]
    return pd.Series(values, index=series.index, name=series.name)


iris_ffill_imputed = iris_missing.copy()
iris_ffill_imputed["sepal length (cm)"] = impute_forward_fill(iris_missing["sepal length (cm)"])
iris_ffill_imputed["petal width (cm)"] = impute_forward_fill(iris_missing["petal width (cm)"])

print("Forward-fill imputed Iris — remaining NaNs:")
print(iris_ffill_imputed.isnull().sum())

#### Visual Comparison of Imputation Strategies

Overlaying the imputed distributions against the original reveals how each strategy
distorts the data. Mean imputation creates a spike at the mean; median imputation
creates a spike at the median; forward-fill distributes imputed values more broadly
but depends on data ordering.

In [None]:
# ── Compare imputed distributions with histograms ────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

feature = "sepal length (cm)"
bins = 15

axes[0, 0].hist(iris_df[feature], bins=bins, alpha=0.6, label="Original", color="steelblue")
axes[0, 0].hist(iris_mean_imputed[feature], bins=bins, alpha=0.4, label="Mean", color="orange")
axes[0, 0].set_title("Mean Imputation")
axes[0, 0].set_xlabel(feature)
axes[0, 0].set_ylabel("Count")
axes[0, 0].legend()

axes[0, 1].hist(iris_df[feature], bins=bins, alpha=0.6, label="Original", color="steelblue")
axes[0, 1].hist(iris_median_imputed[feature], bins=bins, alpha=0.4, label="Median", color="green")
axes[0, 1].set_title("Median Imputation")
axes[0, 1].set_xlabel(feature)
axes[0, 1].set_ylabel("Count")
axes[0, 1].legend()

axes[1, 0].hist(iris_df[feature], bins=bins, alpha=0.6, label="Original", color="steelblue")
axes[1, 0].hist(iris_ffill_imputed[feature], bins=bins, alpha=0.4, label="Forward-Fill", color="red")
axes[1, 0].set_title("Forward-Fill Imputation")
axes[1, 0].set_xlabel(feature)
axes[1, 0].set_ylabel("Count")
axes[1, 0].legend()

# Compare all strategies on a single plot
axes[1, 1].hist(iris_df[feature], bins=bins, alpha=0.5, label="Original", color="steelblue")
axes[1, 1].hist(iris_mean_imputed[feature], bins=bins, alpha=0.3, label="Mean", color="orange")
axes[1, 1].hist(iris_median_imputed[feature], bins=bins, alpha=0.3, label="Median", color="green")
axes[1, 1].hist(iris_ffill_imputed[feature], bins=bins, alpha=0.3, label="Forward-Fill", color="red")
axes[1, 1].set_title("All Strategies Compared")
axes[1, 1].set_xlabel(feature)
axes[1, 1].set_ylabel("Count")
axes[1, 1].legend()

fig.suptitle("Imputation Strategy Comparison — sepal length (cm)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 1.4 Categorical Encoding

Machine learning models work with numbers, not strings. We need to convert categorical
features into numeric representations. Three common strategies:

| Encoding | Description | When to use |
|----------|-------------|-------------|
| **One-hot** | Binary column per category | Nominal categories (no ordering), few unique values |
| **Ordinal** | Integer mapping preserving order | Ordered categories (e.g., low/medium/high) |
| **Target** | Replace category with mean of target | High cardinality, tree-based models |

In [None]:
# ── One-Hot Encoding (from scratch with NumPy) ───────────────────────────────


def one_hot_encode_scratch(
    series: pd.Series,
) -> pd.DataFrame:
    """One-hot encode a categorical Series using NumPy.

    Creates a binary column for each unique category. Values are 0 or 1.

    Args:
        series: A pandas Series containing categorical values.

    Returns:
        DataFrame with one binary column per unique category.
    """
    categories = sorted(series.dropna().unique())
    num_samples = len(series)
    num_categories = len(categories)
    encoded = np.zeros((num_samples, num_categories), dtype=int)
    cat_to_idx = {cat: idx for idx, cat in enumerate(categories)}
    for row_idx, value in enumerate(series):
        if pd.notna(value) and value in cat_to_idx:
            encoded[row_idx, cat_to_idx[value]] = 1
    column_names = [f"{series.name}_{cat}" for cat in categories]
    return pd.DataFrame(encoded, columns=column_names, index=series.index)


# Apply to Iris species
iris_onehot_scratch = one_hot_encode_scratch(iris_df["species"])
print("=== One-Hot Encoding (from scratch) ===")
print(iris_onehot_scratch.head(10))

# Compare with pd.get_dummies
iris_onehot_pandas = pd.get_dummies(iris_df["species"], prefix="species", dtype=int)
print("\n=== One-Hot Encoding (pd.get_dummies) ===")
print(iris_onehot_pandas.head(10))

# Verify equivalence
assert iris_onehot_scratch.values.sum() == iris_onehot_pandas.values.sum(), "Mismatch!"
print(f"\nTotal 1s match: {iris_onehot_scratch.values.sum()} == {iris_onehot_pandas.values.sum()}")

#### Ordinal Encoding

Maps each category to an integer based on a specified ordering. Unlike one-hot encoding,
ordinal encoding produces a single column — but it introduces an artificial numerical
relationship between categories that may not exist.

In [None]:
# ── Ordinal Encoding (from scratch) ──────────────────────────────────────────


def ordinal_encode_scratch(
    series: pd.Series,
    order: list[str],
) -> pd.Series:
    """Ordinal encode a categorical Series given a specific ordering.

    Maps each category to an integer based on the specified order.

    Args:
        series: A pandas Series containing categorical values.
        order: Ordered list of categories from lowest to highest.

    Returns:
        A new Series with integer-encoded values.
    """
    mapping = {cat: idx for idx, cat in enumerate(order)}
    return series.map(mapping).rename(f"{series.name}_ordinal")


# Apply to Iris species (arbitrary order for demonstration)
species_order = ["setosa", "versicolor", "virginica"]
iris_ordinal_scratch = ordinal_encode_scratch(iris_df["species"], species_order)
print("=== Ordinal Encoding (from scratch) ===")
print(iris_ordinal_scratch.value_counts().sort_index())

# Compare with sklearn OrdinalEncoder
sklearn_ordinal = OrdinalEncoder(categories=[species_order])
iris_ordinal_sklearn = sklearn_ordinal.fit_transform(
    iris_df[["species"]].astype(str)
).ravel()
print(f"\nScratch vs sklearn match: {np.allclose(iris_ordinal_scratch.values, iris_ordinal_sklearn)}")

#### Target Encoding

Replace each category with the mean of the target variable for that category.
This produces a single numeric column that captures the category's relationship
with the target. **Warning:** target encoding must be computed on training data only —
using the full dataset leaks information from the test set.

In [None]:
# ── Target Encoding (from scratch) ───────────────────────────────────────────


def target_encode_scratch(
    series: pd.Series,
    target: pd.Series,
) -> tuple[pd.Series, dict[str, float]]:
    """Target encode a categorical Series using the mean of the target per category.

    Each category value is replaced by the mean of the target variable for
    samples in that category. This must be fitted on training data only
    to avoid data leakage.

    Args:
        series: A pandas Series containing categorical values.
        target: A pandas Series containing the numeric target variable.

    Returns:
        Tuple of (encoded Series, mapping dict from category to mean target).
    """
    combined = pd.DataFrame({"category": series, "target": target})
    encoding_map = combined.groupby("category")["target"].mean().to_dict()
    encoded = series.map(encoding_map).rename(f"{series.name}_target_encoded")
    return encoded, encoding_map


# Demonstrate on Iris: encode species using sepal length as a proxy target
iris_target_encoded, target_map = target_encode_scratch(
    iris_df["species"], iris_df["sepal length (cm)"]
)
print("=== Target Encoding Map ===")
for category, mean_val in target_map.items():
    print(f"  {category}: {mean_val:.4f}")
print(f"\nEncoded values (first 5): {iris_target_encoded.head().tolist()}")

In [None]:
# ── Visualize encoding strategies side by side ────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# One-hot: show as heatmap for first 15 samples
axes[0].imshow(iris_onehot_scratch.iloc[:15].values, cmap="viridis", aspect="auto")
axes[0].set_xticks(range(3))
axes[0].set_xticklabels(["setosa", "versicolor", "virginica"], rotation=45, ha="right")
axes[0].set_ylabel("Sample Index")
axes[0].set_title("One-Hot Encoding")

# Ordinal: bar chart of encoded values
sample_indices = range(0, 150, 10)
axes[1].bar(range(len(sample_indices)), iris_ordinal_scratch.iloc[list(sample_indices)].values,
            color="steelblue")
axes[1].set_xlabel("Sample (every 10th)")
axes[1].set_ylabel("Ordinal Value")
axes[1].set_title("Ordinal Encoding")

# Target: scatter of encoded values
axes[2].scatter(range(len(iris_target_encoded)), iris_target_encoded.values,
                c=iris_ordinal_scratch.values, cmap="viridis", alpha=0.6, s=10)
axes[2].set_xlabel("Sample Index")
axes[2].set_ylabel("Target-Encoded Value")
axes[2].set_title("Target Encoding")

plt.tight_layout()
plt.show()

### 1.5 Merge and Join

In practice, data lives in multiple tables. Pandas provides `merge()` for SQL-style
joins and `concat()` for stacking DataFrames.

| Join Type | Keeps |
|-----------|-------|
| **Inner** | Only rows with matching keys in both tables |
| **Left** | All rows from the left table, matching rows from the right |
| **Right** | All rows from the right table, matching rows from the left |
| **Outer** | All rows from both tables |

In [None]:
# ── Create related DataFrames for merge demonstration ────────────────────────
# Feature table: per-species measurements
species_features = pd.DataFrame({
    "species": ["setosa", "versicolor", "virginica"],
    "avg_sepal_length": [5.006, 5.936, 6.588],
    "avg_petal_length": [1.462, 4.260, 5.552],
})

# Metadata table: additional info (note: "unknown" species not in features)
species_metadata = pd.DataFrame({
    "species": ["setosa", "versicolor", "virginica", "unknown"],
    "petal_color": ["white", "purple", "violet", "N/A"],
    "native_region": ["North America", "Eastern North America", "Eastern North America", "Unknown"],
})

print("=== Features Table ===")
print(species_features)
print("\n=== Metadata Table ===")
print(species_metadata)

In [None]:
# ── Inner Join: only matching keys ───────────────────────────────────────────
inner_merged = pd.merge(species_features, species_metadata, on="species", how="inner")
print("=== Inner Join ===")
print(inner_merged)
print(f"Rows: {len(inner_merged)} (dropped 'unknown' — no match in features)")

In [None]:
# ── Left Join: all rows from left table ──────────────────────────────────────
left_merged = pd.merge(species_features, species_metadata, on="species", how="left")
print("=== Left Join ===")
print(left_merged)
print(f"Rows: {len(left_merged)} (keeps all from features, drops unmatched metadata)")

In [None]:
# ── Outer Join: all rows from both tables ────────────────────────────────────
outer_merged = pd.merge(species_features, species_metadata, on="species", how="outer")
print("=== Outer Join ===")
print(outer_merged)
print(f"Rows: {len(outer_merged)} (keeps everything, NaN where no match)")

#### Concatenation with `pd.concat()`

`pd.concat()` stacks DataFrames either vertically (row-wise, `axis=0`) or
horizontally (column-wise, `axis=1`). Unlike `merge()`, it does not match on keys —
it simply glues DataFrames together.

In [None]:
# ── pd.concat for stacking DataFrames ────────────────────────────────────────
iris_first_half = iris_df.iloc[:75]
iris_second_half = iris_df.iloc[75:]

# Vertical concatenation (stacking rows)
iris_recombined = pd.concat([iris_first_half, iris_second_half], axis=0, ignore_index=True)
print(f"Original shape: {iris_df.shape}")
print(f"Recombined shape: {iris_recombined.shape}")
assert iris_recombined.shape == iris_df.shape, "Shape mismatch after concat!"

# Horizontal concatenation (adding columns)
extra_features = pd.DataFrame({
    "sepal_area": iris_df["sepal length (cm)"] * iris_df["sepal width (cm)"],
    "petal_area": iris_df["petal length (cm)"] * iris_df["petal width (cm)"],
})
iris_extended = pd.concat([iris_df, extra_features], axis=1)
print(f"\nExtended shape: {iris_extended.shape} (added sepal_area, petal_area)")
iris_extended.head()

---
## Part 2 — Putting It All Together

We combine the individual operations from Part 1 into two reusable classes:
1. **`EDAReport`** — generates a structured exploratory data analysis report.
2. **`DataPreprocessor`** — handles missing values and categorical encoding with a
   scikit-learn-style `fit()`/`transform()` API.

### EDAReport Class

The `EDAReport` class takes a DataFrame and provides four methods:
- `summary()` — extended descriptive statistics including skewness and kurtosis.
- `missing_report()` — counts and percentages of missing values per column.
- `correlation_plot()` — renders a correlation heatmap.
- `distribution_plots()` — renders histograms for all numeric columns.

In [None]:
class EDAReport:
    """Generate a structured exploratory data analysis report for a DataFrame.

    Attributes:
        df: The input DataFrame to analyze.
        numeric_cols: List of numeric column names.
        categorical_cols: List of categorical column names.
    """

    def __init__(self, df: pd.DataFrame) -> None:
        """Initialize EDAReport with a DataFrame.

        Args:
            df: The DataFrame to analyze.
        """
        self.df = df
        self.numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
        self.categorical_cols = df.select_dtypes(
            include=["object", "category"]
        ).columns.tolist()

    def summary(self) -> pd.DataFrame:
        """Generate summary statistics for all numeric columns.

        Returns:
            DataFrame with mean, std, min, max, skewness, and kurtosis per column.
        """
        stats = self.df[self.numeric_cols].describe().T
        stats["skewness"] = self.df[self.numeric_cols].skew()
        stats["kurtosis"] = self.df[self.numeric_cols].kurtosis()
        return stats.round(4)

    def missing_report(self) -> pd.DataFrame:
        """Generate a report of missing values per column.

        Returns:
            DataFrame with count and percentage of missing values per column.
        """
        missing_count = self.df.isnull().sum()
        missing_pct = (missing_count / len(self.df)) * 100
        report = pd.DataFrame({
            "missing_count": missing_count,
            "missing_pct": missing_pct.round(2),
        })
        return report[report["missing_count"] > 0].sort_values(
            "missing_count", ascending=False
        )

    def correlation_plot(self) -> None:
        """Display a correlation heatmap for all numeric columns."""
        corr = self.df[self.numeric_cols].corr()
        fig, ax = plt.subplots(figsize=(8, 5))
        cax = ax.matshow(corr, cmap="coolwarm", vmin=-1, vmax=1)
        fig.colorbar(cax)
        ax.set_xticks(range(len(corr.columns)))
        ax.set_yticks(range(len(corr.columns)))
        ax.set_xticklabels(corr.columns, rotation=45, ha="left", fontsize=8)
        ax.set_yticklabels(corr.columns, fontsize=8)
        ax.set_title("Correlation Heatmap", pad=20)
        plt.tight_layout()
        plt.show()

    def distribution_plots(self, max_cols: int = 8) -> None:
        """Plot histograms for numeric columns.

        Args:
            max_cols: Maximum number of columns to plot.
        """
        cols_to_plot = self.numeric_cols[:max_cols]
        num_plots = len(cols_to_plot)
        ncols = min(4, num_plots)
        nrows = (num_plots + ncols - 1) // ncols
        fig, axes = plt.subplots(nrows, ncols, figsize=(4 * ncols, 3 * nrows))
        if num_plots == 1:
            axes = np.array([axes])
        axes_flat = axes.flatten()
        for idx, col in enumerate(cols_to_plot):
            axes_flat[idx].hist(self.df[col].dropna(), bins=20, color="steelblue",
                                edgecolor="white", alpha=0.8)
            axes_flat[idx].set_title(col, fontsize=9)
            axes_flat[idx].set_xlabel("Value")
            axes_flat[idx].set_ylabel("Count")
        # Hide unused subplots
        for idx in range(num_plots, len(axes_flat)):
            axes_flat[idx].set_visible(False)
        fig.suptitle("Feature Distributions", fontsize=14)
        plt.tight_layout()
        plt.show()

### DataPreprocessor Class

The `DataPreprocessor` follows the scikit-learn convention:
- **`fit(df)`** — learns imputation values (means or medians) and encoding maps from training data.
- **`transform(df)`** — applies the learned parameters to any DataFrame (train or test).

This separation ensures we never leak test-set information into the preprocessing step.

In [None]:
class DataPreprocessor:
    """Preprocess tabular data with missing value imputation and categorical encoding.

    Follows the scikit-learn fit/transform pattern: fit() learns parameters
    from training data, transform() applies them to any data split.

    Attributes:
        impute_strategy: Strategy for numeric imputation ('mean' or 'median').
        encode_strategy: Strategy for categorical encoding ('onehot' or 'ordinal').
        numeric_fill_values: Dict mapping column name to imputation value (learned in fit).
        categorical_maps: Dict mapping column name to encoding map (learned in fit).
        fitted: Whether fit() has been called.
    """

    def __init__(
        self,
        impute_strategy: str = "mean",
        encode_strategy: str = "onehot",
    ) -> None:
        """Initialize the DataPreprocessor.

        Args:
            impute_strategy: Strategy for imputing numeric missing values.
                One of 'mean' or 'median'.
            encode_strategy: Strategy for encoding categorical columns.
                One of 'onehot' or 'ordinal'.
        """
        self.impute_strategy = impute_strategy
        self.encode_strategy = encode_strategy
        self.numeric_fill_values: dict[str, float] = {}
        self.categorical_maps: dict[str, dict] = {}
        self.fitted: bool = False

    def fit(self, df: pd.DataFrame) -> "DataPreprocessor":
        """Learn imputation values and encoding maps from the training data.

        Args:
            df: Training DataFrame to learn parameters from.

        Returns:
            Self, for method chaining.
        """
        # Learn numeric imputation values
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if df[col].isnull().any():
                if self.impute_strategy == "mean":
                    self.numeric_fill_values[col] = df[col].mean()
                elif self.impute_strategy == "median":
                    self.numeric_fill_values[col] = df[col].median()

        # Learn categorical encoding maps
        categorical_cols = df.select_dtypes(include=["object", "category"]).columns
        for col in categorical_cols:
            unique_vals = sorted(df[col].dropna().unique())
            if self.encode_strategy == "onehot":
                self.categorical_maps[col] = {
                    val: idx for idx, val in enumerate(unique_vals)
                }
            elif self.encode_strategy == "ordinal":
                self.categorical_maps[col] = {
                    val: idx for idx, val in enumerate(unique_vals)
                }

        self.fitted = True
        return self

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply learned imputation and encoding to a DataFrame.

        Args:
            df: DataFrame to transform. Must have the same columns as the
                DataFrame used in fit().

        Returns:
            Transformed DataFrame with imputed values and encoded categoricals.

        Raises:
            RuntimeError: If transform() is called before fit().
        """
        if not self.fitted:
            raise RuntimeError("Call fit() before transform().")

        result = df.copy()

        # Apply numeric imputation
        for col, fill_value in self.numeric_fill_values.items():
            if col in result.columns:
                result[col] = result[col].fillna(fill_value)

        # Apply categorical encoding
        encoded_frames = []
        cols_to_drop = []
        for col, mapping in self.categorical_maps.items():
            if col not in result.columns:
                continue
            if self.encode_strategy == "onehot":
                num_categories = len(mapping)
                encoded_array = np.zeros((len(result), num_categories), dtype=int)
                for row_idx, value in enumerate(result[col]):
                    if pd.notna(value) and value in mapping:
                        encoded_array[row_idx, mapping[value]] = 1
                col_names = [f"{col}_{cat}" for cat in mapping.keys()]
                encoded_frames.append(
                    pd.DataFrame(encoded_array, columns=col_names, index=result.index)
                )
                cols_to_drop.append(col)
            elif self.encode_strategy == "ordinal":
                result[col] = result[col].map(mapping)

        # Drop original categorical columns and add one-hot columns
        if cols_to_drop:
            result = result.drop(columns=cols_to_drop)
        if encoded_frames:
            result = pd.concat([result] + encoded_frames, axis=1)

        return result

### Sanity Check

Before applying our classes to real data, we verify them on a small toy DataFrame
with known missing values and categorical columns.

In [None]:
# ── Sanity Check on a toy DataFrame ──────────────────────────────────────────
toy_df = pd.DataFrame({
    "height": [170.0, np.nan, 165.0, 180.0, np.nan],
    "weight": [65.0, 70.0, np.nan, 80.0, 75.0],
    "color": ["red", "blue", "red", "green", "blue"],
})
print("=== Toy DataFrame (before) ===")
print(toy_df)

preprocessor = DataPreprocessor(impute_strategy="mean", encode_strategy="onehot")
preprocessor.fit(toy_df)
toy_transformed = preprocessor.transform(toy_df)

print("\n=== Toy DataFrame (after transform) ===")
print(toy_transformed)

# Verify: no missing values remain
assert toy_transformed.isnull().sum().sum() == 0, "Missing values remain!"
print(f"\nNo missing values: {toy_transformed.isnull().sum().sum() == 0}")
print(f"Learned fill values: {preprocessor.numeric_fill_values}")
print(f"Learned encoding maps: {preprocessor.categorical_maps}")

---
## Part 3 — Application on Real Data

We apply our `EDAReport` and `DataPreprocessor` classes to the California Housing
dataset, then compare our from-scratch preprocessing against an equivalent sklearn
pipeline. We also demonstrate the critical principle of fitting on training data
only, then transforming test data.

In [None]:
# ── EDAReport on California Housing ──────────────────────────────────────────
housing_report = EDAReport(housing_missing)

print("=== Summary Statistics ===")
housing_report.summary()

In [None]:
print("=== Missing Values Report ===")
housing_report.missing_report()

In [None]:
housing_report.correlation_plot()

In [None]:
housing_report.distribution_plots()

### Train/Test Split and Preprocessing

A critical rule: **always fit preprocessors on training data only, then transform both
train and test**. If we compute the mean from the full dataset (including test), the
imputed values in the training set would be contaminated by test-set information —
this is a form of data leakage.

In [None]:
# ── Train/Test Split (fit on train, transform on test) ───────────────────────
# Separate features and target
housing_features = housing_missing.drop(columns=["MedHouseVal"])
housing_target = housing_missing["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(
    housing_features, housing_target, test_size=0.2, random_state=SEED
)
print(f"Train shape: {X_train.shape}")
print(f"Test shape:  {X_test.shape}")
print(f"\nTrain NaNs:\n{X_train.isnull().sum()[X_train.isnull().sum() > 0]}")
print(f"\nTest NaNs:\n{X_test.isnull().sum()[X_test.isnull().sum() > 0]}")

In [None]:
# ── From-Scratch Preprocessing ───────────────────────────────────────────────
preprocessor_scratch = DataPreprocessor(impute_strategy="mean", encode_strategy="onehot")
preprocessor_scratch.fit(X_train)  # Fit on training data ONLY

X_train_scratch = preprocessor_scratch.transform(X_train)
X_test_scratch = preprocessor_scratch.transform(X_test)

print("=== From-Scratch Preprocessing ===")
print(f"Train NaNs after transform: {X_train_scratch.isnull().sum().sum()}")
print(f"Test NaNs after transform:  {X_test_scratch.isnull().sum().sum()}")
print(f"Train shape: {X_train_scratch.shape}")
print(f"Test shape:  {X_test_scratch.shape}")
print(f"\nLearned fill values:")
for col, val in preprocessor_scratch.numeric_fill_values.items():
    print(f"  {col}: {val:.4f}")

### Library Comparison

We build an equivalent sklearn pipeline using `SimpleImputer` with the `mean` strategy
and verify that our from-scratch implementation produces numerically identical results.

In [None]:
# ── sklearn Pipeline Comparison ──────────────────────────────────────────────
# Build an equivalent sklearn pipeline with SimpleImputer (mean strategy)
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

sklearn_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
])

X_train_sklearn = sklearn_pipeline.fit_transform(X_train[numeric_cols])
X_test_sklearn = sklearn_pipeline.transform(X_test[numeric_cols])

# Convert to DataFrame for comparison
X_train_sklearn_df = pd.DataFrame(X_train_sklearn, columns=numeric_cols, index=X_train.index)
X_test_sklearn_df = pd.DataFrame(X_test_sklearn, columns=numeric_cols, index=X_test.index)

print("=== sklearn Pipeline Preprocessing ===")
print(f"Train NaNs after transform: {np.isnan(X_train_sklearn).sum()}")
print(f"Test NaNs after transform:  {np.isnan(X_test_sklearn).sum()}")
print(f"Train shape: {X_train_sklearn.shape}")
print(f"Test shape:  {X_test_sklearn.shape}")

In [None]:
# ── Verify Numerical Equivalence ─────────────────────────────────────────────


def compare_preprocessing_results(
    scratch_df: pd.DataFrame,
    sklearn_df: pd.DataFrame,
    label: str,
) -> pd.DataFrame:
    """Compare from-scratch and sklearn preprocessing results column by column.

    Args:
        scratch_df: DataFrame from the from-scratch preprocessor.
        sklearn_df: DataFrame from the sklearn pipeline.
        label: Label for the comparison (e.g., 'Train' or 'Test').

    Returns:
        DataFrame showing max absolute difference per column.
    """
    common_cols = [col for col in scratch_df.columns if col in sklearn_df.columns]
    diffs = []
    for col in common_cols:
        max_diff = np.abs(
            scratch_df[col].values - sklearn_df[col].values
        ).max()
        diffs.append({"column": col, "max_abs_diff": max_diff})
    result = pd.DataFrame(diffs)
    result["match"] = result["max_abs_diff"] < 1e-10
    return result


train_comparison = compare_preprocessing_results(
    X_train_scratch[numeric_cols], X_train_sklearn_df, "Train"
)
test_comparison = compare_preprocessing_results(
    X_test_scratch[numeric_cols], X_test_sklearn_df, "Test"
)

print("=== Train Set — From-Scratch vs sklearn ===")
print(train_comparison.to_string(index=False))
print(f"\nAll columns match: {train_comparison['match'].all()}")

print("\n=== Test Set — From-Scratch vs sklearn ===")
print(test_comparison.to_string(index=False))
print(f"\nAll columns match: {test_comparison['match'].all()}")

---
## Part 4 — Evaluation & Analysis

We compare imputation strategies to see which preserves the original data
distribution best, analyze encoding strategies for their impact on
feature dimensionality and sparsity, and produce a summary comparison table.

### Imputation Strategy Analysis

A good imputation strategy preserves the original distribution as closely as possible.
We measure the shift in mean, standard deviation, and skewness introduced by each strategy.

In [None]:
# ── Compare Imputation Strategies: Distribution Preservation ──────────────────


def compute_distribution_shift(
    original: pd.Series,
    imputed: pd.Series,
) -> dict[str, float]:
    """Compute statistics measuring how imputation shifted the distribution.

    Args:
        original: The original Series (no missing values).
        imputed: The imputed Series (missing values filled).

    Returns:
        Dictionary with mean_diff, std_diff, and skew_diff.
    """
    return {
        "mean_diff": abs(original.mean() - imputed.mean()),
        "std_diff": abs(original.std() - imputed.std()),
        "skew_diff": abs(original.skew() - imputed.skew()),
    }


feature_name = "sepal length (cm)"
original_series = iris_df[feature_name]

strategies = {
    "Mean": iris_mean_imputed[feature_name],
    "Median": iris_median_imputed[feature_name],
    "Forward-Fill": iris_ffill_imputed[feature_name],
}

shift_results = []
for strategy_name, imputed_series in strategies.items():
    shift = compute_distribution_shift(original_series, imputed_series)
    shift["strategy"] = strategy_name
    shift_results.append(shift)

shift_df = pd.DataFrame(shift_results)[["strategy", "mean_diff", "std_diff", "skew_diff"]]
print("=== Distribution Shift by Imputation Strategy ===")
print(shift_df.to_string(index=False))

In [None]:
# ── Histogram overlay: all strategies vs original ─────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for idx, (strategy_name, imputed_series) in enumerate(strategies.items()):
    axes[idx].hist(original_series, bins=15, alpha=0.6, label="Original",
                   color="steelblue", edgecolor="white")
    axes[idx].hist(imputed_series, bins=15, alpha=0.4, label=strategy_name,
                   color="orange", edgecolor="white")
    axes[idx].set_title(f"{strategy_name} vs Original")
    axes[idx].set_xlabel(feature_name)
    axes[idx].set_ylabel("Count")
    axes[idx].legend()

plt.tight_layout()
plt.show()

### Encoding Strategy Analysis

Different encoding strategies produce different output shapes and sparsity levels.
One-hot encoding increases dimensionality (one new column per category) but avoids
imposing artificial ordering. Ordinal and target encoding maintain a single column
but introduce different assumptions.

In [None]:
# ── Compare Encoding Strategies: Dimensionality & Sparsity ────────────────────


def analyze_encoding(
    name: str,
    encoded_data: np.ndarray,
    num_original_features: int,
) -> dict[str, float | int | str]:
    """Analyze the dimensionality and sparsity of an encoding.

    Args:
        name: Name of the encoding strategy.
        encoded_data: The encoded data as a NumPy array.
        num_original_features: Number of features before encoding.

    Returns:
        Dictionary with encoding name, dimensions, and sparsity metrics.
    """
    total_elements = encoded_data.size
    zero_elements = np.sum(encoded_data == 0)
    sparsity = zero_elements / total_elements
    return {
        "encoding": name,
        "original_features": num_original_features,
        "encoded_features": encoded_data.shape[1],
        "dimensionality_increase": encoded_data.shape[1] - num_original_features,
        "sparsity_pct": round(sparsity * 100, 2),
    }


# Prepare encoded versions of Iris (numeric features + encoded species)
iris_numeric_features = iris_df[iris_df.select_dtypes(include=[np.number]).columns]
num_original = iris_numeric_features.shape[1]  # 4 numeric + 1 categorical = 5 total features

# One-hot encoded
onehot_full = pd.concat([iris_numeric_features, iris_onehot_scratch], axis=1)
# Ordinal encoded
ordinal_full = iris_numeric_features.copy()
ordinal_full["species_ordinal"] = iris_ordinal_scratch.values
# Target encoded
target_full = iris_numeric_features.copy()
target_full["species_target"] = iris_target_encoded.values

encoding_analysis = pd.DataFrame([
    analyze_encoding("One-Hot", onehot_full.values, num_original + 1),
    analyze_encoding("Ordinal", ordinal_full.values, num_original + 1),
    analyze_encoding("Target", target_full.values, num_original + 1),
])

print("=== Encoding Strategy Comparison ===")
encoding_analysis

In [None]:
# ── Correlation heatmap of processed features ────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for idx, (name, df_encoded) in enumerate([
    ("One-Hot", onehot_full),
    ("Ordinal", ordinal_full),
    ("Target", target_full),
]):
    corr = df_encoded.corr()
    cax = axes[idx].matshow(corr, cmap="coolwarm", vmin=-1, vmax=1)
    axes[idx].set_title(f"{name} Encoding", pad=15)
    axes[idx].set_xticks(range(len(corr.columns)))
    axes[idx].set_yticks(range(len(corr.columns)))
    axes[idx].set_xticklabels(corr.columns, rotation=90, fontsize=6)
    axes[idx].set_yticklabels(corr.columns, fontsize=6)

fig.colorbar(cax, ax=axes, shrink=0.6)
plt.tight_layout()
plt.show()

### Summary Comparison Table

The table below consolidates all preprocessing approaches covered in this notebook,
including their best use cases, distribution preservation characteristics, and
computational complexity.

In [None]:
# ── Summary Table: All Preprocessing Approaches ──────────────────────────────
summary_data = [
    {
        "approach": "Mean Imputation",
        "type": "Imputation",
        "best_for": "Symmetric continuous features",
        "preserves_distribution": "Moderately (inflates center)",
        "complexity": "O(n)",
    },
    {
        "approach": "Median Imputation",
        "type": "Imputation",
        "best_for": "Skewed continuous features",
        "preserves_distribution": "Better (robust to outliers)",
        "complexity": "O(n log n)",
    },
    {
        "approach": "Mode Imputation",
        "type": "Imputation",
        "best_for": "Categorical features",
        "preserves_distribution": "Inflates most common class",
        "complexity": "O(n)",
    },
    {
        "approach": "Forward-Fill",
        "type": "Imputation",
        "best_for": "Time-ordered data",
        "preserves_distribution": "Depends on ordering",
        "complexity": "O(n)",
    },
    {
        "approach": "One-Hot Encoding",
        "type": "Encoding",
        "best_for": "Nominal categories, few unique values",
        "preserves_distribution": "N/A",
        "complexity": "O(n * k)",
    },
    {
        "approach": "Ordinal Encoding",
        "type": "Encoding",
        "best_for": "Ordered categories",
        "preserves_distribution": "N/A",
        "complexity": "O(n)",
    },
    {
        "approach": "Target Encoding",
        "type": "Encoding",
        "best_for": "High cardinality, tree models",
        "preserves_distribution": "N/A (risk of data leakage)",
        "complexity": "O(n)",
    },
]

summary_df = pd.DataFrame(summary_data)
print("=== Preprocessing Approaches — Summary ===")
summary_df

---
## Part 5 — Summary & Lessons Learned

### Key Takeaways

1. **Always explore data before modeling** — `.describe()`, `.value_counts()`, and `.corr()` reveal dataset characteristics such as class balance, feature ranges, and correlations that inform all downstream decisions.

2. **Missing value strategy depends on the data:** mean imputation works for roughly symmetric continuous features, median is robust for skewed distributions, and mode is standard for categorical columns. Forward-fill is appropriate only for time-ordered data.

3. **One-hot encoding is safe but increases dimensionality;** ordinal encoding is compact but implies ordering; target encoding is compact but risks data leakage if not computed strictly on training data.

4. **Always fit preprocessors on training data only, then transform test data** — this prevents data leakage and ensures the model sees genuinely unseen data during evaluation.

5. **Pandas operations are vectorized and fast** — avoid iterating rows with `for` loops. Use `.apply()`, `.groupby()`, and vectorized arithmetic instead.

### What's Next

\u2192 **1-04 (Visualization with Matplotlib)** teaches the plotting skills to create publication-quality EDA visualizations. These Pandas skills are used extensively in **Module 2 (Supervised Learning)** and **Module 4 (ML Theory & Evaluation)**.