In [1]:
# %% [markdown]
# # Dimensionality Reduction Workflow
# 
# This notebook demonstrates the essential data science workflow: **Cleaning -> Encoding -> Scaling -> PCA**.
# 
# **Why?**
# 1. **Encoding (OHE):** Machines need numbers. We convert `Gender='M'` to `[0, 1]`.
# 2. **Scaling:** PCA is sensitive to scale. We must normalize features so large numbers (like Age=90) don't dominate small binary flags (0 or 1).
# 3. **PCA:** We reduce the exploded dimensions back down to the core signals.

# %%
import sys
sys.path.insert(0, '..')
from fs_thesis import sql, show

import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# %% [markdown]
# ## 1. Load Data
# We select Demographic features (Categorical) and Age (Numerical).

# %%
df = sql("""
    SELECT 
        p.subject_id,
        p.gender,
        p.anchor_age,
        a.race,
        a.insurance,
        a.marital_status
    FROM hosp.patients p
    JOIN hosp.admissions a ON p.subject_id = a.subject_id
    LIMIT 5000
""")
show(df.head())


Unnamed: 0,subject_id,gender,anchor_age,race,insurance,marital_status
0,10000032,F,52,WHITE,Medicaid,WIDOWED
1,10000032,F,52,WHITE,Medicaid,WIDOWED
2,10000032,F,52,WHITE,Medicaid,WIDOWED
3,10000032,F,52,WHITE,Medicaid,WIDOWED
4,10000068,F,19,WHITE,,SINGLE


In [2]:


# %% [markdown]
# ## 2. The Preprocessing Pipeline
# We use `ColumnTransformer` to apply different logic to different columns:
# *   **Categorical:** `OneHotEncoder` (creates new columns for every unique value)
# *   **Numerical:** `StandardScaler` (centers around 0, variance of 1)

# %%
categorical_features = ['gender', 'race', 'insurance', 'marital_status']
numerical_features = ['anchor_age']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)
    ]
)

# Apply the transformation
X_processed = preprocessor.fit_transform(df)

print(f"Original Feature Count: {len(categorical_features) + len(numerical_features)}")
print(f"Processed Feature Count (after OHE): {X_processed.shape[1]}")

# %% [markdown]
# ## 3. Apply PCA
# Now we have a matrix of numbers. We can use PCA to find the "Principal Components."

# %%
pca = PCA()
components = pca.fit_transform(X_processed)

# Calculate cumulative variance
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

px.area(
    x=range(1, len(exp_var_cumul) + 1),
    y=exp_var_cumul,
    labels={"x": "# Components", "y": "Explained Variance"},
    title="Explained Variance by Different Principal Components"
)

# %% [markdown]
# ## 4. Visualization (2D)
# Visualizing the high-dimensional data in just 2 dimensions.

# %%
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(4),
    color=df["gender"],  # Color by original label to see if patterns emerge
    title="First 4 Principal Components"
)
fig.update_traces(diagonal_visible=False)
fig.show()


Original Feature Count: 5
Processed Feature Count (after OHE): 47
