# Data Preprocessing & Splitting — Full Breakdown

## Overview 📊

When preparing data for machine learning, especially clinical or bio data, it's critical to handle preprocessing steps correctly to avoid **data leakage** and ensure your model generalizes well.

---

## Steps

### 1 Drop missing values

We remove rows with missing `gene_expression` to avoid contaminating model inputs.
* if too many are dropped we may need to consider imputation if we will lose too much data.

```
df_clean = df.dropna(subset=['gene_expression'])
```
### 2 Split into features and target

Separate features (`X`) from target (`y`).

```
X = df_clean[['gene_expression', 'age']]
y = df_clean['response']
```

### 3 Train-test split with stratification
We use train_test_split to split data into training and testing sets.

* **Stratify**: Ensures that both sets maintain the same proportion of classes as the original data.

* **Random_state**: Guarantees reproducibility of the split.

```
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
```
### Why stratify?

When dealing with imbalanced datasets (like many clinical studies), class distributions can easily get skewed if you split randomly.

For example, suppose your original dataset looks like this:

| Response | Count | Percentage |
|-----------|-------|------------|
| 0         | 90    | 90%        |
| 1         | 10    | 10%        |

If you do a random split **without stratification**, you might accidentally end up with:

- Training set: 95% class 0, 5% class 1
- Test set: 85% class 0, 15% class 1

This imbalance can mess up your model's ability to learn minority classes and skew evaluation metrics.

When you use `stratify=y`, you preserve the original class distribution:

| Response | Original % | After Stratify (Train/Test) % |
|-----------|------------|-------------------------------|
| 0         | 90%       | ~90%                        |
| 1         | 10%       | ~10%                        |

---

**Key point:**
> "Stratification ensures both train and test sets reflect the true class distribution, so your model can generalize better and evaluation is fair."

### 4 Standardize features
We fit the scaler on X_train only to learn its mean and standard deviation, and then transform both train and test sets using these parameters.
```
scaler = StandardScaler()
scaler.fit(X_train[['gene_expression']])

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled['gene_expression'] = scaler.transform(X_train[['gene_expression']])
X_test_scaled['gene_expression'] = scaler.transform(X_test[['gene_expression']])
```

### ⚖️ Fit vs Transform

| Step         | Purpose                                       | Applied to    | Mean ≈ 0, SD ≈ 1?                 |
|---------------|-----------------------------------------------|---------------|----------------------------------|
| `fit()`      | Learn mean and std from training data only    | Train data    | No change yet (just stores stats) |
| `transform()`| Apply learned stats to scale data            | Train & Test | Yes for train, approx. for test |

---

#### 🌋Knowledge bombs

- **Why only `fit()` on train?**
  > We only learn scaling parameters (mean and std) from training data to prevent data leakage. If you include test data when fitting, you let the model "peek" at future data, which inflates performance and ruins real-world validity.

- **Why `transform()` on both?**
  > Once trained on train data, the scaler should apply the exact same transformation to any new data (test or future real-world samples). This ensures consistency.

- **Why test data ≠ exact zero mean & unit std?**
  > Test data is transformed using the train stats, so it usually ends up *close* to zero mean and unit std, but not exactly, since its distribution might differ slightly.

---

#### Analogy

> "Fit is like learning the recipe in your own kitchen. Transform is applying that same recipe to your friend's food — no adjustments allowed. Never fit on test, or you basically sneak a taste and tweak their plate, and that’s cheating."

---

**One-liner mantra:**
> "Fit on train, transform on train, transform on test — never fit on test."

### ☠️Data Leakage
* We **do not** fit on test data to prevent leakage of future information into training.

* Transforming test data uses only parameters from training data.

### Clinical context
* In clinical data, preserving true data distribution and preventing leakage is crucial for:

* Accurate model evaluation

* Regulatory compliance (Remember: on AWS 'Config is Code for Compliance')

* Generalization to real-world cohorts

### Analogy
* **Fit on train**: Like tasting your sauce at home to get the seasoning right.

* **Transform on train**: Apply that balance to your own dishes.

* **Transform on test**: You use your sauce on a guest dish — but their dish might taste a bit different since you can't adjust it further.

* **Never fit on test**: Don't taste and fix guest's dish at their table — that's cheating.

### Summary mantra
> "Fit on train. Transform on train. Transform on test. Never fit on test. Stratify to keep class balance."



?## Let's make some mock data called
`mock_patients.csv`

In [1]:
data = """patient_id,gene_expression,age,response
1,2.5,45,0
2,3.2,52,1
3,1.5,39,0
4,2.9,61,0
5,3.8,47,1
6,,50,0
7,2.2,55,0
8,4.1,62,1
9,3.0,48,0
10,1.8,44,0
"""

with open("mock_patients.csv", "w") as f:
    f.write(data)


In [4]:
import pandas as pd
df = pd.read_csv("mock_patients.csv")
df.head()

Unnamed: 0,patient_id,gene_expression,age,response
0,1,2.5,45,0
1,2,3.2,52,1
2,3,1.5,39,0
3,4,2.9,61,0
4,5,3.8,47,1


### Read CSV & run our pipeline

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv("mock_patients.csv")

# Drop missing
df_clean = df.dropna(subset=['gene_expression'])

# Split X and y
X = df_clean[['gene_expression', 'age']]
y = df_clean['response']

# Split with stratify
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Fit scaler on train
scaler = StandardScaler()
scaler.fit(X_train[['gene_expression']])

# Transform
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled['gene_expression'] = scaler.transform(X_train[['gene_expression']])
X_test_scaled['gene_expression'] = scaler.transform(X_test[['gene_expression']])

# Check means and stds
print("Train gene_expression mean:", X_train_scaled['gene_expression'].mean())
print("Train gene_expression std:", X_train_scaled['gene_expression'].std())

print("Test gene_expression mean:", X_test_scaled['gene_expression'].mean())
print("Test gene_expression std:", X_test_scaled['gene_expression'].std())

print("\nX_train_scaled:")
print(X_train_scaled)

print("\nX_test_scaled:")
print(X_test_scaled)


Train gene_expression mean: -3.806478941571965e-16
Train gene_expression std: 1.0801234497346435
Test gene_expression mean: -0.11261065411536281
Test gene_expression std: 0.7962775715882578

X_train_scaled:
   gene_expression  age
4         1.126107   47
0        -0.337832   45
3         0.112611   61
8         0.225221   48
9        -1.126107   44
2        -1.463939   39
7         1.463939   62

X_test_scaled:
   gene_expression  age
6        -0.675664   55
1         0.450443   52


# ⚖️ Handling Class Imbalance (On our "Dummy-Data"

## Problem

When one class dominates (e.g., 90% non-responders, 10% responders), the model might just always predict the majority class to get high "accuracy," but fail to actually detect the minority (important) class.

---

## Solutions

### Resampling

- **Oversample** minority class (e.g., using SMOTE or random oversampling).
- **Undersample** majority class.

### Class weighting

- Pass `class_weight='balanced'` to some models (e.g., logistic regression, random forest).
- Adjusts loss function to penalize mistakes on minority class more.

### Evaluation metrics

- Use metrics that highlight minority class performance:
  - Precision, recall
  - F1-score
  - ROC-AUC
  - PR-AUC

---

## 🥷 Key points

> "Accuracy alone is a trap on imbalanced data. You need to focus on recall & precision to understand real-world clinical utility."


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Dummy example
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train_scaled[['gene_expression']], y_train)

y_pred = model.predict(X_test_scaled[['gene_expression']])

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Dummy Data-Set is all well and good, but let's chack out a sample Breast Cancer data from `scikit-learn`

In [7]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print(df.head())
print("\nClass distribution:")
print(df['target'].value_counts())


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# X and y
X = df[data.feature_names]
y = df['target']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

# Scale
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

# Model
model = LogisticRegression(class_weight='balanced', random_state=42, max_iter=5000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.91      0.98      0.94        42
           1       0.99      0.94      0.96        72

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114



# Experimental Plan: Mess Up the Pipeline to See Why It Matters

## 💣 Hypothesis

If we **don't stratify** during train/test split, or if we **fit the scaler on the whole data before splitting**, we might:

- Cause **data leakage** (test info leaks into train).
- End up with **skewed class distributions** in train/test.
- Get **inflated or misleading model performance**.

---

## Controlled experiment

### Setup

- Use the scikit-learn breast cancer dataset.
- Two experimental "mistakes":

#### Mistake 1: No stratify during split

- Without stratify, we risk imbalanced splits and underrepresenting classes.

#### ⚠️ Mistake 2: Fit scaler on entire data before split

- By fitting scaler on all data before splitting, we "peek" at test set statistics.

---

## Expected outcome

- Train and test distributions will be different (class balance off).
- Model metrics may look artificially better or worse.
- Final metrics won’t reflect true generalization performance.

---

## Goal

- Show why proper preprocessing (fit on train only, transform test only, and use stratify) is critical to avoid misleading results in clinical ML studies.


In [9]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# ☠️Mistake 1: NO stratify
X = df[data.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ☢️Mistake 2: Fit on ALL data BEFORE split
bad_scaler = StandardScaler()
bad_scaler.fit(X)  # ⧮ fitting on all data

X_scaled = pd.DataFrame(bad_scaler.transform(X), columns=X.columns)

# Now split "scaled data" (already contaminated)
X_train_scaled = X_scaled.loc[X_train.index]
X_test_scaled = X_scaled.loc[X_test.index]

# Train model
model = LogisticRegression(class_weight='balanced', random_state=42, max_iter=5000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print(classification_report(y_test, y_pred))

# Check class balance
print("\nClass balance (train):")
print(y_train.value_counts(normalize=True))
print("\nClass balance (test):")
print(y_test.value_counts(normalize=True))


              precision    recall  f1-score   support

           0       0.98      0.98      0.98        43
           1       0.99      0.99      0.99        71

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114


Class balance (train):
target
1    0.628571
0    0.371429
Name: proportion, dtype: float64

Class balance (test):
target
1    0.622807
0    0.377193
Name: proportion, dtype: float64


# Lessons Learned from Messing Up the Pipeline

## What we did wrong

- **Did not use `stratify`** → Risks unbalanced train/test splits.
- **Fit scaler on entire data before split** → Caused data leakage.

---

## Consequences

- Inflated precision, recall, and F1-score due to seeing test data prematurely.
- Misleading evaluation metrics, especially dangerous in healthcare.

---

## Correct approach

- Always **split first**, using `stratify=y` to maintain class balance.
- Then **fit scaler only on training data**, transform both train and test after.
- Use evaluation metrics beyond accuracy (e.g., recall, F1) to understand real-world performance.

---

> "Fit on train, transform on train, transform on test — never fit on test."


## 🚨 What is Data Leakage?

**Data leakage** occurs when information from outside the training dataset is used to create the model. This artificially inflates performance metrics during development, but the model fails in production because that information won't be available for new, unseen data.

**The core problem:** Your model learns patterns that won't exist in real-world deployment.

## Types of Leakage in Clinical ML

### 1. **Scaling Leakage** (what we're preventing)
```python
# ❌ WRONG - Leakage!
scaler.fit(X)  # Fits on ALL data including test
X_train, X_test = train_test_split(X_scaled)

# ✅ CORRECT - No leakage
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Only learns from training data
```

**Why this matters:** When you fit the scaler on all data, the scaler learns the mean and standard deviation of the test set. Your model indirectly "knows" something about the test data's distribution.

### 2. **Feature Engineering Leakage**
```python
# ❌ WRONG
df['gene_zscore'] = (df['gene_expression'] - df['gene_expression'].mean()) / df['gene_expression'].std()
X_train, X_test = train_test_split(df)

# ✅ CORRECT
X_train, X_test = train_test_split(df)
train_mean = X_train['gene_expression'].mean()
train_std = X_train['gene_expression'].std()
X_train['gene_zscore'] = (X_train['gene_expression'] - train_mean) / train_std
X_test['gene_zscore'] = (X_test['gene_expression'] - train_mean) / train_std
```

### 3. **Target Leakage** (especially dangerous in clinical data)
```python
# ❌ WRONG - Using future information
# Example: predicting treatment response, but including
# "days_until_response" as a feature
# This info wouldn't exist at prediction time!

X = df[['gene_expression', 'age', 'days_until_response']]  # ❌

# ✅ CORRECT - Only use info available at prediction time
X = df[['gene_expression', 'age', 'baseline_symptoms']]  # ✅
```

## 🕰️ The Time Machine Analogy

Imagine you're developing a model in 2024 to predict patient outcomes:

**Data leakage is like having a time machine:**
- You travel to 2025, see which patients responded to treatment
- You come back to 2024 and use subtle hints from that future knowledge
- Your model looks amazing in testing!
- But when you deploy it in the *real* 2025, it fails because it doesn't have the time machine anymore

**In our scaling example:**
- Fitting scaler on all data = using the time machine to peek at test patients
- The scaler learns "future" statistics (test set mean/std)
- Model gets subtle advantages it won't have on truly new patients

## 🏥 Why Leakage is Especially Dangerous in Clinical ML

In healthcare and bioinformatics, data leakage can have serious consequences:

1. **Regulatory Failure**
   - FDA/regulatory bodies will reject models with leakage
   - Cannot claim true prospective validation

2. **Patient Harm**
   - Model performs well in testing (90% accuracy!)
   - Deployed model performs poorly (60% accuracy)
   - Wrong treatments given, delayed diagnoses

3. **Resource Waste**
   - Clinical trials designed around flawed models
   - Expensive sequencing/tests ordered based on bad predictions
   - Time and funding lost

4. **Publication Retraction**
   - Many papers retracted due to leakage in ML pipelines
   - Damages scientific credibility

**Real example:** A gene expression model claimed 95% accuracy predicting cancer subtype, but had fitted the normalizer on all data. Real-world accuracy was only 68%.

## 📊 Leakage Visualization

### Without Leakage (Correct):
```
[All Data]
    ↓
Split (stratify=y)
    ↓
[Train Data]  [Test Data]
    ↓              ↓
Fit Scaler        (Wait)
    ↓              ↓
Transform         Transform (using train stats)
    ↓              ↓
Train Model       Evaluate
```

### With Leakage (Wrong):
```
[All Data]
    ↓
Fit Scaler  ← 🚨 Test data statistics leak here!
    ↓
Transform All
    ↓
Split
    ↓
[Train Data]  [Test Data]
    ↓              ↓
Train Model       Evaluate (Falsely optimistic!)
```

## Quantifying the Leakage Impact

**What happened in our "bad" pipeline:**

1. Test set had 114 samples with certain statistical properties
2. When we fit the scaler on all 569 samples, it learned a mean that incorporated those 114 test samples
3. The model training indirectly benefited from knowing the test distribution
4. Metrics looked artificially good (98% F1 vs 96% in correct pipeline)

**The 2% difference might seem small, but:**
- In a study of 1000 patients, that's 20 misclassified cases
- Could be life-or-death decisions in cancer diagnosis
- Compounds with other modeling choices
- Won't replicate in prospective validation

In [1]:
!pip install pandas-stubs==2.3.2.250926

Collecting pandas-stubs==2.3.2.250926
  Downloading pandas_stubs-2.3.2.250926-py3-none-any.whl.metadata (10 kB)
Collecting types-pytz>=2022.1.1 (from pandas-stubs==2.3.2.250926)
  Downloading types_pytz-2025.2.0.20250809-py3-none-any.whl.metadata (1.7 kB)
Downloading pandas_stubs-2.3.2.250926-py3-none-any.whl (159 kB)
Downloading types_pytz-2025.2.0.20250809-py3-none-any.whl (10 kB)
Installing collected packages: types-pytz, pandas-stubs
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pandas-stubs]
[1A[2KSuccessfully installed pandas-stubs-2.3.2.250926 types-pytz-2025.2.0.20250809
[0m

# ✅ CORRECT approach:
```
scaler = StandardScaler()
scaler.fit(X_train)  # Learn mean & std from TRAINING data only

X_train_scaled = scaler.transform(X_train)  # Scale train data
X_test_scaled = scaler.transform(X_test)    # Scale test data with TRAIN stats
```

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# ============================================
# STEP 1: Load the data
# ============================================
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

print("=" * 60)
print("ORIGINAL DATA")
print("=" * 60)
print(f"Total samples: {len(X)}")
print(f"\nFirst few rows of features:")
print(X.head())
print(f"\nClass distribution:")
print(y.value_counts())

# ============================================
# STEP 2: Split FIRST (with stratify)
# ============================================
# 🔑 KEY: Split happens BEFORE any scaling
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,      # Maintains class balance
    random_state=42
)

print("\n" + "=" * 60)
print("AFTER SPLITTING")
print("=" * 60)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nTrain class distribution:")
print(y_train.value_counts(normalize=True))
print(f"\nTest class distribution:")
print(y_test.value_counts(normalize=True))

# Check original statistics BEFORE scaling
print("\n" + "=" * 60)
print("BEFORE SCALING - Original Statistics")
print("=" * 60)
print(f"Train 'mean radius' - mean: {X_train['mean radius'].mean():.2f}, std: {X_train['mean radius'].std():.2f}")
print(f"Test 'mean radius' - mean: {X_test['mean radius'].mean():.2f}, std: {X_test['mean radius'].std():.2f}")

# ============================================
# STEP 3: Fit scaler ONLY on training data
# ============================================
# 🔑 KEY: Scaler only learns from X_train
scaler = StandardScaler()
scaler.fit(X_train)  # ✅ Only fit on training data

print("\n" + "=" * 60)
print("SCALER LEARNED FROM TRAINING DATA")
print("=" * 60)
print(f"Scaler learned mean for 'mean radius': {scaler.mean_[0]:.2f}")
print(f"Scaler learned std for 'mean radius': {scaler.scale_[0]:.2f}")

# ============================================
# STEP 4: Transform BOTH train and test
# ============================================
# 🔑 KEY: Use the SAME scaler (with train stats) for both
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train),
    columns=X.columns,
    index=X_train.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X.columns,
    index=X_test.index
)

print("\n" + "=" * 60)
print("AFTER SCALING")
print("=" * 60)
print(f"Train 'mean radius' - mean: {X_train_scaled['mean radius'].mean():.6f}, std: {X_train_scaled['mean radius'].std():.2f}")
print(f"Test 'mean radius' - mean: {X_test_scaled['mean radius'].mean():.6f}, std: {X_test_scaled['mean radius'].std():.2f}")
print("\nNote: Train mean ≈ 0, std ≈ 1 (exact)")
print("      Test mean ≈ 0, std ≈ 1 (approximate, uses train's stats)")

# ============================================
# STEP 5: Train model on scaled training data
# ============================================
model = LogisticRegression(
    class_weight='balanced',
    random_state=42,
    max_iter=5000
)
model.fit(X_train_scaled, y_train)

# ============================================
# STEP 6: Evaluate on scaled test data
# ============================================
y_pred = model.predict(X_test_scaled)

print("\n" + "=" * 60)
print("MODEL PERFORMANCE (NO LEAKAGE)")
print("=" * 60)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\n" + "=" * 60)
print("✅ SUMMARY: CORRECT PIPELINE")
print("=" * 60)
print("1. Split data first (with stratify)")
print("2. Fit scaler on X_train only")
print("3. Transform X_train using fitted scaler")
print("4. Transform X_test using SAME fitted scaler (train stats)")
print("5. Train model on X_train_scaled")
print("6. Evaluate on X_test_scaled")
print("\n🎯 No data leakage - model never saw test statistics!")

ORIGINAL DATA
Total samples: 569

First few rows of features:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...

# ❌ WORST: Leakage + No stratify
scaler.fit(X)  # Leakage!
X_train, X_test = train_test_split(X)  # Imbalanced splits possible

# ⚠️ BETTER: No leakage, but might have imbalanced splits
X_train, X_test = train_test_split(X, y)
scaler.fit(X_train)  # No leakage!
# But train/test might have different class distributions

# ✅ BEST: No leakage + Balanced splits
X_train, X_test = train_test_split(X, y, stratify=y)
scaler.fit(X_train)  # No leakage + balanced classes!
```

## In Clinical/Bioinformatics Context

**Without stratify:**
```
Original: 90% non-responders, 10% responders
Train: 92% non-responders, 8% responders  ← Model undertrained on responders
Test: 85% non-responders, 15% responders  ← Evaluation skewed

## What Stratify Does
stratify ensures your train and test sets have the same class proportions as the original data:

Original data:
- Class 0: 70%- Class 1: 30%

# WITHOUT stratify - might get unlucky:
`X_train, X_test = train_test_split(X, y, test_size=0.2)`
Train might be: 65% class 0, 35% class 1
Test might be:  80% class 0, 20% class 1  ← Imbalanced!

WITH stratify - guaranteed balance:
`X_train, X_test = train_test_split(X, y, test_size=0.2, stratify=y)`
Train: 70% class 0, 30% class 1
Test:  70% class 0, 30% class 1  ← Same as original!