<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/dev/2_4_04_Train_Test_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train/Test Split

## 0) Objective
* Demonstrate why train/test splitting is critical for fair model evaluation.
* Learn about validation sets, cross-validation, and best practices to avoid data leakage.
* Practice creating reproducible and stratified splits.

## Setup

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

## 2) Load a Sample Dataset

Use a clean, ready-to-go dataset:

In [None]:
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target
X.shape, y.value_counts(normalize=True)


## 3) Perform a Basic Train/Test Split

* 80/20 split is standard.
* **Key point:** The test set simulates unseen data. Never train or tune on it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(X_train.shape, X_test.shape)

## 4) Stratified Split for Classification

Show how to preserve class balance.

Stratification keeps class ratios consistent in both sets.

In [None]:
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
y.value_counts(normalize=True), y_train_s.value_counts(normalize=True)

## 5) Train and Evaluate Model

Test set should be untouched until final evaluation.

In [None]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train_s, y_train_s)
y_pred = model.predict(X_test_s)
print("Accuracy:", accuracy_score(y_test_s, y_pred))

## 6) Add a Validation Split

* Three-way split
* Train = 60%, Val = 20%, Test = 20%.
* Train -> tune on val -> evaluate once on test.

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

## 7) Cross-Validation Demonstration

More reliable estimate of model performance than one split.

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print("CV mean accuracy:", scores.mean(), "Â±", scores.std())

## 10) Key Takeaways

* Always test on unseen data.
* Use stratification for imbalanced data.
* Introduce a validation set for tuning.
* Use cross-validation for small datasets.
* Control randomness and avoid leakage for reproducible, trustworthy models.