# 🧠 Session 4: Train-Test Split & Hands-On Activity

## 🕒 00:00–00:10 — Why Train-Test Split Matters
In this part of the session, we explore why it's important to divide data into training and testing (and optionally validation) sets.

**Key Concepts:**
- Training Set: Used to train the model
- Test Set: Used to evaluate generalization
- Validation Set (optional): Used for hyperparameter tuning

**Why not train and test on the same data?**
- Risk of **overfitting**
- Poor generalization to unseen data
- Artificially high accuracy

_Discuss with a partner: What could go wrong if we skip this step?_

## 🕒 00:10–00:20 — `train_test_split()` in scikit-learn
We’ll now use scikit-learn’s `train_test_split()` to split our dataset.

**Typical Syntax:**
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
- `test_size` = proportion to reserve for testing
- `random_state` = reproducibility

_Try changing `test_size` to 0.2 or 0.5 and see how it affects the split._

In [None]:
# Load dataset and preview
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)
df = iris.frame
df.head()

In [None]:
# Explore dataset
df.info()
df.describe()
df['target'].value_counts()

In [None]:
# Prepare features and labels, then split the data
from sklearn.model_selection import train_test_split

X = df.drop(columns='target')
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print(f'Training size: {X_train.shape}, Testing size: {X_test.shape}')

## 🕒 00:20–00:45 — 🧪 Hands-On Task
**Instructions:**
1. Load and explore the Iris dataset
2. Perform a train-test split
3. Optionally visualize features and distributions

In [None]:
# Optional: Visualize pairwise features
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='target')
plt.show()

## ✅ Summary & Reflection
- Why is splitting data essential for model evaluation?
- What happens if you don’t use a `random_state`?
- How might data leakage still happen?

_Reflection prompt: What did you find easiest/hardest about this step?_