# ✂️ Splitting Data into Train/Test Sets in Machine Learning

## 🎯 What is Train/Test Split?

Splitting the dataset is a critical step in machine learning to evaluate how well your model generalizes to **unseen data**.

- The **Training Set** is used to train the model.
- The **Test Set** is used to evaluate model performance on unseen data.

---

## 🧠 Why is it Important?

If you train and test your model on the same data:
- The model may **memorize** the data (overfitting).
- You get an **over-optimistic** accuracy score.
- You cannot evaluate how well your model will perform in production.

To prevent this, we split the data.

---

## 🔢 Common Ratios

| Split        | Purpose         | Typical Proportion |
|--------------|------------------|---------------------|
| Training Set | Model training   | 70–80%              |
| Test Set     | Final evaluation | 20–30%              |

Sometimes we also use a **Validation Set** for hyperparameter tuning.

---

## 🧮 Mathematical Notation

Let the dataset be:
\[
D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}
\]

Split into:
- Training set \( D_{train} \)
- Test set \( D_{test} \)

\[
D_{train} \cup D_{test} = D,\quad D_{train} \cap D_{test} = \emptyset
\]

---

## 🛠️ How to Split the Data (Python)

```python
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)  # Features
y = df['target']               # Target

# Split into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### Parameters:
- test_size=0.2: 20% for test, 80% for train
- random_state: ensures reproducibility
- shuffle=True: data is shuffled before splitting (default)



## 🧪 With Stratification (for Classification)
Use stratified split to maintain class balance in both train and test sets.

```python
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
```

## 🧪 Train/Validation/Test Split (Optional)
```python
# First split into temp (80%) and test (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Split temp into train (60%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42
)
```

✅ Now:

- 60% Training
- 20% Validation
- 20% Test

## 📌 Best Practices
- Always split before preprocessing (scaling, encoding) to avoid data leakage.
- Use stratify for imbalanced classification problems.
- Keep a fixed random_state to make results reproducible.
- Don’t peek at the test set until final evaluation.



| Set        | Usage                                 |
| ---------- | ------------------------------------- |
| Training   | Fit the model                         |
| Validation | Tune hyperparameters (optional)       |
| Test       | Final evaluation of model performance |
