# 📂 Train-Test Split

Splitting your dataset is a critical step in building reliable machine learning models. The goal is to evaluate how well a model generalizes to **unseen data**, not just how well it memorizes the training data.

## 🔧 Basic Split: Train/Test

- **Training Set**:  
  This subset is used to **train** the model – the model adjusts its internal parameters based on this data.

- **Test Set**:  
  This separate subset is used to **evaluate** the final model. It simulates new data the model hasn’t seen before, helping us assess its generalization performance.

📏 **Typical Split Ratios**:  
- 80% training / 20% testing  
- 90% training / 10% testing

> ⚠️ Important: The split must be random (unless working with time series), and in classification tasks, consider using **stratified sampling** to preserve label distribution.

---

## ⚠️ Risk of Overfitting to the Test Set

If the model selection or hyperparameter tuning is done using the **test set**, we risk **overfitting** to it.  
This makes our final evaluation **unreliable**, because we’ve essentially "peeked" at future data.

---

## 🧪📊 Train / Validation / Test Split

To avoid overfitting to the test set, it’s best to split the data into **three** parts:

- **Training Set**: Used to fit the model (optimize internal parameters).  
- **Validation Set**: Used for **model selection**, **hyperparameter tuning**, and **early stopping**.  
- **Test Set**: Used **only once**, after model tuning is complete, for final unbiased evaluation.

📏 **Typical Ratios**:
- 75% training  
- 15% validation  
- 10% test

---

## 🧠 Key Notes

- For **small datasets**, use **cross-validation** instead of a fixed validation set.
- For **time series data**, avoid random splits; respect chronological order (e.g., use forward chaining).
- A good data split leads to a trustworthy evaluation and better real-world performance.

---

## ✅ Summary

Data splitting is a foundational step for any machine learning workflow.  
Using a **train/validation/test split** ensures that your model is trained properly, tuned wisely, and evaluated fairly.