## Model selection
Model selection in **machine learning** is the process of choosing the best model (or model configuration) from a set of candidate models for a given task, based on performance criteria.

It’s essentially answering:

> *“Out of all the models I could use, which one should I trust to perform best on unseen data?”*

---

##  Why Model Selection Matters

Even if you have the most advanced algorithm, if it’s not the right fit for your data, it can underperform. Model selection helps to:

* **Avoid overfitting** (model fits training data too closely, failing on new data)
* **Avoid underfitting** (model is too simple to capture the data patterns)
* **Maximize generalization** (good performance on unseen data)

---

##  Common Scenarios for Model Selection

* Choosing **between algorithms** (e.g., logistic regression vs. random forest vs. neural network)
* Choosing **hyperparameters** (e.g., number of trees in a random forest, learning rate in a neural net)
* Selecting **feature sets** or **data preprocessing methods** that yield the best performance

---


##  Example: Choosing a Classification Model

Let’s say you have 10,000 labeled emails (spam / not spam).

1. **Split data**: 70% train, 15% validation, 15% test.
2. **Train models**: Logistic Regression, Random Forest, XGBoost.
3. **Hyperparameter tuning**: Use grid search + 5-fold CV for each model.
4. **Compare**:

   | Model               | Validation F1 |
   | ------------------- | ------------- |
   | Logistic Regression | 0.89          |
   | Random Forest       | 0.91          |
   | XGBoost             | 0.93 ✅        |
5. **Select XGBoost** and test → F1 = 0.92 (good generalization).

---



## Choosing Right Model Estimator


<img src='images/ml_map.png'>

Refs: [1](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

# Overfitting and Underfitting

<img src='images/overfitting_underfitting.svg'>

Refs: [1](https://towardsdatascience.com/a-short-introduction-to-model-selection-bb1bb9c73376)