## Model Selection

Model selection refers to picking the best model architecture or set of features for your prediction task. In practical data science, this usually means deciding which columns (features or attributes) of your data actually matter for the prediction you want to make. The right combination of features helps your model make accurate predictions and avoids problems like overfitting or underfitting.

- **Feature selection** is often used interchangeably with model selection, since the chosen features directly dictate your model’s structure and complexity.
- The process commonly involves examining training and validation error, and sometimes applies statistical or algorithmic approaches to systematically select features.

#### Polynomial Features Explained

Polynomial features involve expanding your dataset by creating new columns derived from existing features, raised to powers or multiplied together. This technique is often used for linear models when you suspect that relationships between variables aren’t strictly linear.

- If you start with a dataset containing just one feature (e.g., horsepower), applying a polynomial transformation of degree 2 creates these features: horsepower, horsepower squared.
- With multiple features (say, horsepower and weight), polynomial transformation yields: horsepower, weight, horsepower squared, horsepower * weight, weight squared.
- The number of possible polynomial features grows quickly as you increase the degree or number of initial features, which can add complexity but also risk overfitting.


#### Model Complexity vs. Error

As you add complexity (by increasing polynomial degree or adding more features), something interesting happens:
- **Training error usually decreases** (the model fits your training data better).
- **Validation/test error can decrease at first but may later increase** (the model overfits your training data and struggles with new/unseen data).
- This effect is called the **bias-variance tradeoff**.

#### The Challenge: Too Many Feature Combinations

When dealing with multiple features and polynomial expansions, the number of possible feature combinations explodes.
- Example: With 5 initial features and using degree 3 polynomial transformation, you could end up with 55 generated features.
- To find the best subset of features, you could, in theory, try every combination (exhaustive search), but that’s rarely practical. For 55 features, there are $2^{55}$ possible combinations—far more than you could ever test quickly.


### Sequential Feature Selection

This is a systematic way to build models by **incrementally adding or removing features**:

- **Forward selection**: Start with no features, then add them one by one, always choosing the feature that most improves performance until adding new features doesn’t help.
- **Backward elimination**: Start with all features, then remove them one by one, always choosing the feature whose removal least hurts performance, stopping when further removal worsens model accuracy.
- **Recursive feature elimination**: Features are recursively pruned from the entire set to find those most contributing to the prediction.

These methods help you find a decent subset of features without having to search every combination.

### Regularization

Regularization is a set of techniques designed to prevent overfitting by adding a penalty for complexity to your model’s training process.

- Common types include **L1 regularization (Lasso)** and **L2 regularization (Ridge)**.
- By penalizing large weights in your regression model, regularization encourages models to use fewer features and simpler relationships, leading to better generalization on new data.
- Regularization is particularly powerful when you have many features, as it can "shrink" the less useful ones toward zero, often effectively removing them from the model.


#### Limitations of Exhaustive Search

- For a small number of features (like fewer than 10), it’s possible to fit models for every possible combination and pick the best one.
- For larger feature sets, exhaustive search becomes computationally infeasible—"at a millisecond per model, it would take around a million years" to test every possibility with 55 features.
- Sequential selection and regularization make model selection practical and scalable.


### Real-World Example: Vehicle Dataset

Suppose you’re predicting car price based on horsepower, weight, and displacement.
- Using polynomial features of degree 2 gives you nine features: each original variable, each squared, and all pairwise products.
- Sequential selection or regularization can help you figure out if only a few of these features actually matter, instead of blindly using them all.