## Sequential Feature Selection

Sequential feature selection is a method for **automating the process of choosing the best subset of features** (columns or variables) from a larger set. This technique helps find a smaller group of features that improves model performance without having to try every possible combination, which can be computationally impossible on larger datasets.


### Why Do We Need Sequential Feature Selection?

- In machine learning, having too many features (inputs) can cause the model to **overfit**—it memorizes the training data but performs poorly on unseen data.
- For example, a model built with 55 polynomial features might have very low training error (it fits the training data well) but has high error on validation or new data, indicating overfitting.
- Exhaustively testing every combination of those 55 features (2^55 possible models) to find the best subset would take an extremely long time, even on powerful computers.
- **Sequential feature selection finds good features without needing to test every possible combination.**

### How Does Sequential Feature Selection Work?

Sequential feature selection works by either:

1. **Starting with no selected features and adding one new feature at a time (forward selection).**
2. Alternatively, starting with all features and removing one at a time(backward selection)

The goal is to reach a predetermined number of features $ k $.

### Forward Sequential Feature Selection Step-by-Step

1. **Start:** 
   - No features are selected initially (an empty set).
   - All features are candidates for selection.

2. **Iteration 1 - Adding the First Feature:**
   - Try each candidate feature individually.
   - Fit a model with that single feature.
   - Compute the **development (or validation) set error** for each model.
   - Select the feature that results in the *lowest* error.
   - This feature is added permanently to the selected set (called phi_selected).

3. **Iteration 2 - Adding Subsequent Features:**
   - Now take the already selected feature(s).
   - For each remaining candidate feature (phi_remaining), fit a model that uses all selected features plus the new candidate.
   - Compute the development set error for each model.
   - Select the candidate that improves performance the most (lowest error).
   - Add this feature permanently to phi_selected.

4. **Repeat** the above step until the desired number of features is selected.

### Illustration Example (Using 5 Features)

Suppose your original features are:

- horsepower
- weight
- horsepower squared
- horsepower times weight
- weight squared

#### Step 1
- Evaluate models with each single feature:
  - Model 1: horsepower only
  - Model 2: weight only
  - Model 3: horsepower squared only
  - Model 4: horsepower times weight only
  - Model 5: weight squared only

- Suppose horsepower gives the lowest validation error.
- Add **horsepower** to selected features.

#### Step 2
- Now try adding each of the remaining four one by one along with horsepower:
  - horsepower + weight
  - horsepower + horsepower squared
  - horsepower + horsepower times weight
  - horsepower + weight squared

- Suppose "horsepower + weight" gives the lowest error.
- Add **weight** to selected features.

#### Step 3
- Try adding one more feature with the two already selected (horsepower and weight):
  - horsepower + weight + horsepower squared
  - horsepower + weight + horsepower times weight
  - horsepower + weight + weight squared

- Suppose "horsepower + weight + horsepower squared" gives the lowest error.
- Add **horsepower squared** and stop, if you wanted three features.

### Key Points

- Once selected, features remain **permanently in the set**.
- The method **only fits a limited number of models** during the search, drastically reducing computational cost compared to testing all combinations.
- This technique is **greedy**: at each step, it picks the feature that looks best right now, which might not always lead to the absolute best combination but works well in practice.
- It is commonly used in scenarios with many features to **automate and streamline feature selection**.

### Comparison with Brute-force Approach

| Brute-force Search                             | Sequential Feature Selection                           |
|-----------------------------------------------|------------------------------------------------------|
| Tests every subset of features (exponentially many) | Adds features one at a time, reducing search space dramatically |
| Guarantees finding the absolute best subset    | Approximates best subset but much faster              |
| Impractical for large feature sets              | Scalable to large datasets                            |

### Why Use This Method?

Sequential feature selection efficiently searches to find a **desirable middle ground**—a small, powerful feature subset that leads to good performance without overfitting.