### What is feature selection

Feature selection is the process of choosing a subset of relevant input variables (features) from the original set, while discarding irrelevant, redundant, or noisy features, without transforming the features themselves.

### Feature selection vs feature engineering vs feature extraction

##### Feature selection

In feature selection, we select columns without changing their values. The goal is to identify features that are not redundant and not noise, and to remove those that do not contribute meaningful information to the task.

To do this, we use different techniques, such as LASSO, where features with zero (or near-zero) coefficients are removed. Importantly, the original feature values remain unchanged; we only decide whether a column is kept or discarded.

By removing features, dimensionality is reduced, but dimensionality reduction is not the primary goal. The reduction may be small (removing one or two variables) or large (removing many variables), depending on the data. The main objective is relevance and non-redundancy, not compression.

##### Feature extraction

In feature extraction, we transform the data into a new feature space. Methods such as PCA and SVD are used to reshape the original variables into new features that better explain the structure or variance of the data.

These methods are explicit dimensionality reduction algorithms. The resulting features are combinations of the original variables, and therefore feature values change, and the original columns are no longer preserved.

Thus, dimensionality reduction methods are a subset of feature extraction methods, where the objective is to represent the data using fewer, transformed variables.

##### Feature engineering

In feature engineering, we create new variables that did not originally exist in the dataset, typically using domain knowledge.

For example, given weight and height, we can create BMI as a new variable. These engineered features may improve model performance or interpretability and can coexist with the original features.

##### Dimensionality reduction

Dimensionality reduction is the process of reducing the number of variables (dimensions) in a dataset by transforming the data into a lower-dimensional space.

The primary goal is to reduce the dimension itself, typically to:

- lower computational cost,

- mitigate the curse of dimensionality,

- remove noise,

- enable visualization or faster learning.

To achieve this, dimensionality reduction methods change the feature values by constructing new variables that summarize or combine the original ones. As a result, the original features are not preserved.


#### Why feature selection is needed

1. Improves generalization

   - Irrelevant features increase variance

   - Removing them reduces overfitting

2. Removes redundancy

   - Highly correlated features provide duplicate information

   - Selection keeps one representative feature

3. Improves interpretability

   - Models become easier to explain and trust

   - Critical in healthcare, finance, and scientific studies

4. Reduces computational cost

   - Faster training and inference

   - Lower memory usage

5. Improves model stability

   - Fewer noisy features ‚Üí more stable coefficients and predictions

6. Works well with limited data

   - When ùëõ ‚â™ ùëë feature selection is often essential


#### Feature selection methods

Feature selection methods can be divided into three main categories, based on how the selection is performed and how much the learning model is involved.

- Filter method
- Wrapper methods
- Embedding methods


#### 1. Filter methods

Filter methods select features independently of any machine learning model.
They rely on statistical properties of the data to measure the relevance of each feature.

- Features are ranked or scored using statistical criteria.

- Model training is not involved.

- Fast and scalable to high-dimensional data.

**Examples**

- Correlation coefficients

- Mutual information

- Chi-square test

- Variance threshold

**Pros**

- Computationally efficient

- Model-**agnostic** (_means a method does not depend on any specific learning algorithm and can be applied with any model_)

- Good for initial screening

**Cons**

- Ignore feature interactions

- May select redundant features


#### 2. Wrapper methods

Wrapper methods evaluate feature subsets by training and testing a model repeatedly. Feature selection is treated as a search problem over subsets. **They use a greedy algorithm to search**.

- Model performance directly guides feature selection.

- Can capture feature interactions.

- Computationally expensive.

**Examples**

- Forward selection (add features one-by-one)

- Backward elimination (starts with all features and removes one or more each time)

  - simple backward model
    - uses p-value metric
    - sometimes model-agnostic
  - Recursive Feature Elimination (RFE)
    - model-based feature importance
    - model-dependent (requires a trained model)

- Exhaustive feature search (uses all combinations of features)

**Pros**

- Often higher predictive performance

- Accounts for feature interactions

- The selected features may not perform well for other ML models

**Cons**

- High computational cost

- Risk of overfitting on small datasets

**Stopping Criteria**

1. Predefined number of features

   - Stop when a specific number of features remains.

2. Performance threshold

   - Stop when model performance reaches a desired level.

   Example: Stop if validation accuracy ‚â• 95%.

3. No further improvement

   - Stop when removing or adding more features does not improve performance.

   - Common in RFE: continue until removing the next feature degrades validation score.

4. Maximum iterations

   - Stop after a fixed number of steps to prevent excessive computation.

5. Minimum improvement delta

   - Stop if the change in performance between iterations is below a small threshold (e.g., Œî accuracy < 0.001).


#### 3. Embedded methods

Embedded methods perform feature selection as part of the model training process itself.

- Selection happens during optimization.

- Balance efficiency and performance.

- Common in modern ML pipelines.

**Examples**

- LASSO (L1 regularization)

- Elastic Net

- Tree-based models (feature importance)

**Pros**

- Computationally efficient

- Model-aware selection

- Widely used in industry

**Cons**

- Model-dependent

- Selection tied to model assumptions
