# Dimensionality Reduction

**Avoiding Overfitting**

To prevent overfitting, several techniques are employed:
- Structural Risk Minimization
- Regularization
- Cross-Validation
    - Model Selection
- **Feature Selection**
- **Feature Extraction**

**Feature Selection vs. Feature Extraction**

- Feature **Selection:**
    - Select a **subset** from a original set
    - Filter methods, wrapper methods, embedded methods

- Feature **Extraction:**
    - Applies a **transformation** to project features into a lower-dimensional space
    - PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), etc

# Feature Selection

**Challenges in Real-World Data**

Datasets frequently contain:
- **Irrelevant features:**  
    Variables that provide no predictive signal (e.g., noise).

- **Redundant features:**  
    Highly correlated or duplicate measurements.

- **High-dimensionality:**  
    Number of features $d$ is very large (perhaps $d \gg n$), common in:
    - Text classification (e.g., bag-of-words representations).
    - Genomics (e.g., gene expression micro-arrays)

**Benefits of Feature Selection**

**Feature Selection** is way to find **more accurate**, **faster**, and **easier to understand** classifiers.

**Effects:**
- **Improve Model Performance**
    - Reduces overfitting by eliminating noise
    - Mitigates the *curse of dimensionality*
    - Fewer features improve the sample-to-parameter ratio, enhancing generalization

- **Efficiency:**
    - Lowers memory usage, training time, and inference cost
    
- **Interpretability:**
    - Simplifies model explanations
    - Critical in domains like healthcare or finance

- **Noise Reduction:**
    - Directly improves test accuracy by removing non-informative features

<div style="text-align:center">
  <img src="../assets/noise_feature.png" alt="Noise Feature Example">
</div>

**Supervised Feature Selection:**

Given:
- A feature matrix $X \in R^{N \times d}$ *(where $N$ is number of samples, $d$ is number of features)*
- A label vector $Y \in R^N$.

Feature selection outputs a subset of indices $\{i_1, i_2, \ldots, i_{d'}\}$ (with $d' \ll d$) representing the most discriminative features.

$$
X = \begin{bmatrix} 
x_1^{(1)} & \cdots & x_d^{(1)} \\ 
\vdots & \ddots & \vdots \\ 
x_1^{(N)} & \cdots & x_d^{(N)} 
\end{bmatrix}, 
\quad 
Y = \begin{bmatrix} 
y^{(1)} \\ 
\vdots \\ 
y^{(N)} 
\end{bmatrix} \quad \xrightarrow{\text{Feature Selection}} \quad \{i_1, i_2, \ldots, i_{d'}\}
$$

# Filter: Univariate

## Introduction

**Univariate Method:**  
Evaluates each feature independently by considering its individual relationship with the target variable.

**Filter Method:**  
Ranks features or subsets *without* involving a classifier (preprocessing step).

**Advantage:**  
Computationally efficient and statistically scalable.

**Univariate Filter Method:**
- Score the each feature $k$ based on the $k$-th column of the data matrix and the label vector
    - Score: relevance of the feature to predict labels: Can the feature discriminate the patterns of different classes?

- Rank features according to their score values and select the ones with the highest scores.
    - Use *cross-validation* to select among the possible values of the $k$

**General Procedure**

- **Scoring:**  
    For each feature $X_k$, compute a score quantifying its relevance to predict labels $Y$ (e.g., correlation, mutual information).  
    *Key question:* How well does $X_k$ discriminate between classes?

- **Ranking:**  
    Sort features by their scores and select the top-$k$ highest-scoring features.  
    *Optimization:* Use cross-validation to determine the optimal $k$.

## Scoring Criteria

### **Pearson Correlation**

**Formula:**
$$
R(k) = \frac{\text{cov}(X_k, Y)}{\sqrt{\text{var}(X_k)} \sqrt{\text{var}(Y)}} \approx \frac{\sum_{i=1}^N (x_k^{(i)} - \bar{x_k})(y^{(i)} - \bar{y})}{\sqrt{\sum_{i=1}^N (x_k^{(i)} - \bar{x_k})^2} \sqrt{\sum_{i=1}^N (y^{(i)} - \bar{y})^2}}
$$

**Goal:**  
Find maximum $R(k)$ between $k$-th attribute to select features with strong linear dependence on $Y$

**Limitation:**  
Fails to capture nonlinear relationships.

### **Mutual Information**

**Recall: Mutual Information**
$$
I(k) = MI(X, Y) = H(Y) - H(Y | X) = \sum_{i} \sum_{j} p(x_i, y_j) \left(\log{\frac{p(x_i, y_j)}{p(x_i)p(y_j)}} \right)
$$
Where:
- Marginal Entropy:
    $$
    H(Y) = -\sum_{j} p(y_j) \log{p(y_j)}
    $$

- Conditional Entropy:
    $$
    H(Y | X) = \sum_{i} \sum_{j} p(x_i, y_j) \log{\frac{p(x_i)}{p(x_i, y_j)}}
    $$

Equivalently:
$$
MI(X, Y) = E_{X,Y} \left[ \log{\frac{P(X, Y)}{P(X)P(Y)}} \right]
$$

**Advantage:** Captures both linear and nonlinear dependencies.

### **KL Divergence for Irrelevance Testing**

#### **Key Definitions:**
- **$X_k$**: The $ k $-th feature component in input vectors (e.g., $[x_1, x_2, \dots, x_d]$).
- **$Y$**: Binary labels ($Y \in \{1, -1\}$).

#### **Irrelevance Condition:**
A feature $X_k$ is **irrelevant** for predicting $Y$ if:  
$$
P(X_k | Y = 1) = P(X_k | Y = -1).
$$  
This means $X_k$'s distribution does not depend on $Y$, providing no predictive information, as the distribution of $X_k$ does not change with $Y$.

#### **Measuring Relevance with KL Divergence:**
- The Kullback-Leibler (KL) divergence $D(P || Q)$ measures the "distance" between two probability distributions $P$ and $Q$.  
Mathematically, it is defined as:
$$
D_{KL}(P || Q) = \sum_{x \in X} P(x) \log{\frac{P(x)}{Q(x)}}
$$
- To quantify relevance, use the **symmetric KL divergence**:
$$
d(k) = D(P(X_k | Y{=}1) \| P(X_k | Y{=}{-1})) + D(P(X_k | Y{=}{-1}) \| P(X_k | Y{=}1))
$$
- **$ d(k) = 0 $**: $X_k$ is irrelevant (distributions are identical).  
- **$ d(k) > 0 $**: $X_k$ is relevant (distributions differ).

#### **Density Context:**
- "Density" refers to the conditional probability distributions:
  - $ P(X_k | Y = 1) $ (density under class 1).
  - $ P(X_k | Y = -1) $ (density under class -1).

#### **Summary:**
- **Relevance**: $ d(k) > 0 $ implies $X_k$ is useful for prediction.  
- **Irrelevance**: $ d(k) = 0 $ implies $X_k$ can be ignored.

## Disadvantages

**Redundant Features:**
- **Problem:**
    - Fails to detect redundant features (variables providing identical/similar information).
    - Example: If two features $X_1$ and $X_2$ are perfectly correlated, univariate filters will rank both highly despite redundancy
- **Key Challenge:**
    - Correlation alone cannot distinguish between relevance and redundancy

**Complementary Features:**
- **Problem:**
    - Ignores *feature interactions*: Pairs of weakly correlated features may jointly improve prediction.

- **Key Challenge:**
    - Univariate methods cannot evaluate synergistic effects between features.

<div style="text-align:center">
  <img src="../assets/complementary_attributes.png" alt="complementary attributes example">
</div>

# Multivariate

Search in the space of all possible combinations of features.

**Core Challenges: Feature Subset Search**
- **Search Space:** For $d$ features, $2^d$ possible subsets.

- **Complexity:**
    - *Computational:* Evaluating all subsets is infeasible for large $d$
    - *Statistical:* Risk of overfitting when testing many combinations

**Multivariate Feature Selection:**
- **Wrapper Method:**
    - Uses a classifier to evaluate the score of features or feature subsets.
    - Accurate for specific models
    - Training $2^d$ classifiers is infeasible for large $d$.
    - Most wrapper algorithms use a heuristic search.

- **Filter Method:**
    - Evaluates subsets via statistical measures
    - Cheaper to compute than the performance of the classifier
    - May ignore model-specific interactions

## General Procedure

**Input:**
- OriginalFeatureSet
- EvaluationFunction
- StoppingCriterion

**Output:**
- BestFeatureSubset

> bestSubset $\leftarrow$ OriginalFeatureSet  
> bestScore $\leftarrow$ $-\infty$  
> repeat  
> &nbsp;&nbsp;&nbsp;&nbsp;candidateSubset $\leftarrow$ GenerateNextSubset(OriginalFeatureSet)  
> &nbsp;&nbsp;&nbsp;&nbsp;score $\leftarrow$ EvaluateSubset(candidateSubset, EvaluationFunction)  
> &nbsp;&nbsp;&nbsp;&nbsp;if score > bestScore then  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bestScore $\leftarrow$ score  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bestSubset $\leftarrow$ candidateSubset  
> until StoppingCriterionMet(bestSubset, bestScore, StoppingCriterion)  
> Validate(bestSubset)  
> return bestSubset  

Where:
- **Subset Generation:**  
    - Select a candidate feature subset for evaluation
    - *(We will discuss about this later)*
    
- **Subset Evaluation:**  
    - **Wrapper:** Train a model on the subset; use its performance (e.g., accuracy) as the score.
    - **Filter:** Compute relevance scores (e.g., multivariate mutual information).
    
- **Stopping criterion:**  
    - When stopping the search in the space of feature subsets
    
- **Validation:**  
    - Verify selected subset on holdout data or via cross-validation.

**Stopping Criteria**

- Predefined **number of features** is selected
- Predefined **number of iterations** is reached
- Addition (or deletion) of any feature does **not result in a better subset**
- An **optimal subset** (according to the evaluation criterion) is obtained.


## Filters vs. Wrappers

### **Filters**

**Evaluation criteria:**
- **Euclidean distance:**
    - **Class separability:** Maximize intra-class similarity, inter-class dissimilarity.

- **Information Gain:**
    - Select features maximizing entropy reduction: $S_{IG}$
        $$S_{IG} = \argmax_{i} \text{Gain}(S_i, A)$$

    - Where:
        $$\text{Gain} (S, A) = H_S(Y) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H_{S_v}(Y)$$

- **Dependency (correlation coefficient):**
    - Good feature subsets contain features highly correlated with the class, yet uncorrelated with each other

- **Consistency (min-features bias):**
    - Selects features that guarantee no inconsistency in data
        - *inconsistent instances have the same feature vector but different class labels*

**Key Properties**
- **Fast execution:**  
    Computationally efficient (no classifier training).

- **Generality:**  
    Evaluate intrinsic properties of the data, rather than their interactions with a particular classifier (“good” for a larger family of classifiers)

- **Tendency to select large subsets:**  
    Their objective functions are generally monotonic (so tending to select the full feature set).

### **Wrappers**

**How It Works?**  
For each feature subset, train classifier on training data and assess its performance using evaluation techniques like cross-validation

**Key Properties**

- **Slow execution:**  
    Must train a classifier for each feature subset (or several trainings if cross-validation is used)

- **Lack of generality:**  
    The solution lacks generality since it is tied to the bias of the classifier used in the evaluation function.

- **Ability to generalize:**  
    Since they typically use cross-validation measures to evaluate classification accuracy, they have a mechanism to avoid overfitting.

- **Accuracy:**  
    Generally achieve better recognition rates than filters since they find a proper feature set for the intended classifier.

## Subset Selection or Generation

### **Search Direction**

- **Forward**
- **Backward**
- **Random**

### **Search Strategies**

#### **Exhaustive - Complete**

**Approach:**  
Evaluate all possible feature subsets (total combinations: $2^d$).

**Properties:**
- Optimal subset is achievable
- Too expensive if $d$ is large

**Methods:**
- Branch & Bound
- Best first

#### **Heuristic**

**Approach:**  
Guided incremental selection/elimination.

**Properties:**
- Incremental generation of subsets
- Smaller search space and thus faster search
- May miss optimal subsets

**Methods:**
- Sequential forward selection
- Sequential backward elimination
- Plus-l Minus-r Selection
- Bidirectional Search
- Sequential floating Selection

#### **Non-deterministic**

**Approach:**  
No predefined way to select feature candidate (i.e., probabilistic approach)  
Avoids local optima

**Properties:**
- Optimal subset depends on the number of trials
- Need more user-defined parameters
- Handles more complex interactions

**Methods:**
- Simulated annealing
- Genetic algorithm

# Summary

**Feature Selection Categorization:**
- **Univariate method:**  
    Considers one variable (feature) at a time.
    
- **Multivariate method:**  
    Considers subsets of features together.
    

**Another Feature Selection Categorization:**
- **Filter method:**  
    Ranks features or feature subsets independent of the classifier as a preprocessing step.

- **Wrapper method:**  
    Uses a classifier to evaluate the score of features or feature subsets.

- **Embedded method:**  
    Feature selection is done during the training of a classifier  
    *E.g., Adding a regularization term $||w||_1$ or $||w||_2$ in the cost function of linear classifiers*