<img src="./notebooks/ml/assets/regularization.png" alt="Regularization Path" width="600" height="150">

## L0 Norm Definition
**Front:** What is the mathematical definition of the L0 "norm"? <br/>
**Back:** $$ \|\theta\|_0 = \sum_{j=1}^{p} \mathbb{1}(\theta_j \neq 0) $$
It counts the number of non-zero parameters in the model, where $\mathbb{1}$ is the indicator function (1 if true, 0 if false).

## L0 Regularization Objective
**Front:** What does the L0-regularized loss function look like? <br/>
**Back:** $$ J(\theta) = L(\theta) + \lambda \|\theta\|_0 $$
Where $L(\theta)$ is the original loss (e.g., MSE), $\lambda$ is the regularization strength, and $\|\theta\|_0$ counts non-zero parameters. Each additional non-zero parameter costs $\lambda$ in the objective.

## Computational Complexity
**Front:** Why is exact L0 regularization computationally challenging? <br/>
**Back:** L0 regularization requires solving a combinatorial optimization problem. For p features, there are $2^p$ possible subsets to evaluate. This exponential growth makes exact solutions NP-hard for large p, requiring heuristic approximations instead.

## Geometric Interpretation
**Front:** What geometric shape represents the L0 constraint region? <br/>
**Back:** The L0 constraint $\|\theta\|_0 \leq k$ forms a union of coordinate subspaces. In 2D: the axes themselves. In 3D: the coordinate planes and axes. It's not a convex set like L1's diamond or L2's circle, making optimization much harder.

## L0 vs L1 Comparison
**Front:** How does L0 differ fundamentally from L1 regularization? <br/>
**Back:** L0 directly counts non-zero parameters (true sparsity), while L1 sums absolute values (approximate sparsity). L1 is convex and computationally efficient; L0 is non-convex and NP-hard. L1 often approximates L0 solutions but doesn't guarantee exact zeros.

## Best Subset Selection
**Front:** What is best subset selection and how does it relate to L0? <br/>
**Back:** Best subset selection explicitly solves L0 regularization by evaluating all $2^p$ feature subsets and choosing the best one for each model size k. It's the exact implementation of L0 regularization but limited to p â‰¤ ~40 due to computational constraints.

## Stepwise Approximation
**Front:** How do stepwise selection methods approximate L0 regularization? <br/>
**Back:** Forward stepwise adds the best feature at each step; backward stepwise removes the worst. Both greedily approximate the optimal subset without evaluating all $2^p$ combinations, providing practical but suboptimal solutions to L0 regularization.

## Bayesian Interpretation
**Front:** What Bayesian prior corresponds to L0 regularization? <br/>
**Back:** The spike-and-slab prior: a mixture of a point mass at zero (spike) and a continuous distribution (slab). This explicitly models whether each parameter is exactly zero (excluded) or follows some distribution (included), directly implementing L0-style sparsity.

## Practical Applications
**Front:** When would you use L0 regularization despite its computational cost? <br/>
**Back:** When model interpretability or deployment constraints require strict feature limits (e.g., medical diagnosis with limited tests, edge devices with memory constraints, or when each feature has significant measurement cost).

## Optimization Challenges
**Front:** Why can't we use gradient descent for L0 regularization? <br/>
**Back:** The L0 "norm" is discontinuous and non-differentiable (derivative is 0 or undefined). More fundamentally, it's a combinatorial counting problem, not a continuous optimization problem. Specialized algorithms like matching pursuit or Bayesian methods are needed.

## Regularization Path Behavior
**Front:** How does the L0 regularization path differ from L1/L2 paths? <br/>
**Back:** The L0 path shows discrete jumps as features enter/exit the model. Unlike L1's smooth coefficient shrinkage or L2's gradual decay, L0 coefficients jump from 0 to some value or vice versa, with model size changing by integer counts.

## Information Criteria Connection
**Front:** How are AIC and BIC related to L0 regularization? <br/>
**Back:** AIC and BIC can be viewed as L0-regularized objectives:
AIC: $ \text{Loss} + 2 \cdot \|\theta\|_0 $
BIC: $ \text{Loss} + \log(n) \cdot \|\theta\|_0 $
Both penalize model complexity by counting parameters, with BIC having stronger penalty for large n.

## Recent Advances
**Front:** What modern methods approximate L0 regularization efficiently? <br/>
**Back:** 1. **Learned Thresholding**: Iterative thresholding algorithms
2. **Stochastic Gates**: Continuous relaxations using hard concrete distributions
3. **SparseMAP**: Sparse posterior inference methods
4. **Convex Relaxations**: Using non-convex but continuous penalty functions

## Subset Selection Algorithms
**Front:** What are the three main approaches to subset selection (L0 optimization)? <br/>
**Back:** 
1. **Filter Methods**: Select features independently of model (e.g., correlation)
2. **Wrapper Methods**: Evaluate subsets using model performance (e.g., forward selection)
3. **Embedded Methods**: Feature selection during model training (e.g., L1 regularization)
L0 is typically implemented via wrapper methods.

## Sparsity-Difficulty Tradeoff
**Front:** Why is achieving exact sparsity (L0) more difficult than approximate sparsity (L1)? <br/>
**Back:** L0 requires solving discrete optimization (feature selection), while L1 solves continuous optimization (shrinkage). The combinatorial nature of "which features" vs. "how much weight" makes L0 fundamentally harder despite seeming conceptually simpler.