This video connects three ideas: classical bias–variance, the modern “double descent” phenomenon, and the surprising way stochastic gradient descent (SGD) behaves like a built‑in regularizer (“implicit regularization”), especially in over‑parameterized models.

***

## Classical view: complexity vs train/test error

In the classical bias–variance picture:

- As **model complexity** goes up (more parameters, more flexible shape), **training error** typically goes down steadily.  
- **Test/validation error** usually:
  - Decreases at first (less bias, better fit).  
  - Reaches a “sweet spot.”  
  - Then **increases** as the model starts to overfit (capturing noise).

The Belkin et al. 2019 paper reframes this in terms of:

- **Risk** (expected error) instead of test error.  
- **Capacity** instead of complexity.

Conceptually it is the same shape: a U‑shaped test error curve with a single optimal complexity.

***

## Modern phenomenon: double descent

In the late 2010s, people noticed that in many real models (especially large neural networks), the test error vs. complexity story is more complicated:

- If you keep increasing model capacity far beyond the point where training error first hits **zero**, something surprising can happen.  
- Near the point of exactly fitting the training data (interpolation), the test error often **spikes** upward (overfitting).  
- But if you keep increasing capacity further:
  - Test error can **decrease again**, sometimes to a level even **lower** than the original “sweet spot.”  
  - This gives a **double descent** shape:
    - First descent (classical regime).  
    - Spike near interpolation (zero training error).  
    - Second descent in an over‑parameterized “modern” regime.

Key terms:

- **Interpolating regime**: the part of the curve where the model achieves **zero training error** (fits every training point exactly).  
- **Modern interpolating regime**: extremely over‑parameterized models (like big neural nets) that interpolate but still generalize well.

So the simple “more complexity ⇒ worse generalization after a point” story is not always correct in practice.

***

## Witten’s spline example: 4, 6, 20, and 36 degrees

The video uses an example from Daniela Witten (via a series of tweets) to make double descent concrete in a simpler setting (spline regression, conceptually similar to high‑degree polynomial regression) on 20 data points:

1. **Degree‑4 model**  
   - Blue curve fits the data reasonably well.  
   - Still some bias, but it captures the broad shape of the underlying “true” black curve.

2. **Degree‑6 model**  
   - Blue curve fits a bit better (more flexibility).  
   - Training error decreases; test error also often improves compared to degree 4.

3. **Degree‑20 model**  
   - Now the model has enough flexibility to achieve **zero training error** on the 20 points.  
   - The curve wiggles wildly between points, “chasing” noise.  
   - Training error is zero, but test/generalization is clearly bad (overfitting).  
   - This matches the classical picture near the interpolation point.

4. **Degree‑36 model**  
   - There are now more parameters (36) than data points (20).  
   - In linear algebra terms:
     - 20 data constraints.  
     - 36 parameters.  
     - There are **infinitely many** parameter settings that perfectly fit the training data (zero training error).  
   - Among these infinitely many interpolating solutions, the particular one found by the training algorithm is actually **less wiggly** and somewhat better in test behavior than the degree‑20 solution.  
   - This illustrates the **second descent**: moving deeper into the over‑parameterized regime reduces test error again.

Plotting test and training error vs. degree:

- Training error decreases to zero and stays there.  
- Test error:
  - Decreases at first (degrees 4–6).  
  - Rises up near degree 20 (classic overfit).  
  - Then decreases again at degree 36 (double descent).  

In this specific toy example, test error eventually rises again for even higher degrees; in many neural network cases, the second descent can continue without a visible “second spike.”

***

## Over‑parameterization and infinite interpolating solutions

Once a model has **more parameters than data points**:

- It can fit the training data in many different ways.  
- There are **infinitely many** parameter vectors that all give **zero training error**.  

For example:

- With \(n\) points and \(n\) parameters, there is usually one unique interpolating solution.  
- With \(n\) points and \(p > n\) parameters, there is an entire family (a high‑dimensional set) of interpolating solutions.

So if we just say, “train until training error is zero,” that does not tell us:

- Which particular interpolating model we will end up with.  
- How “wiggly” or “smooth” that model will be.  
- How well it will generalize.

The key question becomes:

> Among all zero‑training‑error models, **which one** does the training procedure (like SGD) actually pick?

***

## Where stochastic gradient descent comes in

Stochastic gradient descent (and mini‑batch versions) is not just “any” optimizer: it has its own built‑in preferences.

- SGD is a **specific procedure** for walking through parameter space.  
- In over‑parameterized settings with infinitely many perfect fits, SGD tends to select **particular** kinds of solutions.  
- Empirically and theoretically (in certain settings), these solutions often behave as if they are **regularized**, even if you did not add any explicit regularization term (like L2, dropout, etc.) in the loss function.

This effect is called **implicit regularization** (or **implicit bias**) of SGD.

***

## Implicit regularization: “less wiggly” solutions without explicit penalties

Explicit regularization is something you put directly into the objective, e.g.:

- L2 penalty: minimize \( \text{loss}(\theta) + \lambda \|\theta\|^2 \).  
- Smoothing penalties in splines or polynomials.  

Implicit regularization means:

- There is **no explicit penalty term** in the loss.  
- Yet, the **training algorithm** (SGD) itself tends to favor solutions that look like they had been regularized.

In the Witten example:

- Both degree‑20 and degree‑36 models can achieve zero training error.  
- But when trained with SGD (or similar gradient methods), the degree‑36 model that emerges from the optimization is **less wiggly** and behaves more smoothly around the data.  
- Intuitively: even though it has higher nominal complexity, the *particular* solution SGD lands on is effectively more constrained and smoother—“as if” we had applied a stronger regularization.

The lecturer summarizes this as:

- “Stochastic gradient descent results in a model which is implicitly regularized, yielding less wiggly behavior.”  
- “Counterintuitively, because the 36‑degree model has a greater implicit regularization strength, it is actually less free than a 20‑degree model.”

That is, the training dynamics under SGD push the very high‑capacity model into a special subset of all possible interpolating solutions—those that are smoother, simpler in some hidden sense, or of smaller “norm,” depending on the setting.

***

## Why implicit regularization leads to double descent

Putting the pieces together:

1. **Moderate complexity**  
   - Classic bias–variance regime.  
   - Increasing capacity reduces bias, then overfitting starts, test error goes up.

2. **Approaching interpolation (training error → 0)**  
   - Test error spikes because the model is just flexible enough to chase noise.  
   - There is essentially only one interpolating solution; it is typically quite wiggly.

3. **Further over‑parameterization (many more parameters than data points)**  
   - There are infinitely many interpolating solutions.  
   - SGD doesn’t explore all of them; it has a built‑in bias toward certain smoother, more regularized solutions.  
   - As a result, the chosen zero‑training‑error model can generalize **better** than the unique interpolator at the interpolation threshold.  
   - Test error can drop again → **double descent**.

As model capacity grows even more:

- The strength or effect of this implicit regularization can increase.  
- In some contexts (e.g., deep nets), test error after the second descent is **lower** than anywhere in the classical regime, which explains why huge models can perform very well in practice despite being capable of massive overfitting.

***

## Active research and pointers

The instructor stresses that:

- The detailed theory of implicit regularization and double descent is still an **active research area**.  
- Many of the deeper mathematical explanations are beyond the course (and in some cases still not fully understood by researchers themselves).  
- They mention:
  - Belkin et al. 2019, *“Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off”*.  
  - Barrett 2021, *“Implicit Gradient Regularization”*, which studies how gradient-based methods (like SGD) themselves act as a kind of regularizer.

The main takeaway for you at this stage is **intuition**, not proofs:

- Over‑parameterized models have many perfect‑fit solutions.  
- SGD tends to pick particular, implicitly regularized ones that can generalize surprisingly well.  
- This selection mechanism is a key piece of why we see double descent and why very large models often perform better than smaller ones, contrary to the classical picture.

If you’d like, next I can give a small toy linear‑regression example (no code, just concept) to illustrate how “implicit bias toward minimum‑norm solutions” can arise from gradient descent in a simple over‑parameterized linear system.