The bias–variance tradeoff explains how model complexity affects two different sources of error—bias and variance—and how, in “modern” over‑parameterized models trained with stochastic gradient descent, this classical picture can break down. [telnyx](https://telnyx.com/learn-ai/bias-variance-tradeoff)

### Bias and variance: intuitive meanings

- **Bias**: how fundamentally *wrong* the model’s assumptions are, even with the best possible parameters.  
  - High bias = model is too simple to capture the true pattern (underfitting). [bmc](https://www.bmc.com/blogs/bias-variance-machine-learning/)
  - Example from the video: a model with **only an intercept** for miles‑per‑gallon (MPG) assumes “all cars have the same fuel efficiency.” Even with the best intercept, it can’t capture the relationship between MPG and weight/horsepower, so its bias is high.

- **Variance**: how *sensitive* the model is to the specific training data it sees.  
  - High variance = model changes a lot when the data changes a little (overfitting). [telnyx](https://telnyx.com/learn-ai/bias-variance-tradeoff)
  - In the MPG example, a very wiggly high‑degree model will move dramatically if a single point is nudged, so it has high variance.

You can think:

- Low‑complexity models: **high bias, low variance**.  
- High‑complexity models: **low bias, high variance**. [bmc](https://www.bmc.com/blogs/bias-variance-machine-learning/)_

### MPG example: bias decreasing, variance increasing

The video walks through increasing model complexity on an MPG‑style dataset:

1. **Intercept‑only model**  
   - Predicts the same MPG for every vehicle.  
   - High bias: cannot express the clear dependency on the input variable.  
   - Low variance: tiny changes in data barely change the constant estimate.

2. **Linear model (intercept + slope)**  
   - A straight regression line through the data.  
   - Bias is **lower**: captures a general trend (e.g., MPG decreases as weight increases).  
   - Variance still modest: different samples will shift the line, but not wildly.

3. **Quadratic model (degree‑2 polynomial)**  
   - Can capture a **roughly parabolic** shape in MPG vs input.  
   - Bias decreases again: fits the visible curvature better.  
   - Variance increases: more flexible, more sensitive to small data changes.

4. **Very high‑degree polynomial (e.g., degree 25 with 25 points)**  
   - Can fit every training point exactly → **zero training error** → essentially zero bias on the training set.  
   - But extremely high variance: tiny changes in one point can produce a totally different, wildly oscillating curve.  
   - This is classic overfitting: the model is “too free” and learns noise.

So, as complexity increases:

- **Bias** tends to go down.  
- **Variance** tends to go up.

### Formal relationship: error = bias² + variance + noise

If you define bias and variance mathematically, you can show:

$$
\text{Expected loss (risk)} = \text{Bias}^2 + \text{Variance} + \text{Irreducible error}
$$

- **Irreducible error** is noise in the data you can’t get rid of (e.g., measurement noise). [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/ml-bias-variance-trade-off/)
- Since all terms are non‑negative, the irreducible error is a lower bound on what any model can achieve. [en.wikipedia](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

The **bias–variance tradeoff** is the fact that:

- Making the model more complex usually **reduces bias** but **increases variance**.  
- Making it simpler usually **increases bias** but **reduces variance**. [mlu-explain.github](https://mlu-explain.github.io/bias-variance/)

Classically, if you plot:

- Total error  
- Bias²  
- Variance  

against model complexity, you get:

- Bias²: decreases as complexity increases.  
- Variance: increases as complexity increases.  
- Total error: U-shaped, with a **sweet spot** at intermediate complexity where bias² + variance is minimized. [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/ml-bias-variance-trade-off/)

That sweet spot is the “just right” model in the classical regime.

### Modern interpolating regime and double descent

The video then connects this to the **modern interpolating regime** and **double descent** seen in large models:

- In many neural networks and over‑parameterized models, **test error vs. complexity** does *not* simply follow the classical U‑shape.  
- As you increase complexity:
  1. Test error decreases (classical first descent).  
  2. Near the point where training error hits **zero**, test error can **spike up** (overfitting).  
  3. As you increase complexity further, test error can **drop again** and sometimes reach **even lower** values than at the classical sweet spot.  

This is **double descent**: two descents in test error separated by a peak around the interpolation threshold.

The cited work (Belkin et al. and Yang et al.) and the video’s discussion emphasize:

- When you move far into the over‑parameterized, interpolating regime, bias can continue to **decrease**.  
- Surprisingly, **variance can start to decrease again**, rather than just increasing forever.  
- In some experiments, the **optimal** model (lowest test risk) is not at the classical sweet spot, but way out in the high‑capacity region, where the model is enormous.

In the neural network experiment mentioned:

- Bias² decreases with complexity.  
- Variance first increases, then **drops again** as the model grows huge.  
- Expected loss (risk) reaches its minimum in this far‑right regime, where both bias and variance are relatively small.

In that setting, it almost looks like there is **no tradeoff**: bigger models can achieve **lower bias and lower variance** simultaneously.

### Why can bigger models have low variance? Role of SGD

The video links this to **stochastic gradient descent** and **implicit regularization** (discussed in earlier videos):

- In over‑parameterized models (more parameters than data), there are **infinitely many** parameter settings that perfectly fit the training data (zero training error).  
- However, training with SGD does **not** pick an arbitrary interpolating solution; it picks a particular one, influenced by:
  - Initialization.  
  - Learning rate schedule.  
  - Noise from mini‑batches.  
  - Properties of the architecture and optimization path.

Empirically and in some theoretical settings, the solutions SGD gravitates toward are:

- **Less wiggly** or smoother in function space.  
- “Smaller” in some norm or capacity measure.  
- More **implicitly regularized**, even though you did not explicitly add a penalty.

As the model size grows:

- The space of interpolating solutions gets larger.  
- The **implicit bias** of SGD becomes more influential in picking a “nice” solution among many bad ones.  
- This can **reduce variance** again, because the chosen solutions are “tamer” than what raw capacity might suggest.

That is part of why:

- Extremely large neural networks—trained with SGD and related methods—can generalize well, even though parameter count far exceeds number of data points.  
- The best‑performing models in practice today are often **huge**, with hundreds of billions of parameters, not small, carefully “bias–variance balanced” models.

### Practical takeaways

From the video’s perspective:

- **Classical regime (smaller models)**  
  - Bias–variance tradeoff works as usual: pick a moderate complexity to balance underfitting and overfitting.  
  - The sweet spot is where bias is fairly low and variance hasn’t exploded.

- **Modern interpolating regime (very large models)**  
  - Test error and even variance can improve again as you go to larger models.  
  - In some contexts, the best model is **very large**, far beyond the interpolation threshold.  
  - The way SGD selects one solution among many zero‑training‑error solutions is crucial to this behavior.

- **In practice**  
  - Modern ML often pushes model sizes very high, especially in neural networks.  
  - Classical bias–variance intuition still helps, but you must be aware of double descent and implicit regularization when reasoning about very large models.