# Bias Variance Analysis
**All errors in machine learning are as a result of high bias or high variance.**

In this notebook, we'll learn exactly how to identify when your model has high bias (underfitting) or high variance (overfitting)  and how to optimise the model for the perfect bias-variance trade-off. Here is an overview of what will be covered:

1. The bias-variance trade-off
1. Optimising for Bias/Variance: Model Hyperparameters, Outside Hyperparameters
2. Diagnosing for Bias/Variance: Learning Curves, Loss Curves, and Validation Curves
3. Bonus 1: Baseline Models (Incremental Optimisation)
4. Bonus 2: Handling Imbalance Datasets

## The Bias Variance Trade-Off

## Diagnosing for Bias/Variance
In this section, we'll explore three charts that allow us to visualise and diagnose our model performance. Before, we deleve deep into the charts, we should always break down our dataset into three sets:

1. **Training Set:** The data you use to train/fit your model
2. **Dev/Validation Set:** The data you used for model selection and tuning. You check performance on this dataset while adjusting hyperparameters, model architectures, and even feature selection/engineering
3. **Test Set:** The data that will be used to evaluate final performance after all tuning is done. It simulates how your model will perform on unseen real-world data. You should only check the test set at the very end, once you’ve chosen your best model via dev set.

### Learning Curves
A learning curve shows how a model’s performance changes as we increase the size of the training dataset. The objective is to diagnose whether a model is suffering from high bias (underfitting), high variance (overfitting), or is in a good balance (generalising well). By plotting training and test (validation) errors against training set size, we can visualise model behaviour and decide what to do next.

#### Understanding the Learning Curve Plot
In plot A below, the green curve represents error on the training set and the red curve represents error on the test (validation) set: With very few training examples, the model learns the training data almost perfectly (very low train error), but generalises poorly (very high test error). As the training set grows, the training error increases slightly (harder to fit more data perfectly), while the test error decreases (better generalisation). Eventually, both curves stabilise and converge to a point where adding more data doesn’t help much.

| Plot A: Train vs Test Set Learning Curve | Plot B: Extrapolating the Test Set Curve | Plot C: Plateauing Test Set Curve |
|---|---|---|
| ![Learning Curve Architecture](assets/learning-curve-architecture.png) | ![Extrapolating Learning Curves](assets/learning-curve-extrapolating.png) | ![Plateauing Learning Curves](assets/learning-curve-plateauing.png) |

Extrapolating learning curves can also help us answer: Do we need more data? Will model performance improve further? Or, should we prioritise other performance optimisations rather than collecting more data?

Intepretation: If the test error is still trending downward, more data may help. If the test error plateaued, adding data won’t bring meaningful improvement. (See Plot B and Plot C below)

#### Diagnosing bias and variance with Learning Curves 
| Plot D: First Scenario | Plot E: Second Scenario | Plot F: Third Scenario |
|---|---|---|
| ![High Variance Scenario](assets/learning-curve-high-variance.png) | ![High Bias Scenario](assets/learning-curve-high-bias.png) | ![Ideal Scenario](assets/learning-curve-ideal.png) |

- **First Scenario (High Variance):** Training error is very low, but test error is high. The model memorises training data but fails to generalise.
- **Second Scenario (High Bias):** Both training error and test error are high. The model cannot capture the underlying patterns, even with more data.
- **Ideal Scenario:** Training and test errors are both low, and the gap between them is small. The model generalises well and meets desired performance. This is the sweet spot we aim for.

#### Diagnosing without graphs
| Error | High Variance | High Bias | Ideal |
|---|---|---|---|
| Test Error | 15% | 15% | 9% |
| Train Error | 3% | 12% | 5% |
| Desired Error | 7% | 7% | 7% |

Even without a plot, you can infer the scenario by comparing train error and test error:
- High Variance (Overfitting): Train error is low, but test error is high.
- High Bias (Underfitting): Both train and test errors are high and close to each other.
- Ideal: Both train and test errors are low, with only a small gap.

Usually, if the errors are pretty close to each other, even without a clear desired performance, we can infer that our goal is to reduce bias and make the model more complex. However, if there is a significant gap in the errors, our objective is to first make sure our model generalises well for unseen data i.e. fix high variance.

#### Desired/Optimal Performance
Even if you can diagnose bias vs variance by comparing train/test errors, you also need to know what "good performance" looks like. Otherwise, you don’t know if a 12% error is acceptable or if you should keep improving. This is where the desired error (optimal error) comes in. You can infer the optimal error from:
- **Existing Solutions (Benchmarking):** If other models/solutions (even simple baselines) are already achieving 7% error on the same task, then that becomes your target. Example: If logistic regression gives 15% error but random forests give 8%, then we know the avoidable error is ≤8%. Or, in industry you can use an existing solution as a benchmark error for you to surpass think of GPT-5 improving upon GPT-4 or even looking to beat the latest model by its competitor Claude 4 Sonnet.
- **Human Error:** In some tasks, human-level performance is used as a reference for optimal error. Example: Humans make ~5% error in road sign recognition. If your model is at 12%, the goal is to get closer to 5%, not necessarily 0%.
- **Ensemble (Multiple Models):** If multiple models converge to similar error rates on the same data, that suggests you’re approaching the optimal achievable error. If a radically different model family does much better, you weren’t at the optimal error yet.

#### Disadvantages of Learning Curves
In conclusion, learning curves are great in identifying whether you are in a high bias or high variance scenario. However, they have the following disadvantages:
- **Computational Time:** Plotting a learning curve often requires retraining models multiple times on subsets of data — expensive for large datasets or complex models.
- **Noisy Curve (Especially in Small Datasets):** With little data, the variance in errors across subsets makes the curve jagged, sometimes hiding the real trend.
- **High Human Expertise Needed:** Interpreting the curves (or tables) correctly requires a solid understanding of bias, variance, and optimal error. Misinterpretation can easily lead to wrong interventions (e.g., adding complexity when you actually need regularisation).

### Validation Curves
In a validation curve, we plot the model error/performance score against a select model hyperparameter value. Its objective is to  the best  values for the model i.e. where the model achieves the ideal bias-variance trade-off.

See images below: in plot A, we have plotted our classification model `accuracy_score` against `C` – the regularisation paramater of our Support Vector Classifier; In plot B, we have plotted our regression `mean_squared_error` against `polynomial_degree` of Linear Regression (done as a preprocessing transformation)



On Scikit Learn, you can plot a validation curve using `sklearn.model_selection.validation_curve` or by getting the `cv_results_` of your `sklearn.model_selection.GridSearchCV` and plotting `mean_test_score` against your selected parameter (covered in the previous topic while learning GridSerarchCV and RandomizedSearchCV).

### Loss Curves

## Optimising for Bias/Variance
Now that you have evaluated your model for bias and variance, you need to know what actions to take to reduce bias (fix underfitting) or reduce variance (fix overfitting). There are two ways to approach this:
1. Model hyperparameter tuning - you should understand which hyperparameters adjust for bias and/or variance
2. Model/Workflow adjustments - you should understand how different steps in your modeling workflow affect bias and variance

### Using Hyperparameter Tuning
This section is more of a revision of what you covered when learning different machine learning algorithms. For each of the following algorithms, identify which hyperparameters reduce bias (adding model complexity) or reduce variance (regularisation). 

Below is an activity with a handful selection of common machine learning algorithms. Can you identify the key hyperparameters for reducing bias or reduding variance? Open the Scikit Learn documentation for each algorithm, go through each hyperparameter and classify them in the table below:

| ML Algorithms | How to reduce bias (Add complexity) | How to reduce variance (Regularise) |
|---|---|---|
| K-Nearest Neighbours | **Reduce K** - the number of nearest neighbours, **Change weight** - give closer neighbours higher weights, Use **distance metric** that emphasises closer neighbours - larger `p` i.e. Euclidean or Minkowski | Increase K, Use uniform weight, Use a distance metric that deemphasises closer neighbours - smaller `p` i.e. Manhattan distance |
| Linear/Logistic Regression | Add polynomial features - higher `degree` equals higher variance/low bias, Other functional transformations - more complex ways to tranform data before regression | Add regularisation - increase regularisation parameter when doing Lasso (L1), Ridge (L2), or Elastic Net (L1-L2) regularisation |
| Support Vector Machines |  | |
| Decision Trees | | |
| Random Forest | | |
| Gradient Boosting |  |  |
| Neural Networks |  |  |

Note that it is essential before using any new machine learning algorithm that you understand how it works, what hyperparameters it has, and what these hyperparameters adjust (whether it is bias, variance, or other algorithmic adjustments e.g. learning rate and maximum iterations on gradient descent ML algorithms)

### Adjusting Models or Workflows
Outside the model hyperparameters, here are some different how other steps of your machine learning will affect bias/variance:
1. **Data:** Increasing the data is the one way that guarantees to reduce both bias and variance at the same time. More data means our model can generalise better (less variance) while also allowing the model to identify more complex patterns in the dataset (less bias)
2. **Features:** Increasing features adds model complexity (reducing bias) while reducing features reduces variance. There are numerous techniques for adjusting features ranging from simple methods such as intuitive feature selection, data collection for more features to more advanced methods such feature selection through feature importance scores, dimensionality reduction algorithms (Principal Component Analysis & Manifold Learning), to domain-anchored engineering of new features
3. **Models:** You can always try different models in a bid to identify a models that performs best with your dataset. It is always recommended to start with simple models and try more complex models as you seek to reduce bias, however, trial and error still works. Your model understanding should help you understand which models can identify more complex data patterns. Examples: Gradient Boosting models tend to be more complex than your simple Linear Regression or a Multilayered Perceptron id definitely more complex than a Perceptron or a Logistic Regression.
4. **Epochs**: In neural networks and gradient descent algorithms, the more epochs/iterations we run our model, the higher the variance as we saw when covering Loss curves above.
5. **Objective/Cost Functions**: It is possible to adjust the objective function of your model in some cases. Let's take an example, in regression fitting a model based on mean-squared errors penalises more for larger errors therefore yields higher variance, lower bias. If model was to be fit using absolute errors, there will less penalty on larger errors hence higher bias and lower variance. You could also opt for **Huber Loss (Hybrid of MSE and MAE)** to balance the bias-variance. In classification, you could opt for cross-entropy/log loss errors where misclassifications are penalised more heavily yielding higher variance and lower variance. Below are some additional things to note:
    - Not all models allow you to adjust the cost function but some do. For example, in Scikit Learn:
        - For linear models, `LinearRegression` is fixed to MSE but `SGDRegressor` allows you to choose the `loss` metric
        - In decision trees, you can choose the `criterion` as `gini` or `entropy`
        - Gradient boosting models also support different `loss` parameters
        - When studying models, it is important to understand if the objective function can be adjusted
    - Use can also use the `sample_weight` parameter on Scikit Learn supported by most models to indirectly adjust the loss function. In the example below, we have doubled the penalty on classification misclassifications for our Logistic regression
        ``` python
        from sklearn.linear_model import LogisticRegression
        clf = LogisticRegression()
        clf.fit(X_train, y_train, sample_weight=[2 if y==1 else 1 for y in y_train])
        ```
    - Outisde vanilla Scikit Learn, some libraries allow you to have more control over the cost function. These include: `scipy.optimize`, `XGBoost`, `LightGBM`, or even the `sklearn.base.BaseEstimator`

## Baseline Models (Incremantal Optimisation)

## Imbalanced Datasets (Classification Problems)