# :one: TP1 : Regression


---------------------

**Question 1.2: Recall closed form of the posterior distribution in linear case. Then, code and visualize posterior sampling. What can you observe?**

Thanks to the prior and the likelihood both having a normal distribution (conjugated), the posterior ends up following a Normal distribution.
$$p(w|X, Y) = \mathcal{N}(w|\mu, \Sigma)$$
where
- $\Sigma = \big[\alpha I + \beta \Phi^T \Phi\big]^{-1}$
- $\mu = \beta . \Sigma . \Phi^T. Y$

- If we add more points, the posterior distribution gets narrower (sharper). We get more confidence in this estimator.
- If we set $N=0$ (getting no samples), it's a degenerate case where posterior=prior.
    - The posterior plot simply looks like a isotropic gaussian, $p(w|\alpha) = \mathcal{N}(w, 0, \frac{1}{\alpha})$ . 
    - When we set a very large $\alpha$ we have a lot of confidence in the prior so we get a narrow spot at (0,0).
    - If $\alpha$ is low, we get low confidence in the priori, the gaussian is widespread.

Note: when $\alpha=0$, $\Sigma$ is the inverse of the covariance matrix of the sample feature vectors.


![](figures/bayesian_linear_regression_posterior.png)


#####  [Question 1.4] Bayesian linear regression: results of the predictive distribution on the synthetic dataset

![](figures/bayesian_linear_regression_closed_form_solution.png)
- Uncertainty (standard deviation of the posterio distribution) increases as we get farther away from the dataset.

![](figures/bayesian_linear_regression_closed_form_solution_HOLE_dataset.png)
- In the case with two point cloone would probably expect the uncertainty to increase in between. On the contrary, although it seems a bit counter intuitive, confidence is maximum in the middle ($x=0$ - parabola minimum).


##### [Question 1.5] Theoretical analysis to explain the form of the distribution (simplified case $\alpha=0, \beta=1$) 




##### [Question 2.2]
Polynomial base tries to approximate a sinusoid.
We're able to fit more complex shapes using linear regression using a basis function. We're able to apply the same uncertainty framerwork.
Close to the dataset, the polynomial of degree 9 fits the sinewave correctly. Outside, it does not "generalize well" but the uncertainty increases dramatically which is a good thing. 
![](figures/sinusoid_functions_fitting.png)


##### [Question 2.4/2.5] Non-linear regression: analysis of the Gaussian basis feature maps results
![](figures/gaussian_kernel.png)

Predicted variance does not increase when we go farther away from the sample. Gaussian basis functions are good at interpolation but not really at extrapolation here and the result seems over confident.


##### [Question 2.5]: Explain why in regions far from training distribution, the predictive variance converges to this value when using **localized basis functions such as Gaussians.**
- As we get far away from the training data, the Gaussian basis function do not contribute to the model's output ($||x-\mu||<<\sigma$) 
- so the model's posterior **fallbacks to the prior distribution**... 
- which is a gaussian centered around 0 and which is what we observe. $\sigma=0.2$  , $\sigma^2=0.08$ this is exactly what we get in the curve on the right! 

# :two: TP2 : Uncertainty applications

------


# TP2 PART 1 LOGISTIC REGRESSION
------
- :one: classic
- :two: Laplace posterior
- :three: Variational inference (= weights are Gaussians)

--------------------------------------

### Logistic regression

|  :one:  Logistic regression   | :two: Laplace posterior approximation | :three: Variational inference |
|:----------------------:|:----------------------:|:----------------------:|
| ![Logistic regression](figures/logistic_regression.gif)|![Logistic regression with lapalce approximation](figures/logistic_regression_laplace.gif) | ![vi](figures/variational_logistic_regression.gif) |
- $w_{MAP}$ Maximum a posteriori estimated weights = trained model weights
- Classic Posterior: $p(\mathbf{y}=1 | \pmb{x}, \pmb{w}_{\textrm{MAP}}) = \big( \sigma(w_{MAP}^T.x+b) \big)$ 
  - probability is simply the output of the model
  - distribution of the weights is a dirac function centered at $w_{MAP}$ (meaning we trust 100% these weights)
- Laplace posterior uncertainty approximation. We assume that the weights follow a Gaussian distribution. $\mathcal{N}(\pmb{w} ; \pmb{\mu}_{lap}, \pmb{\Sigma}_{lap}^2)$
  - It makes sense that ${\mu}_{lap} = w_{MAP}$.
  - if we'd put $\pmb{\Sigma}_{lap}$ to 0, we'd get back to the "certain" previous case as a degenerate example.
- Variational inference allows training "weights" assuming they're drawn from a random Gaussians distribution. Instead of regressing the weights directly, we train their distribution (mean and variance.)

##### Training logistic regression


![](figures/training_logistic_regression.png)

Training of the logistic regression model looks alright. Good to go.

##### [Q 1.1]: Looking at $p(\mathbf{y}=1 | \pmb{x}, \pmb{w}_{\textrm{MAP}})$, what can you say about points far from train distribution?

![classic](figures/logistic_regression_CLASSIC.png)

"Naïve" uncertainty deduced from the logits (in a classical inference with the MAP weights) **does not increase** (remains constant actually) when the distance to training data increases.

##### [Q 1.2] Comment Laplace’s approximation results
![laplace](figures/logistic_regression_LAPLACE.png)

**Laplace posterior approximation**
- On the contrary to naïve uncertainty, **Laplace posterior approximation** is able to **increase uncertainty** as new samples are farther away from the training distribution. 
- We note that the mean of the orientation of the separation line is the same as the classic case. This comes form the fact that  ${\mu}_{lap} = w_{MAP}$. 

---
This is a powerful trick available at almost no extra cost.
- as we didn't have to do anything specific to the training loop. 
- Just take the logistic regressor out of the box, train it as usual and instrument it...

> This whole simplicity also comes from pytorch which allows retrieving gradients and gradients of the loss regarding to the linear weight matrix $w$. :+1: *thank you pytorch!*.

### [Part I.3] « Variational inference » :
- comment the class LinearVariational.
- What is the main difference between Laplace’s and VI’s approximations?

-------
#### Note on initialization
:zap: Note: It looked very strange at first sight to get to 100% accuracy at the first step (such a quick training.) Initializing the weight and biases with a zero mean instead of random values (like we usually do when training neural networks) helps a lot here
- including the fact that the line separatin the red and blue dots goes through 0 $b=[0, 0]$ seems like a good initalizer.
- We initialize with $\mu_{w}^{(t=0)} = \mu_{b}^{(t=0)} = 0$ 
- and $\rho_{w}{(t=0)} = \rho_{b}{(t=0)} = log(e-1)$ so the initial weights follow a standardized gaussian distribution prior. At initialization, the KL term shall be 0, $KL(p || q_{\mu, \theta} ) = 0$

If we had wanted to be totally fair when comparing to the previous methods from part, prior standard deviation shall involve `WEIGHT_DECAY`.


|  $\mu_{w \& b}^{(t=0)} = 0$    | $\mu_{w \& b}^{(t=0)}$  random|
|:----------------------:|:----------------------:|
| ![](figures/variational_logistic_regression.gif) | ![](figures/variational_logistic_regression_from_random_seed.gif)|


--------

# TP2 PART 2

-----

- :one: classic (MAP weights)
- :two: Bayesian MLP
- :three: Dropout (+MC dropout sampling at inference)

| :one: classic  MLP   | :two: Bayesian MLP | :three: MLP + MC dropout |
|:----------------------:|:----------------------:|:----------------------:|
| ![](figures/classic_MLP_classifier__dropout_0.0.gif)|![](figures/variational_MLP_classifier.gif) | ![dropout](figures/dropout_MLP_classifier__dropout_0.2.gif) |



#### Variational MLP
Let's apply the variational technique to a 1-hidden layer.
Now the shape of the uncertainty becomes much more complex than in the linear case 

![](figures/variational_MLP_classifier.gif)


###

##### [Q2.1] Again, analyze the results showed on plot. What is the benefit of MC Dropout variational inference over Bayesian Logistic Regression with variational inference?

![](figures/MLP_mc_dropout.png)

# TP3 : Uncertainty applications

#### LeNet training


![](figures/Lenet_training_losses.png)

##### [I.1] Monte Carlo dropout sampling to estimate confidence

> Question : Comment results for investigating most uncertain vs confident samples.


![](figures/vr_ratios.png)
Variation -Ratios curves for the MNIST dataset.

| Most confident | Most confusing|
|:-----:|:-----:|
| ![](figures/most_confident_samples.png) | ![](figures/most_confusing_samples.png) |



##### [II] Failure prediction:

--------

> Explain the goal of failure prediction

First of all, **failure prediction** in an autonomous system is critical as no engineer shall take a system as perfect and shall always design safety and emergency mechanism. 

If the autonoumous system is over confident or simply does not even tell it's making a mistake, there are actual potential consequences.


Here are the 3 main goals I thought of:
- Reliability and **build trust** in the system: 
  - By predicting when a model might fail, we can improve the reliability of machine-learning based systems, especially in critical applications like autonomous cars, healthcare, or finance, where mistakes can have serious consequences.
  - A rough "bad" example: In the most widespread and popular Machine Learning based technology today being ChatGPT, there are no explicit indications of confidence in the answer. *We see from this lab session that it's not an easy thing either*. Although there are warnings everywhere on the website, `ChatGPT can make mistakes. Consider checking important information.`, you can get wrong answers (wrong content) but in a good form so it looks like a good answer. The issue with such a sometimes *deceptive* system is that you tend to forget it can make mistakes.
- **Improve model performances**: Failure prediction can help in identifying weaknesses in a model:
   - This insight allows data scientists and engineers to refine and improve the model, either by retraining with more diverse data, tweaking the architecture, or applying different techniques to handle potential failure cases.
   - For instance, what we learnt when reviewing the MNIST most confusing example (using MC-dropout based confidence) is: "how do they look like?". We basically got a knowledge of what's causing trouble to the network. From there we could try to get more samples of this kind for instance.
   - After mining some "hard examples", you may start collecting new data to improve your system performances.
- **Safety and Risk Management**: 
  - In safety-critical systems, such as medical diagnosis or industrial automation, predicting failures is crucial for risk management. By understanding when and how a model might fail, steps can be taken to mitigate these risks, either through human intervention or automatic **safeguards**. 
  - Assessing there's been a failure can even give back the control to a human or another manual system, trigger an emergency etc...
   
  

--------
##### Comment the code of the LeNetConfidNet class [II.1]

The implementation is very much similar to the `LeNet` originally defined architecture.
- Having the same `conv1`, `conv1`, `fc1`, `fc2` names for the modules allows reloading the weights in `LeNetConfidNet` to initialize from the baseline classifier.
- Key difference is simply another regression head made of 4 fully connected layers which are plugged on top of the output of the activation of `fc1`
- All layers not named `uncertainty` will be frozen (variable namings matters here). This allows not modifying the image backbone, the classifier and the performances during training.


**Note on code duplication**: 


A cleaner/safer way to implement this to avoid risky copy paste typos, I believe, would be to let `LeNetConfidNet` have a class attribute for the backbone which instantiates `LeNet`.
This would have avoided:
- code duplications
- risk of non shared module names
- minimal change in `LeNet` as we'd simply have to return the prediction and the output of `fc1`.

**Note on adaptability to image sizes**:

Although there's a function called `num_flat_features` which lets you think that the code could adapt to other image sizes on the fly.

All sizes are hardcoded and baked in for MNIST 28x28 here. (the infamous 4x4x32 constant)
This is due to going from convolutional layer outputs to a fixed classifier size. 
There are alternatives to this "problem" by performing a global pooling operation for instance before classifying, to reduce the spatial dimension.

-------
##### Analyze results between MCP, MCDropout and ConfidNet [II.2]

![](figures/confid_net_AUPR.png)

The biggest area under the Precision Recall curves grants the best sytem.

Note on confid net:
When looking at the AUPR test and validation losses, we clearly see some fluctuations. 
One cannot take the best checkpoint based on validation error as this is cheating.



##### [III.1] OOD detection: analyse results and explain the difference between the 3 methods.
> Compare the precision-recall curves of each OOD method along with their AUPR values. Which method perform best and why?



![](figures/ODIN_AUPR.png)


Overall, the Out-Of-Distribution performances detection look pretty impressive in this case. - ODIN has the best performances (97.80%)