# TP2 : Uncertainty applications

### Logistic regression

|  Logistic regression   | Laplace posterior approximation |
|:----------------------:|:----------------------:|
| ![Logistic regression](figures/logistic_regression.gif)|![Logistic regression with lapalce approximation](figures/logistic_regression_laplace.gif) |
- $w_{MAP}$ Maximum a posteriori estimated weights = trained model weights
- Classic Posterior: $p(\mathbf{y}=1 | \pmb{x}, \pmb{w}_{\textrm{MAP}}) = \big( \sigma(w_{MAP}^T.x+b) \big)$ 
  - probability is simply the output of the model
  - distribution of the weights is a dirac function centered at $w_{MAP}$ (meaning we trust 100% these weights)
- Laplace posterior uncertainty approximation. We assume that the weights follow a Gaussian distribution. $\mathcal{N}(\pmb{w} ; \pmb{\mu}_{lap}, \pmb{\Sigma}_{lap}^2)$
  - It makes sense that ${\mu}_{lap} = w_{MAP}$.
  - if we'd put $\pmb{\Sigma}_{lap}$ to 0, we'd get back to the "certain" previous case as a degenerate example.

##### [Q 1.1]: Looking at $p(\mathbf{y}=1 | \pmb{x}, \pmb{w}_{\textrm{MAP}})$, what can you say about points far from train distribution?

![classic](figures/logistic_regression_CLASSIC.png)

"Naïve" uncertainty deduced from the logits (in a classical inference with the MAP weights) **does not increase** (remains constant actually) when the distance to training data increases.

##### [Q 1.2] Comment Laplace’s approximation results
![laplace](figures/logistic_regression_LAPLACE.png)

**Laplace posterior approximation**
- On the contrary to naïve uncertainty, **Laplace posterior approximation** is able to **increase uncertainty** as new samples are farther away from the training distribution. 
- We note that the mean of the orientation of the separation line is the same as the classic case. This comes form the fact that  ${\mu}_{lap} = w_{MAP}$. 

---
This is a powerful trick available at almost no extra cost.
- as we didn't have to do anything specific to the training loop. 
- Just take the logistic regressor out of the box, train it as usual and instrument it...

> This whole simplicity also comes from pytorch which allows retrieving gradients and gradients of the loss regarding to the linear weight matrix $w$. :+1: *thank you pytorch!*.

### [Part I.3] « Variational inference » :
- comment the class LinearVariational.
- What is the main difference between Laplace’s and VI’s approximations?

# TP3 : Uncertainty applications

##### [I.1] Monte Carlo dropout sampling to estimate confidence
> Question : Comment results for investigating most uncertain vs confident samples.

##### [II] Failure prediction:

--------

> Explain the goal of failure prediction

First of all, **failure prediction** in an autonomous system is critical as no engineer shall take a system as perfect and shall always design safety and emergency mechanism. 

If the autonoumous system is over confident or simply does not even tell it's making a mistake, there are actual potential consequences.


Here are the 3 main goals I thought of:
- Reliability and **build trust** in the system: 
  - By predicting when a model might fail, we can improve the reliability of machine-learning based systems, especially in critical applications like autonomous cars, healthcare, or finance, where mistakes can have serious consequences.
  - A rough "bad" example: In the most widespread and popular Machine Learning based technology today being ChatGPT, there are no explicit indications of confidence in the answer. *We see from this lab session that it's not an easy thing either*. Although there are warnings everywhere on the website, `ChatGPT can make mistakes. Consider checking important information.`, you can get wrong answers (wrong content) but in a good form so it looks like a good answer. The issue with such a sometimes *deceptive* system is that you tend to forget it can make mistakes.
- **Improve model performances**: Failure prediction can help in identifying weaknesses in a model:
   - This insight allows data scientists and engineers to refine and improve the model, either by retraining with more diverse data, tweaking the architecture, or applying different techniques to handle potential failure cases.
   - For instance, what we learnt when reviewing the MNIST most confusing example (using MC-dropout based confidence) is: "how do they look like?". We basically got a knowledge of what's causing trouble to the network. From there we could try to get more samples of this kind for instance.
   - After mining some "hard examples", you may start collecting new data to improve your system performances.
- **Safety and Risk Management**: 
  - In safety-critical systems, such as medical diagnosis or industrial automation, predicting failures is crucial for risk management. By understanding when and how a model might fail, steps can be taken to mitigate these risks, either through human intervention or automatic **safeguards**. 
  - Assessing there's been a failure can even give back the control to a human or another manual system, trigger an emergency etc...
   
  

--------
##### Comment the code of the LeNetConfidNet class [II.1]

The implementation is very much similar to the `LeNet` originally defined architecture.
- Having the same `conv1`, `conv1`, `fc1`, `fc2` names for the modules allows reloading the weights in `LeNetConfidNet` to initialize from the baseline classifier.
- Key difference is simply another regression head made of 4 fully connected layers which are plugged on top of the output of the activation of `fc1`
- All layers not named `uncertainty` will be frozen (variable namings matters here). This allows not modifying the image backbone, the classifier and the performances during training.


**Note on code duplication**: 


A cleaner/safer way to implement this to avoid risky copy paste typos, I believe, would be to let `LeNetConfidNet` have a class attribute for the backbone which instantiates `LeNet`.
This would have avoided:
- code duplications
- risk of non shared module names
- minimal change in `LeNet` as we'd simply have to return the prediction and the output of `fc1`.

**Note on adaptability to image sizes**:

Although there's a function called `num_flat_features` which lets you think that the code could adapt to other image sizes on the fly.

All sizes are hardcoded and baked in for MNIST 28x28 here. (the infamous 4x4x32 constant)
This is due to going from convolutional layer outputs to a fixed classifier size. 
There are alternatives to this "problem" by performing a global pooling operation for instance before classifying, to reduce the spatial dimension.

-------
##### Analyze results between MCP, MCDropout and ConfidNet [II.2]

##### [III.1] OOD detection: analyse results and explain the difference between the 3 methods.
Compare the precision-recall curves of each OOD method along with their AUPR values. Which method perform best and why?