##### Authors: Rafael Dousse, Eva Ray, Massimo Stefani

# Exercice 3 - Review questions

**a) Assuming an univariate input *x*, what is the complexity at inference time of a Bayesian classifier based on histogram computation of the likelihood ?**

The complexity is O(K) where K is the number of classes.

For a univariate feature (D = 1) a histogram-based likelihood is a constant-time lookup per class (find the bin → O(1)), then multiply by the prior and compare. Doing that for all K classes gives O(K·1) = O(K). (If K and D are treated as constants and small, authors sometimes write this as ~O(1).) (Slide 26 from the week 3 lecture.)

**b) Bayesian models are said to be generative as they can be used to generate new samples. Taking the implementation of the exercise 1.a, explain the steps to generate new samples using the system you have put into place.**
 

The procedure is as follows:
1. Draw a class C according to the prior distribution P(C).
2. Draw a sample x from the likelihood distribution P(x|C) corresponding to the selected class C. In our case, we can use the histogram to determine the bin probabilities and sample accordingly.
3. Repeat the above steps to generate as many samples as needed.

***Optional*: Provide an implementation in a function generateSample(priors, histValues, edgeValues, n)**

In [1]:
import numpy as np

def generateSample(priors, histValues, edgeValues, n):
    samples = []
    classes = [0, 1]
    for _ in range(n):
        # Step 1: Draw a class C according to the prior distribution P(C).
        C = np.random.choice(classes, p=priors)
        # Step 2: Draw a sample x from the likelihood distribution P(x|C).
        hist = histValues[C]
        edges = edgeValues[C]
        # Choose a bin according to the histogram probabilities
        bin = np.random.choice(len(hist), p=hist/hist.sum())
        # Uniformly sample within the bin
        x = np.random.uniform(edges[bin], edges[bin + 1])
        samples.append((x, C))
    return samples


**c) What is the minimum overall accuracy of a 2-class system relying only on priors and that is built on a training set that includes 5 times more samples in class A than in class B?**

We have P(B) = 5 * P(A) and P(A) + P(B) = 1, so P(A) = 1/6 and P(B) = 5/6. Since we only rely on priors, we will always predict the most probable class, which is class B. Therefore, the minimum overall accuracy is P(B) = 5/6.

**d) Let’s look back at the PW02 exercise 3 of last week. We have built a knn classification systems for images of digits on the MNIST database.**

**How would you build a Bayesian classification for the same task ? Comment on the prior probabilities and on the likelihood estimators. More specifically, what kind of likelihood estimator could we use in this case ?**

To build a Bayesian classification system for the MNIST digit recognition task, follow these steps:
1. **Prior Probabilities**: We estimate the prior probabilities P(C) for each digit class (0-9) based on the frequency of each class in the training dataset. Since the MNIST dataset is relatively balanced, we expect the priors to be roughly equal, but we still compute them from the training data.
2. **Likelihood Estimators**: For the likelihood P(x|C), where x is the image data, we could use a histogram-based approach, but given the high dimensionality of image data, we might suffer from the curse of dimensionality. Instead, we use a Gaussian Naive Bayes approach, where we assume that the pixel values for each class are normally distributed. We estimate the mean and variance of the pixel values for each class from the training data. We use the log form of the Gaussian Naive Bayes approach to prevent float underflow when multiplying many small probabilities.
3. **Classification**: For a new image, we compute the posterior probabilities for each class using Bayes' theorem and classify the image to the class with the highest posterior probability.

***Optional:* implement it and report performance !**

```python
import numpy as np

class GaussianNaiveBayesMNIST:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.priors = np.array([np.mean(y == c) for c in self.classes])
        # calculate each pixel mean and variance for each class
        self.means = np.array([X[y == c].mean(axis=0) for c in self.classes])
        self.vars = np.array([X[y == c].var(axis=0) + 1e-6 for c in self.classes])
        # Adding a small value to variance to avoid division by zero

    def predict(self, X):
        log_probs = []
        for idx, c in enumerate(self.classes):
            mean = self.means[idx]
            var = self.vars[idx]
            prior = np.log(self.priors[idx])
            # log-likelihood pixel-wise
            log_likelihood = -0.5 * np.sum(np.log(2 * np.pi * var) + ((X - mean) ** 2) / var, axis=1)
            log_probs.append(prior + log_likelihood)
        return self.classes[np.argmax(log_probs, axis=0)]

# Train and test
model = GaussianNaiveBayesMNIST()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:.3f}")
```

We ran this code in the notebook of PW02 exercise 3 and obtained an accuracy of 0.542. This is significantly lower than the KNN classifier. There could be several reasons for this performance difference:
- The strong independence assumption of Naive Bayes may not hold for image data in MNIST dataset, where pixel values are often correlated.
- The Gaussian assumption for pixel values may not accurately capture the true distribution of pixel intensities (0-255) in the images.
- KNN can capture more complex decision boundaries by considering local neighborhoods, while Naive Bayes relies on global statistics.

**e) Read [europe-border-control-ai-lie-detector](https://theintercept.com/2019/07/26/europe-border-control-ai-lie-detector/). The described system is "a virtual policeman designed to strengthen European borders". It can be seen as a 2-class problem, either you are a suspicious traveler or you are not. If you are declared as suspicious by the system, you are routed to a human border agent who analyses your case in a more careful way.**

1. What kind of errors can the system make ? Explain them in your own words.
2. Is one error more critical than the other ? Explain why.
3. According to the previous points, which metric would you recommend to tune your MLsystem ?

1. The system can make two types of errors: false positives (FP) and false negatives (FN).
    - FP occurs when a non-suspicious traveler is incorrectly classified as suspicious, leading to unnecessary scrutiny and potential delays.
    - FN occurs when a suspicious traveler is incorrectly classified as non-suspicious, allowing them to pass through the border without additional checks.
2. A FN is more critical than a FP because it can lead to security risks, such as allowing individuals who may pose a threat to enter the country without proper screening. FPs only cause inconvenience to innocent travelers, which is less severe than the potential consequences of a FN.
3. Our goal is to minimize the number of FNs (while keeping FPs at a reasonable level). Thus, we should focus on recall. A high recall ensures that most suspicious travelers are correctly identified, even if it means accepting a higher number of FPs.

**f) When a deep learning architecture is trained using an unbalanced training set, we usually observe a problem of bias, i.e. the system favors one class over another one. Using the Bayes equation, explain what is the origin of the problem.**

Remember that the Bayes equation is given by: P(C|x) = P(x|C) * P(C) / P(x)

When the training set is unbalanced, the majority class has a higher prior probability P(C) than the minority class. As a result, the model tends to predict the majority class more often, since P(C) appears in the numerator of Bayes’ equation. This bias occurs because the model learns to associate features more strongly with the majority class due to its higher frequency in the training data.