# Deep Ensembles: A Loss Landscape Perspective

**Team Members:** Anita Mahinpei (interested in spring research), Carter Nakamoto, Emre Erdem

## Table of Content

1. [Problem Statement and Context](#problem-statement-context)
2. [Existing Work](#existing-work)
3. [Paper Contributions](#paper-contributions)
4. [Technical Content](#technical-content)
    1. [Comparison Metrics](#comparison-metrics)
    2. [Models](#models)
5. [Paper Experiments](#paper-experiments)
6. [Pedagogical Explorations](#pedagogical-explorations)
    1. [Setup](#setup)
    2. [Diversity Comparison](#diversity-comparison)
    3. [Accuracy Comparison](#accuracy-comparison)
    4. [Uncertainty Comparison](#uncertainty-comparison)
    5. [Comparison with Paper](#comparison-with-paper)
7. [Evaluation](#evaluation)
8. [Future Work](#future-work)
9. [Broader Impact](#broader-impact)

<a id='problem-statement-context'></a>

## 1. Problem Statement and Context

In the paper “Deep Ensembles: A Loss Landscape”, the authors explored why deep ensembles perform better in practice than Bayesian neural networks in terms of classification accuracy and uncertainty estimation. They hypothesize that unlike deep ensembles, Bayesian neural networks are unable to find different modes of the solution space thus making them perform poorly compared to deep ensemble classifiers. To test this hypothesis, they examine the loss function landscape of different classification problems by performing several experiments with CIFAR and ImageNet datasets using three different neural network architectures of different sizes and four representative Bayesian sampling methods. 

The problem the authors are tackling is interesting and impactful for machine learning because it involves investigations into how ensemble neural networks behave in function space compared to bayesian methods which can assist in finding ways of modifying different bayesian methods to have the desirable traits of ensemble models. Moreover it sheds light into what methods are most suitable for an application based on their properties. In order to address these, the authors introduced the concept of diversity-accuracy plane with investigations of the trade-off between them for different models. Also to compare the performance of deep ensembles and Bayesian neural networks, the authors focused on investigating whether Bayesian and Frequentist neural networks can find different modes of the solution space through using weight space cosine similarity plots, prediction disagreement plots, and prediction space T-SNE plots.

They have also investigated how the state of the art neural network models perform during dataset shift for classification problems. In machine learning, models may perform well with data that is very similar to the training set (ie. in-distribution), however they may perform poorly with data that is significantly different than the training set (ie. out-of-distribution). Improving machine learning/deep learning models' prediction accuracies under dataset shift is an important area of research and is applicable to many areas/use cases in real life. This is particularly important for high risk use cases such as in medicine, finance, and robotics. For example, if a self-driving car is not able to distinguish out-of-distribution events, it may behave abnormally and cause catastrophic incidents. In order to investigate OOD performance, the authors studied the effects of function space diversity and evaluated the performance of the models on corrupted versions of the benchmark datasets (CIFAR-10-C and ImageNet-C). 
 


<a id='existing-work'></a>

## 2. Existing Work
The loss landscape was previously investigated by Goodfellow and Vinyals (2014) where they observed the loss along a linear path from an initialization. Li et al. (2018) showed that optimization could be done on a low-dimensional hyperplane in the weight space compared to full space optimization which gave comparable results. In Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs, Garipov et. al.(2018) investigated geometric properties of the loss functions of deep neural networks and demonstrated that optima of these loss functions are in fact connected by simple curves and proposed a new method that can find such paths between two local optima so that the train loss and
test error remain low along these paths. Fort and Jastrzebski (2019) modeled the loss landscape as a set of high dimensional wedges that form a large and connected structure towards which optimization is drawn and showed that hyperparameter choices such as learning rate, network width and regularization affect the path optimizer takes through the landscape.

There has also been important work done regarding uncertainty estimations which we will explore further in this submission. In “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles”, Lakshminarayana et. al (2016) proposed an alternative method to Bayesian neural networks through deep ensembles that can provide accurate uncertainty estimates and evaluated predictive uncertainty on test examples from known and unknown cases.  Alex Kendall and Yarin Gal (2017) studied benefits of modelling epistemic vs aleatoric uncertainty in Bayesian deep learning models for vision tasks. Depeweg et. al (2018) also worked on extracting and decomposing uncertainty into epistemic and aleatoric components for decision-making purposes. 


<a id='paper-contributions'></a>

## 3. Paper Contributions

While there has been extensive work in examining loss landscapes, previous works do not directly investigate function space diversity at different modes of the loss function. In this paper, the authors focused on investigating the loss function diversity (ie. diversity in the space of weights and biases that the network navigates during training) and explored whether Bayesian and Frequentist Ensemble Neural Networks can generate sufficient variety via a diverse exploration of the weight space. They explicitly measured diversity as well as accuracy and uncertainty performance within subspace samples of a single training trajectory as well as across different randomly initialized training trajectories. Having found empirical evidence for the greater diversity of ensemble Neural Networks as well as their superior performance compared to Bayesian method, they then attribute this superior performance to the greater diversity of the weight and prediction spaces of deep ensembles.
 
 
Their main contribution was to investigate the weight and function space thoroughly via experiments and summarized graphs. They investigated trajectories of randomly initialized neural networks that explore different modes in function space, which explained why deep ensembles trained with just random initializations work well in practice compared to Bayesian methods of subspace sampling (weight averaging, Monte Carlo dropout) and two versions of local Gaussian approximations. They showed that sample functions may lie relatively far from the starting point in the weight space but they remain similar in function space, giving rise to an insufficiently diverse set of predictions. 


<a id='technical-content'></a>

## 4. Technical Content

<a id='comparison-metrics'></a>
### 4.1 Comparison Metrics

Rather than taking a mathematical approach to validating their hypothesis, the authors take an empirical approach and layout various experiments to demonstrate that deep ensemble methods perform a more diverse exploration of the loss function space and to show that deep ensemble methods achieve better performance and uncertainty estimation results because of this diversity.

As the paper is focused on validating their hypothesis for classification problems, classification accuracy is used as the metric for model performance. To evaluate the model uncertainties, they use the Brier score as defined by Glen Brier's paper where a lower Brier score suggests better uncertainties. Finally, to quantify diversity, the paper uses three metrics: 

1) **cosine similarity** between the neural network weights as defined by $\cos(\boldsymbol{\theta_1}, \boldsymbol{\theta_2}) = \frac{\boldsymbol{\theta_1}^T \boldsymbol{\theta_2}}{||\boldsymbol{\theta_1}|| ||\boldsymbol{\theta_2}||}$ where $\boldsymbol{\theta_1}$ and $\boldsymbol{\theta_2}$ are the weights and biases of two different model runs. 

2) **prediction disagreement** as defined by $\frac{1}{N} \Sigma_{n=1}^N [ f(\boldsymbol{x_n} | \boldsymbol{\theta_1}) \ne f(\boldsymbol{x_n} | \boldsymbol{\theta_2})]$ where f gives the predicted label for $\boldsymbol{x_n}$ under the given weights and biases.

3) **Jensen-Shannon divergence** given by $\Sigma_{m=1}^M KL(p_{\boldsymbol{\theta_m}}(y|\boldsymbol{x})|| \bar{p}(y|\boldsymbol{x}))$ where KL is the Kullback-Leiber divergence and $\bar{p}(y|\boldsymbol{x})$ is the average of all the $p_{\boldsymbol{\theta_m}}(y|\boldsymbol{x})$

<a id='models'></a>
### 4.2 Models

To compare Bayesian methods with Ensemble Neural Networks, the paper uses the following four subspace sampling methods as representative Bayesian methods:

1) **Random Subspace Sampling:** Starting with an optimized parameter set $\boldsymbol{\theta_0}$ from a single deep neural network training run, select a random direction $\boldsymbol{v}$ in the weight space and change the weights by moving in that direction with step size t: $\boldsymbol{\theta_0} + t \boldsymbol{v}$.

2) **Dropout Subspace Sampling:** Starting with an optimized parameter set $\boldsymbol{\theta_0}$ from a single deep neural network training run, make predictions after dropping a random portion of the model weights as specified by a randomly selected dropout rate p.

3) **Diagonal Gaussian Subspace Sampling:** Using the parameter values from the most recent iterations of a single deep neural network training run, compute a mean and standard deviation for each parameter and randomly draw new parameter values distributed as $\theta_i \sim \mathcal{N}(\mu_i, \sigma_i)$

4) **Low-Rank Gaussian Subspace Sampling:** Similar to diagonal Gaussian sampling, compute the mean of the parameters. However, rather than computing standard deviations, use the top k principle components of the parameter sets to define a k-dimensional normal distribution for generating new parameters: $\boldsymbol{\theta} \sim \boldsymbol{\mu_i} + \Sigma_i \mathcal{N}(0^k, 1^k) \boldsymbol{v_i}$


<a id='paper-experiments'></a>

## 5. Paper Experiments

The paper mainly focuses on examining the performance of 3 neural network architectures on the CIFAR-10 dataset: 1) a SmallCNN trained for 10 epochs 2) a MediumCNN trained for 40 epochs 3) a ResNet20v1 trained for 200 epochs. In order to show generalizability to other more complex problems, they also perform some of their experiments on CIFAR-100 and ImageNet data.

Their first set of experiments compares the performance of the aforementioned neural network architectures under the four subspace sampling techniques versus ensemble training. They found that as suggested by their hypothesis, the subspace sampling methods stayed in a small region of the prediction and weight space while different runs of ensemble training with slightly different initializations would end up exploring a larger area of the prediction and weight space. Their experiments showed that the subspace sampling techniques were unable to improve their diversity without sacrificing accuracy while ensemble neural networks could achieve a much better accuracy-diversity trade-off.

<img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/tsne-paper.png" width="100%" />

<div align="center">Figure 5.1: t-SNE plots of predictions for the validation set for three different training trajectories with a subspace sampling technique applied to the final state. While different trajectories end up far from each other, different subspace sampling techniques stay fairly local relative to the trajectory results.</div>

<img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/diversity-accuracy-paper.png" width="100%" />

<div align="center">Figure 5.2: diversity-accuracy plots show better trade-off for ensemble methods compared to subspace sampling methods.</div>

Their second set of experiments compares the performance of the models under dataset shift. The authors claim that having weight space diversity is crucial for improving generalizability and avoiding over-confident predictions for out-of-distribution data. They test this claim by corrupting images from their datasets to varying degrees and making predictions on the corrupted OOD data. For these experiments, they use Jensen-Shannon divergence of predictions as a measure of diversity and show that ensemble neural networks have more diversity and are able to achieve better accuracy and Brier scores under all degrees of data corruption compared to subsampling methods.

<img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/corruption-paper.png" width="100%" />

<div align="center">Figure 5.3: plots of accuracy, Brier score, and J-S divergence for ensemble training and subspace sampling techniques under different degrees of data corruption. Larger ensembles show better accuracy and Brier scores.</div>

<a id='pedagogical-explorations'></a>

## 6. Pedagogical Explorations

In our work, we wanted to test their claims with different 2D classification toy datasets and evaluate how deep ensembles perform against Bayesian neural networks in terms of test set accuracy and uncertainty estimation. We have noticed that the authors employed Bayesian methods such as random subspace sampling and dropout subspace sampling but did not try more sophisticated Bayesian neural networks such as Hamiltonian Monte Carlo (HMC) and Automatic Differentiation Variational Inference (ADVI). Although in their appendix, they have mentioned performing a comparison with a more sophisticated Bayesian method, namely cyclic stochastic gradient descent MCMC, the only comparison plots provided are of the diversity measures and not the accuracy and uncertainty estimation results. Our hypothesis is that since more advanced Bayesian methods take information about the loss function into account unlike some of the purely random subspace sampling methods used in the paper, these advanced methods should be able to achieve comparable if not better uncertainty estimations and accuracy results.

In the sections below, we will layout the setup of our experiments as well as our main findings. To view the relevant code and more detailed results, please refer to our compiled experiments notebooks. Each notebook shows the results from running the experiments below for one of the toy datasets.

<a id='setup'></a>
### 6.1 Setup

We compared the performance of Ensemble Neural Networks to dropout subspace sampling and random subspace sampling which were used in the original paper as well as HMC and ADVI sampling. We used the following four 2D classification datasets to investigate whether the results are problem dependent. For each dataset, we generated some test points of interest to use for evaluating uncertainty estimations for out-of-distribution and in-distribution data.

<table>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_data_plot.png" style="width: "50%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_data_plot.png" style="width: "50%";"/> </td>
</tr>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_data_plot.png" style="width: "50%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_data_plot.png" style="width: "50%";"/> </td>
</tr>
</table>

<div align="center">Figure 6.1: classification toy datasets and their corresponding test points for uncertainty evaluation.</div>

Our classification neural network had 2 hidden layers with 64 and 8 nodes followed by a single-node output layer. The hidden layers had a tanh activation function while the output layer had a sigmoid activation function. The Ensemble Neural Network model with random initializations was trained with an Adam optimizer and a binary crossentropy loss function. All the Bayesian methods were initialized with the final weights and biases from a single training trajectory and then the Bayesian sampling method was used to create several new samples from the parameter space.

<p align="center"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/model.png" width="50%" /></p>

<div align="center">Figure 6.2: neural network model summary</div>

<a id='diversity-comparison'></a>
### 6.4 Diversity Comparison

In order to compare the diversity of the samples from the different methods, we computed the cosine similarity between the weights from different samples of the Bayesian methods/ runs of the ensemble method. 
In the figure below, we can see from left to right heatmaps of the cosine similarities for deep ensembles, dropout sampling, random sampling, HMC sampling, and ADVI sampling. Similar to the paper, we can see that cosine similarities are considerably lower among different runs of the ensemble model compared to the dropout and random sampling methods thus suggesting greater diversity in the weight space of ensemble models. However, our HMC and ADVI samples were capable of reaching a similar weight space diversity to the ensemble runs as measured by cosine similarity. The blue squares showing high similarity between adjacent samples of ADVI and HMC are indicative of the correlation between consecutive samples which can be elimiated by further thinning.

<table>
<tr>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Deep Ensemble</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Dropout Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Random Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">HMC</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">ADVI</td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_cos_similarity_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_cos_similarity_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_cosine_similarity_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_cos_similarity_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_cosine_similarity_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_cos_similarity_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_cos_similarity_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_cosine_similarity_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_cos_similarity_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_cosine_similarity_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_cos_similarity_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_cos_similarity_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_cosine_similarity_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_cos_similarity_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_cosine_similarity_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_cos_similarity_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_cos_similarity_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_cosine_similarity_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_cos_similarity_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_cosine_similarity_plot_advi.png" style="width: "25%";"/> </td>
</tr>
</table>

<div align="center">Figure 6.3: Cosine similarity heatmaps of the circles, moons, blobs, and overlap data (from top to bottom) for ensemble training, dropout, random, HMC, and ADVI sampling (from left to right).</div>

We also plotted t-SNE plots of the predictions from different runs/samples similar to what was done in the original paper. We noticed that in general, the prediction space diversity was lower in the Bayesian subspace sampling methods compared to ensemble training as was observed in the paper. However, for certain datasets (specifically the blobs and overlap datasets), dropout and random subspace sampling were able to achieve similar or greater prediction space diversity compared to ensemble training. Given our prediction space t-SNE plots and the weight space cosine similarity plots, we can see that weight space diversity is not necessarily indicative of prediction space diversity and vice versa.


<table>
<tr>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Deep Ensemble</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Dropout Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Random Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">HMC</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">ADVI</td>
</tr>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_tsne_plot_ensemble.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_tsne_plot_dropout.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_tsne_plot_random.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_tsne_plot_hmc.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_tsne_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_tsne_plot_ensemble.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_tsne_plot_dropout.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_tsne_plot_random.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_tsne_plot_hmc.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_tsne_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_tsne_plot_ensemble.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_tsne_plot_dropout.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_tsne_plot_random.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_tsne_plot_hmc.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_tsne_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_tsne_plot_ensemble.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_tsne_plot_dropout.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_tsne_plot_random.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_tsne_plot_hmc.png" style="width: "25%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_tsne_plot_advi.png" style="width: "25%";"/> </td>
</tr>
</table>

<div align="center">Figure 6.4: prediction space t-SNE plots of the circles, moons, blobs, and overlap data (from top to bottom) for ensemble training, dropout, random, HMC, and ADVI sampling (from left to right).</div>

<a id='accuracy-comparison'></a>
### 6.2 Accuracy Comparison

With regards to comparing test set accuracies, we have observed that the performance of the methods we have tried are very similar for the circles, blobs and overlap datasets. For the moon dataset however, ADVI and dropout sampling underperform compared to the other methods. The circles, moons and blobs datasets were able to achieve accuracies in the 97-99% range with their respective best performing methods. This high accuracy is expected as the two classes are barely overlapping in these datasets and neural networks can create non-linear classification boundaries. For the overlap dataset, accuracy rates drop to the low 80s as expected due to the large overlap between the two classes.

<table>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_accuracy_plot.png" style="width: "50%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_accuracy_plot.png" style="width: "50%";"/> </td>
</tr>
<tr>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blob_accuracy_plot.png" style="width: "50%";"/> </td>
<td> <img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_accuracy_plot.png" style="width: "50%";"/> </td>
</tr>
</table>

<div align="center">Figure 6.5: comparison of the accuracy of ensemble training, dropout, random, HMC, and ADVI sampling for circles, moons, blobs, and overlap datasets.</div>


When we look at posterior predictive probability plots, we can see that HMC seems to provide the most desirable class probabilities as it is less certain about out of distribution areas than other methods which can be particularly observed with the moons and blobs data. In addition to HMC, Random sub-space sampling and ADVI also achieves desirable levels of uncertainty for the overlap dataset which was the hardest model to fit. We can also observe that Deep ensemble on the other hand is one of the best performers in this respect as it is very certain about out of distribution areas particularly with moons and blobs datasets.


<table>
<tr>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Deep Ensemble</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Dropout Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Random Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">HMC</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">ADVI</td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_posterior_predictive_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_posterior_predictive_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_posterior_predictive_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_posterior_predictive_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_posterior_predictive_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_posterior_predictive_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_posterior_predictive_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_posterior_predictive_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_posterior_predictive_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_posterior_predictive_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_posterior_predictive_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_posterior_predictive_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_posterior_predictive_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_posterior_predictive_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_posterior_predictive_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_posterior_predictive_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_posterior_predictive_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_posterior_predictive_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_posterior_predictive_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_posterior_predictive_plot_advi.png" style="width: "25%";"/> </td>
</tr>
</table>

<div align="center">Figure 6.6: posterior predictive plots of the circles, moons, blobs, and overlap data (from top to bottom) for ensemble training, dropout, random, HMC, and ADVI sampling (from left to right).</div>

<a id='uncertainty-comparison'></a>
### 6.3 Uncertainty Comparison

In our experiments, we used the standard deviation of the prediction probabilities from the different samples/runs as an estimate for the uncertainty in the prediction. As can be seen in the uncertainty plots below, most of the methods did not encapsulate the training data and generated bad out-of-distribution uncertainty estimates that were similar to the uncertainty estimates for the in-distribution test points. HMC was the most promising method; for the moons and circles datasets, HMC shows higher uncertainties for OOD regions which seem to scale with distance away from the training data. Even for the blobs and overlap data HMC is showing some training data encapsulation and higher uncertainties for most OOD regions compard to the other methods.

<table>
<tr>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Deep Ensemble</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Dropout Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">Random Subspace Sampling</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">HMC</td>
<td style="text-align: center; vertical-align: bottom; font-size:80%;">ADVI</td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_uncertainty_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_uncertainty_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_uncertainty_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_uncertainty_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/circles_uncertainty_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_uncertainty_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_uncertainty_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_uncertainty_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_uncertainty_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/moons_uncertainty_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_uncertainty_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_uncertainty_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_posterior_uncertainty_plot_random.png" style="width: "25%";"/></td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_uncertainty_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/blobs_uncertainty_plot_advi.png" style="width: "25%";"/> </td>
</tr>
<tr>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_uncertainty_plot_ensemble.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_uncertainty_plot_dropout.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_uncertainty_plot_random.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_uncertainty_plot_hmc.png" style="width: "25%";"/> </td>
<td style="text-align: center; vertical-align: bottom;"><img src="https://raw.githubusercontent.com/anita76/AM107-Project/main/Images/overlap_uncertainty_plot_advi.png" style="width: "25%";"/> </td>
</tr>
</table>

<div align="center">Figure 6.7: uncertainty plots (based on prediction standard deviation) of the circles, moons, blobs, and overlap data (from top to bottom) for ensemble training, dropout, random, HMC, and ADVI sampling (from left to right).</div>

<a id='comparison-with-paper'></a>
### 6.4 Comparison with Paper

Our experiments led to some different results compared to the "Deep Ensembles: A Loss Landscape Perspective" paper. For our toy datasets, we did not see a significant difference between the accuracy of the Deep Ensemble model and the Bayesian Neural Networks. This can in part be because of the simpler, lower dimensional classification problems that we investigated compared to the original paper. We did notice lower performance for the moons dataset with ADVI and dropout sampling compared to the Ensemble NN; thus it is possible that Deep Ensembles perform better than Bayesian methods for some problems. However, based on our experiments, we cannot make the generalization that deep ensembles always perform better unlike what was suggested by the paper. 

Second, we performed a more extensive uncertainty analysis as we took into account the out-of-distribution uncertainties by mapping the uncertainty estimates over a portion of the problem space and considering some OOD test points in our experiments. We noticed that HMC gives more reasonable uncertainty estimations for all of the datasets compared to Deep Ensembles. Given the empirical nature of these findings however, we would need to perform more diverse experiments before generalizating this observation. Finally, while we did observe greater prediction space and weight space diversity in the Deep Ensembles compared to some or all of our Bayesian methods, it is not necessary to have greater diversity in order to obtain good results unlike what was suggested in the paper. As an example, despite the lower prediction space diversity of HMC, HMC was able to achieve comparable accuracy to Deep Ensembles and better uncertainty estimations.

In summarize, our hypothesis was that advanced Bayesian methods such as HMC should be able to identify out-of-distribution areas better than deep ensembles due to their ability of acquiring distribution over parameters through accurate posterior predictive sampling. While posterior predictive accuracy levels were generally very comparable as expected, our hypothesis regarding epistemic uncertainties turned out to be true as we observed from posterior predictive probabilities and epistemic uncertainty plots where deep ensembles failed to identify out-of-distribution areas. For this reason, we believe that authors did not provide enough level of analysis in regards to epistemic uncertainty and prematurely suggested that “current variational Bayesian methods do not reach the trade-off between diversity and accuracy achieved by independently trained models”.

<a id='evaluation'></a>

## 7. Evaluation 

Overall, the paper serves as an interesting, thought-provoking exploration of its central idea rather than a firm proof because of the limited scope of its experiments. 
Working with three image datasets and only using four "Bayesian" sampling methods, not including the standard, advanced Bayesian procedures, represent serious restrictions in scope. 
In particular, the absence of Hamiltonian Monte Carlo and Variational Inference makes the paper's comparison incomplete. 
The paper also does not explore out-of-distribution uncertainty at length, relying only on Brier scores for different intensities of dataset corruption for different models as an indicator for the quality of uncertainties and handling of OOD points. 
The Brier score, however, can be decomposed into terms that represent the inherent uncertainty of event outcomes, the model's predictive uncertainty, and the resolution of the model (its specificity and divergence from overall averages), according to Murphy (1973). 
This is cause for skepticism of the paper's evaluations of which models produce "better" uncertainty. 
There are also open questions about parameter choices for the subspace sampling methods: It is possible to change sample variation in random and dropout subspace sampling, and these changes are not explored in the paper. 

With those limitations, though, the paper does use sensible methods to investigate function space diversity and its effect on accuracy, especially for out-of-distribution data, for its chosen models and datasets. 
The t-SNE plot is a reasonable dimensionality reduction method that is used to produce clear plots comparing the function space diversity of different approaches. 
These plots, with the aforementioned caveats, do provide compelling evidence that independent random initializations explore much more of the function space than the chosen subspace sampling methods. 
The paper's demonstration that some of the subspace sampling methods are insufficient to explore multiple optima in the loss landscape when different initializations were able to do so was not very rigorous or general, but illustrated the point of the paper. 

The diversity vs. accuracy plots were particularly convincing, within the aforementioned limitations. 
These plots visualized large datasets composed of many function samples, bolstering their credibility. 
The plots showed the relationship between accuracy and diversity, which map onto the bias-variance tradeoff, for easy interpretation. 
The plots also showed a visually stark and unmistakable pattern, with the different subspace samples forming a distinct curve and the independently initialized and optimized solutions clustering clearly beyond the curve with near-maximum values for both diversity and accuracy. 

The paper's examination comparing and combining the effects of ensembles with random initialization with subspace sampling built reasonably on previous work in the paper. 
Reporting the accuracies for different ensemble sizes (including 1) with all different subspace sampling methods (including none) allows for easy comparative evaluation. 
The resulting plot is fairly convincing, for the given subspace samplers and dataset, that subspace sampling does improve outcomes slightly but much less than ensembling, and the advantage from subspace sampling shrinks as ensembles grow. 

The paper's final results, studying dataset shift, suffer most from a lack of detail around dataset corruption. 
It is unclear if the corruption really generates out-of-distribution data, and corruption intensity cannot be understood as a scale without greater specificity. 
However, Jensen-Shannon divergence, a sum of KL divergences from the mean model, is a good metric to include to quantify sample model diversity, which is useful to investigate under dataset shift in order to study the central theory that diversity is responsible for increased accuracy, particularly under dataset shift. 
The substantial gap between the Jensen-Shannon divergences for indepenednetly initialized models and subspace samples is convincing evidence of differences in diversity, which help to explain why even as the data are corrupted, ensembling leads to large increases in accuracy while subpsace sampling yields small increases. 

<a id='future-work'></a>

## 8. Future Work

There is the potential for future work in addressing the various limitations of the paper mentioned above. 
More studies conducted with different datasets and models are necessary to interpret the results more generally. 
Ensemble methods ought to be benchmarked against more effective Bayesian methods, including Monte Carlo methods and variational inference. 
Deeper, more detailed investigations of uncertainty are necessary in order to make claims about the appropriateness of model uncertainties, particulalry epistmic uncertainty. 
We took up many of these issues with a collection of 2D, 2-class data that presented different classification challenges while allowing for easy visualization, uncertainty investigation, and the easy use of more complex, state-of-the-art Bayesian methods. 

In addition, the paper itself mentions some directions for future work that are compelling, including an investigation of how random initialization affects training and how model diversity could be increased within ensembles. 

<a id='broader-impact'></a>

## 9. Broader Impact

Because this paper is focused on comparative methods evaluation, it is far upstream of immediately impactful technology. 
However, its findings contribute to an improved understanding of neural network training and the landscape of different optimization pathways in training. 
This improved understanding has the potential to inspire future neural network-based prediction technologies that will perform better, particularly with out-of-distribution data. 
Therefore, this paper may affect situations in which complex relationships are elucidated based on different data. 
These situations include facial recognition, medical diagnostics, and human genomics. 
In all of these situations, the most affected people will be members of minority groups not well represented in training data: 
People of color, people with disabilities, people with rare comorbidities, intersex people, and others. 
The improved accuracy of the neural networks under dataset shift is not necessarily advantageous for these people. 
If facial recognition becomes more accurate, for example, it may be adopted more widely to incarcerate people, and the US carceral system is expansive, violent, and racist. 
However, the improvement of minority predictive capabilities for these technologies for the research and practice of medicine, if deployed equitably, may save lives. 


The paper may have the added effect of pushing practitioners away from Bayesian methods towards frequentist methods. 
This may have some negative consequences, as Bayesian models have many advantages: 
Bayesian models incorporate prior knowledge, balancing out sparse data, and often represent uncertainty in a way that is most useful to users. 