# **Bayesian Neural Network in Classification**

The neural network will work really well with training data but underperforms when it is fed unseen data. This makes the network blind to the uncertainties in the training data and tends to be overly confident in its wrong predictions. Consider an example where you are trying to classify a car and a bike. If an image of a truck is shown to the network, it ideally should not predict anything. But, because of the softmax function, it assigns a high probability to one of the classes and the network wrongly, though confidently predicts it to be a car. In order to avoid this, we use the Bayesian Neural Network (BNN).

To read about it more, please refer [this](https://analyticsindiamag.com/hands-on-guide-to-bayesian-neural-network-in-classification/) article.

## **Implementation**

We will now implement the BNN on a small dataset from sklearn called iris dataset. This dataset has 4 attributes and around 150 data points. 

### **Loading the dataset and importing essential packages**

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels keras tensorflow torch --user -q

In [None]:
!python -m pip install torchbnn --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import numpy as np
from sklearn import datasets
import torch
import torch.nn as nn
import torch.optim as optim
import torchbnn as bnn
import matplotlib.pyplot as plt

dataset = datasets.load_iris()

### **Splitting the dataset into data and target and converting them to tensors**

In [None]:
data = dataset.data
target = dataset.target 
data_tensor=torch.from_numpy(data).float()
target_tensor=torch.from_numpy(target).long()

## **Defining a simple Bayesian model**

In [None]:
model = nn.Sequential(
    bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=4, out_features=100),
    nn.ReLU(),
    bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=100, out_features=3),
)

#prior_mu (Float) is the mean of prior normal distribution.
#prior_sigma (Float) is the sigma of prior normal distribution.

## **Defining loss function**

The two-loss functions used here are cross-entropy loss and the BKL loss which is used to compute the KL (Kullback–Leibler) divergence of the network.

In [None]:
cross_entropy_loss = nn.CrossEntropyLoss()
klloss = bnn.BKLLoss(reduction='mean', last_layer_only=False)
klweight = 0.01
optimizer = optim.Adam(model.parameters(), lr=0.01)

## **Training the model**

The model is trained for 3000 steps(this would have lead to overfitting for a traditional network)

In [None]:
for step in range(3000):
    models = model(data_tensor)
    cross_entropy = cross_entropy_loss(models, target_tensor)
    kl = klloss(model)
    total_cost = cross_entropy + klweight*kl

    optimizer.zero_grad()
    total_cost.backward()
    optimizer.step()
  
_, predicted = torch.max(models.data, 1)
final = target_tensor.size(0)
correct = (predicted == target_tensor).sum()
print('- Accuracy: %f %%' % (100 * float(correct) / final))
print('- CE : %2.2f, KL : %2.2f' % (cross_entropy.item(), kl.item()))

## **Visualisation**

Let us now visualise the model and see how it has performed. To understand how sampling works, run the model multiple times and plot the graphs. You will notice minor changes with each iteration.

In [None]:
def draw_graph(predicted) :
    fig = plt.figure(figsize = (16, 8))

    fig_1 = fig.add_subplot(1, 2, 1)
    fig_2 = fig.add_subplot(1, 2, 2)

    z1_plot = fig_1.scatter(data[:, 0], data[:, 1], c = target,marker='v')
    z2_plot = fig_2.scatter(data[:, 0], data[:, 1], c = predicted)

    plt.colorbar(z1_plot,ax=fig_1)
    plt.colorbar(z2_plot,ax=fig_2)

    fig_1.set_title("REAL")
    fig_2.set_title("PREDICT")

    plt.show()

#Run 1 : 
models = model(data_tensor)
_, predicted = torch.max(models.data, 1)
draw_graph(predicted)

In [None]:
#Run 2: 
models = model(data_tensor)
_, predicted = torch.max(models.data, 1)
draw_graph(predicted)

The above graph indicates the difference between the scattering of data points in the actual dataset versus the scattering of points in the predicted bayesian network. Since there are 3 output classes, each of them are indicated in a different color. Each time the data is sampled, the network assigns a probability distribution to the entire input data. On close observation near coordinates (2.3,6.2) of the predicted graph, there is a different prediction made because of the change in probability.

# **Related Articles --**


> * [Bayesian Neural Network in Classification](https://analyticsindiamag.com/hands-on-guide-to-bayesian-neural-network-in-classification/)
> * [Introduction to LSTM Autoencoder](https://analyticsindiamag.com/introduction-to-lstm-autoencoder-using-keras/)
> * [Beginners Guide to MLP](https://analyticsindiamag.com/a-beginners-guide-to-scikit-learns-mlpclassifier/)
> * [Beginners Guide to PyTorch](https://analyticsindiamag.com/a-beginners-guide-pytorch/)
> * [Loss functions in PyTorch](https://analyticsindiamag.com/all-pytorch-loss-function/)
> * [Loss functions in Tensorflow Keras](https://analyticsindiamag.com/ultimate-guide-to-loss-functions-in-tensorflow-keras-api-with-python-implementation/)
> * [Loss function with examples](https://analyticsindiamag.com/loss-functions-in-deep-learning-an-overview/)
> * [Optimizers in Tensorflow Keras](https://analyticsindiamag.com/guide-to-tensorflow-keras-optimizers/)