## Outlier detection and uncertainty estimation in Nueral Network Models

Most celebrated discoveries of the 20th century are [Heisenberg Uncertainty Principle](https://en.wikipedia.org/wiki/Uncertainty_principle) and [Godel Incompleteness Theorems](https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems) which helped to understand the limitations of physical model and axiomatic mathematical models respectively. In that spirit, it is equally important to understand the limitation of the neural network models as opposed to their capability. Specifically, in a mission-critical scenario like cancer detection or self-driving cars, it is imperative to build trust in the AI models that are used in such applications.
We at UntangleAI build tools, as part of Clarity SDK which helps in explaining the decision/evaluation process behind the neural network model and ascertain the quality of such decision/evaluation using uncertainty modeling. It helps to detect outliers, which otherwise would be classified into one of the existing classes due to maximum likelihood principle used to classify the test point. These are the mislabelled data that needs to be excluded from model training. Also, uncertainty modeling helps in identifying uncertain points, for which the model fails to make the correct prediction. These are data points for which human intervention is needed to label them correctly and retrain the model for better decision-making capabilities

## Hello World of Neural Network

[MNIST database](http://yann.lecun.com/exdb/mnist/) handwritten digit recognition is analogous to Hello World of Neural Networks. Let us try to understand uncertainty modeling and ranking on this dataset which consists of 70,000 handwritten digits divided into 60,000 training set and 10,000 validation set. This is a multi-class classification with 10 classes recognizing 0 to 9 digits. For any neural network recognizing anything other than the defined classes is infeasible as the possibilities for out of class distributions are infinite.

## Step 0 Training a CNN for recognizing MNIST dataset

This step is optional. If you would like to train a CNN network to recognize MNIST dataset you can refer to [this tutorial](/tutorials/mnist_model_training) which trains a model for 25 epochs and saves the trained weights into lenet_mnist_model.h5.

Or you can download the trained weights from [here](https://untanglemodels.s3.amazonaws.com/lenet_mnist_model.h5).

## Step 1 - Model Uncertainty Statistics for each class

During this phase, we take the trained model and training data set batched by class and learn the distribution that model has learned during its training phase for each class and generate uncertainty statistics for each class and save them

In [None]:
# Required imports

import os
import torch
import torch.nn as nn
torch.set_printoptions(precision=8)
from untangle import UntangleAI

torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic=True

Load the model from the trained or downloaded checkpoint file.

In [None]:
#Use the same model used for training
class LeNet(nn.Module):
    # TODO: This isn't really a LeNet, but we implement this to be
    #  consistent with the Evidential Deep Learning paper
    def __init__(self):
        super(LeNet, self).__init__()
        self.model = None
        lenet_conv = []
        lenet_conv += [torch.nn.Conv2d(1,20, kernel_size=(5,5))]
        lenet_conv += [torch.nn.ReLU(inplace=True)]
        lenet_conv += [torch.nn.MaxPool2d(kernel_size=(2,2), stride=2)]
        lenet_conv += [torch.nn.Conv2d(20, 50, kernel_size=(5,5))]
        lenet_conv += [torch.nn.ReLU(inplace=True)]
        lenet_conv += [torch.nn.MaxPool2d(kernel_size=(2,2), stride=2)]

        lenet_dense = []
        lenet_dense += [torch.nn.Linear(4*4*50, 500)]
        lenet_dense += [torch.nn.ReLU(inplace=True)]
        lenet_dense += [torch.nn.Linear(500, 10)]

        self.features = torch.nn.Sequential(*lenet_conv)
        self.classifier = torch.nn.Sequential(*lenet_dense)

    def forward(self, input):
        output = self.features(input)
        output = output.view(input.shape[0], -1)
        output = self.classifier(output)
        return(output)
    
model_ckpt_path = 'lenet_mnist_model.h5'
model = LeNet()
if (torch.cuda.is_available()):
    ckpt = torch.load(model_ckpt_path)
    model.load_state_dict(ckpt)
    model = model.cuda()
else:
    ckpt = torch.load(model_ckpt_path, map_location='cpu')
    model.load_state_dict(ckpt)

model.eval()

Let us define arguments needed to model and rank uncertainty, some of which are used to tweak thresholds like outlier threshold and uncertainty threshold based on the model complexity and knowledge of data distribution.

In [None]:
class UncertaintyArgs:
    mname = 'lenet'              # Model name
    batch_size = 64              # Batch size for training
    num_classes = 10             # Number of classes model is trained to classify
    img_size = (1,28,28)         # Input tensor to model
    outlier_threshold = -3       # outlier threshold (default: -3 99.7%)
    uncertainty_threshold = 0.2  # uncertainty_threshold (default: 0.2)
    sigmoid_node = False
    data_class = None            # Model uncertainty for all classes
    metric = 'similarity'

args = UncertaintyArgs()

Create required directories to store uncertainty statistics which will be used later to rank or evaluate a test point.

In [None]:
module_path = os.path.dirname(os.path.realpath('.'))
proj_path = os.path.abspath(os.path.join(module_path, os.pardir))
model_uncrt_data_path = os.path.join(module_path, 'model_uncrt_data/')
results_path = os.path.join(module_path, 'results')
if(not os.path.exists(model_uncrt_data_path)):
    os.makedirs(model_uncrt_data_path)
if(not os.path.exists(results_path)):
    os.makedirs(results_path)
uncertainty_store_path = os.path.join(model_uncrt_data_path, '{}_uncertainty'.format(args.mname))

# Create untangle object
untangle_ai = UntangleAI()

Call untangle API (model_uncertainty) to learn and store uncertainty statistics. Provide a data loader which loads the training dataset class by class. For MNIST we have provided an API for the same, which is load_mnist_per_class

In [None]:
def train_loader_fun(class_i):
    loader, _ = untangle_ai.load_mnist_per_class(batch_size=args.batch_size, data_class=class_i)
    return(loader)

untangle_ai.model_uncertainty(model, uncertainty_store_path, train_loader_fun, args)

## Step 2 - Ranking of the data by assigning outlier and uncertainty scores

Now we use the uncertainty statistics modeled in Step 1 to rank the dataset and ascertain whether the data point is an outlier or uncertain or certain concerning top 2 classes that the model predicts.

We use untangle_ai APIs to rank data which returns a CSV containing the scores for each data point and classifying them as certain, uncertain or outlier concerning top 2 model predictions and for the uncertainty statistics collected in Step 1. 

Ranking is done using untangle API (rank_data). This API expects a data-loader that loads the evaluation dataset and yields (path, tensor) as the output. Typically path is path to the images and tensor is the transformed and normalized torch tensor corresponding to the image. `path` is used as a key to identify the evaluation dataset.

In [None]:
keys = [str(item) for item in range(args.num_classes)]
ID2Name_Map = dict(zip(keys, keys))
data_gen = untangle_ai.mnist_data_gen(args.batch_size, results_path, gen_images=True)
df = untangle_ai.rank_data(model, data_gen, uncertainty_store_path, results_path, ID2Name_Map, args)

## Results

pandas dataframe returned from rank data is now sorted by score. As per our threshold score < 0.2 is treated as uncertain.

In [None]:
import pandas as pd


df = pd.read_csv(results_path + '/scores_similarity.csv')
df.sort_values(by=['score'], inplace=True)


In [None]:
from scipy.misc import imread
import matplotlib.pyplot as plt
df_outliers = df[df['decision'] == 'outlier']
for i in range(len(df_outliers)):
    img = imread(df_outliers['img_path'].values[i])
    print(df_outliers.iloc[i])
    plt.imshow(img)
    plt.show()

In [None]:
df_uncertain = df[df['decision'] == 'uncertain']
for i in range(len(df_uncertain))[:3]:
    img = imread(df_uncertain['img_path'].values[i])
    print(df_uncertain.iloc[i])
    plt.imshow(img)
    plt.show()

In [None]:
from scipy.misc import imread
import matplotlib.pyplot as plt
df_certain = df[df['score'] > args.uncertainty_threshold]
for i in range(len(df_certain))[:3]:
    img = imread(df_certain['img_path'].values[i])
    print(df_certain.iloc[i])
    plt.imshow(img)
    plt.show()