# Bayes Error Rate Estimation Tutorial


## _Problem Statement_

For classification machine learning tasks, there is an _inherent difficulty_ associated with signal to noise ratio in the images. One way of quantifying this difficulty is the Bayes Error Rate, or irreducable error.

DAML has introduced a method of calculating this error rate that uses image embeddings.


### _When to use_

The `BER` metric should be used when you would like to measure the feasibility of a machine learning task. For example, if you have an operational accuracy requirement of 80%, and would like to know if this is feasibly achievable given the imagery.


### _What you will need_

1. A set of image embeddings and their corresponding class labels. This requires training an autoencoder to compress the images.


### _Setting up_

Let's import the required libraries needed to set up a minimal working example


In [None]:
try:
    import google.colab  # noqa: F401

    %pip install -q daml
except Exception:
    pass

import os

from pytest import approx

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow.keras.datasets as tfds
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer
from tensorflow.nn import relu

from daml.metrics import BER

tf.keras.utils.set_random_seed(408)

## Loading in data

Let's start by loading in tensorflow's mnist dataset,
then we will examine it


In [None]:
# Load in the mnist dataset from tensorflow datasets
(images, labels), (test_images, test_labels) = tfds.mnist.load_data()

In [None]:
print("Number of training samples: ", len(images))
print("Image shape:", images[0].shape)
print("Label counts: ", np.unique(labels, return_counts=True))

To highlight the effects of modifying the dataset on its Bayes Error Rate,
we will only include a subset of 6,000 images and their labels for digits 1, 4, and 9


In [None]:
images_split = {}
labels_split = {}

# Keep only 1, 4, and 9
for label in (1, 4, 9):
    subset_indices = np.where(labels == label)
    images_split[label] = images[subset_indices][:2000]
    labels_split[label] = labels[subset_indices][:2000]

images_subset = np.concatenate(list(images_split.values()))
labels_subset = np.concatenate(list(labels_split.values()))
print(images_subset.shape)
print(np.unique(labels_subset, return_counts=True))

We have taken a subset of the data that is only the digits 1, 4, and 9.
The BER estimate requires 1 dimension, so we flatten the images during this step. This is ok since MNIST images are small, in practice we would need to do some dimension reduction (autoencoder) here.


In [None]:
# Flatten the images
images_flattened = images_subset.reshape((images_subset.shape[0], -1))
print("Dataset shape:", images_flattened.shape)

We now have 9,000 flattened images of size 784. Next we can move on to evaluation of the dataset.


## Evaluation

Suppose we would like to build a classifier that differentiates between the handwritten digits 1, 4, and 9 with predetermined accuracy requirement of 99%.

We will use BER to check the feasibility of the task.
As the images are small, we can simple use the flattened raw pixel intensities to calculate BER (no embedding necessary).
_Note_: This will not be the case in general.


In [None]:
# Load the BER metric
metric = BER(images_flattened, labels_subset, method="MST")

In [None]:
# Evaluate the BER metric for the MNIST data with digits 1, 4, 9.
# One minus the value of this metric gives our estimate of the upper bound on accuracy.
base_ber = metric.evaluate()

In [None]:
print("The bayes error rate estimation:", base_ber)

In [None]:
### TEST ASSERTION ###
print(base_ber)
assert base_ber["ber"] == approx(0.025833, abs=1e-6)
assert base_ber["ber_lower"] == approx(0.0130443, abs=1e-6)

The estimate of the maximum achievable accuracy is one minus the BER estimate.


In [None]:
print("The maximum achievable accuracy:", (1 - base_ber["ber"]) * 100)

### Results

The maximum achievable accuracy on a dataset of 1, 4, and 9 is about 97.4%.
This _does not_ meet our requirement of 99% accuracy!


## Modify dataset classification

To address insufficient accuracy, lets modify the dataset to classify an image as "1" or "Not a 1".
By combining classes, we can hopefully achieve the desired level of attainable accuracy.


In [None]:
# Creates a binary mask where current label == 1 that can be used as the new labels
labels_merged = labels_subset == 1
print("New label counts:", np.unique(labels_merged, return_counts=True))

# Update the metric with merged labels with digits 1, and not 1 (classes 4 & 9).
metric.labels = labels_merged

In [None]:
# Evaluate the BER metric for the MNIST data with updated labels
new_ber = metric.evaluate()

In [None]:
print("The bayes error rate estimation:", new_ber)

In [None]:
### TEST ASSERTION ###
print(new_ber)
assert new_ber["ber"] == approx(0.005, abs=1e-6)
assert new_ber["ber_lower"] == approx(0.002506, abs=1e-6)

The estimate of the maximum achievable accuracy is one minus the BER estimate.


In [None]:
print("The maximum achievable accuracy:", 1 - new_ber["ber"])

### Results

The maximum achievable accuracy on a dataset of 1 and not 1 (4, 9) is about 99.5%.
This _does_ meet our accuracy requirement.

By using BER to check for feasibility early on, we were able to reformulate the problem such that it is feasible under our specifications


## Building a classifier

We can now attempt to build a classifier that achieves this level of accuracy.


In [None]:
# Build a simple CNN for classifying MNIST images.
model = Sequential(
    [
        InputLayer(input_shape=(28, 28, 1)),
        Conv2D(
            64,
            4,
            strides=2,
            padding="same",
            activation=relu,
        ),
        Conv2D(
            128,
            4,
            strides=2,
            padding="same",
            activation=relu,
        ),
        Conv2D(
            512,
            4,
            strides=2,
            padding="same",
            activation=relu,
        ),
        Flatten(),
        Dense(2),
    ]
)

Since we are using a subset for training, we must also subset the testing data


In [None]:
test_indices = np.where((test_labels == 1) | (test_labels == 4) | (test_labels == 9))
test_images_subset = test_images[test_indices]
test_labels_subset = test_labels[test_indices]
test_labels_merged = test_labels_subset == 1

## Train and test the model

Now we train and test the model on the modified data


In [None]:
# Set up model hyperparameters
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

# Fitting a model may take a few minutes
history = model.fit(
    images_subset,
    labels_merged,
    epochs=90,
    batch_size=32,
    steps_per_epoch=1,
    validation_data=(test_images_subset, test_labels_merged),
    verbose=0,
)

In [None]:
loss, accuracy = model.evaluate(test_images_subset, test_labels_merged, verbose=1)
print(f"The model accuracy: {accuracy*100:0.2f}%")

In [None]:
### TEST ASSERTION ###
print(accuracy)
assert accuracy == approx(0.9914, abs=1e-4)

In [None]:
plt.title("Model Accuracy")
plt.plot(range(60, 90), np.array(history.history["val_accuracy"])[60:], label="Classifier")
plt.hlines(
    y=1 - new_ber["ber"],
    colors=["red"],
    xmin=60,
    xmax=90,
    label="1-BER",
    linestyles="dashed",
)
plt.hlines(
    y=0.99,
    colors=["green"],
    xmin=60,
    xmax=90,
    label="Accuracy Requirement",
    linestyles="dashed",
)

plt.xticks(range(60, 91, 10))
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(loc=4)

### Results

The model achieves an accuracy of 99.14% accuracy, exceeding the requirement of 99%.

The model accuracy does not quite approach the maximum achievable accuracy, meaning there are still improvements that can be made.
