# Confusion matrix

A confusion matrix counts the detected words depending on the spoken/tested words. It is a compact logging tool for a classification task/experiment.
The spoken words corresponds to the rows of the matrix and the detected words corresponds to the columns of the matrix.

In the following experiment, a logging of a confusion matrix is simulated:

In [1]:
import numpy as np

NumberOfWordClasses = 4
NumberOfWordsPerClass = 100
AssumedAccuracy = 0.6
AssumedFalseRejectionRate = 0.05

NumberOfTotalWords = NumberOfWordClasses * NumberOfWordsPerClass
FalseRejectionCounter = 0
ConfusionMatrix = np.zeros((NumberOfWordClasses, NumberOfWordClasses))
for SpokenWordIndex in range(ConfusionMatrix.shape[0]):
    for n in range(NumberOfWordsPerClass):
        RandomNumber = np.random.rand(1)
        if RandomNumber < AssumedFalseRejectionRate:
            # false rejection occurs
            FalseRejectionCounter += 1
        elif RandomNumber < AssumedFalseRejectionRate + AssumedAccuracy:
            # correct classification occurs
            ConfusionMatrix[SpokenWordIndex, SpokenWordIndex] += 1
        else:
            # wrong classification occurs
            # find arbitrary wrong index
            TargetWrongIndex = SpokenWordIndex
            while TargetWrongIndex == SpokenWordIndex:
                TargetWrongIndex = int(np.random.rand(1) * NumberOfWordClasses)
            ConfusionMatrix[SpokenWordIndex, TargetWrongIndex] += 1
            
print(ConfusionMatrix)

[[55. 13. 12. 13.]
 [ 8. 70.  9. 10.]
 [ 8. 20. 54. 12.]
 [10. 13. 15. 59.]]


There exists several parameters for the confusion matrix in order to evaluate the goodness of the underlying experiment of classification:

## Accuracy
Accuracy is the number of correct detections (the main diagonal) over the total number of spoken words. The range for the accuracy is 0..1. An accuracy of 0 is the worst case scenario with zero correct detected words. A scenario with an accuracy of 1 is a perfect scenario, where all spoken words are detected correctly.

In [2]:
NumberOfCorrectDetectedWords = 0
for SpokenWordIndex in range(ConfusionMatrix.shape[0]):
    NumberOfCorrectDetectedWords += ConfusionMatrix[SpokenWordIndex, SpokenWordIndex]
Accuracy = NumberOfCorrectDetectedWords / NumberOfTotalWords
print('accuracy = ', str(Accuracy))

accuracy =  0.56


## False rejections

Another parameter for evaluating a voice control algorithm is the false rejection rate. If a word is spoken and the voice detection algorithm ignores the spoken word, a false rejection occurs. The false rejection rate is the number of false rejections over the number of spoken/tested words.

In [3]:
NumberOfFalseRejections = NumberOfTotalWords - np.sum(ConfusionMatrix)
FalseRejectionRate = NumberOfFalseRejections / NumberOfTotalWords

print('Number of false rejections = ', str(int(NumberOfFalseRejections)))
print('false rejection rate = ', str(FalseRejectionRate))

Number of false rejections =  19
false rejection rate =  0.0475


## False alarms

False alarms occurs, if no word is spoken, but the voice control system detects a word due to background noise or internal errors. In a real voice control benchmark, the false alarms must be counted in order to evaluate the voice control algorithm correctly. In this artificial scenario with a random confusion matrix, the false alarms must be set to a random value.

In [4]:
NumberOfFalseAlarms = int(2 * np.random.rand(1) * NumberOfTotalWords * AssumedFalseRejectionRate)
FalseAlarmRate = NumberOfFalseAlarms / NumberOfTotalWords

print('Number of false alarms = ', str(NumberOfFalseAlarms))
print('false alarm rate = ', str(FalseAlarmRate))

Number of false alarms =  0
false alarm rate =  0.0


## Word error rate

In order to evaluate the overall performance of a classification task, the accuracy, the false alarms and the false rejections must be considered.

To simply compare two or more different classification algorithms, a single value measurement is beneficial. One example for such a single value measurement is the word error rate. It is defined as the sum of all errors (wrong classifications, false rejections and false alarms) over the number of spoken/testet words:

In [5]:
NumberOfWrongClassifications = NumberOfTotalWords - NumberOfCorrectDetectedWords
NumberOfWordErrors = NumberOfFalseAlarms + NumberOfFalseRejections + NumberOfWrongClassifications
WordErrorRate = NumberOfWordErrors / NumberOfTotalWords

print('word error rate = ', str(WordErrorRate))

word error rate =  0.4875


## Precision

Precision is the relation, how often a single word is correct spoken (the entry on the main diagonal) to how often this word is detected (the sum of the column).

In [6]:
for n in range(ConfusionMatrix.shape[1]):
    Precision = ConfusionMatrix[n, n] / np.sum(ConfusionMatrix[:, n])
    print('Precision for the ', n, '-th word: ', Precision)

Precision for the  0 -th word:  0.5656565656565656
Precision for the  1 -th word:  0.5517241379310345
Precision for the  2 -th word:  0.59
Precision for the  3 -th word:  0.6421052631578947


## Recall

Recall is the relation, how often a single word is correct detected (the entry on the main diagonal) to how often this word is spoken/tested (the sum of the row):

In [7]:
for n in range(ConfusionMatrix.shape[0]):
    Recall = ConfusionMatrix[n, n] / np.sum(ConfusionMatrix[n, :])
    print('Recall for the ', n, '-th word: ', Recall)

Recall for the  0 -th word:  0.5957446808510638
Recall for the  1 -th word:  0.5161290322580645
Recall for the  2 -th word:  0.6210526315789474
Recall for the  3 -th word:  0.6161616161616161


## Accuracy paradox
Accuracy is only a practical measure for goodness of fit for the training data if the training data is nearly balanced.

## Balanced trainingsdata
Balanced trainingsdata means, that nearly all classes to be detectet has nearly the same amount of training samples. For imbalanced trainingsdata, the classes has not the same probability in the trainingsdata.

A perfect dice will produce all six numbers nearly equally often. Therefore, each class ha a probability of roughly $\frac{1}{6}$ in the trainingsdata.

Trainingsdata for detecting earthquakes may be totally imbalanced, because in $99.99$ percent of all cases, there is no earthquake. Training with such a set of trainingsdata may result in the simplest possible classificator, which simply detects the class with the highest probability. In this case, the classificator always states: there is no earthquake. With this statement, an accuracy of $99.99$ percent is reached. But nevertheless, this classificator is useless.

## F Score
In the case of unbalanced data, the F is a better indicator for goodness of fit of the given classification than the accuracy.

The F score is the harmonic mean of the mean precision and the mean recall:

$F=\frac{2}{\frac{1}{\text{precision}}+\frac{1}{\text{recall}}}$

The bigger the F score, the better.

## Programming Exercise:

A camera system should detect if a traffic light is red.
The following confusion matrix is measured.
The first row corresponds to a traffic light showing 'red'.
The second row corresponds to a traffic light showing 'green'.
The first column corresponds to a detected traffic light 'red'.
The second column corresponds to a detected traffic light 'green'.

The aim is to detect red lights.
By this, false alarms are defined by: A red light is detected but a green light would be correct.
By this, false rejections are defined by: A green light is detected but a red light would be correct.

Define the procedures to evaluate the accuracy, the mean precision, the mean recall, the false alarm rate, the false rejection rate and the F1 score.

In [8]:
def EvalAccuracy(ConfusionMatrix):
    Accuracy = 0.0
    # solution begins

    # solution ends
    return Accuracy

def EvalPrecision(ConfusionMatrix):
    MeanPrecision = 0.0
    # solution begins

    # solution ends
    return MeanPrecision

def EvalRecall(ConfusionMatrix):
    MeanRecall = 0.0
    # solution begins

    # solution ends
    return MeanRecall

def EvalFalseAlarms(ConfusionMatrix):
    FalseAlarms = 0.0
    # solution begins

    # solution ends
    return FalseAlarms

def EvalFalseRejections(ConfusionMatrix):
    FalseRejections = 0.0
    # solution begins

    # solution ends
    return FalseRejections

def EvalF1Score(ConfusionMatrix):
    F1Score = 0.0
    # solution begins

    # solution ends
    return F1Score

import unittest

class TestProgrammingExercise(unittest.TestCase):

    def test_Accuracy1(self):
        ConfusionMatrix = np.array([[7, 2], [3, 6]])
        accuracy = EvalAccuracy(ConfusionMatrix)
        self.assertAlmostEqual(accuracy, 0.722, delta = 1e-3)

    def test_Accuracy2(self):
        ConfusionMatrix = np.array([[9, 5], [1, 4]])
        accuracy = EvalAccuracy(ConfusionMatrix)
        self.assertAlmostEqual(accuracy, 0.684, delta = 1e-3)

    def test_Precision1(self):
        ConfusionMatrix = np.array([[7, 2], [3, 6]])
        precision = EvalPrecision(ConfusionMatrix)
        self.assertAlmostEqual(precision, 0.725, delta = 1e-3)

    def test_Precision2(self):
        ConfusionMatrix = np.array([[9, 5], [1, 4]])
        precision = EvalPrecision(ConfusionMatrix)
        self.assertAlmostEqual(precision, 0.672, delta = 1e-3)

    def test_Recall1(self):
        ConfusionMatrix = np.array([[7, 2], [3, 6]])
        recall = EvalRecall(ConfusionMatrix)
        self.assertAlmostEqual(recall, 0.722, delta = 1e-3)

    def test_Recall2(self):
        ConfusionMatrix = np.array([[9, 5], [1, 4]])
        recall = EvalRecall(ConfusionMatrix)
        self.assertAlmostEqual(recall, 0.721, delta = 1e-3)

    def test_FalseAlarms1(self):
        ConfusionMatrix = np.array([[7, 2], [3, 6]])
        FalseAlarms = EvalFalseAlarms(ConfusionMatrix)
        self.assertAlmostEqual(FalseAlarms, 0.111, delta = 1e-3)

    def test_FalseAlarms2(self):
        ConfusionMatrix = np.array([[9, 5], [1, 4]])
        FalseAlarms = EvalFalseAlarms(ConfusionMatrix)
        self.assertAlmostEqual(FalseAlarms, 0.263, delta = 1e-3)

    def test_FalseRejections1(self):
        ConfusionMatrix = np.array([[7, 2], [3, 6]])
        FalseRejections = EvalFalseRejections(ConfusionMatrix)
        self.assertAlmostEqual(FalseRejections, 0.167, delta = 1e-3)

    def test_FalseRejections2(self):
        ConfusionMatrix = np.array([[9, 5], [1, 4]])
        FalseRejections = EvalFalseRejections(ConfusionMatrix)
        self.assertAlmostEqual(FalseRejections, 0.053, delta = 1e-3)

    def test_F1Score1(self):
        ConfusionMatrix = np.array([[7, 2], [3, 6]])
        F1Score = EvalF1Score(ConfusionMatrix)
        self.assertAlmostEqual(F1Score, 0.724, delta = 1e-3)

    def test_F1Score2(self):
        ConfusionMatrix = np.array([[9, 5], [1, 4]])
        F1Score = EvalF1Score(ConfusionMatrix)
        self.assertAlmostEqual(F1Score, 0.696, delta = 1e-3)

unittest.main(argv=[''], verbosity=2, exit=False)

test_Accuracy1 (__main__.TestProgrammingExercise.test_Accuracy1) ... ok
test_Accuracy2 (__main__.TestProgrammingExercise.test_Accuracy2) ... ok
test_F1Score1 (__main__.TestProgrammingExercise.test_F1Score1) ... ok
test_F1Score2 (__main__.TestProgrammingExercise.test_F1Score2) ... ok
test_FalseAlarms1 (__main__.TestProgrammingExercise.test_FalseAlarms1) ... ok
test_FalseAlarms2 (__main__.TestProgrammingExercise.test_FalseAlarms2) ... ok
test_FalseRejections1 (__main__.TestProgrammingExercise.test_FalseRejections1) ... ok
test_FalseRejections2 (__main__.TestProgrammingExercise.test_FalseRejections2) ... ok
test_Precision1 (__main__.TestProgrammingExercise.test_Precision1) ... ok
test_Precision2 (__main__.TestProgrammingExercise.test_Precision2) ... ok
test_Recall1 (__main__.TestProgrammingExercise.test_Recall1) ... ok
test_Recall2 (__main__.TestProgrammingExercise.test_Recall2) ... ok

----------------------------------------------------------------------
Ran 12 tests in 0.019s

OK


<unittest.main.TestProgram at 0x1f067909550>

## Exam preparation

1) You have collected images from three different animals in wildlife: apes (356 images), boars (987 images) and unicorns (1 image taken after a wild student party). What is the accuracy of the simplest possible classificator and what is the output of this simplest possible classificator?

2) The following confusion matrix CM is given. Evaluate the accuracy for this confusion matrix and the recall and precision for each class. Which accuracy can be acchieved by the simplest possible classificator? Is this data set balanced?

In [5]:
CM = np.array([[8, 1, 2], [0, 5, 3], [3, 2, 5]])
print(CM)

[[8 1 2]
 [0 5 3]
 [3 2 5]]
