# Performance assessment template

**Note**: This notebook provides a suggested way to compare model performance for a hierarchical multi-label classification problem -- see the notebook [5_hierarchical_labels.ipynb](5_hierarchical_labels.ipynb).

## Import libraries

In [1]:
import pandas as pd
from sklearn.metrics import classification_report

In [2]:
UNKNOWN = pd.NA

### Test set

Some made-up test data: Replace this with the real test data. See [8_holdout.ipynb](./8_holdout.ipynb).

In [3]:
test_set = pd.DataFrame({
    'label3level1': ['Copepod',   'Copepod',   'Copepod',    'Copepod', 'Cladocera', 'Cladocera'],
    'label3level2': ['Calanoida', 'Calanoida', 'Cyclopoida', UNKNOWN,   UNKNOWN,     'Evadne'],
    'label3level3': ['Acartia',   'Calanus',   UNKNOWN,      UNKNOWN,   UNKNOWN,     UNKNOWN],
})
test_set

Unnamed: 0,label3level1,label3level2,label3level3
0,Copepod,Calanoida,Acartia
1,Copepod,Calanoida,Calanus
2,Copepod,Cyclopoida,
3,Copepod,,
4,Cladocera,,
5,Cladocera,Evadne,


### Model output

Some made-up model output: Replace this with the classification output of a model run on the test inputs.

The example below supposes an 'UNKNOWN' label is a possible output of the classifier, when a label has no children at the finest level.

In [4]:
prediction_model_1 = pd.DataFrame({
    'label3level1': ['Copepod',   'Copepod',    'Copepod',    'Copepod',    'Cladocera', 'Cladocera'],
    'label3level2': ['Calanoida', 'Cyclopoida', 'Cyclopoida', 'Cyclopoida', 'Evadne',    'Evadne'],
    'label3level3': ['Acartia',   'Oithona',    'Oithona',    'Corycaeus',  UNKNOWN,     UNKNOWN],
})
prediction_model_1

Unnamed: 0,label3level1,label3level2,label3level3
0,Copepod,Calanoida,Acartia
1,Copepod,Cyclopoida,Oithona
2,Copepod,Cyclopoida,Oithona
3,Copepod,Cyclopoida,Corycaeus
4,Cladocera,Evadne,
5,Cladocera,Evadne,


### Report

The general idea is to assess the model against each level in the hierarchy separately, reducing the size of the test set for each level by removing the 'UNKNOWN's.

Note: Since our labels at each level in the hierarchy are all distinct, we can pass the finest level to the report function below. Care should be taken if this is ever *not* the case (the comparison between two labels at a particular level ought to made between all levels at and below the level of interest).

In [5]:
def report(label):
    test_data = test_set[label]
    predicted = prediction_model_1[label]
    print(
        classification_report(
            test_data[~test_data.isna()], 
            predicted[~test_data.isna()],
            zero_division=0
        )
    )

In [6]:
report('label3level1')

              precision    recall  f1-score   support

   Cladocera       1.00      1.00      1.00         2
     Copepod       1.00      1.00      1.00         4

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6



In [7]:
report('label3level2')

              precision    recall  f1-score   support

   Calanoida       1.00      0.50      0.67         2
  Cyclopoida       0.50      1.00      0.67         1
      Evadne       1.00      1.00      1.00         1

    accuracy                           0.75         4
   macro avg       0.83      0.83      0.78         4
weighted avg       0.88      0.75      0.75         4



In [8]:
report('label3level3')

              precision    recall  f1-score   support

     Acartia       1.00      1.00      1.00         1
     Calanus       0.00      0.00      0.00         1
     Oithona       0.00      0.00      0.00         0

    accuracy                           0.50         2
   macro avg       0.33      0.33      0.33         2
weighted avg       0.50      0.50      0.50         2

