# Evaluation (with annotated test set)

After training a model to classify single cell images, it is often useful to evaluate the performance of the model on an unseen annotated dataset. Evaluation helps predict model performance on unseen data.

Suppose we have the following directory structure. Data from this experiment was not shown to the model during training. Images are saved as NPY files:

    /data/parsed/
        Experiment 003/
            Day 1/
                Sample A/
                    Replicate 1/
                        Class B/
                            B__3618e715e62a229aa78a7e373b49b888.npy
                            B__3cf53cea7f4db1cfd101e06c366c9868.npy
                            B__84949e1eba7802b00d4a1755fa9af15e.npy
                            B__852a1edbf5729fe8721e9e5404a8ad20.npy
        ...

Use `deepometry.utils.load` to load parsed data and their corresponding labels. We can limit the number of samples to 256 samples per-class by specifying `samples=256`.

# User's settings

In [None]:
input_dir = 'D:/Works/non_GSK/Deepometry/Data/APPROACH_Master_EDSRC/STEP1_Parsing/Test'
modellocation = 'D:/Works/non_GSK/Deepometry/Data/APPROACH_Master_EDSRC/STEP2_Model_training'
output_dir = 'D:/Works/non_GSK/Deepometry/Data/APPROACH_Master_EDSRC/STEP3a_Evaluation/'

# Some hyperparameter
n_samples = None # sub-sampling for over-representing classes

Re-call how many classes there are during the training session. It is crucial to retrieve the list of possible classficiation targets from **the model training session** to ensure the correct reconstruction of categorization, since the training materials should contain all the categories the model has been exposed to. E.g. there could be a situation that one or some categories are missing in a testing dataset.

In [None]:
input_dir_for_model_training = 'D:/Works/non_GSK/Deepometry/Data/APPROACH_Master_EDSRC/STEP1_Parsing/Test'

import glob, os, re
import itertools
all_subdirs = [x[0] for x in os.walk(input_dir_for_model_training)]
list1 = sorted(list(set([os.path.basename(i.lower()) for i in all_subdirs[1:]])))
keyf = lambda text: re.split('\s|(?<!\d)[,._-]|[,._-](?!\d)', text)[0]
sorted([sorted(list(items)) for gr, items in itertools.groupby(list1, key=keyf)])

In [None]:
# Copy a list from the above output
labels_of_interest = [
    'class_crenateddisc_',
    'class_crenateddiscoid',
    'class_crenatedsphere',
    'class_crenatedspheroid',
    'class_side',
    'class_smoothdisc',
    'class_smoothsphere'
]

# Executable

In [None]:
%matplotlib inline

import keras
import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn
import sklearn.metrics
import tensorflow

import deepometry.model
import deepometry.utils

In [None]:
# build session running on GPU 1
configuration = tensorflow.ConfigProto()
configuration.gpu_options.allow_growth = True
# configuration.gpu_options.visible_device_list = "0"
session = tensorflow.Session(config = configuration)

# apply session
keras.backend.set_session(session)

In [None]:
pathnames_of_interest = deepometry.utils.collect_pathnames(input_dir, labels_of_interest, n_samples=None)

In [None]:
x, y, _ = deepometry.utils._load(pathnames_of_interest, labels_of_interest)

units = len(list(set(labels_of_interest)))

# Classification test

The evaluation and target data (`x` and `y`, respectively) is next passed to the model for evaluation. **A previously trained model is required.** The `evaluate` method loads the trained model weights. See the `fit` notebook for instructions on training a model. 

Evaluation data is provided to the model in batches of 32 samples. Use `batch_size` to configure the number of samples. A smaller `batch_size` requires less memory.

The evaluate function outputs the model's loss and accuracy metrics as the array `[loss, accuracy]`.

In [None]:
model = deepometry.model.Model(shape=x.shape[1:], units=units)

model.compile()

predicted = model.predict(x, modellocation, batch_size=32, verbose=1)

predicted = numpy.argmax(predicted, -1)

In [None]:
expected = y

confusion = sklearn.metrics.confusion_matrix(expected, predicted)

# Normalize values in confusion matrix
confusion = confusion.astype('float') / confusion.sum(axis=1)[:, numpy.newaxis]
confusion = pandas.DataFrame(confusion)
confusion = confusion.rename(index={index: label for index, label in enumerate(labels_of_interest)}, columns={index: label for index, label in enumerate(labels_of_interest)})

# Plot confusion matrix
fig, _ = plt.subplots()
fig.set_size_inches(10, 10) 
plt.imshow(confusion, interpolation='nearest', cmap=plt.cm.Blues )
plt.colorbar()
plt.xticks(numpy.arange(len(labels_of_interest)), labels_of_interest, rotation=45)
plt.yticks(numpy.arange(len(labels_of_interest)), labels_of_interest)

fmt = '.2f'
thresh = confusion.max() / 2.
for i, j in itertools.product(range(confusion.shape[0]), range(confusion.shape[1])):
    plt.text(j, i, format(confusion.iloc[i, j], fmt),
            horizontalalignment="center",
            color="white" if numpy.all(confusion.iloc[i, j] > thresh) else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')

matplotlib.rcParams.update({'font.size': 15})

In [None]:
report = pandas.DataFrame(sklearn.metrics.classification_report(expected, predicted, output_dict=True)).transpose()
report.index = labels_of_interest + ['accuracy', 'macro avg', 'weighted avg']
report