# Evaluation

After training a model to classify single cell images, it is often useful to evaluate the performance of the model on an unseen annotated dataset. Evaluation helps predict model performance on unseen data.

Suppose we have the following directory structure. Data from this experiment was not shown to the model during training. Images are saved as NPY files with patient prefixes:

    /data/parsed/
        Experiment 003/
            Day 1/
                Sample A/
                    Replicate 1/
                        Class B/
                            B__3618e715e62a229aa78a7e373b49b888.npy
                            B__3cf53cea7f4db1cfd101e06c366c9868.npy
                            B__84949e1eba7802b00d4a1755fa9af15e.npy
                            B__852a1edbf5729fe8721e9e5404a8ad20.npy
        ...

**User should reload data structure as used during model training**, then use `deepometry.utils.load` to load parsed data and their corresponding labels. We can limit the number of samples to 256 samples per-class by specifying `samples=256`.

# User's settings

In [None]:
input_dir = '/Data/STEP1_Parsing'
output_dir = '/Data/STEP3_Evaluation'
class_option = 'class'

# Re-call how many classes there are during the training session
input_dir_for_model_training = '/Data/STEP1_Parsing'

# Some hyperparameter
n_samples = None # sub-sampling for over-representing classes

# Executable

In [None]:
%matplotlib inline

import glob
import os.path

import keras
import matplotlib.pyplot
import numpy
import pandas
import seaborn
import sklearn.metrics
import tensorflow

import deepometry.model
import deepometry.utils

In [None]:
# build session running on GPU 1
configuration = tensorflow.ConfigProto()
configuration.gpu_options.allow_growth = True
# configuration.gpu_options.visible_device_list = "0"
session = tensorflow.Session(config = configuration)

# apply session
keras.backend.set_session(session)

In [None]:
all_subdirs = [x[0] for x in os.walk(input_dir_for_model_training)]
possible_labels = sorted(list(set([os.path.basename(i) for i in all_subdirs])))
labels_of_interest = [i for i in possible_labels if class_option.lower() in i.lower()]

pathnames_of_interest = deepometry.utils.collect_pathnames(input_dir, labels_of_interest, n_samples=None)

In [None]:
x, y, _ = deepometry.utils._load(pathnames_of_interest, labels_of_interest)

units = len(list(set(labels_of_interest)))

# Classification test

The evaluation and target data (`x` and `y`, respectively) is next passed to the model for evaluation. **A previously trained model is required.** The `evaluate` method loads the trained model weights. See the `fit` notebook for instructions on training a model. 

Evaluation data is provided to the model in batches of 32 samples. Use `batch_size` to configure the number of samples. A smaller `batch_size` requires less memory.

The evaluate function outputs the model's loss and accuracy metrics as the array `[loss, accuracy]`.

In [None]:
model = deepometry.model.Model(shape=x.shape[1:], units=units)

model.compile()

predicted = model.predict(x, output_dir, batch_size=32)

predicted = numpy.argmax(predicted, -1)
expected = y

In [None]:
confusion = sklearn.metrics.confusion_matrix(expected, predicted)

# Normalize values in confusion matrix
confusion = confusion.astype('float') / confusion.sum(axis=1)[:, numpy.newaxis]

confusion = pandas.DataFrame(confusion)
confusion = confusion.rename(index={index: label for index, label in enumerate(labels_of_interest)}, columns={index: label for index, label in enumerate(labels_of_interest)})

In [None]:
matplotlib.pyplot.figure(figsize=(12, 8))

seaborn.heatmap(confusion, annot=True)

In [None]:
sklearn.metrics.accuracy_score(expected, predicted)