# ImageCLEF 2018 concept detector - logistic regression

This notebook performs multi-label classification of biomedical concepts with logistic regression. The feature sets, built separately, are loaded from HDF5 files.

You may read more about this approach in our working notes:

> Eduardo Pinho and Carlos Costa. _Feature Learning with Adversarial Networks for Concept Detection in Medical Images: UA.PT Bioinformatics at ImageCLEF 2018_, CLEF working notes, CEUR, 2018.

#### Instructions of use

1. Run preamble cells below.

2. Pick an existing representation kind, run the respective data set loading and training bundle harness creation cells.

3. Choose the number of epochs to train, run respective cell.

4. View the results with the following cell, go to step 3 at will to keep on training.

5. When done, print the test set predictions in the following cell.

#### HDF5 data format

All feature files must contain these two datasets:

- `/data`: (N, D), 32-bit float containing the feature vectors
- `/id`: (N,), variably-lengthed UTF-8 string containing the image ID (the file name without the extension)

In [None]:
import json
import random
import time
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from util import *
from lin import *
%matplotlib inline

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)

## Read concept list (in frequency order)

The following cell creates a list of concepts and their counts in descending order of frequency. This allows us to focus on classifying more balanced labels (they are generally very sparse). 

In [None]:
with open("./vocabulary.csv", encoding="utf-8") as file:
    CONCEPT_LIST = []
    CONCEPT_COUNT = []
    for x in file:
        parts = x.strip().split('\t')
        CONCEPT_LIST.append(parts[0])
        CONCEPT_COUNT.append(int(parts[1]))
        
    CONCEPT_COUNT = np.array(CONCEPT_COUNT)
CONCEPT_MAP = {cname: v for (v, cname) in enumerate(CONCEPT_LIST)}
print("Number of concepts:", len(CONCEPT_MAP))

### Read ground truth

Please **add the concept list file** to this directory, or modify the file path below.

In [None]:
labels_all = build_labels('./ConceptDetectionTraining2018-Concepts.csv', CONCEPT_MAP)

### Label statistics

The constants below are specific to the ImageCLEF 2018 caption task.

In [None]:
N_SAMPLES = 223859
N_TESTING_SAMPLES = 9938
N_LABELED_SAMPLES = N_SAMPLES - len(labels_all)
print("{} items in full data set without labels ({:.4}% of set)".format(
    N_LABELED_SAMPLES, N_LABELED_SAMPLES * 100.0 / N_SAMPLES))
N_AVERAGE_LABELS = np.mean([len(c) for c in labels_all.values()])
print("Each labeled item contains {} labels on average".format(N_AVERAGE_LABELS))

## Make train-val split partition

In order to obtain some feedback on the training process, the training set is split into two parts. In this code, 10% of the data set was separated for tuning the classifiers.

In [None]:
N_VALIDATION = N_SAMPLES // 10
N_TRAINING_SAMPLES = N_SAMPLES - N_VALIDATION
RANDOM_SEED = 63359405

print("Using {} validation samples (out of {})".format(N_VALIDATION, N_SAMPLES))

random.seed(RANDOM_SEED)
all_indices = list(range(N_SAMPLES))
val_indices = random.sample(all_indices, k=N_VALIDATION)
train_indices = np.delete(all_indices, val_indices)
assert len(train_indices) + len(val_indices) == N_SAMPLES

# Evaluation with Logistic Regression

The following constants may be adjusted to select which concepts to classify, starting from the most frequent ones.

In [None]:
N_TRAIN = 500       # just these most frequent features
N_TRAIN_OFFSET = 0  # skip these most frequent features first

# -------------- AUTOMATICALLY CALCULATED, DO NOT MODIFY --------------
CONCEPTS_TO_TRAIN = CONCEPT_LIST[N_TRAIN_OFFSET:N_TRAIN_OFFSET + N_TRAIN]
# calculate the probability of each concept (based on its frequency in the training set)
CONCEPTS_PROB = CONCEPT_COUNT[N_TRAIN_OFFSET:N_TRAIN_OFFSET + N_TRAIN] / N_SAMPLES

### Operating point thresholds

Choose a list of operating point thresholds to consider in the fine-tuning process. A threshold of 0.5 maximizes accuracy, but is not very useful in this context, since the concepts are very sparse and infrequent. On the other hand, excessively low thresholds will yield too many concepts, decreasing precision. By defining multiple thresholds, we are searching for the one that will maximize the $F_1$ score.

In [None]:
thresholds = [0.06, 0.0625, 0.07, 0.075, 0.08, 0.1, 0.125, 0.15, 0.175]

## Bags of Colors

The following code uses features based on an implementation of bags of colors. Please see [this repository](https://github.com/Enet4/bag-of-colors-nb) for the implementation. It was only written after the 2018 challenge.

The following cell loads the training set, splits it, and loads the testing set. Please make sure that you have both the train and testing feature files. If they have a different name, feel free to change them below.

In [None]:
boc_dset = Datasets.from_h5_files_partition(
    './bocs-256-train.h5',
    train_indices,
    './bocs-256-test.h5',
    labels_all,
    CONCEPTS_TO_TRAIN,
    N_TRAIN_OFFSET,
    normalizer_fn=max_normalize)

The following code creates a model for logistic regression and respective estimator.

In [None]:
model_fn = build_model_fn(
    n_classes=N_TRAIN,
    x_shape=[boc_dset.train_x.shape[1]],
    learning_rate=0.05,
    thresholds=thresholds
)
boc_estimator = tf.estimator.Estimator(model_fn=model_fn, config=get_config('boc'))
boc_bundle = TrainBundle()

In [None]:
train_and_eval_boc = build_train_and_eval_function(
    boc_estimator, boc_bundle, boc_dset, thresholds, CONCEPTS_TO_TRAIN)

The next cell performs the actual training, evaluation, and test predictions. It can be run multiple times. Consider trying a small number of epochs as the argument and running the cell multiple times to see the outcomes earlier. 

In [None]:
boc_f1, boc_test_predictions = train_and_eval_boc(10)

The following cell shows the progression of $F_1$ scores with training.

In [None]:
show_eval(boc_bundle, thresholds, name="boc")
print("Best F1:", boc_f1)

Finally, the submission file can be built with the following cell.

In [None]:
# write predictions to file
print_predictions(boc_test_predictions, boc_bundle.all_metrics, key="lin-boc-{}-o{}".format(N_TRAIN, N_TRAIN_OFFSET))

This pipeline replicates itself below for other kinds of visual features.

## Adversarial Auto-Encoder

Please see [imageclef-aae](https://github.com/bioinformatics-ua/imageclef-toolkit/tree/master/caption/imageclef-aae) to train an adversarial auto-encoder.

In [None]:
aae_dset = Datasets.from_pair_files_partition(
    './aae-features-train.h5',
    './aae-list-train.txt',
    train_indices,
    './aae-features-test.h5',
    './aae-list-test.txt',
    labels_all,
    CONCEPTS_TO_TRAIN,
    offset=N_TRAIN_OFFSET
)

In [None]:
model_fn = build_model_fn(
    n_classes=N_TRAIN,
    x_shape=[aae_val_x.shape[1]],
    learning_rate=0.05,
    thresholds=thresholds
)

In [None]:
aae_estimator = tf.estimator.Estimator(model_fn=model_fn, config=get_config('aae'))
aae_bundle = TrainBundle()

In [None]:
train_and_eval_aae = build_train_and_eval_function(
    aae_estimator, aae_bundle, aae_dset, thresholds, CONCEPTS_TO_TRAIN)

In [None]:
aae_f1, aae_test_predictions = train_and_eval_aae(5)

In [None]:
show_eval(aae_bundle, thresholds, name="aae")
print("Best F1:", aae_f1)

In [None]:
# write predictions to file
print_predictions(aae_test_predictions, aae_bundle.all_metrics, key="aae-{}-o{}".format(N_TRAIN, N_TRAIN_OFFSET))

## Flipped-Adversarial Auto-Encoder

Please see [imageclef-aae](https://github.com/bioinformatics-ua/imageclef-toolkit/tree/master/caption/imageclef-aae) to train a flipped-adversarial auto-encoder.

In [None]:
faae_dset = Datasets.from_h5_files_partition(
    './faae-features-train.h5',
    train_indices,
    './aae-features-test.h5',
    labels_all,
    CONCEPTS_TO_TRAIN,
    offset=N_TRAIN_OFFSET
)

In [None]:
model_fn = build_model_fn(
    n_classes=N_TRAIN,
    x_shape=[faae_dset.train_x.shape[1]],
    learning_rate=0.05,
    thresholds=thresholds
)

In [None]:
faae_estimator = tf.estimator.Estimator(model_fn=model_fn, config=get_config('faae'))
faae_bundle = TrainBundle()

In [None]:
train_and_eval_faae = build_train_and_eval_function(
    faae_estimator, faae_bundle, faae_dset, thresholds, CONCEPTS_TO_TRAIN)

In [None]:
faae_f1, faae_test_predictions = train_and_eval_faae(5)

In [None]:
show_eval(aae_bundle, thresholds, name="faae")
print("Best F1:", faae_f1)

In [None]:
# write predictions to file
print_predictions(faae_test_predictions, faae_bundle.all_metrics, key="faae-{}-o{}".format(N_TRAIN, N_TRAIN_OFFSET))