# CLAIRE-COVID19 universal pipeline

The CLAIRE-COVID19 universal pipeline has been designed to compare different training algorithms for the AI-assisted diagnosis of COVID-19, in order to define a baseline for such techniques and to allow the community to quantitatively measure AI’s progress in this field. For more information, please read [this post](https://streamflow.di.unito.it/2021/04/12/ai-assisted-covid-19-diagnosis-with-the-claire-universal-pipeline/).

This notebook showcases how the classification-related portion of the pipeline can be successfully described and documented in Jupyter, using the *Jupyter-workflow* to execute it at scale on an HPC facility. 

The first preprocessing portion of the pipeline is left outside the notebook. It could clearly (and effectively) be included, but since this example wants to serve as an introductory demonstration, we wanted to keep it as simple as possible.

Let's start with a simple import of all the required Python packages.

In [1]:
import argparse
import collections
import math
import os
import sys

%matplotlib notebook
import matplotlib.pyplot as plt

import numpy as np
import torch
import torch.hub

from tqdm import tqdm

import nnframework.data_loader as module_dataloader
import nnframework.graphs.models as module_model
import nnframework.graphs.losses as module_loss
import nnframework.graphs.metrics as module_metric

from metrics.results import Results

from nnframework.parse_config import ConfigParser
from nnframework.trainers import Trainer

Now we need to setup the experiment's parameter. As an example, we decided to run a grid search on a pre-trained `DenseNet-121`, fine-tuning it on a pre-processed and filtered version of the [BIMCV-COVID19+](https://bimcv.cipf.es/bimcv-projects/bimcv-covid19/) dataset (Iteration 1) with different combinations of Learning Rate (LR), LR decay step and weight decay. Moreover, in order to enhance the robustness of the accuracy metrics, we also perform a 5-fold cross-validation for each of the training configurations.

Since this is only a portion of the CLAIRE-COVID19 universal pipeline, there are some manual preliminary steps to be done in order to successfully run this notebook.

First of all, you need to produce a pre-processed and filtered version of the BIMCV-COVID19+ dataset (Iteration 1). Instructions on how to do this can be found [here](https://github.com/CLAIRE-COVID/AI-Covid19-preprocessing) and [here](https://github.com/CLAIRE-COVID/AI-Covid19-pipelines).

This procedure will produce three different versions of the dataset. For this experiment, we need the `final3_BB` folder, which can be either manually pre-transferred on the remote HPC facility (as we did) or left on the local machine, letting *Jupyter-workflow* automatically perform the data transfer when needed.

In the first case, you simply have to insert the correct value of the `dataset_path` variable, while in the second case you have to put your local dataset path in the `dataset_path` variable and then explicitly including it in the remote step dependencies as

```json
{
  "type": "file",
  "valueFrom": "dataset_path"
}
```

What discussed so far for the `dataset_path` applies also to the `labels_path`, which must contain the path to the `labels_covid19_posi.tsv` file produced in output by the preprocessing portion of the pipeline.

Next, if you plan to run on an air-gapped architecture (as HPC facilities normally are), you will need to manually download all the pre-trained neural network models and place them into a `weights` folder, in a portion of file-system shared among all the worker nodes in the data centre. This can be done by running [this script](https://raw.githubusercontent.com/CLAIRE-COVID/AI-Covid19-benchmarking/master/nnframework/init_models.py). We could also include this script as a notebook cell, but again we didn't to keep the Notebook as linear as possible.

Again, you have two different ways to manage this task. You can either run the task directly on the remote machine, and put the path of the `weights` folder in the `weights_path` variable. Or, you can run the script locally and let *Jupyter-workflow* automatically manage the data transfer for you, explicitly defining the `weights_path` as a file variable as discussed before.

Finally, you need to transfer dependencies (i.e., PyTorch and the CLAIRE-COVID19 benchmarking [code](https://github.com/CLAIRE-COVID/AI-Covid19-benchmarking)) on the remote facility. To enhance portability, we used a Singularity container with everything inside, but you have to manually transfer it on the remote HPC facility, in a shared portion of the file-system, and to change the `environment/cineca-marconi100/slurm_template.jinja2` file accordingly. We plan to make this last transfer automatically managed by *Jupyter-workflow* in the very next releases.

If you plan to run on an x86_64 architecture, creating the Singularity is as easy as trandforming the Docker image for this experiment in a Singularity image. Neverhteless, the [MARCONI 100](https://www.hpc.cineca.it/hardware/marconi100) HPC facility comes with an IBM POWER9 architecture. Therefore, we built a Singularity container on a `ppc64le` architecture using [this script](https://raw.githubusercontent.com/alpha-unito/jupyter-workflow/master/examples/claire-covid/singularity/ppcle64/build).

In [None]:
dataset_path = '/path/to/final3_BB'
epochs = 1
gpus = 1
k_folds = list(range(5))
labels_path = '/path/to/labels_covid19_posi.tsv'
learning_rates = [0.001, 0.0001, 0.00001]
lr_step_sizes = [10, 15]
model_versions = [121]
model_type = 'DenseNetModel'
weight_decays = [0.0005, 0.00005]
weights_path = '/path/to/models/weights'

Now we use the *Jupyter-workflow* `scatter` option to iterate over the cartesian product of experiment configurations, producing a config file for eath training instance.

Please not that we perform this step locally, prior to transferring the computation to the remote HPC facility. We can easily separate these two steps as we do not need to move the dataset and the model weights, and therefore we already know their remote paths.

Conversely, it would be probably easier to merge this code cell with the following, producing the configuration when data has been already transferred. Alternatively, it would be possible to rely on the `predump` and `postload` features to modify the value of the `config` variable on the remote node, but this would be much harder than simply merging two code cells.

In [None]:
configs = []
for model_version in model_versions:
    for learning_rate in learning_rates:
        for lr_step_size in lr_step_sizes:
            for weight_decay in weight_decays:
                for k_fold_idx in k_folds:
                    name = f'{model_type}{model_version}_lr{learning_rate}_step{lr_step_size}_wd{weight_decay}'.format(
                        model_type=model_type,
                        model_version=model_version,
                        learning_rate=learning_rate,
                        lr_step_size=lr_step_size,
                        weight_decay=weight_decay)
                    output_folder = 'training_{name}'.format(name=name)
                    configs.append({
                        'name': name,
                        'n_gpu': gpus,
                        'weights_path': weights_path,
                        'arch': {
                            'type': model_type,
                            'args': {
                                'variant': model_version,
                                'num_classes': 2,
                                'print_model': True
                            }
                        },
                        'loss': 'cross_entropy_loss',
                        'metrics': [
                            'accuracy'
                        ],
                        'data_loader': {
                            'type': 'COVID_Dataset',
                            'args': {
                                'root': dataset_path,
                                'k_fold_idx': k_fold_idx,
                                'mode': 'ct',
                                'pos_neg_file': labels_path,
                                'splits': [0.7, 0.15, 0.15],
                                'replicate_channel': 1,
                                'batch_size': 64,
                                'input_size': 224,
                                'num_workers': 2,
                                'self_supervised': 0
                            }
                        },
                        'optimizer': {
                            'type': 'Adam',
                            'args': {
                                'lr': learning_rate,
                                'weight_decay': weight_decay,
                                'amsgrad': True
                            }
                        },
                        'lr_scheduler': {
                            'type': 'StepLR',
                            'args': {
                                'step_size': lr_step_size,
                                'gamma': 0.1
                            }
                        },
                        'trainer': {
                            'epochs': epochs,
                            'save_dir': output_folder,
                            'save_period': 1,
                            'verbosity': 2,
                            'monitor': 'min val_loss',
                            'early_stop': 10,
                            'tensorboard': False
                        }
                    })

Now we perform the main training cycle on the CINECA [MARCONI 100](https://www.hpc.cineca.it/hardware/marconi100) HPC facility. Details about the execution are stored in the `environment/cineca-marconi100/slurm_template.jinja2` file.

*Jupyter-workflow* will automatically transfer all the required dependencies on the login node of the HPC facility prior to run the `sbatch` command with the selected template. Nevertheless, it will not perform any automatic transfer from the login node to the worker nodes. Therefore, please configure the `workdir` directive in the cell metadata to point to a shared potion of the file-system, where both the login node and the worker nodes of the data center are allowed to access.

In order to repeat the experiment, you have to change the `username` and `sshKey` fields in the cell metadata with your credentials. Moreover, to move the computation on a different Slurm-managed environment, you will also have to change the `hostname` field and the contents of the `slurm_template.jinja2` file.

Alternatively, *Jupyter-workflow* also supports PBS-based HPC facilities in an analogous manner. Take a look at the [quantum-espresso example](https://github.com/alpha-unito/jupyter-workflow/tree/master/examples/quantum-espresso) for more details.

In [None]:
SEED = 123
torch.manual_seed(SEED)
np.random.seed(SEED)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

results_dirs = []
for config in configs:
    config = ConfigParser(config)
    results_dirs.append(config.results_dir)
    
    logger = config.get_logger('train')
    data_loader = config.init_obj('data_loader', module_dataloader)
    
    torch.hub.set_dir(config['weights_path'])
    model = config.init_obj('arch', module_model)
    logger.info(model)
    
    if config['data_loader']['args']['self_supervised']:
        criterion = torch.nn.CrossEntropyLoss()
    else:
        criterion = torch.nn.CrossEntropyLoss(weight=data_loader.get_label_proportions().to('cuda'))
        
    metrics = [getattr(module_metric, met) for met in config['metrics']]
    
    trainable_params = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = config.init_obj('optimizer', torch.optim, trainable_params)
    lr_scheduler = config.init_obj('lr_scheduler', torch.optim.lr_scheduler, optimizer)
    
    trainer = Trainer(model.get_model(), criterion, metrics, optimizer,
                      config=config,
                      train_data_loader=data_loader.train,
                      valid_data_loader=data_loader.val,
                      test_data_loader=data_loader.test,
                      lr_scheduler=lr_scheduler)
    trainer.train()

All the metrics results produced by each training step are stored in a local folder, as *Jupyter-workflow* automatically transferred them from the remote file-system. The implicit `gather` strategy of the previous cell merged all the paths of such folders in a list variable, named `results_dirs`. This list can be now processed to, for example, compute more sophisticated metrics and visualising them interactively on the Jupyter `matplotlib` backend 

In [None]:
fig, axes = plt.subplots(math.ceil(len(results_dirs)/2), 2)
fig.suptitle('ROC curves')
for i, results_dir in enumerate(results_dirs):
    res = Results(path=results_dir)
    res.load()
    
    for p in tqdm(range(0,102)):
        res.compute_metrics_per_scan_proportion(proportion=p/100)
    res.save(overwrite=True)
    
    data = res.results_per_scan_proportion
    sens = []
    spec = []
    for k in sorted(data.keys()):
        sens.append(res.results_per_scan_proportion[k]['sensitivity'])
        spec.append(res.results_per_scan_proportion[k]['specificity'])
    sens = np.array(sens)
    spec = np.array(spec)
    axes[i // 2, i % 2].title.set_text(os.path.basename(results_dir)[12:-6])
    axes[i // 2, i % 2].set(xlabel='FPR', ylabel='TPR')
    axes[i // 2, i % 2].plot(1-spec, sens)
if math.ceil(len(results_dirs)/2) % 2 != 0:
    fig.delaxes(axes[len(results_dirs) // 2, 1])
plt.tight_layout()
fig.show()