Linajea Tracking Example
=====================

This example show all steps necessary to generate the final tracks, from training the network to finding the optimal ILP weights on the validation data to computing the tracks on the test data.

- train network
- predict on validation data
- grid search weights for ILP
  - solve once per set of weights
  - evaluate once per set of weights
  - select set with fewest errors
- predict on test data
- solve on test data with optimal weights
- evaluate on test data

In [None]:
%load_ext autoreload
%autoreload 2
import logging
import multiprocessing
import os
import shutil
import sys
import time
import types

import numpy as np
import pandas as pd

from linajea.config import TrackingConfig
import linajea.evaluation
from linajea.process_blockwise import (extract_edges_blockwise,
                                       predict_blockwise,
                                       solve_blockwise)
from linajea.training import train
import linajea.config
import linajea.process_blockwise
import linajea.utils

In [None]:
logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s %(name)s %(levelname)-8s %(message)s')

Experiment
-----------------

To start a new experiment create a new folder and copy the configuration file(s) you want to use into this folder.
For this example we have already done this for you (`example_basic`). Then change the current working directory to that folder.
Make sure that the file paths contained in the configuration files point to the correct destination, for instance that they are adapted to your directory structure. And that `config.general.setup_dir` is set to the folder you just created.

In [None]:
setup_dir = "example_basic"
os.chdir(setup_dir)

Setup
--------

Make sure that the `linajea` package is installed and that the correct kernel is selected in the jupyter notebook.


### Data

Set `download_data` to `True` and execute the next cell to download a subset of: *3D+time nuclei tracking dataset of confocal fluorescence microscopy time series of C. elegans embryos* (https://zenodo.org/record/6460303).

Each sample contains 15 time frames from a developing C. elegans embryo.
One sample will be used for training, one for validation and one for testing.
You can of course use your own data, it has to be in a compatible format (see https://github.com/funkelab/linajea/examples/README.md for more information).

In [None]:
download_data = False
if download_data:
    !wget -c https://figshare.com/ndownloader/files/36873672 -N --show-progress
    !ln -s 36873672 mskcc_emb1_15fr.zip
    !wget -c https://figshare.com/ndownloader/files/36873675 -N --show-progress
    !mv 36873675 mskcc_emb1_15fr_tracks.csv

    !wget -c https://figshare.com/ndownloader/files/36873723 -N --show-progress
    !ln -s 36873723 mskcc_emb2_15fr.zip
    !wget -c https://figshare.com/ndownloader/files/36873726 -N --show-progress
    !mv 36873726 mskcc_emb3_15fr_tracks.csv

    !wget -c https://figshare.com/ndownloader/files/36873825 -N --show-progress
    !ln -s 36873825 mskcc_emb3_15fr.zip
    !wget -c https://figshare.com/ndownloader/files/36873828 -N --show-progress
    !mv 36873828 mskcc_emb3_15fr_tracks.csv

### Database

MongoDB is used to store the computed results. A `mongod` server has to be running before executing the remaining cells.
See https://www.mongodb.com/docs/manual/administration/install-community/ for a guide on how to install it (Linux/Windows/MacOS).
Alternatively you might want to create a singularity image (https://github.com/singularityhub/mongo). This can be used locally, too, but will be necessary if you want to run the code on an HPC cluster and there is no server installed already.

Set `setup_databases` to `True` to add the ground truth tracks to the database server. Make sure to set `db_host` to the correct server (if you run it locally you can usually just set it to `"localhost"`). If you use different data you also have to adapt `csv_tracks_file` and `db_name`.

In [None]:
setup_databases = False
db_host = "localhost"
if setup_databases:
    csv_tracks_file = "mskcc_emb1_15fr_tracks.csv"
    db_name = "linajea_mskcc_emb1_15fr_gt"
    linajea.utils.add_tracks_to_database(
        csv_tracks_file,
        db_name,
        db_host)
    
    csv_tracks_file = "mskcc_emb2_15fr_tracks.csv"
    db_name = "linajea_mskcc_emb2_15fr_gt"
    linajea.utils.add_tracks_to_database(
        csv_tracks_file,
        db_name,
        db_host)
    
    csv_tracks_file = "mskcc_emb3_15fr_tracks.csv"
    db_name = "linajea_mskcc_emb3_15fr_gt"
    linajea.utils.add_tracks_to_database(
        csv_tracks_file,
        db_name,
        db_host)

Configuration
--------------------

All parameters to control the pipeline (e.g. model architecture, data augmentation, training parameters, ILP weights) are contained in a configuration file (in the TOML format https://toml.io)

You can use a single monolithic configuration file or separate configuration files for a subset of the steps of the pipeline, as long as the parameters required for the respective steps are there.

Familiarize yourself with the example configuration files and have a look at the documentation for the configuration to see what is needed. Most parameters have sensible defaults; usually setting the correct paths and the data configuration is all that is needed to start. See `run_advanced.ipynb` for an example setup that can (optionally) handle multiple samples and automates the process of selecting the correct data for each step as much as possible.

In this setup for training `train_data` has to be set, and for validation and testing `inference_data`.

Training
------------

To start training load the appropriate configuration file and pass the configuration object to the train function. Make sure that the training data and parameters such as the number of iterations/setps are set correctly.

To train until convergence will take from several hours to multiple days.

In [None]:
train_config_file = "config_train.toml"
train_config = TrackingConfig.from_file(train_config_file)

In [None]:
# done in child process to automatically free cuda resources
p = multiprocessing.Process(target=train, args=(train_config,))
p.start()
p.join()

As training until convergence will take a while we provide a pretrained model that can be used to test the following steps of the tracking pipeline.
To use the pretained model set `use_pretrained` to `True` and execute the next cell:

In [None]:
use_pretrained = False
if use_pretrained:
    !wget https://figshare.com/ndownloader/files/36939550 -nv -N --show-progress
    shutil.copy2("36939550", f"train_net_checkpoint_{train_config.train.max_iterations}")

Validation
--------------

After the training is completed we first have to determine the optimal ILP weights.
This is achieved by first creating the prediction on the validation data and then performing a grid search by solving the ILP and evaluating the results repeatedly.

In [None]:
validation_config_file = "config_val.toml"
val_config = TrackingConfig.from_file(validation_config_file)

### Predict Validation Data

First we predict the `cell_indicator` and `movement_vectors` on the validation data. Make sure that `inference_data` in the config file points to the data you want to use for validation. The extracted maxima of the `cell_indicator` map correspond to potential cells in our candidate graph.

This command starts a number of workers (`predict.job.num_workers`) in the background, each worker tries to access a GPU. Do not start more workers than GPUs available. By default the workers are started locally. If you are working on a compute cluster (`lsf` supported, `slurm` and `gridengine` experimental) set `predict.job.run_on` to the respective string value, the code will communicate with the cluster scheduler and allocate the appropriate jobs.

Depending on the number of workers used (see config file) and the size of the data this can take a while.

In [None]:
predict_blockwise(val_config)

### Extract Edges Validation Data

In the next step we extract potential edges for our candidate graph. For each cell candidate, look for neighboring cells in the next time frame and insert an edge candidate for each into the database.

In [None]:
extract_edges_blockwise(val_config)

### ILP Weights Grid Search

Cell/Node and edge candidates form together our candidate graph. By solving the ILP we extract tracks from this graph. However the ILP is parameterized by a set of weights. First we have to find the optimal values for these weights. To achieve this we perform a grid search over a predefined search space. For each set of parameter candidates we solve the ILP once on the validation data.

#### Solve on Validation Data

Make sure to provide a number of parameter sets (`solve.parameters`) to try.

In [None]:
linajea.process_blockwise.solve_blockwise(val_config)

#### Evaluate on Validation Data

And as a last validation step we evaluate the performance for each set of parameter candidates.

In [None]:
validation_config_file = "config_val.toml"
val_config = TrackingConfig.from_file(validation_config_file)
parameters = val_config.solve.parameters
for params in parameters:
    val_config.solve.parameters = [params]
    linajea.evaluation.evaluate_setup(val_config)

#### Determine best ILP weights

The set of weights/parameters resulting in the best performance (fewest number of errors) will then be used to get the performance on the test set. 

In [None]:
score_columns = ['fn_edges', 'identity_switches',
                 'fp_divisions', 'fn_divisions']
if not val_config.general.sparse:
    score_columns = ['fp_edges'] + score_columns

results = linajea.evaluation.get_results_sorted(
    val_config,
    filter_params={"val": True},
    score_columns=score_columns,
    sort_by="sum_errors")

parameters = val_config.solve.parameters[0]
parameters.weight_node_score = float(results.iloc[0].weight_node_score)
parameters.selection_constant = float(results.iloc[0].selection_constant)
parameters.track_cost = float(results.iloc[0].track_cost)
parameters.weight_edge_score = float(results.iloc[0].weight_edge_score)
parameters.weight_division = float(results.iloc[0].weight_division)
parameters.weight_child = float(results.iloc[0].weight_child)
parameters.weight_continuation = float(results.iloc[0].weight_continuation)

print("Best parameters:\n", parameters)

Test
------

Now that we know which ILP weights to use we can create the candidate graph on the test data and compute the tracks. 

First load the test configuration file and set the parameters to the previously determined values (alternatively set the values manually directly in the configuration file). 
Make sure that `solve.grid_search` and `solve.random_search` are not set or set to `False`.

In [None]:
test_config_file = "config_test.toml"
test_config = TrackingConfig.from_file(test_config_file)
test_config.solve.parameters = [parameters]
test_config.solve.solver_type = val_config.solve.solver_type

### Predict Test Data

As before we first predict the `cell_indicator` and `movement_vectors`, this time on the test data. Make sure that `inference_data` in the config file points to the data you want to use for testing.

In [None]:
predict_blockwise(test_config)

### Extract Edges on Test Data

In the next step we extract again the potential edges for our candidate graph.

In [None]:
extract_edges_blockwise(test_config)

### Solve on Test Data

Then we can solve the ILP on the test data and compute the tracks. Make sure that the ILP weights are set to the values that resulted in the lowest overall number of errors on the validation data.

In [None]:
test_config.solve.from_scratch = True
solve_blockwise(test_config)

### Evaluate on Test Data

And finally we can evaluate the performance of our tracks.

In [None]:
report = linajea.evaluation.evaluate_setup(test_config)
for k, v in report.get_short_report().items():
    print(f"\t{k: <32}: {v}")