#  Train TopCUP Model to Extract Protein Particles in CryoET Dataset

**Estimated time to complete:** 20 minutes

## Learning Goals
* Create a copick configuration file for loading cryoET dataset.
* Train TopCUP models and automatically save best checkpoints via its CLI.

## Prerequisites

* The TopCUP model requires `python>=3.10`. At the time of publication, Colab defaults to Python 3.12   
* This model requires a minimum T4 GPU to run.  

## Introduction

The Top CryoET U-Net Picker (TopCUP) is a 3D U-Net–based ensemble model designed for particle picking in cryo-electron tomography (cryoET) volumes.
It uses a segmentation heatmap approach to identify particle locations.
TopCUP is fully integrated with copick, a flexible cryoET dataset API developed at the Chan Zuckerberg Imaging Institute (CZII).
This integration makes it easy to apply the model directly to any cryoET dataset in copick format.


For this tutorial, we will use seven tomograms from the Experimental Training Dataset (Dataset ID: DS-10440), which is the same dataset used in the Kaggle CryoET Challenge.
Now that this dataset is publicly available on the CZ CryoET Data Portal,
we can stream it directly using the copick configuration file provided below.
We can automatically generate a copick configuration file from copick API, and add metadata for each particles for training TopCUP models.

## Setup

A copick configuration file is required as input.

The copick configuration file must define **pickable objects** (i.e., the protein complexes you want to detect) and **three key metadata** parameters for each object:
* `score_weight`: weight for each class in the DenseCrossEntropy loss
* `score_threshold`: threshold to filter final picks per class, reducing false positives
* `score_weight`: weight for each class in the F-beta score evaluation

You can find additional instructions and template configurations for accessing datasets across different platforms from the official copick [page](https://copick.github.io/copick/examples/overview/).

An example of a copick file is linked here at the [model Github](https://github.com/czimaginginstitute/czii_cryoet_mlchallenge_winning_models?tab=readme-ov-file#copick-configuration-file).

## Installation

First, download the repository, which will also install required packages.

In [1]:
!pip install git+https://github.com/czimaginginstitute/czii_cryoet_mlchallenge_winning_models.git

## Copick Configuration File

The code below adds metadata for the particles and streams in our copick file.

In [None]:
import os, copick


metadata = {
    "ferritin-complex": {
        "score_weight": 1,
        "score_threshold": 0.16,
        "class_loss_weight": 256
    },
    "thyroglobulin": {
        "score_weight": 2,
        "score_threshold": 0.18,
        "class_loss_weight": 256
    },
    "beta-galactosidase": {
        "score_weight": 2,
        "score_threshold": 0.13,
        "class_loss_weight": 256
    },
    "beta-amylase": {
        "score_weight": 0,
        "score_threshold": 0.25,
        "class_loss_weight": 256
    },
    "cytosolic-ribosome": {
        "score_weight": 1,
        "score_threshold": 0.19,
        "class_loss_weight": 256
    },
    "virus-like-capsid": {
        "score_weight": 1,
        "score_threshold": 0.5,
        "class_loss_weight": 256
    }
}


copick_config_path = os.path.abspath('./training_copick_config_portal.json')
overlay_path = os.path.abspath('./tmp_overlay')
copick_root = copick.from_czcdp_datasets(
    [10440], #dataset_ids
    overlay_path,
    {'auto_mkdir': True}, #overlay_root, self-defined
    output_path = copick_config_path,
)

# only consider the 6 particles
config_pickable_objects = []
for p in copick_root.config.pickable_objects:
    if p.name in metadata:
        p.metadata = metadata[p.name]
        config_pickable_objects.append(p)

copick_root.config.pickable_objects = config_pickable_objects
# save the copick config for later use
copick_root.save_config(copick_config_path)

### Additional Copick Command Options
You can explore dataset-specific options such as `run_names`, `pixelsize`, `tomo_type`, and annotator `user_id` using the copick API.

In [None]:
# Check available run names
for run in copick_root.runs:
    pss = [str(vs.voxel_size) for vs in run.voxel_spacings]
    ps = ','.join(set(pss))
    users = [p.user_id for p in run.picks]
    urs = ','.join(set(users))
    print(f"run name: {run.name}, annotation user_id: {urs}, available voxelsize/pixelsize: {ps} A")

run name: 16463, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16464, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16465, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16466, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16467, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16468, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16469, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A


In [None]:
# Get a single run
run = copick_root.get_run('16463')
voxel_spacing_obj = run.get_voxel_spacing(10.012)

# Check available reconstruction_type
tts = [t.tomo_type for t in voxel_spacing_obj.tomograms]
tt = ','.join(tts)
print(f'run {run.name} has tomogram_type: {tt}')

run 16463 has tomogram_type: wbp-denoised-denoiset-ctfdeconv,wbp-filtered-ctfdeconv


## TopCUP CLI Commands

To explore the available options for running TopCUP, use the `--help` flag. In your terminal, run `topcup train --help`. This will display all command-line options and arguments for running TopCUP training, see below:

```
Usage: topcup train [OPTIONS]

Options:
  -c, --copick_config FILE      copick config file path  [required]
  -tts, --train_run_names TEXT  Tomogram dataset run names for training
                                [required]
  -vts, --val_run_names TEXT    Tomogram dataset run names for validation
                                [required]
  -tt, --tomo_type TEXT         Tomogram type. Default is denoised.
  -u, --user_id TEXT            Needed for training, the user_id used for the
                                ground truth picks.
  -s, --session_id TEXT         Needed for training, the session_id used for
                                the ground truth picks. Default is None.
  -bs, --batch_size INTEGER     batch size for data loader
  -n, --n_aug INTEGER           Data augmentation copy. Default is 1112.
  -l, --learning_rate FLOAT     Learning rate for optimizer
  -p, --pretrained_weight TEXT  One pretrained weights file path. Default is
                                None.
  -e, --epochs INTEGER          Number of epochs. Default is 100.
  --pixelsize FLOAT             Pixelsize in angstrom. Default is 10.0A.
  -o, --output_dir TEXT         output dir for saving checkpoints
  -v, --logger_version INTEGER  PyTorch-Lightning logger version. If not set,
                                logs and outputs will increment to the next
                                version.
  -h, --help                    Show this message and exit.
```

## Training

Next we will train the model through the TopCUP CLI. Training the model takes about 19 minutes per epoch using a batch size of 4. Having data downloaded locally can shorten the data loading overhead per epoch. For this tutorial, we will only train the model for 1 epoch.

In [None]:
#Code for running model training in Jupyter with live printouts. You can also run the commands directly in a terminal.

from topcup.cli.cli import cli

training_outputs = os.path.abspath('./outputs_training')

cli.main(
    args=[
        "train",
        "-c", f"{str(copick_config_path)}",
        "-u", "data-portal",
        "-tts", "16463,16464,16465,16466,16467,16468",
        "-vts", "16469",
        "-bs", "4",
        "-n", "16",  # use default value to replicate the performance
        "-o", f"{str(training_outputs)}",
        "--pixelsize", "10.012",
        "-tt", "wbp-denoised-denoiset-ctfdeconv",
        "-v", "0",
        "-e", "1"
    ],
    standalone_mode=False,  # so Click doesn’t exit on exceptions
)

logger version 0
logger log_dir /content/outputs_training/logs/training_logs/version_0
making output dir /content/outputs_training/jobs/0
Checkpoint dir: /content/outputs_training/checkpoints


train_dataset length: 96
val_dataset length: 1


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Predicting TS 16469


/usr/local/lib/python3.12/dist-packages/pytorch_lightning/utilities/data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.


Best score threshold values {'ferritin-complex': 0.16, 'beta-galactosidase': 0.13, 'virus-like-capsid': 0.5, 'cytosolic-ribosome': 0.19, 'beta-amylase': 0.25, 'thyroglobulin': 0.18}
{'score_ferritin-complex': 0.0, 'score_beta-galactosidase': 0.0, 'score_virus-like-capsid': 0.0, 'score_cytosolic-ribosome': 0.004812183315877375, 'score_beta-amylase': 0.0, 'score_thyroglobulin': 0.03546944858420268, 'score': np.float64(0.010821582926326106)}


## Analysis of Model Outputs

The model will automatically track the validation performance and save the best checkpoint and history metrics inside the specified output directory. The evaluation score for each epoch will be shown in the printouts. The output directory can be changed using the `-o` flag.


## Summary

In this tutorial we streamed in a copick configuration file, trained the topCUP model and saved the best checkpoints via CLI in the specific output directory.


## Contact and Acknowledgments

For issues with this notebook please contact kevin.zhao@czii.org

Special thank you to Christof Hankel for developing the segmenation models and Ermel Utz for developing copick.


## References

- Peck, A., et al., (2025) A Realistic Phantom Dataset for Benchmarking Cryo-ET Data Annotation. Nature Methods. DOI: 10.1101/2024.11.04.621686

## Responsible Use
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy) when engaging with our services.