# Quickstart: TopCUP

**Estimated time to complete:** 5 minutes

## Learning Goals

By the end of this quickstart you will be able to:
* Create a copick configuration file for loading the cryoET dataset.
* Run the TopCUP model to extract particle locations via its CLI.

## Prerequisites

* The TopCUP model requires `python>=3.10`. At the time of publication, Colab defaults to Python 3.12
* This model requires a minimum T4 GPU to run.



## Introduction

The Top CryoET U-Net Picker (TopCUP) is a 3D U-Net–based ensemble model designed for particle picking in cryo-electron tomography (cryoET) volumes.
It uses a segmentation heatmap approach to identify particle locations.
TopCUP is fully integrated with copick — a flexible cryoET dataset API developed at the Chan Zuckerberg Imaging Institute (CZII).
This integration makes it easy to apply the model directly to any cryoET dataset in copick format. This quickstart will walk you through creating a copick configuration file for loading the cryoET dataset and then how to run the TopCUP model via its CLI in order to extract particle locations.

For this tutorial, we’ll use 3 tomograms from the Private Test Dataset (Dataset ID: DS-10446).
Now that this dataset is publicly available on the CZ CryoET Data Portal,
we can stream it directly using copick and cryoet portal APIs.

* Inputs: copick configuration file (in this quickstart, we will stream this in)
* Outputs: The model will automatically save the particle picks (locations in Angstrom) as a CSV file inside the specified output directory




## Create copick configuration file

The only input required is a copick configuration file.

The copick configuration file must define **pickable objects** (i.e., the protein complexes you want to detect) and **three** key metadata parameters for each object:
* ```score_weight```: weight for each class in the DenseCrossEntropy loss
* ```score_threshold```: threshold to filter final picks per class, reducing false positives
* ```score_weight```: weight for each class in the F-beta score evaluation

You can find additional instructions and template configurations for accessing datasets across different platforms from the official copick [page](https://copick.github.io/copick/examples/overview/).

An example of a copick file is included below:

```
{
    "name": "Phantom Dataset",
    "description": "CZII ML Challenge Training dataset",
    "version": "1.0.1",
    "pickable_objects": [
        {
            "name": "apo-ferritin",
            "is_particle": true,
            "pdb_id": "4V1W",
            "label": 1,
            "color": [  0, 117, 220, 255],
            "radius": 60,
            "map_threshold": 0.0418,
            "metadata": {
                "score_weight": 1,
                "score_threshold": 0.16,
                "class_loss_weight": 256
            }
        },
        {
            "name": "beta-amylase",
            "is_particle": true,
            "pdb_id": "1FA2",
            "label": 2,
            "color": [153,  63,   0, 255],
            "radius": 65,
            "map_threshold": 0.035,
            "metadata": {
                "score_weight": 0,
                "score_threshold": 0.25,
                "class_loss_weight": 256
            }
        },
        {
            "name": "beta-galactosidase",
            "is_particle": true,
            "pdb_id": "6X1Q",
            "label": 3,
            "color": [ 76,   0,  92, 255],
            "radius": 90,
            "map_threshold": 0.0578,
            "metadata": {
                "score_weight": 2,
                "score_threshold": 0.13,
                "class_loss_weight": 256
            }
        },
        {
            "name": "ribosome",
            "is_particle": true,
            "pdb_id": "6EK0",
            "label": 4,
            "color": [  0,  92,  49, 255],
            "radius": 150,
            "map_threshold": 0.0374,
            "metadata": {
                "score_weight": 1,
                "score_threshold": 0.19,
                "class_loss_weight": 256
            }
        },
        {
            "name": "thyroglobulin",
            "is_particle": true,
            "pdb_id": "6SCJ",
            "label": 5,
            "color": [ 43, 206,  72, 255],
            "radius": 130,
            "map_threshold": 0.0278,
            "metadata": {
                "score_weight": 2,
                "score_threshold": 0.18,
                "class_loss_weight": 256
            }
        },
        {
            "name": "virus-like-particle",
            "is_particle": true,
            "label": 6,
            "color": [255, 204, 153, 255],
            "radius": 135,
            "map_threshold": 0.201,
            "metadata": {
                "score_weight": 1,
                "score_threshold": 0.5,
                "class_loss_weight": 256
            }
        }
    ],
    "config_type": "filesystem",
    "overlay_root": "local:///PATH/TO/EXTRACTED/PROJECT/",
    "static_root": "local:///PATH/TO/EXTRACTED/PROJECT/"
}
```

## Installation of model

First, download the Git repository, which includes necessary packages.

In [8]:
!pip install git+https://github.com/czimaginginstitute/czii_cryoet_mlchallenge_winning_models.git

When using datasets from the cryoET data portal, we can automatically generate a copick configuration file from copick API, and add metadata for each particles. The code below adds metadata and generates a copick file.

In [3]:
import os, copick

#metadata for pickable objects/particles
metadata = {
    "ferritin-complex": {
        "score_weight": 1,
        "score_threshold": 0.16,
        "class_loss_weight": 256
    },
    "thyroglobulin": {
        "score_weight": 2,
        "score_threshold": 0.18,
        "class_loss_weight": 256
    },
    "beta-galactosidase": {
        "score_weight": 2,
        "score_threshold": 0.13,
        "class_loss_weight": 256
    },
    "beta-amylase": {
        "score_weight": 0,
        "score_threshold": 0.25,
        "class_loss_weight": 256
    },
    "cytosolic-ribosome": {
        "score_weight": 1,
        "score_threshold": 0.19,
        "class_loss_weight": 256
    },
    "virus-like-capsid": {
        "score_weight": 1,
        "score_threshold": 0.5,
        "class_loss_weight": 256
    }
}

#stream in the copick file for our selected protein complexes
copick_config_path = os.path.abspath('./copick_config_portal.json')
overlay_path = os.path.abspath('./tmp_overlay')
copick_root = copick.from_czcdp_datasets(
    [10446], # ML Challenge private test dataset
    overlay_path,
    {'auto_mkdir': True}, #overlay_root, self-defined
    output_path = copick_config_path,
)

# only consider the 6 particles
config_pickable_objects = []
for p in copick_root.config.pickable_objects:
    if p.name in metadata:
        p.metadata = metadata[p.name]
        config_pickable_objects.append(p)

copick_root.config.pickable_objects = config_pickable_objects
#save the copick config for later use
copick_root.save_config(copick_config_path)

### Additional Copick Command Options
You can explore dataset-specific options such as `run_names`, `pixelsize`, and `tomo_type` using the copick API.

In [4]:
import copick

# Check available run names, show first 5 tomograms
for run in copick_root.runs[:5]:
    pss = [str(vs.voxel_size) for vs in run.voxel_spacings]
    ps =','.join(set(pss))
    print(f"run name: {run.name}, available voxelsize/pixelsize: {ps} A")

run name: 17803, available voxelsize/pixelsize: 4.99,10.012 A
run name: 17804, available voxelsize/pixelsize: 4.99,10.012 A
run name: 17805, available voxelsize/pixelsize: 4.99,10.012 A
run name: 17806, available voxelsize/pixelsize: 4.99,10.012 A
run name: 17807, available voxelsize/pixelsize: 4.99,10.012 A


In [5]:
# Get a single run
run = copick_root.get_run('17803')
voxel_spacing_obj = run.get_voxel_spacing(10.012)

# Check available reconstruction_type
tts = [t.tomo_type for t in voxel_spacing_obj.tomograms]
tt = ','.join(tts)
print(f'run {run.name} has tomogram_type: {tt}')

run 17803 has tomogram_type: wbp-denoised-denoiset-ctfdeconv,wbp-filtered-ctfdeconv


## Run Model Inference

To explore the available options for running TopCUP, use the `--help` flag. In your terminal, run `topcup inference --help`. This will display all command-line options and arguments for running TopCUP inference, see below:

```
Usage: topcup inference [OPTIONS]

Options:
  -c, --copick_config FILE        copick config file path  [required]
  -ts, --run_names TEXT           Tomogram dataset run names
  -bs, --batch_size INTEGER       batch size for data loader
  -p, --pretrained_weights TEXT   Pretrained weights file paths (use comma for
                                  multiple paths). Default is None.
  -pa, --pattern TEXT             The key for pattern matching checkpoints.
                                  Default is *.ckpt
  --pixelsize FLOAT               Pixelsize in angstrom. Default is 10.0A.
  -tt, --tomo_type TEXT
                                  Tomogram type. Default is denoised.
  -u, --user_id TEXT              Needed for training, the user_id used for
                                  the ground truth picks.
  -o, --output_dir TEXT           output dir for saving prediction results
                                  (csv).
  -g, --gpus INTEGER              Number of GPUs for inference. Default is 1.
  -gt, --has_ground_truth BOOLEAN
                                  Inference with ground truth annoatations
  -h, --help                      Show this message and exit.
```

### Download Checkpoints


In [6]:
import urllib.request
import os
from pathlib import Path

TOPCUP_CHECKPOINTS_URL = [
    "https://huggingface.co/kevinzhao/TopCUP/resolve/main/topcup_weights/topcup_phantom_6_tomograms.ckpt",
    "https://huggingface.co/kevinzhao/TopCUP/resolve/main/topcup_weights/topcup_phantom_12_tomograms.ckpt",
    "https://huggingface.co/kevinzhao/TopCUP/resolve/main/topcup_weights/topcup_phantom_24_tomograms.ckpt",
]

# local directory to save the checkpoints
cache = Path("./checkpoints")
cache.mkdir(parents=True, exist_ok=True)

for url in TOPCUP_CHECKPOINTS_URL:
    filename = url.split("/")[-1]
    dest = cache / filename
    if not dest.exists():
        print(f"Downloading {filename} ...")
        try:
            urllib.request.urlretrieve(url, dest)
            print(f"→ Saved to {dest}")
        except Exception as e:
            print(f"Failed to download {url}: {e}")
    else:
        print(f"Already exists: {dest}")

Downloading topcup_phantom_6_tomograms.ckpt ...
→ Saved to checkpoints/topcup_phantom_6_tomograms.ckpt
Downloading topcup_phantom_12_tomograms.ckpt ...
→ Saved to checkpoints/topcup_phantom_12_tomograms.ckpt
Downloading topcup_phantom_24_tomograms.ckpt ...
→ Saved to checkpoints/topcup_phantom_24_tomograms.ckpt


### Extract Protein Locations

In [7]:
# code for running model for inference in Juputer with live printouts. You can also run the commands directly in a terminal.

from topcup.cli.cli import cli

# Let's do inference for the first 3 tomograms
cli.main(
    args=[
        "inference",
        "-c", f"{copick_config_path}",
        "-ts", "17803,17804,17805",
        "-p", f"{cache}",
        "--pixelsize", "10.012",
        "-o", "output/inference",
        "-tt", "wbp-denoised-denoiset-ctfdeconv",
        "-pa", "*.ckpt",
    ],
    standalone_mode=False,  # so click doesn’t exit on exceptions
)

  from .autonotebook import tqdm as notebook_tqdm


making output dir output/inference
[INFO] Loading 3 checkpoints from checkpoints


/hpc/projects/group.czii/kevin.zhao/conda_envs/topcup/lib/python3.12/site-packages/pytorch_lightning/utilities/migration/utils.py:56: The loaded checkpoint was produced with Lightning v2.5.2, which is newer than your current Lightning version: v2.4.0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/hpc/projects/group.czii/kevin.zhao/conda_envs/topcup/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A40') that has Tensor Cores. 

Inference_dataset length: 3
Predicting DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s][INFO] GPU cuda:0 | Using ensemble of 3 models.
Predicting TS 17803
Predicting DataLoader 0:  33%|███▎      | 1/3 [00:38<01:17,  0.03it/s]



[INFO] GPU cuda:0 | Using ensemble of 3 models.
Predicting TS 17804
Predicting DataLoader 0:  67%|██████▋   | 2/3 [00:56<00:28,  0.04it/s][INFO] GPU cuda:0 | Using ensemble of 3 models.
Predicting TS 17805
Predicting DataLoader 0: 100%|██████████| 3/3 [01:13<00:00,  0.04it/s]Save predicted results in output/inference/val_pred_df_seed.csv
Predicting DataLoader 0: 100%|██████████| 3/3 [01:14<00:00,  0.04it/s]


## Model Outputs

The model will automatically save the particle picks (locations in Angstrom) as a CSV file inside the specified output directory.


## Contact and Acknowledgments

For issues with this quickstart please contact kevin.zhao@czii.org

Special thank you to Christof Hankel for developing the segmenation models and Ermel Utz for developing copick.


## References

- Peck, A., et al., (2025) A Realistic Phantom Dataset for Benchmarking Cryo-ET Data Annotation. Nature Methods. DOI: 10.1101/2024.11.04.621686

## Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy) when engaging with our services.