Skip to content

Adaption of DINOv2 for computational pathology

Notifications You must be signed in to change notification settings

clemsgrs/dinov2

 
 

Repository files navigation

DINOv2 for Computational Pathology

Installation

The training and evaluation code requires PyTorch 2.0 and xFormers 0.0.18 as well as a number of other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:

Clone the repository and then create and activate a dinov2 conda environment using the provided environment definition:

conda env create -f conda.yaml
conda activate dinov2

For dense tasks (depth estimation and semantic segmentation), there are additional dependencies (specific versions of mmcv and mmsegmentation) which are captured in the extras dependency specifications:

conda env create -f conda-extras.yaml
conda activate dinov2-extras

Data preparation

You need to wrap up your data in a tarball file:

  1. Ensure images are all in one directory

  2. Create a single large tarball file that contains all images and name it pretrain_dataset.tar :

    tar -chf pretrain_dataset.tar /path/to/image/folder

Using whole dataset

  1. Infer the auxiliary files pretrain_entries.npy and pretrain_file_indices.npy :

    python scripts/infer_entries.py \
        --tarball_path /path/to/pretrain_dataset.tar \
        --output_root /path/to/output/folder \
        --name pretrain

    The pretrain_entries.npy file will record:

    • a dummy class index (we set it to 0 for all images since we’re not using classes)
    • a unique filename index for each image
    • the start and end offsets of each image within the tarball file

    The pretrain_file_indices.npy file consists in a dictionnary mapping filename index to corresponding filename.

  2. Dump pretrain_dataset.tar, pretrain_entries.npy and pretrain_file_indices.npy in a common folder (e.g. /root/data)

Restricting to a subset

You may not want to use all the patches of a cohort, but only a subset of them (e.g. the cohort comes with a train/tune/test split and you only want to use the patches belonging to slides in the train partition).

Then, follow these simple steps:

  1. Dump the image filenames (e.g. patch1.jpg) of the subset of interest in a .txt file (e.g. {subset}.txt)

  2. Infer the corresponding auxiliary files pretrain_entries_{subset}.npy

    python scripts/infer_entries.py \
      --tarball_path /path/to/pretrain_dataset.tar \
      --output_root /path/to/output/folder \
      --keep /path/to/{subset}.txt \
      --name pretrain \
      --suffix {subset}

    The pretrain_entries_{subset}.npy file will record:

    • a dummy class index (we set it to 0 for all images since we’re not using classes)
    • a unique filename index for each image listed in {subset}.txt
    • the start and end offsets of each image within the tarball file

    A generic pretrain_file_indices.npy file will be saved the first time you run this command.
    It consists in a dictionnary mapping filename index to corresponding filename for the entire tarball file.

  3. Dump pretrain_dataset.tar, pretrain_entries_{subset}.npy and pretrain_file_indices.npy in a common folder (e.g. /root/data)

(optional) Downstream data preparation

This section describes the steps to follow in case you want to run tuning on a downstream task dataset with patch-level labels.

  1. Create a .csv file containing downstream patches' filenames and labels:

    filename,label
    downstream_patch_1.jpg,3
    downstream_patch_2.jpg,1
    ...
    
  2. Create a single tarball file that contains all downstream tuning patches and name it downstream_dataset.tar

    tar -chf downstream_dataset.tar /path/to/downstream/dataset/image/folder
  3. Infer the auxiliary files query_entries.npy and query_file_indices.npy :

    python3 scripts/infer_entries.py \
      --tarball_path /path/to/downstream_dataset.tar \
      --output_root /path/to/output/folder \
      --csv /path/to/csv/file.csv \
      --keep /path/to/output/query.txt \
      --prefix query
    

    /path/to/csv/file.csv should point to the .csv file created in step 1. just above
    /path/to/output/query.txt should contain the list of filnames for the patches in the query subset of the downstream dataset.

  4. Infer the auxiliary file test_entries.npy and test_file_indices.npy:

    python3 scripts/infer_entries.py \
      --tarball_path /path/to/downstream_dataset.tar \
      --output_root /path/to/output/folder \
      --csv /path/to/csv/file.csv \
      --keep /path/to/output/test.txt \
      --prefix test
    

    /path/to/csv/file.csv should point to the .csv file created in step 1. just above
    /path/to/output/test.txt should contain the list of filnames for the patches in the test subset of the downstream dataset.

  5. dump the .tar file and the .npy files in a common folder (e.g. /root/data)

Training

⚠️ To execute the commands provided in this section, make sure the dinov2 package is included in the Python module search path:

export PYTHONPATH="${PYTHONPATH}:/path/to/your/dinov2"

Training a ViT-L/14

Update dinov2/configs/train/vitl14.yaml if you want to change some parameters (e.g. enabling early stopping).
Then run:

python -m torch.distributed.run --nproc_per_node=gpu dinov2/train/train.py \
    --config-file dinov2/configs/train/vitl14.yaml \
    train.dataset_path=Pathology:root={path/to/tarball/root}:extra={path/to/entry/root}:subset={subset}

Replace {path/to/data/root} with the root folder where tarballs are saved, and {path/to/entry/root} with the root folder where numpy entry files are saved (e.g. Pathology:root=/root/data:extra=/root/data).
Leave out :subset={subset} if you didn't restrict the dataset to a specific subset when preparing data.
Otherwise, replace {subset} with the suffix you chose for --suffix in data preparation (e.g. Pathology:root=/root/data:extra=/root/data:subset=train).

In case you want to run downstream tuning, make sure to update the following two parameters in your config:

tune:
  query_dataset_path: KNN:root={path/to/data/root}:extra={path/to/entry/root}:split=query
  test_dataset_path: KNN:root={path/to/data/root}:extra={path/to/entry/root}:split=test

Replace {path/to/data/root} with the folder where you dumped the downstream .tar files. Replace {path/to/entry/root} with the folder where you dumped the downstream .npy entry files.

About

Adaption of DINOv2 for computational pathology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%