<a href="https://colab.research.google.com/github/ccosmin97/IDC-Examples-fix-bg/blob/master/notebooks/cookbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IDC Google Colab cookbook notebook

The goal of this notebook is to serve as the source of various small bits that should be helpful in developing analysis notebooks by the IDC users.

Please email Andrey Fedorov andrey dot fedorov at gmail dot com if you have any questions or suggestions!

Prepared: Spring 2022

Updated: July 2022

# Prerequisites

Please complete the prerequisites as described in this documentation page: https://learn.canceridc.dev/introduction/getting-started-with-gcp.

Insert that project ID in the cell below in place of `REPLACE_ME_WITH_YOUR_PROJECT_ID`.

In [1]:
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "idc-sandbox-003"#"REPLACE_ME_WITH_YOUR_PROJECT_ID"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

# Authentication

In [2]:
# you will need to authenticate with your Google ID to do anything meaningful with IDC
from google.colab import auth
auth.authenticate_user()

# Query

First, instantiate the query client, which can next be configured to run the query.

In [3]:
# python API is the most flexible way to query IDC BigQuery metadata tables
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

## Select by specific UID

Queries below are against the [`dicom_all` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table) that has one row per DICOM file stored in IDC. That table contains the metadata extracted from DICOM files, collection-level attributes (e.g., ID of the collection, license, DOI of the collection), and URLs pointing to the location in the cloud where the file is stored.

In [4]:
# select rows corresponding to the specific DICOM instance, as defined by SOPInstanceUID value
# similarly, you can select by specifying StudyInstanceUID, SeriesInstanceUID or SOPInstanceUID,
# replacing the PatientID line below with the following (as examples): 
#   SOPInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.6450.2626.226637977389233552278537838820\" 
#   SeriesInstanceUID = \"1.3.6.1.4.1.14519.5.2.1.4334.1501.312037286778380630549945195741\" 
#   StudyInstanceUID = \"	1.3.6.1.4.1.14519.5.2.1.4334.1501.116796918629271881210561198785\" 
selection_query = """
  SELECT  
    StudyInstanceUID, 
    SeriesInstanceUID, 
    SOPInstanceUID, 
    instance_size, 
    gcs_url 
  FROM 
    `bigquery-public-data.idc_current.dicom_all` 
  WHERE 
    PatientID = "R01-001"
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [5]:
size_gb = selection_df["instance_size"].sum()/(1024*1024*1024)
print(f"Cohort size on disk: {size_gb} Gb")

Cohort size on disk: 0.273021187633276 Gb


## Select by availability of segmentations

What segmentations do we have anyway? Let's look at the distinct combinations of segmentation property category, type and anatomic location, which are the metadata attributes that describe segmentations.

In [6]:
%%bigquery --project=$my_ProjectID

SELECT
  DISTINCT(SegmentedPropertyCategory.CodeMeaning) as SegmentedPropertyCategory_CodeMeaning,
  SegmentedPropertyType.CodeMeaning as SegmentedPropertyType_CodeMeaning,
  AnatomicRegion.CodeMeaning as AnatomicRegion_CodeMeaning
FROM
  `bigquery-public-data.idc_current.segmentations`

Unnamed: 0,SegmentedPropertyCategory_CodeMeaning,SegmentedPropertyType_CodeMeaning,AnatomicRegion_CodeMeaning
0,Morphological Abnormal Structure,Nodule,Lung
1,Tissue,Mammary Fibroglandular Tissue,Breast
2,Tissue,Breast,Breast
3,Morphologically Altered Structure,Neoplasm,Entire body
4,Morphologically Altered Structure,Mass,
5,Morphologically Altered Structure,"Neoplasm, Secondary",lymph node of head and neck
6,Morphologically Altered Structure,Enhancing Lesion,Brain
7,Morphologically Altered Structure,"Neoplasm, Primary",Head and Neck
8,Anatomical Structure,Esophagus,
9,Anatomical Structure,Lung,


Select all rows that correspond to the instances of segmentations of anything in the prostate.

In [7]:
# select rows corresponding to cases that have segmentation of prostate tumor
selection_query = f"""
  SELECT  
    dicom_all.StudyInstanceUID, 
    dicom_all.SeriesInstanceUID, 
    dicom_all.SOPInstanceUID, 
    gcs_url 
  FROM 
    `bigquery-public-data.idc_current.dicom_all` as dicom_all 
  JOIN 
    `bigquery-public-data.idc_current.segmentations` as segmentations 
  ON 
    dicom_all.SOPInstanceUID = segmentations.SOPInstanceUID 
  WHERE 
    segmentations.SegmentedPropertyType.CodeMeaning LIKE "%prostate%" OR 
    segmentations.AnatomicRegion.CodeMeaning LIKE "%prostate%"
    """

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

In [8]:
selection_df

Unnamed: 0,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,1.3.6.1.4.1.14519.5.2.1.3671.4754.288848219213...,1.2.276.0.7230010.3.1.3.1426846371.5420.151320...,1.2.276.0.7230010.3.1.4.1426846371.5420.151320...,gs://public-datasets-idc/f1907408-9c91-43c4-bd...
1,1.3.6.1.4.1.14519.5.2.1.3671.4754.288848219213...,1.2.276.0.7230010.3.1.3.1426846371.5420.151320...,1.2.276.0.7230010.3.1.4.1426846371.5420.151320...,gs://public-datasets-idc/f1907408-9c91-43c4-bd...
2,1.3.6.1.4.1.14519.5.2.1.3671.4754.288848219213...,1.2.276.0.7230010.3.1.3.1426846371.5420.151320...,1.2.276.0.7230010.3.1.4.1426846371.5420.151320...,gs://public-datasets-idc/f1907408-9c91-43c4-bd...
3,1.3.6.1.4.1.14519.5.2.1.3671.4754.131608903441...,1.2.276.0.7230010.3.1.3.1426846371.17976.15132...,1.2.276.0.7230010.3.1.4.1426846371.17976.15132...,gs://public-datasets-idc/26d7e17c-7bc5-4c51-8b...
4,1.3.6.1.4.1.14519.5.2.1.3671.4754.131608903441...,1.2.276.0.7230010.3.1.3.1426846371.17976.15132...,1.2.276.0.7230010.3.1.4.1426846371.17976.15132...,gs://public-datasets-idc/26d7e17c-7bc5-4c51-8b...
...,...,...,...,...
525,1.3.6.1.4.1.14519.5.2.1.7310.5101.808270842417...,1.2.276.0.7230010.3.1.3.1070885483.13380.15991...,1.2.276.0.7230010.3.1.4.1070885483.13380.15991...,gs://public-datasets-idc/80f2eec5-e6b2-41b7-9b...
526,1.3.6.1.4.1.14519.5.2.1.7310.5101.808270842417...,1.2.276.0.7230010.3.1.3.1070885483.13380.15991...,1.2.276.0.7230010.3.1.4.1070885483.13380.15991...,gs://public-datasets-idc/80f2eec5-e6b2-41b7-9b...
527,1.3.6.1.4.1.14519.5.2.1.7310.5101.808270842417...,1.2.276.0.7230010.3.1.3.1070885483.13380.15991...,1.2.276.0.7230010.3.1.4.1070885483.13380.15991...,gs://public-datasets-idc/80f2eec5-e6b2-41b7-9b...
528,1.3.6.1.4.1.14519.5.2.1.3671.4754.133806669697...,1.2.276.0.7230010.3.1.3.1426846371.12016.15132...,1.2.276.0.7230010.3.1.4.1426846371.12016.15132...,gs://public-datasets-idc/c2ddd0fb-caea-4bea-ba...


# Visualization

In [9]:
# helper function to view a study or a specific series hosted by IDC
def get_idc_viewer_url(studyUID, seriesUID=None):
  url = "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID
  if seriesUID is not None:
    url = url+"?seriesInstanceUID="+seriesUID
  return url

my_StudyInstanceUID = selection_df["StudyInstanceUID"][0]
my_SeriesInstanceUID = selection_df[selection_df["StudyInstanceUID"] == selection_df["StudyInstanceUID"][0]]["SeriesInstanceUID"][0]

print("URL to view the entire study:")
print(get_idc_viewer_url(my_StudyInstanceUID))

URL to view the entire study:
https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.3671.4754.288848219213026850354055725664


# Downloading

Downloading is easiest using the `gsutil` command line tool that is preinstalled in Colab. `gsutil`, however, is not the fastest tool for downloading data, which is particulary important for large datasets. In the following we discuss two tools that you can use. Complete documentation relarted to downloading data is here: https://learn.canceridc.dev/data/downloading-data.

## `gsutil`

In [10]:
import os
os.environ["DOWNLOAD_DEST"] = "/content/IDC_downloads"
os.environ["MANIFEST"] = "/content/idc_manifest.txt"

In [11]:
!mkdir -p ${DOWNLOAD_DEST}
!echo "gsutil cp \$* $DOWNLOAD_DEST" > gsutil_download.sh
!chmod +x gsutil_download.sh

In [12]:
# creating a manifest file for the subsequent download of files
selection_df["gcs_url"].to_csv(os.environ["MANIFEST"], header=False, index=False)

In [13]:
# download is this simple
%%capture

!cat ${MANIFEST} | gsutil -m cp -I ${DOWNLOAD_DEST}

If you want to download a non-trivial amount of data, you will want to parallelize downloads, as illustrated below.

In [14]:
!cat ${MANIFEST} | xargs -n 25 -P 10 ./gsutil_download.sh

Copying gs://public-datasets-idc/7a73c7ce-b30c-4984-ab22-3fcaff395781.dcm...
Copying gs://public-datasets-idc/7a73c7ce-b30c-4984-ab22-3fcaff395781.dcm...
Copying gs://public-datasets-idc/9e6d816f-73bb-4e46-8c07-00dec8c831c1.dcm...
Copying gs://public-datasets-idc/b630454f-147c-4e8a-8394-cc2c08f976bd.dcm...
Copying gs://public-datasets-idc/a9becaad-6cd1-4ec1-be7b-3c8b3700e07c.dcm...
Copying gs://public-datasets-idc/7d149412-2d47-4560-a408-ea639e08c833.dcm...
Copying gs://public-datasets-idc/b630454f-147c-4e8a-8394-cc2c08f976bd.dcm...
Copying gs://public-datasets-idc/a9becaad-6cd1-4ec1-be7b-3c8b3700e07c.dcm...
/ [3 files][  1.3 MiB/  1.3 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://public-datasets-idc/b630454f-147c-4e8a

## `s5cmd`

See details in https://learn.canceridc.dev/data/downloading-data/downloading-data-with-s5cmd. 

The steps below assume you created a Service Account, generated a key, and saved your credentials in Google Drive (those steps are covered in the aforementioned documentation article).

In [15]:
from google.colab import drive

drive.mount('/content/gdrive')

!mkdir -p ~/.aws

Mounted at /content/gdrive


In [None]:
!cp /content/gdrive/MyDrive/aws/credentials ~/.aws

In [None]:
!wget https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz
!tar zxf s5cmd_2.0.0_Linux-64bit.tar.gz
!mv s5cmd /usr/bin

The cell below is just a quick test to confirm that the setup is working. If the file is copied without issues, we are all set.

In [None]:
!s5cmd --endpoint-url https://storage.googleapis.com cp s3://public-datasets-idc/eae91afc-1977-4728-9d6a-06f782c696d4.dcm .

Content of the manifest that can be used with `s5cmd` is a bit different. The cell below demonstrates how to create it.

In [19]:
# create s5cmd manifest file for the subsequent download of files
import os
try:
  os.mkdir("s5cmd_downloaded_files")
except FileExistsError:
  pass
("cp "+selection_df["gcs_url"].replace(to_replace="gs://",value="s3://", regex=True)+" s5cmd_downloaded_files").to_csv("s5cmd_manifest.txt", header=False, index=False)

Once manifest is ready, use the following command to download the files.

In [20]:
%%capture
!s5cmd --endpoint-url https://storage.googleapis.com run s5cmd_manifest.txt

# Sorting

In [21]:
%%capture
!pip install pydicom
!git clone https://github.com/pieper/dicomsort
!sudo apt-get install dcmtk

In [22]:
import os
os.environ["SORTED_DEST"] = "/content/IDC_sorted"

!mkdir -p $SORTED_DEST
!rm -rf $SORTED_DEST/*
!python dicomsort/dicomsort.py -k -u $DOWNLOAD_DEST ${SORTED_DEST}/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm

100% 188/188 [00:40<00:00,  4.62it/s]
Files sorted


# Misc

## Mount Google Drive

Since everything you save in your Colab instance will disappear after restart, you may want to use some persistent location, such as Google Drive, for saving your artifacts.

In [23]:
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
