<a href="https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb)

# Summary

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers. 

**This notebook is focused on basic use cases for leveraging TCIA's REST APIs to query and download data.**  If you're interested in additional TCIA notebooks and coding examples check out https://github.com/kirbyju/TCIA_Notebooks. 

# 1 Learn about available Collections on the TCIA website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) datasets on TCIA are the easiest ways to become familiar with what is available.  These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets, non-DICOM segmentation data), and answer most common questions you might have about the datasets.  

# 2 REST API Overview 
TCIA uses software called NBIA to manage DICOM data.  The NBIA REST APIs include:
1. [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) that allow you to perform basic queries and download data from **public** collections. This API does not require a TCIA account.
2. [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) that allow you to perform basic queries and download data from **public and limited-access** collections. This API requires a TCIA account for creation of authentication tokens.
3. [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) that allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications.  This API requires a TCIA account for creation of authentication tokens.

**This notebook will focus on the NBIA Search REST APIs (not the Advanced API).**  

# 3 Import tcia_utils

The following cells import [**tcia_utils**](https://github.com/kirbyju/TCIA_Notebooks/raw/main/tcia_utils.py) which contain a variety of useful functions for accessing TCIA via Jupyter/Python. It includes two functions for downloading data. These are **downloadSampleSeries()** and **downloadSeries()**. The only difference between them is that **downloadSampleSeries()** only grabs the first 3 scans in the list of scans to download, which is useful for demonstration and testing purposes.

Both functions ingest a set of series UIDs to download.  By default, they expect JSON data containing "SeriesInstanceUID" elements which can be generated using **getSeries()** in **tcia_utils**.  However, if you have a series UID list from some other source you can set **input_type = "list"** to pass a python list of series UIDs instead of JSON. 

The **api_url** parameter can be omitted in most cases.  However, it must be set to **api_url = "nlst"** to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection and you must use **api_url = "restricted"** for datasets that require logging in.  To download restricted data you must first use **getToken()** to create an API token with your username and password.

In addition to downloading the data, these functions return a dataframe of the series metadata describing the data that were downloaded.  You can optionally export a CSV of the series metadata by specifying the **csv_filename** parameter.

In [None]:
# imports
import requests
import pandas as pd

# download tcia_utils
tcia_utils_text = requests.get("https://github.com/kirbyju/TCIA_Notebooks/raw/main/tcia_utils.py")
with open('tcia_utils.py', 'wb') as f:
    f.write(tcia_utils_text.content)

In [None]:
import tcia_utils as tcia

# 4 Download Examples

In this section we'll cover downloading data via the REST API for the following use cases:

1.   Download a full TCIA collection
2.   Download custom results of an API query
3.   Download a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" that was created via https://nbia.cancerimagingarchive.net/
4.   Download data from a TCIA manifest file
5.   Download data from a **restricted** collection that requires creating an API token

## 4.1 Download a full collection

You can [Browse Collections](https://www.cancerimagingarchive.net/collections) on our website to figure out what you might want to download, but you can also get a list of available collections via the API as shown below.

In [None]:
# get list of available collections as JSON
tcia.getCollections()


Let's say that we're interested in downloading the entire **Soft-tissue-Sarcoma** collection.  First we need to get a list of all Series Instance UIDs in that collection.  We can use **tcia.getSeries()** to return JSON metadata about all series (scans) in this collection.

In [None]:
tcia.getSeries(collection = "Soft-tissue-Sarcoma")

If we save the JSON that is returned to a variable we can pass that to the our download functions and view/save the metadata for what was downloaded.

In [None]:
# save output of getSeries()
data = tcia.getSeries(collection = "Soft-tissue-Sarcoma")

# feed data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data)
display(df)

# Or download the full results using downloadSeries...
#df = tcia.downloadSeries(data, csv_filename = "Soft-tissue-Sarcoma")
#display(df)

## 4.2 Download custom API query
The REST API allows for a variety of different query options as demonstrated in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries_for_Public_Datasets.ipynb).  For this use case, let's assume that you are only interested in the MR scans from the [TCGA-BRCA](https://doi.org/10.7937/K9/TCIA.2016.AB2NAZRP) collection that were acquired on Siemens scanners.

In [None]:
# getSeries with query parameters
data = tcia.getSeries(collection = "TCGA-BRCA", 
               modality = "MR", 
               manufacturer = "SIEMENS")

print(len(data), 'Series returned')

Once again, let's pass those Series Instance UIDs to our download functions.

In [None]:
# feed series_data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data)
display(df)

# Or download the full results using downloadSeries...
#df = tcia.downloadSeries(data, csv_filename = "TCGA-BRCA_Siemens_MRIs")
#display(df)

## 4.3 Download custom NLST API query
Let's show a similar example where we look for a specific modality and manufacturer within the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  We have to set **api_url = "nlst"** in our functions for this to work, but otherwise the steps are the same.

In [None]:
# getSeries with query parameters
data = tcia.getSeries(collection = "NLST", 
               modality = "CT", 
               manufacturer = "Philips",
               api_url = "nlst")

print(len(data), 'Series returned')

In [None]:
# feed series_data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data, api_url = "nlst")
display(df)

# Or download the full results using downloadSeries...
#df = tcia.downloadSeries(series_data, api_url = "nlst", csv_filename = "NLST_Philips_CTs")
#display(df)

## 4.4 Download a "shared cart"
It's possible to use https://nbia.cancerimagingarchive.net to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans you'd like to share with others. After creating a Shared Cart you receive a URL like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can use the cart name to download the related scans via the API.

In [None]:
# getSharedCart metadata
data = tcia.getSharedCart(name = "nbia-49121659384603347")
print(len(data), 'Series returned')

In [None]:
# feed series_data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data)
display(df)

# Or download the full results using downloadSeries...
#df = tcia.downloadSeries(data, csv_filename = "my_shared_cart")
#display(df)

## 4.4 Download data from a TCIA manifest file

When working with manifest files in a notebook you can install the NBIA Data Retriever to open the manifest and download the data as shown in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).  However, there may be cases where you don't have administrative rights to install software or prefer using the REST API to download a manifest.  

In order to demonstrate this use case, let's assume that after you [Browse Collections](https://www.cancerimagingarchive.net/collections) you decided you are interested in the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) collection.  We can find the URL of the manifest to download the full collection by looking at the blue "Download" button on that page.  Then we can download the manifest with the following commands.

In [None]:
# download manifest file from RIDER Breast MRI page
manifest = requests.get("https://wiki.cancerimagingarchive.net/download/attachments/22512757/doiJNLP-Fo0H1NtD.tcia?version=1&modificationDate=1534787017928&api=v2")
with open('RIDER_Breast_MRI.tcia', 'wb') as f:
    f.write(manifest.content)


If you open the manifest file in a text editor you'll see that it contains six lines of download parameters that precede a list of Series Instance UIDs.  The code below will put the Series UIDs into a list while ignoring the parameter text.

In [None]:
# initialize variable
data = []

# open file and write lines to a list
with open("RIDER_Breast_MRI.tcia") as f:
    for line in f:
        data.append(line.rstrip())

# remove the parameters from the list
del data[:6]
#print(data)

print("Result contains", len(data), "Series Instance UIDs (scans).")


Now we can pass this data to our download functions.  Note that we need to use the **input_type = "list"** parameter this time since the series UIDs are a list rather than JSON format.

In [None]:
# feed series_data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data, input_type = "list")
display(df)

# Or download the full results using downloadSeries...
#df = tcia.downloadSeries(data, input_type = "list", csv_filename = "RIDER_Breast_MRI")
#display(df)

## 4.5 Download data from a restricted collection
In some cases, you must specifically request access to collections before you can download them.  These are listed as **limited access** on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page. The steps to request access may vary depending on the collection, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg). Once you've created an account, you can use your login/password to create an API token with the **getToken()** function from **tcia_utils** to verify your permissions. **<font color='red'>Tokens are valid for 2 hours and must be refreshed after that point.</font>**

In [None]:
tcia.getToken()

Let's say that we're interested in the [QIN-Breast-02](https://doi.org/10.7937/TCIA.2019.4cfm06rr) collection. As you can see on the collection page, you must email help@cancerimagingarchive.net to request access to the data. Once you've recieved approval we can use **tcia.getSeries()** to get a full list of series UIDs in this restricted collection by including **api_url = "restricted"** as a parameter.

In [None]:
# getSeries with query parameters
data = tcia.getSeries(collection = "QIN-Breast-02", 
                      api_url = "restricted")

print(len(data), 'Series returned')

Don't forget to include **api_url = "restricted"** in the download functions as well!

In [None]:
# feed series_data to our downloadSampleSeries function
df = tcia.downloadSampleSeries(data, 
                               api_url = "restricted")
display(df)

# Or download the full results using downloadSeries...
#df = tcia.downloadSeries(data, api_url = "restricted", csv_filename = "QIN-Breast-02")
#display(df)


# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7