You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Series_UID_Report.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Series_UID_Report.ipynb)

# Summary

This notebook can be used to generate a collection-oriented summary reports to help understand the contents of a TCIA manifest or a spreadsheet/list of TCIA Series Instance UIDs.  

The [manifest files are used with the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/egOnAg) to download DICOM data from TCIA.  Manifest files to download full collections can be found on their respective homepages.  Custom manifests can also be created via our search portal at https://nbia.cancerimagingarchive.net.

It is also possible to use the Export Metadata function on the "cart" page of https://nbia.cancerimagingarchive.net or use the [REST API](https://wiki.cancerimagingarchive.net/x/NIIiAQ) to create spreadsheets or lists of Series Instance UIDs of interest.

This notebook will provide a series-level metadata report and then help you prepare the data for use by the **reportCollectionSummary()** function in **tcia_utils**, which summarizes the data by ingesting the Series Instance UIDs (series_data) and returns the following:

    - Collections: List of unique collections (1 per row)
    - DOIs: List of unique values by collection
    - Modalities: List of unique values by collection
    - Licenses: List of unique values by collection
    - Manufacturers: List of unique values in the collection
    - Body Parts: List of unique values by collection
    - Subjects: Number of subjects by collection
    - Studies: Number of studies by collection
    - Series: Number of series by collection
    - Images: Number of images by collection
    - Disk Space: Formatted as KB/MB/GB/TB/PB by collection
    - TimeStamp Min: Earliest TimeStamp date by collection
    - TimeStamp Max: Latest TimeStamp date by collection
    - UniqueTimestamps: List of dates on which new series were published by collection

Parameters:

    series_data: The input data to be summarized (expects JSON by default).
    input_type: Defaults to dataframe if not populated.  
                Set to 'list' for python list, or 'manifest' for *.TCIA manifest file.
                If manifest is used, series_data should be the path to the TCIA manifest file.
    format (str): Output format (default is dataframe, 'csv' for CSV file, 'chart' for charts).
    api_url: Only necessary if input_type = list or manifest.
            Set to 'restricted' for limited-access collections or
            'nlst' for National Lung Screening trial.

# 1 Setup

Install the latest release of [**tcia_utils**](https://pypi.org/project/tcia-utils/) and Pandas if you haven't already.

In [None]:
!pip install --upgrade -q tcia-utils
!pip install -q pandas

Import the modules we'll need.

In [None]:
# imports
import requests
import pandas as pd
from tcia_utils import nbia
api_url = ""

# 2 Create a Token (optional)
If you're working with any restricted collections, you must enter your TCIA login/password to create a token.  If not, you can skip this step.




In [None]:
# set api_url to 'restricted' in report query if token is created
api_url = "restricted"

nbia.getToken()

# 3 Prepare your Series UIDs

If you already have the file containing your series UIDs saved on the machine where this notebook is running you can skip this step. Otherwise:

1. To import a file to Colab from your hard drive, use the menu on the left sidebar to upload it.
2. To import a file from the web (e.g. TCIA), use the command in the next cell by updating it with the URL of the file you want to analyze.  



In [None]:
# OPTIONAL: import your UID file from the web
url = "https://URL_on_TCIA/manifest.tcia"
local_filename = "manifest.tcia"

manifest = requests.get(url)
with open(local_filename, 'wb') as f:
    f.write(manifest.content)

Next we'll read in the UIDs from your file into a python list.  If you're using a manifest file, the code below will put the Series UIDs into a list while ignoring the parameter text.  

If you're using a custom text/csv file of UIDs it will insert all rows into the list.  You must verify the file is formatted correctly **(one UID per row with no column header or commas)** or you may encounter errors.

In cases where you're working with a large set of data this code will split things up into groups of 10,000 series UIDs so that the server doesn't time out when you try to generate the report.

In [None]:
# enter manifest path/filename
manifest = "manifest.tcia"

# converts manifest to list of UIDs
uids = nbia.manifestToList(manifest)

# break up the list into smaller chunks if needed
chunk_size = 10000
if len(uids) > chunk_size:
    chunked_uids = list()
    for i in range(0, len(uids), chunk_size):
        chunked_uids.append(uids[i:i+chunk_size])
    # Count how many chunks
    chunk_count = len(chunked_uids)
    print("Your data has been imported and split into", chunk_count, "groups.")
else:
    chunk_count = 0
    print("Your data has been imported.")


# 4 Download series metadata

Using the next step you can create a dataframe and save **scan_metadata.csv** containing the Collection Name, Subject ID, Study UID, Study Description, Study Date, Series UID, Series Description, Number of Images, File Size (Bytes), Modality, Manufacturer, Data Description URI (DOI), SOP Class UID, License Name, and License URL for each scan.

**Note:** Due to its size (> 26,000 patients!) the [National Lung Screening Trial](https://doi.org/10.7937/TCIA.HMQ8-J677) resides on a separate server.  If you'd like to create a report about this collection use the 2nd option below.

In [None]:
# use for regular collections
count = 0

if chunk_count == 0:
    df = nbia.getSeriesList(uids)
else:
    dfs = []  # create an empty list to store DataFrames
    for x in chunked_uids:
        str_count = str(count)
        chunk_df = nbia.getSeriesList(x, csv_filename="scan_metadata_" + str_count)
        dfs.append(chunk_df)  # append the DataFrame for this chunk to the list
        count += 1

    # concatenate all the DataFrames in the list into a single DataFrame
    df = pd.concat(dfs, ignore_index=True)

display(df)


In [None]:
# use for NLST collection
api_url = "nlst"
count = 0

if chunk_count == 0:
    df = nbia.getSeriesList(uids, api_url = api_url)
else:
    dfs = []  # create an empty list to store DataFrames
    for x in chunked_uids:
        str_count = str(count)
        chunk_df = nbia.getSeriesList(x, api_url = "nlst", csv_filename = "scan_metadata_" + str_count)
        dfs.append(chunk_df)  # append the DataFrame for this chunk to the list
        count += 1

    # concatenate all the DataFrames in the list into a single DataFrame
    df = pd.concat(dfs, ignore_index=True)

display(df)

# Create the summary report
Now we can use the metadata we've downloaded to create the summary report.

In [None]:
# rename df columns to match expected input
df = df.rename(columns={'Subject ID': 'PatientID',
                                    'Study UID': 'StudyInstanceUID',
                                    'Series ID': 'SeriesInstanceUID',
                                    'Number of images': 'ImageCount',
                                    'Collection Name': 'Collection',
                                    'File Size (Bytes)': 'FileSize',
                                    'Data Description URI': 'CollectionURI',
                                    'License Name': 'LicenseName',
                                    'Series Number': 'SeriesNumber',
                                    'License URL': 'LicenseURI'})

nbia.reportCollectionSummary(df, format = "chart")

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7