# GDC April 2021 Webinar: Using the GDC API

### Monday, April 26, 2021<br>2:00 PM - 3:00 PM (EST)<br>Bill Wysocki, Director of User Services <br>University of Chicago

## API User's Guide and Other Helpful Links

[GDC API User's Guide](https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/)

[GDC Support Website](https://gdc.cancer.gov/support)

support@nci-gdc.datacommons.io - GDC Helpdesk E-mail

[Requests Python Package User's Guide](https://2.python-requests.org/en/master/)

[Python Documentation Website](https://www.python.org)

# Notebook Overview 


### About this notebook

- This notebook serves to be a resource for GDC users to familiarize themselves with the GDC API endpoints and capablities and allow users to edit and create custom queries "in place" with template functions, as well as perform other data analyses and visualizations within the Jupyter Notebook interface
- Jupyter Notebook documnetation can be found at https://jupyter.org 
- Commands and functions in this notebook will rely on the following Python packages:
    - `requests` - if not already installed on your system, can use `pip install requests` from command line or in new notebook code cell
    - `json` - part of Python standard library, should already be installed on system
    - `urllib` - part of Python standard library, should already be installed on system

In [1]:
#import packages to use in this notebook

import requests
import json
import urllib

### Table of Contents

- [GDC API Overview](#api_overview')
- [How to Search and Retrieve Data with GDC API](#search_retrieve)
- [How to Download Files with GDC API](#download)
- [How to Perform BAM Slicing with GDC API](#bam_slice)
- [How to Submit Data with GDC API](#submit)


# <a id='api_overview'>GDC API Overview</a>


- The GDC Application Programming Interface (API) is the external facing REpresentational State Transfer (REST) interface for the GDC, which supports user interactions with the GDC Submission and Data Portals, as well as provides developers with a programmatic interface to query and download GDC data, metadata and annotations and submit data to the GDC.
- The GDC API uses JSON as its communication format, and uses standard HTTP methods (GET, PUT, POST and DELETE)
- The [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) client also relies on the GDC API for user authentication, reading manifests, and for download and upload features


### GDC API Format

- GDC API URL is: https://api.gdc.cancer.gov/
- GDC API format for non-submission use is: <b>API_URL + ENDPOINT + QUERY_PARAMETERS</b>
    - In order to utilize the GDC API, calls to specific API endpoints for a given task need to be made
    - Query parameters can be included, such as <b>filters</b> to search and distill results, and <b>fields</b> or types of data to return
    - Formatting parameters can be specified such as <b>format</b> and <b>pretty</b>
    - Examples of endpoints and possible query parameters:
- The format for using the GDC API Submission enpoint = https://api.gdc.cancer.gov/<b>program_name/project_code</b>, i.e. https://api.gdc.cancer.gov/submission/TCGA/LUAD or https://api.gdc.cancer.gov/submission/CPTAC/3 

# <a id='search_retrieve'>How to Search and Retrieve Data with GDC API</a>

### Overview

### Endpoints

There are two 'types' of endpoints that can be used to search and retrieve data:


[GDC Search and Retrieval Endpoints](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#endpoints) - includes endpoints that cover `project`, `file` and `case` information, including clinical and biospecimen metadata, as well as file version and history

[GDC Analysis Endpoints](https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/) - endpoints that are used by the GDC data analysis, visualization and exploration (DAVE) tools in the Exploration tab of the GDC Data Portal to access indexed data including gene, mutation, copy number variation and survival data. 


### Steps

1. Specify and encode `filters`*
2. Specify `fields` to be returned*
3. Specify additional parameters (`size`, `format` of results etc.)
3. Build query url
4. Submit query and save response text to file

*specifying `filters` and `fields` are optional; not specifying filters will return all results at endpoint, and not specifying fields will return all fields at endpoint

### Template queryBuilder() function

- Search and Retrieval requests can be built as a url with the api and endpoint and pther parameters specified
- Users can edit a template function to build url request for querying data in GDC API to include other parameters, such as `facets`, `expand`, `from` (pagination) and `sort`: 
https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#request-parameters
- To specify default parameters, such as no `filters`, no `fields`, or default number of results returned (`size`), users can simply input two quotation marks, i.e. `''`

In [7]:
#format is specified as 'frmat' in function as format is an already declared object in python [the format() function]

def queryBuilder(endpoint, filters, fields, size, frmat):
    api_url = 'https://api.gdc.cancer.gov/'
    
    if frmat.lower() == 'json':
        request_query = api_url + endpoint + '?filters=' + filters + '&fields=' + fields + '&size=' + size + '&format=' + frmat + '&pretty=true'
    else:
        request_query = api_url + endpoint + '?filters=' + filters + '&fields=' + fields + '&size=' + size + '&format=' + frmat
    return request_query

### Templates for query search parameters (<font color="red">filters</font>)

- describe filters (operators, fields and values)
- need to be in JSON format that then will need to be percent encoded to be sent in url request
- will use `urllib` package for formatting
- link to [operators](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#filters-specifying-the-query)
- link to filters/fields 
- specifying no filters will return all entities for a given endpoint
- need to use filters/fields of that endpoint
- several examples below to edit 

In [10]:
#one filter applied to endpoint

#one filter 
one_filter = {
            "op":"=",
            "content":{
                "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
    }
}

In [None]:
#combination of two filters applied to endpoint, i.e. (x AND/OR y) must be met

combination_two = {
    "op" : "and",
    "content":[{
        "op":"=",
         "content":{
              "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
            }
        }, 
        {
            "op":"=", 
            "content":{
                "field":"cases.disease_type",
                "value": "ductal and lobular neoplasms"
            }
        }
    ]
}

In [32]:
#combination of three filters applied to endpoint, i.e. (x AND/OR y AND/OR z) must be met

combination_three = {
    "op" : "and",
    "content":[{
        "op":"=",
         "content":{
              "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
            }
        }, 
        {
            "op":"=", 
            "content":{
                "field":"cases.disease_type",
                "value": "ductal and lobular neoplasms"
            }
        },
        {
            "op":">", 
            "content":{
                "field":"diagnoses.age_at_diagnosis",
                "value": "15000"
            }
        }
        
    ]
}

In [None]:
#complex combination of three filters applied to endpoint, i.e. (x AND/OR [y AND/OR z]) must be met

combination_three_2 = {
    "op": "and",
    "content": [{
            "op": "=",
            "content": {
                "field": "cases.project.project_id",
                "value": "TCGA-BRCA"
            }
        },
        {
            "op": "or",
            "content": [{
                    "op": "=",
                    "content": {
                        "field": "cases.disease_type",
                        "value": "cystic, mucinous and serious neoplasms"
                    }
                },
                {
                    "op": "=",
                    "content": {
                        "field": "cases.disease_type",
                        "value": "ductal and lobular neoplasms"
                    }
                }
            ]
        }
    ]
}

### Template commands for formatting search query parameters

In [69]:
#percent encoding of filters
json_string=str(json.dumps(one_filter))
example_filter = urllib.parse.quote(json_string.encode('utf-8'))

### Template for formatting <font color="red">fields</font> to be returned by query

- all fields = do not specify?
- comma delimited list of fields to be returned
- specifying no fields will return all available fields for entities that match filters in endpoint

In [57]:
#specify fields to be returned
example_fields = ",".join([
    "submitter_id",
    "disease_type",
    "samples.submitter_id",
    "samples.sample_type", 
    "samples.tissue_type",
    "diagnoses.age_at_diagnosis"
])

### Template API `GET` Request 

In [72]:
#build API query: queryBuilder(endpoint, filters, fields, size, frmat)

#to specify no filters and/or no fields to return, replace variable with <''>

template_request = queryBuilder('cases', example_filter, example_fields, '11315', "json")

template_request

'https://api.gdc.cancer.gov/cases?filters=%7B%22op%22%3A%20%22%3D%22%2C%20%22content%22%3A%20%7B%22field%22%3A%20%22cases.project.program.name%22%2C%20%22value%22%3A%20%22TCGA%22%7D%7D&fields=submitter_id,samples.submitter_id,samples.is_ffpe,samples.portions.submitter_id,samples.portions.is_ffpe&size=11315&format=json&pretty=true'

##### Note: You can also copy and paste foramtted request URL into browser url bar to  return results in browser

In [73]:
#send request
result = requests.get(template_request)

#write request results to file, edit file name and type 
with open("/Users/catherineausland/Desktop/ffpe.json", "w+") as output: 
    output.write(result.text)
output.close()

### Example 1: Retrieve sample type and primary diagnosis data for DNA-seq files in TCGA-BRCA project

- Retrieve whether BAM files are for normal or tumor samples, as well as what disease cases were diagnosed as, in the TCGA-BRCA project
- Use 'files' endpoint, as this endpoint contains metadata related to files in the GDC (such as experimental strategy and data category)
- Need to filter down to files that are of the data category "sequencing reads" and experimental strategy type "WXS" (whole exome) to filter out other categories (like copy number variation, gene expression) and other experimental stragies (like RNA-Seq). 

In [75]:
#step 1: specify and encode filters

filters = {
    "op" : "and",
    "content":[{
        "op":"=",
         "content":{
              "field": "cases.project.project_id", 
                "value": "TCGA-BRCA"
            }
        }, 
        {
            "op":"=", 
            "content":{
                "field":"files.data_category",
                "value": "sequencing reads"
            }
        },
        {
            "op":"=", 
            "content":{
                "field":"files.experimental_strategy",
                "value": "WXS"
            }
        },
        {
            "op":"=", 
            "content":{
                "field":"files.data_format",
                "value": "BAM"
            }
        }
        
    ]
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify fields to be returned
fields = ",".join([
    "cases.submitter_id",
    "file_name",
    "cases.samples.sample_type",
    "cases.diagnoses.primary_diagnosis"
])

#step 3+4: build query url with 'files' endpoint, specify size=1 and format=tsv
brca_request = queryBuilder('files', filters_format, fields, '1', "tsv")

#step 5: send request
brca_result = requests.get(brca_request)

print(brca_result.text)

cases.0.diagnoses.0.primary_diagnosis	cases.0.samples.0.sample_type	cases.0.submitter_id	file_name	id
Infiltrating duct carcinoma, NOS	Blood Derived Normal	TCGA-AR-A0TQ	TCGA-AR-A0TQ-10A-01D-A099-09_IlluminaGA-DNASeq_exome_gdc_realn.bam	7e089ffc-a3c6-4182-b14a-b0c75c829af2



### Example 2: Retrieve FFPE data for samples and portions for TCGA projects

- Retrieve whether case samples and portions taken from cases in TCGA projects were Formalin-Fixed Paraffin-Embedded (FFPE) specimens or not
- Will use the 'cases' endpoint, as this endpoint contains biospecien and clinical information related to cases and samples in the GDC

In [29]:
#step 1: specify and encode filters
filters = {
            "op":"=",
            "content":{
                "field": "cases.project.program.name", 
                "value": "TCGA"
    }
}

json_string=str(json.dumps(filters))
filters_format = urllib.parse.quote(json_string.encode('utf-8'))

#step 2: specify fields to be returned
fields = ",".join([
    "submitter_id",
    "samples.submitter_id",
    "samples.is_ffpe",
    "samples.portions.submitter_id",
    "samples.portions.is_ffpe", 
])

#step 3+4: build query url with 'cases' endpoint, specify size=1 and format=json
ffpe_request = queryBuilder('cases', filters_format, fields, '1', "json")

#step 5: send request
ffpe_result = requests.get(ffpe_request)

print(ffpe_result.text)
ffpe_request

{
  "data": {
    "hits": [
      {
        "id": "375436b3-66ac-4d5e-b495-18a96d812a69", 
        "submitter_id": "TCGA-F5-6810", 
        "samples": [
          {
            "submitter_id": "TCGA-F5-6810-01Z", 
            "is_ffpe": true
          }, 
          {
            "submitter_id": "TCGA-F5-6810-10A", 
            "is_ffpe": false, 
            "portions": [
              {
                "submitter_id": "TCGA-F5-6810-10A-01", 
                "is_ffpe": false
              }
            ]
          }, 
          {
            "submitter_id": "TCGA-F5-6810-01A", 
            "is_ffpe": false, 
            "portions": [
              {
                "submitter_id": "TCGA-F5-6810-01A-11", 
                "is_ffpe": false
              }, 
              {
                "submitter_id": "TCGA-F5-6810-01A-13-1935-20", 
                "is_ffpe": false
              }
            ]
          }
        ]
      }
    ], 
    "pagination": {
      "count": 1, 
      "total": 1

'https://api.gdc.cancer.gov/cases?filters=%7B%22op%22%3A%20%22%3D%22%2C%20%22content%22%3A%20%7B%22field%22%3A%20%22cases.project.program.name%22%2C%20%22value%22%3A%20%22TCGA%22%7D%7D&fields=submitter_id,samples.submitter_id,samples.is_ffpe,samples.portions.submitter_id,samples.portions.is_ffpe&size=1&format=json&pretty=true'

### Example 3: Analysis endpoint, survival data and mutations/genes

### Example 4: Days to Death after Diagnosis and Vital Status for cases in TCGA-HNSC project

- remove TCGA0HNSc as well as 

# <a id='download'>How to Download Files with GDC API</a>

### Overview

- Users can download 
- 

### Endpoint

- The endpoint to download files is `https://api.gdc.cancer.gov/data/`

### Steps



In [8]:
import requests

In [9]:
requests.get('https://api.gdc.cancer.gov/data/5b2974ad-f932-499b-90a3-93577a9f0573')

#curl --remote-name --remote-header-name 'https://api.gdc.cancer.gov/data/5b2974ad-f932-499b-90a3-93577a9f0573

<Response [200]>

# <a id='bam_slice'>How to Perform BAM Slicing with GDC API</a>

### Endpoint


# <a id='submit'>How to Submit Data with GDC API</a>

### Overview

- Submitters can make use of the `submission` GDC API endpoint to submit node entities to submission projects
- Submission will require a token downloaded from the [GDC Submission Portal](https://docs.gdc.cancer.gov/Data_Submission_Portal/Users_Guide/Data_Submission_Process/#authentication)
- Data can be submitted in `JSON` or `TSV` format; depending on the data format, users will need to edit the `"Content-Type"` in the request command (see below)
- Submittable files (such as FASTQ or BAM files) should be uploaded with the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
- Additional features and more information regarding submission using the GDC API can be found here: https://docs.gdc.cancer.gov/API/Users_Guide/Submission/


### Endpoint

- The format for using the GDC API Submission enpoint uses the project information, i.e. `https://api.gdc.cancer.gov/<program_name>/<project_code>`
- for example: https://api.gdc.cancer.gov/submission/TCGA/LUAD or https://api.gdc.cancer.gov/submission/CPTAC/3 


### Steps

1. Read in token file
2. Read in submission file
3. Edit endpoint and submit data using `POST` request

### Example 1: Submitting a JSON Data File

In [None]:
#1. Read in token file

token = open("path/to/file/gdc-user-token.txt").read().strip()

In [None]:
#2. Read in submission file

example_file_json = json.load(open("example_file.json"))

In [5]:
#3. Edit endpoint and submit data using POST request

ENDPT = "https://api.gdc.cancer.gov/submission/<program_name>/<project_id>"

#submission request if data is in JSON format
requests.post(url = ENDPT, json = example_file_json, headers={'X-Auth-Token': token, "Content-Type": "application/json"})

NameError: name 'requests' is not defined

### Example 2: Submitting a TSV Data File

In [2]:
#1. Read in token file

#token = open("path/to/file/gdc-user-token.txt").read().strip()
token = open("/Users/catherineausland/Documents/gdc-user-token.txt").read().strip()

In [3]:
#2. Read in submission file

#example_file_tsv = open("example_file.txt")
example_file_tsv = open("submission_case_template.tsv")

In [5]:
#3. Edit endpoint and submit data using POST request

#ENDPT = "https://api.gdc.cancer.gov/submission/<program_name>/<project_id>"
ENDPT = "https://api.gdc.cancer.gov/submission/GDC/INTERNAL"

#submission request if data is in TSV format
res = requests.put(url = ENDPT, data = example_file_tsv, headers={'X-Auth-Token': token, "Content-Type": "text/tsv"})

res.text

KeyboardInterrupt: 