# CDA example for subject 09CO022

## Introduction

This example was [developed by Ian Fore](https://github.com/ianfore/cdatest/blob/8002ae1624b34e0e2f291022c295d4b778989de9/09CO022%20Example.ipynb) to determine if CDA could link together data from three different [NCI CRDC](https://datascience.cancer.gov/data-commons) nodes by the [CDA](https://datacommons.cancer.gov/cancer-data-aggregator).

It is based on subject 09CO022 from the TCGA Colon Cancer project. This subject is found in at least three NCI CRDC nodes:
* [Case ID 09CO022 from GDC](https://portal.gdc.cancer.gov/cases/c5421e34-e5c7-4ba5-aed9-146a5575fd8d)
* [Case ID 09CO022 from PDC](https://proteomic.datacommons.cancer.gov/pdc/case/459e3b69-63d6-11e8-bcf1-0a2705229b82)
* (possibly?) [Three images](https://pathology.cancerimagingarchive.net/pathdata/select.html?filter=%7B%22study%22%3A%7B%22match%22%3A%22CPTAC%22%7D%2C%22subject_id%22%3A%7B%22match%22%3A%2209CO022%22%7D%7D) from The Cancer Imaging Archive (TCIA).

## Setup

We only need to install the [cda-python](https://github.com/CancerDataAggregator/cda-python) library so we can access the CDA API.

In [13]:
import sys

# Install cdapython so we can access the CDA API.
!{sys.executable} -m pip install git+https://github.com/CancerDataAggregator/cda-python.git
    
# Install pandas.
!{sys.executable} -m pip install pandas

Collecting git+https://github.com/CancerDataAggregator/cda-python.git
  Cloning https://github.com/CancerDataAggregator/cda-python.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-o0gcerze
  Running command git clone -q https://github.com/CancerDataAggregator/cda-python.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-o0gcerze
Collecting cda-client@ git+https://github.com/CancerDataAggregator/cda-service-python-client.git
  Cloning https://github.com/CancerDataAggregator/cda-service-python-client.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-install-ch57sgu4/cda-client_69354f9fbb06488fb985786ad19c241c
  Running command git clone -q https://github.com/CancerDataAggregator/cda-service-python-client.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-install-ch57sgu4/cda-client_69354f9fbb06488fb985786ad19c241c


## What do we know about Subject 09CO022?

Let's start by searching each node individually.

In [14]:
submitter_id = '09CO022'

### From GDC

We can search GDC using the [GDC Cases API](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#cases-endpoint) for `submitted_id == 09CO022`. Note that with the GDC API, you specifically have to request that certain sections (`diagnoses` and `samples` in this example) are expanded -- other sections, such as `aliquot_ids` only include identifiers, unless we explicitly ask for them to be expanded as well.

In [15]:
# Load required packages.
import requests
import json
import pandas

In [16]:
# Search by subject_id.
cases_endpt = "https://api.gdc.cancer.gov/cases"

filters = {
    "op": "in",
    "content":{
        "field": "submitter_id",
        "value": [ submitter_id ]
    }
}

params = {
    "filters": json.dumps(filters),
    "expand": "diagnoses,samples",
    "format": "JSON",
    "size": "2"
}

response = requests.get(cases_endpt, params = params)
result = json.loads(response.content)

pandas.DataFrame(result['data']['hits'])

Unnamed: 0,id,lost_to_followup,days_to_lost_to_followup,disease_type,submitter_id,aliquot_ids,submitter_aliquot_ids,diagnoses,diagnosis_ids,samples,sample_ids,created_datetime,submitter_sample_ids,primary_site,submitter_diagnosis_ids,updated_datetime,case_id,index_date,state
0,c5421e34-e5c7-4ba5-aed9-146a5575fd8d,,,Adenomas and Adenocarcinomas,09CO022,"[0d8adcbf-13f0-48c3-83df-3fa205b79ae8, 9250d96...","[f69deaeb-6b6f-4c61-8900-fd0f26_D7_1, 60805d52...","[{'irs_stage': None, 'anaplasia_present_type':...",[7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb],"[{'distributor_reference': None, 'sample_type_...","[4591a53d-5668-4a70-b44b-e08a3d59267e, b12c257...",2017-01-25T15:29:16.160843-06:00,"[f69deaeb-6b6f-4c61-8900-fd0f26, 60805d52-8ca1...",Colon,[09CO022-DX],2019-10-24T07:59:21.887408-05:00,c5421e34-e5c7-4ba5-aed9-146a5575fd8d,,released


In [17]:
cases = result['data']['hits']

# We should only have one case for subject 09CO022.
assert len(cases) == 1
case = cases[0]

# Write this to a file.
with open('gdc_subject_09CO022.json', 'w') as f:
    json.dump(case, f, indent=4, sort_keys=True)

# What information do we have on this case?
pandas.DataFrame(cases)

Unnamed: 0,id,lost_to_followup,days_to_lost_to_followup,disease_type,submitter_id,aliquot_ids,submitter_aliquot_ids,diagnoses,diagnosis_ids,samples,sample_ids,created_datetime,submitter_sample_ids,primary_site,submitter_diagnosis_ids,updated_datetime,case_id,index_date,state
0,c5421e34-e5c7-4ba5-aed9-146a5575fd8d,,,Adenomas and Adenocarcinomas,09CO022,"[0d8adcbf-13f0-48c3-83df-3fa205b79ae8, 9250d96...","[f69deaeb-6b6f-4c61-8900-fd0f26_D7_1, 60805d52...","[{'irs_stage': None, 'anaplasia_present_type':...",[7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb],"[{'distributor_reference': None, 'sample_type_...","[4591a53d-5668-4a70-b44b-e08a3d59267e, b12c257...",2017-01-25T15:29:16.160843-06:00,"[f69deaeb-6b6f-4c61-8900-fd0f26, 60805d52-8ca1...",Colon,[09CO022-DX],2019-10-24T07:59:21.887408-05:00,c5421e34-e5c7-4ba5-aed9-146a5575fd8d,,released


#### In other formats

While we anticipate that JSON will be the best format to work with, the GDC API also allows us to download data in TSV and XML formats. So let's download those as well so we can compare between them.

In [18]:
# Write to TSV.
params = {
    "filters": json.dumps(filters),
    "expand": "diagnoses,samples",
    "format": "TSV",
    "size": "2"
}

response = requests.get(cases_endpt, params = params)

# Write this to a file.
with open('gdc_subject_09CO022.tsv', 'w') as f:
    f.write(response.text)
    
print(f"{len(response.text)} characters written to 'gdc_subject_09CO022.tsv'")

# Write to XML.
params = {
    "filters": json.dumps(filters),
    "expand": "diagnoses,samples",
    "format": "XML",
    "size": "2"
}

response = requests.get(cases_endpt, params = params)

# Write this to a file.
import xml.dom.minidom

dom = xml.dom.minidom.parseString(response.text)
pretty_xml_as_string = dom.toprettyxml()

with open('gdc_subject_09CO022.xml', 'w') as f:
    f.write(pretty_xml_as_string)
    
print(f"{len(response.text)} characters written to 'gdc_subject_09CO022.xml'")

6950 characters written to 'gdc_subject_09CO022.tsv'
9466 characters written to 'gdc_subject_09CO022.xml'


### From PDC

PDC has a GraphQL endpoint we can use to make these queries.

In [19]:
# Search for case submitter ID 09CO022:

# I got this query from https://pdc.cancer.gov/data-dictionary/publicapi-documentation/#!/Case/case
graphql_query = """{
    case (
        case_submitter_id: "%s"
        acceptDUA: true
    ) {
        case_id case_submitter_id project_submitter_id days_to_lost_to_followup disease_type
        index_date lost_to_followup primary_site
        externalReferences {
            external_reference_id
            reference_resource_shortname reference_resource_name reference_entity_location
        }
        demographics {
            demographic_id ethnicity gender demographic_submitter_id race cause_of_death days_to_birth
            days_to_death vital_status year_of_birth year_of_death
        }
        samples {
            sample_id sample_submitter_id sample_type sample_type_id gdc_sample_id gdc_project_id
            biospecimen_anatomic_site composition current_weight days_to_collection days_to_sample_procurement
            diagnosis_pathologically_confirmed freezing_method initial_weight intermediate_dimension is_ffpe
            longest_dimension method_of_sample_procurement oct_embedded pathology_report_uuid preservation_method
            sample_type_id shortest_dimension time_between_clamping_and_freezing time_between_excision_and_freezing
            tissue_type tumor_code tumor_code_id tumor_descriptor
            aliquots {
                aliquot_id aliquot_submitter_id analyte_type
                aliquot_run_metadata {
                    aliquot_run_metadata_id
                }
            }
        }
        diagnoses {
            diagnosis_id tissue_or_organ_of_origin age_at_diagnosis primary_diagnosis tumor_grade tumor_stage
            diagnosis_submitter_id classification_of_tumor days_to_last_follow_up days_to_last_known_disease_status
            days_to_recurrence last_known_disease_status morphology progression_or_recurrence
            site_of_resection_or_biopsy prior_malignancy ajcc_clinical_m ajcc_clinical_n ajcc_clinical_stage
            ajcc_clinical_t ajcc_pathologic_m ajcc_pathologic_n ajcc_pathologic_stage ajcc_pathologic_t
            ann_arbor_b_symptoms ann_arbor_clinical_stage ann_arbor_extranodal_involvement ann_arbor_pathologic_stage
            best_overall_response burkitt_lymphoma_clinical_variant circumferential_resection_margin
            colon_polyps_history days_to_best_overall_response days_to_diagnosis days_to_hiv_diagnosis
            days_to_new_event figo_stage hiv_positive hpv_positive_type hpv_status iss_stage laterality
            ldh_level_at_diagnosis ldh_normal_range_upper lymph_nodes_positive lymphatic_invasion_present
            method_of_diagnosis new_event_anatomic_site new_event_type overall_survival perineural_invasion_present
            prior_treatment progression_free_survival progression_free_survival_event residual_disease
            vascular_invasion_present year_of_diagnosis icd_10_code synchronous_malignancy
            tumor_largest_dimension_diameter
        }
    }
}""" % (submitter_id)

params = {
    "query": graphql_query
}

pdc_graphql_endpoint = "https://pdc.cancer.gov/graphql"

response = requests.get(pdc_graphql_endpoint, params = params)
result = json.loads(response.content)

cases = result['data']['case']
assert len(cases) == 1
case = cases[0]

# Write this to a file.
with open('pdc_subject_09CO022.json', 'w') as f:
    json.dump(case, f, indent=4, sort_keys=True)


pandas.DataFrame(cases)

Unnamed: 0,case_id,case_submitter_id,project_submitter_id,days_to_lost_to_followup,disease_type,index_date,lost_to_followup,primary_site,externalReferences,demographics,samples,diagnoses
0,459e3b69-63d6-11e8-bcf1-0a2705229b82,09CO022,CPTAC-2,0,Colon Adenocarcinoma,,,Colon,[{'external_reference_id': 'c5421e34-e5c7-4ba5...,[{'demographic_id': 'f6954ae9-7588-11e8-bcf1-0...,[{'sample_id': 'f4af3e4d-641b-11e8-bcf1-0a2705...,[{'diagnosis_id': 'ff301535-70ca-11e8-bcf1-0a2...


### From CDA by Case ID

Note that for cases extracted from GDC/PDC, we can look up the same record on CDA by using the case_id.

In [20]:
from cdapython import Q, columns

case_id = case['case_id']
query_by_case_id = Q(f'ResearchSubject.id = "{case_id}"')
results = query_by_case_id.run(limit=2) 
print(results)

cda_entry = results[0]

# Write this to a file.
with open(f'cda_case_{case_id}.json', 'w') as f:
    json.dump(cda_entry, f, indent=4, sort_keys=True)

# cda_entry


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.id = '459e3b69-63d6-11e8-bcf1-0a2705229b82')
Offset: 0
Limit: 2
Count: 1
More pages: No



In [21]:
columns()

SELECT field_path FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3'


['days_to_birth',
 'race',
 'sex',
 'ethnicity',
 'id',
 'ResearchSubject',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.tumor_stage',
 'ResearchSubject.Diagnosis.tumor_grade',
 'ResearchSubject.Diagnosis.Treatment',
 'ResearchSubject.Diagnosis.Treatment.type',
 'ResearchSubject.Diagnosis.Treatment.outcome',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Specimen',
 'ResearchSubject.Specimen.File',
 'ResearchSubject.Specimen.File.label',
 'ResearchSubject.Specimen.File.associated_project',
 'ResearchSubject.Specimen.File.drs_uri',
 'ResearchSubject.Specimen.File.identifier',
 'ResearchSubject.Specimen.File.identifier.system',
 'ResearchSubject.Specimen.File.identifier.value',
 'ResearchSubject.Specimen.File.data_category',
 'ResearchSubject.Specimen.File.byte_size',
 'ResearchSubject.Specimen.File.type',
 'ResearchSubject.Specimen.File

The big advantage with CDA is that we can query the submitter ID directly (by using the Specimen.derived_from_subject option) to get a list of all the cases associated with a particular subject.

In [22]:
query_by_derived = Q(f'ResearchSubject.Specimen.derived_from_subject = "{submitter_id}"')
results = query_by_derived.run(limit=100)
print(results)

result_list = []
for result in results:
    result_list.append(result)

# Write this to a file.
with open(f'cda_derived_from_{submitter_id}.json', 'w') as f:
    json.dump(result_list, f, indent=4, sort_keys=True)

pandas.DataFrame(result_list)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen WHERE (_Specimen.derived_from_subject = '09CO022')
Offset: 0
Limit: 100
Count: 11
More pages: No



Unnamed: 0,days_to_birth,race,sex,ethnicity,id,ResearchSubject,Diagnosis,Specimen,associated_project,primary_disease_type,identifier,primary_disease_site,File,derived_from_specimen,age_at_collection,anatomical_site,source_material_type,derived_from_subject,specimen_type
0,,black or african american,female,not hispanic or latino,4591a53d-5668-4a70-b44b-e08a3d59267e,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': '4591a53d-5668-4a7...",Colon,[{'label': 'fbc0c313-d356-4ad9-8257-57e90fb7f2...,Initial sample,,,Primary Tumor,09CO022,sample
1,,black or african american,female,not hispanic or latino,c53c4d60-2ddb-5da8-932e-00a86fa2347f,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': 'c53c4d60-2ddb-5da...",Colon,[],4591a53d-5668-4a70-b44b-e08a3d59267e,,,Primary Tumor,09CO022,portion
2,,black or african american,female,not hispanic or latino,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': '31075cfa-7aef-59f...",Colon,[],c53c4d60-2ddb-5da8-932e-00a86fa2347f,,,Primary Tumor,09CO022,analyte
3,,black or african american,female,not hispanic or latino,0d8adcbf-13f0-48c3-83df-3fa205b79ae8,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': '0d8adcbf-13f0-48c...",Colon,[],31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,,,Primary Tumor,09CO022,aliquot
4,,black or african american,female,not hispanic or latino,d085ebd9-7605-54a0-abb9-10867f5fa1b1,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': 'd085ebd9-7605-54a...",Colon,[],4591a53d-5668-4a70-b44b-e08a3d59267e,,,Primary Tumor,09CO022,portion
5,,black or african american,female,not hispanic or latino,a31724b6-e550-552b-bd61-41341c534e28,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': 'a31724b6-e550-552...",Colon,[],d085ebd9-7605-54a0-abb9-10867f5fa1b1,,,Primary Tumor,09CO022,analyte
6,,black or african american,female,not hispanic or latino,9250d96e-1cdc-4d68-8a56-f7b186a6fab5,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': '9250d96e-1cdc-4d6...",Colon,[],a31724b6-e550-552b-bd61-41341c534e28,,,Primary Tumor,09CO022,aliquot
7,,black or african american,female,not hispanic or latino,b12c257d-7409-4858-9384-c430929a075a,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': 'b12c257d-7409-485...",Colon,[{'label': 'fbc0c313-d356-4ad9-8257-57e90fb7f2...,Initial sample,,,Blood Derived Normal,09CO022,sample
8,,black or african american,female,not hispanic or latino,702d7ba0-9558-5b2d-af4d-cd797485b8c1,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': '702d7ba0-9558-5b2...",Colon,[],b12c257d-7409-4858-9384-c430929a075a,,,Blood Derived Normal,09CO022,portion
9,,black or african american,female,not hispanic or latino,f0003f0a-07ea-548e-b1f7-7e6d1b27d47a,"[{'Diagnosis': [{'morphology': '8140/3', 'tumo...","[{'morphology': '8140/3', 'tumor_stage': 'Stag...",[{'File': [{'label': 'fbc0c313-d356-4ad9-8257-...,CPTAC-2,Adenomas and Adenocarcinomas,"[{'system': 'GDC', 'value': 'f0003f0a-07ea-548...",Colon,[],702d7ba0-9558-5b2d-af4d-cd797485b8c1,,,Blood Derived Normal,09CO022,analyte
