# CDA example for subject 09CO022

## Introduction

This example was [developed by Ian Fore](https://github.com/ianfore/cdatest/blob/8002ae1624b34e0e2f291022c295d4b778989de9/09CO022%20Example.ipynb) to determine if CDA could link together data from three different [NCI CRDC](https://datascience.cancer.gov/data-commons) nodes by the [CDA](https://datacommons.cancer.gov/cancer-data-aggregator).

It is based on subject 09CO022 from the TCGA Colon Cancer project. This subject is found in at least three NCI CRDC nodes:
* [Case ID 09CO022 from GDC](https://portal.gdc.cancer.gov/cases/c5421e34-e5c7-4ba5-aed9-146a5575fd8d)
* [Case ID 09CO022 from PDC](https://proteomic.datacommons.cancer.gov/pdc/case/459e3b69-63d6-11e8-bcf1-0a2705229b82)
* (possibly?) [Three images](https://pathology.cancerimagingarchive.net/pathdata/select.html?filter=%7B%22study%22%3A%7B%22match%22%3A%22CPTAC%22%7D%2C%22subject_id%22%3A%7B%22match%22%3A%2209CO022%22%7D%7D) from The Cancer Imaging Archive (TCIA).

## Setup

We only need to install the [cda-python](https://github.com/CancerDataAggregator/cda-python) library so we can access the CDA API.

In [1]:
import sys

# Install cdapython so we can access the CDA API.
!{sys.executable} -m pip install git+https://github.com/CancerDataAggregator/cda-python.git

Collecting git+https://github.com/CancerDataAggregator/cda-python.git
  Cloning https://github.com/CancerDataAggregator/cda-python.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-oahp81nu
  Running command git clone -q https://github.com/CancerDataAggregator/cda-python.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-oahp81nu
Collecting cda-client@ git+https://github.com/CancerDataAggregator/cda-service-python-client.git
  Cloning https://github.com/CancerDataAggregator/cda-service-python-client.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-install-0i91nroh/cda-client_e12c0169ea3b4f6ea10e67848167dfe1
  Running command git clone -q https://github.com/CancerDataAggregator/cda-service-python-client.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-install-0i91nroh/cda-client_e12c0169ea3b4f6ea10e67848167dfe1
Collecting urllib3>=1.15
  Using cached urllib3-1.26.4-py2.py3-none-any.whl (153 kB)
Collec

## What do we know about Subject 09CO022?

Let's start by searching each node individually.

In [19]:
submitter_id = '09CO022'

### From GDC

We can search GDC using the [GDC Cases API](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#cases-endpoint) for `submitted_id == 09CO022`. Note that with the GDC API, you specifically have to request that certain sections (`diagnoses` and `samples` in this example) are expanded -- other sections, such as `aliquot_ids` only include identifiers, unless we explicitly ask for them to be expanded as well.

In [20]:
# Load required packages.
import requests
import json

In [21]:
# Search by subject_id.
cases_endpt = "https://api.gdc.cancer.gov/cases"

filters = {
    "op": "in",
    "content":{
        "field": "submitter_id",
        "value": [ submitter_id ]
    }
}

params = {
    "filters": json.dumps(filters),
    "expand": "diagnoses,samples",
    "format": "JSON",
    "size": "2"
}

response = requests.get(cases_endpt, params = params)
result = json.loads(response.content)

result

{'data': {'hits': [{'id': 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d',
    'lost_to_followup': None,
    'days_to_lost_to_followup': None,
    'disease_type': 'Adenomas and Adenocarcinomas',
    'submitter_id': '09CO022',
    'aliquot_ids': ['0d8adcbf-13f0-48c3-83df-3fa205b79ae8',
     '9250d96e-1cdc-4d68-8a56-f7b186a6fab5',
     'ce309009-9257-4c13-9241-83fc629c35a0'],
    'submitter_aliquot_ids': ['f69deaeb-6b6f-4c61-8900-fd0f26_D7_1',
     '60805d52-8ca1-46d4-8101-0ad055_D1_1',
     'f69deaeb-6b6f-4c61-8900-fd0f26_D6_2'],
    'diagnoses': [{'irs_stage': None,
      'anaplasia_present_type': None,
      'iss_stage': None,
      'ajcc_pathologic_stage': 'Stage IIB',
      'gross_tumor_weight': None,
      'tumor_largest_dimension_diameter': None,
      'tumor_stage': 'Stage IIB',
      'ann_arbor_clinical_stage': None,
      'enneking_msts_stage': None,
      'created_datetime': '2017-03-01T11:57:00.297885-06:00',
      'circumferential_resection_margin': None,
      'inrg_stage': None,
  

In [17]:
cases = result['data']['hits']

# We should only have one case for subject 09CO022.
assert len(cases) == 1
case = cases[0]

# Write this to a file.
with open('gdc_subject_09CO022.json', 'w') as f:
    json.dump(case, f, indent=4, sort_keys=True)

# What information do we have on this case?
case

{'id': 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d',
 'lost_to_followup': None,
 'days_to_lost_to_followup': None,
 'disease_type': 'Adenomas and Adenocarcinomas',
 'submitter_id': '09CO022',
 'aliquot_ids': ['0d8adcbf-13f0-48c3-83df-3fa205b79ae8',
  '9250d96e-1cdc-4d68-8a56-f7b186a6fab5',
  'ce309009-9257-4c13-9241-83fc629c35a0'],
 'submitter_aliquot_ids': ['f69deaeb-6b6f-4c61-8900-fd0f26_D7_1',
  '60805d52-8ca1-46d4-8101-0ad055_D1_1',
  'f69deaeb-6b6f-4c61-8900-fd0f26_D6_2'],
 'diagnoses': [{'irs_stage': None,
   'anaplasia_present_type': None,
   'iss_stage': None,
   'ajcc_pathologic_stage': 'Stage IIB',
   'gross_tumor_weight': None,
   'tumor_largest_dimension_diameter': None,
   'tumor_stage': 'Stage IIB',
   'ann_arbor_clinical_stage': None,
   'enneking_msts_stage': None,
   'created_datetime': '2017-03-01T11:57:00.297885-06:00',
   'circumferential_resection_margin': None,
   'inrg_stage': None,
   'enneking_msts_metastasis': None,
   'tissue_or_organ_of_origin': 'Colon, NOS',
   '

### From PDC

PDC has a GraphQL endpoint we can use to make these queries.

In [28]:
# Search for case submitter ID 09CO022:

# I got this query from https://pdc.cancer.gov/data-dictionary/publicapi-documentation/#!/Case/case
graphql_query = """{
    case (
        case_submitter_id: "%s"
        acceptDUA: true
    ) {
        case_id case_submitter_id project_submitter_id days_to_lost_to_followup disease_type
        index_date lost_to_followup primary_site
        externalReferences {
            external_reference_id
            reference_resource_shortname reference_resource_name reference_entity_location
        }
        demographics {
            demographic_id ethnicity gender demographic_submitter_id race cause_of_death days_to_birth
            days_to_death vital_status year_of_birth year_of_death
        }
        samples {
            sample_id sample_submitter_id sample_type sample_type_id gdc_sample_id gdc_project_id
            biospecimen_anatomic_site composition current_weight days_to_collection days_to_sample_procurement
            diagnosis_pathologically_confirmed freezing_method initial_weight intermediate_dimension is_ffpe
            longest_dimension method_of_sample_procurement oct_embedded pathology_report_uuid preservation_method
            sample_type_id shortest_dimension time_between_clamping_and_freezing time_between_excision_and_freezing
            tissue_type tumor_code tumor_code_id tumor_descriptor
            aliquots {
                aliquot_id aliquot_submitter_id analyte_type
                aliquot_run_metadata {
                    aliquot_run_metadata_id
                }
            }
        }
        diagnoses {
            diagnosis_id tissue_or_organ_of_origin age_at_diagnosis primary_diagnosis tumor_grade tumor_stage
            diagnosis_submitter_id classification_of_tumor days_to_last_follow_up days_to_last_known_disease_status
            days_to_recurrence last_known_disease_status morphology progression_or_recurrence
            site_of_resection_or_biopsy prior_malignancy ajcc_clinical_m ajcc_clinical_n ajcc_clinical_stage
            ajcc_clinical_t ajcc_pathologic_m ajcc_pathologic_n ajcc_pathologic_stage ajcc_pathologic_t
            ann_arbor_b_symptoms ann_arbor_clinical_stage ann_arbor_extranodal_involvement ann_arbor_pathologic_stage
            best_overall_response burkitt_lymphoma_clinical_variant circumferential_resection_margin
            colon_polyps_history days_to_best_overall_response days_to_diagnosis days_to_hiv_diagnosis
            days_to_new_event figo_stage hiv_positive hpv_positive_type hpv_status iss_stage laterality
            ldh_level_at_diagnosis ldh_normal_range_upper lymph_nodes_positive lymphatic_invasion_present
            method_of_diagnosis new_event_anatomic_site new_event_type overall_survival perineural_invasion_present
            prior_treatment progression_free_survival progression_free_survival_event residual_disease
            vascular_invasion_present year_of_diagnosis icd_10_code synchronous_malignancy
            tumor_largest_dimension_diameter
        }
    }
}""" % (submitter_id)

params = {
    "query": graphql_query
}

pdc_graphql_endpoint = "https://pdc.cancer.gov/graphql"

response = requests.get(pdc_graphql_endpoint, params = params)
result = json.loads(response.content)

cases = result['data']['case']
assert len(cases) == 1
case = cases[0]

# Write this to a file.
with open('pdc_subject_09CO022.json', 'w') as f:
    json.dump(case, f, indent=4, sort_keys=True)


case

{'case_id': '459e3b69-63d6-11e8-bcf1-0a2705229b82',
 'case_submitter_id': '09CO022',
 'project_submitter_id': 'CPTAC-2',
 'days_to_lost_to_followup': 0,
 'disease_type': 'Colon Adenocarcinoma',
 'index_date': None,
 'lost_to_followup': '',
 'primary_site': 'Colon',
 'externalReferences': [{'external_reference_id': 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d',
   'reference_resource_shortname': 'GDC',
   'reference_resource_name': 'Genomic Data Commns',
   'reference_entity_location': 'https://portal.gdc.cancer.gov/cases/c5421e34-e5c7-4ba5-aed9-146a5575fd8d'}],
 'demographics': [{'demographic_id': 'f6954ae9-7588-11e8-bcf1-0a2705229b82',
   'ethnicity': 'Not Hispanic or Latino',
   'gender': 'Female',
   'demographic_submitter_id': '09CO022-DM',
   'race': 'Black or African American',
   'cause_of_death': None,
   'days_to_birth': None,
   'days_to_death': None,
   'vital_status': 'Not Reported',
   'year_of_birth': None,
   'year_of_death': None}],
 'samples': [{'sample_id': 'f4af3e4d-641b-11

### From CDA by Case ID

Note that for cases extracted from GDC/PDC, we can look up the same record on CDA by using the case_id.

In [32]:
from cdapython import Q, columns

case_id = case['case_id']
query_by_case_id = Q(f'ResearchSubject.id = "{case_id}"')
results = query_by_case_id.run(limit=2) 
print(results)

cda_entry = results[0]

# Write this to a file.
with open(f'cda_case_{case_id}.json', 'w') as f:
    json.dump(cda_entry, f, indent=4, sort_keys=True)

cda_entry


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.id = '459e3b69-63d6-11e8-bcf1-0a2705229b82')
Offset: 0
Limit: 2
Count: 1
More pages: No



{'days_to_birth': None,
 'race': 'black or african american',
 'sex': 'female',
 'ethnicity': 'not hispanic or latino',
 'id': '459e3b69-63d6-11e8-bcf1-0a2705229b82',
 'ResearchSubject': [{'Diagnosis': [{'morphology': '8140/3',
     'tumor_stage': 'Stage IIB',
     'tumor_grade': 'Not Reported',
     'Treatment': [],
     'id': '7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb',
     'primary_diagnosis': 'Adenocarcinoma, NOS',
     'age_at_diagnosis': None}],
   'Specimen': [{'File': [{'label': 'fbc0c313-d356-4ad9-8257-57e90fb7f26b.wxs.Pindel.somatic_annotation.vcf.gz',
       'associated_project': ['CPTAC-2'],
       'drs_uri': 'drs://dg.4DFC:00a9219e-109c-4013-89c8-3d01b39cd9bf',
       'identifier': [{'system': 'GDC',
         'value': '00a9219e-109c-4013-89c8-3d01b39cd9bf'}],
       'data_category': 'Simple Nucleotide Variation',
       'byte_size': 248056,
       'type': None,
       'file_format': None,
       'checksum': 'e09e988c4969cf68fac8168fa622df9f',
       'id': '00a9219e-109c-4013-8

In [12]:
columns()

SELECT field_path FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3'


['days_to_birth',
 'race',
 'sex',
 'ethnicity',
 'id',
 'ResearchSubject',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.tumor_stage',
 'ResearchSubject.Diagnosis.tumor_grade',
 'ResearchSubject.Diagnosis.Treatment',
 'ResearchSubject.Diagnosis.Treatment.type',
 'ResearchSubject.Diagnosis.Treatment.outcome',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Specimen',
 'ResearchSubject.Specimen.File',
 'ResearchSubject.Specimen.File.label',
 'ResearchSubject.Specimen.File.associated_project',
 'ResearchSubject.Specimen.File.drs_uri',
 'ResearchSubject.Specimen.File.identifier',
 'ResearchSubject.Specimen.File.identifier.system',
 'ResearchSubject.Specimen.File.identifier.value',
 'ResearchSubject.Specimen.File.data_category',
 'ResearchSubject.Specimen.File.byte_size',
 'ResearchSubject.Specimen.File.type',
 'ResearchSubject.Specimen.File

The big advantage with CDA is that we can query the submitter ID directly (by using the Specimen.derived_from_subject option) to get a list of all the cases associated with a particular subject.

In [36]:
query_by_derived = Q(f'ResearchSubject.Specimen.derived_from_subject = "{submitter_id}"')
results = query_by_derived.run(limit=100)
print(results)

result_list = []
for result in results:
    result_list.append(result)

# Write this to a file.
with open(f'cda_derived_from_{submitter_id}.json', 'w') as f:
    json.dump(result_list, f, indent=4, sort_keys=True)

result_list


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen WHERE (_Specimen.derived_from_subject = '09CO022')
Offset: 0
Limit: 100
Count: 11
More pages: No



[{'days_to_birth': None,
  'race': 'black or african american',
  'sex': 'female',
  'ethnicity': 'not hispanic or latino',
  'id': '4591a53d-5668-4a70-b44b-e08a3d59267e',
  'ResearchSubject': [{'Diagnosis': [{'morphology': '8140/3',
      'tumor_stage': 'Stage IIB',
      'tumor_grade': 'Not Reported',
      'Treatment': [],
      'id': '7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb',
      'primary_diagnosis': 'Adenocarcinoma, NOS',
      'age_at_diagnosis': None}],
    'Specimen': [{'File': [{'label': 'fbc0c313-d356-4ad9-8257-57e90fb7f26b.wxs.Pindel.somatic_annotation.vcf.gz',
        'associated_project': ['CPTAC-2'],
        'drs_uri': 'drs://dg.4DFC:00a9219e-109c-4013-89c8-3d01b39cd9bf',
        'identifier': [{'system': 'GDC',
          'value': '00a9219e-109c-4013-89c8-3d01b39cd9bf'}],
        'data_category': 'Simple Nucleotide Variation',
        'byte_size': 248056,
        'type': None,
        'file_format': None,
        'checksum': 'e09e988c4969cf68fac8168fa622df9f',
        'id'