

This GA4GH Data Connect table represents the NCI Cancer Data Aggregator (CDA) Minimal Viable Product (MVP) v3 used in CDA testing in May 2021.

The aim of this notebook is to explore queries via Data Connect in comparison with use of the CDA API directly. Looking at both the Data Connect and CDA APIs explore the data integration and aggregation can be performed. These two APIs may help inform one another.

Some of the equivalent queries in this notebook were explored in CDA in the following GitHub repository https://github.com/ianfore/cdatest.


## Schema as represented in Data Connect
What information does the schema contain for the CDA table? The following shows that the schema duplicates the BigQuery definition of the table.

Reading a machine readable schema from CDA may allow creation of a schema which provides more information to a) a human user and b) a query toolset which can make use of machine readable schema. See below.



In [4]:
from fasp.search import DataConnectClient
cl = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com')
cl.listTableInfo('search_cloud.cshcodeathon.cda_mvp_v3', verbose=True)

_Schema for tablesearch_cloud.cshcodeathon.cda_mvp_v3_
{
   "name": "search_cloud.cshcodeathon.cda_mvp_v3",
   "description": "Automatically generated schema",
   "data_model": {
      "$id": "https://ga4gh-search-adapter-presto-public.prod.dnastack.com/table/search_cloud.cshcodeathon.cda_mvp_v3/info",
      "description": "Automatically generated schema",
      "$schema": "http://json-schema.org/draft-07/schema#",
      "properties": {
         "days_to_birth": {
            "format": "bigint",
            "type": "int",
            "$comment": "bigint"
         },
         "race": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "sex": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "ethnicity": {
            "format": "varchar",
            "type": "string",
            "$comment": "varchar"
         },
         "id": {
            "format": 

<fasp.search.data_connect_client.SearchSchema at 0x1251ba940>

Save the schema to a file

In [None]:
tableInfo = cl.listTableInfo('search_cloud.cshcodeathon.cda_mvp_v3')
with open('cda_mvp_v3_dc_schema.json', 'w', encoding='utf-8') as f:
    json.dump(tableInfo.schema, f, ensure_ascii=False, indent=3)

As in other cases the schema above loaded from BigQuery provides only limited information. In those other cases the Data Connect schema has been enhanced by importing from machine readble schema provided by the source system, for example by reading dbGaP data dictionaries. For CDA MVP v3 json schemas were published [here](https://github.com/CancerDataAggregator/cda-data-model). Separate JSON Schema were provided for different CDA objects as opposed to the single nested schema above. For example, this is the [CDA schema for Specimen](https://github.com/CancerDataAggregator/cda-data-model/blob/main/src/schema/json/Specimen.json). Automated translation of the schema may be possible. For now a [manually integrated schema](cda_mvp_v3_dc_schema_integrated.json) was created incorporating descriptions from the CDA schemas. 

## Queries

The following repeats one of the queries conducted using the CDA API. In that case the query required the use of SQL rather than the json query the CDA API provides. The json approach only provides limited capability to combine the criteria within a query.

The query identifies unique subjects with data in the CPTAC-2 study with Stage IIB colon cancer.

In [3]:
query1 = '''SELECT distinct p.id FROM search_cloud.cshcodeathon.cda_mvp_v3 AS p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis 
WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') 
AND (_Diagnosis.tumor_stage = 'Stage IIB')) 
AND (_ResearchSubject.primary_disease_site = 'Colon'))'''

res = cl.runQuery(query1)
res

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________
____Page13_______________
____Page14_______________
____Page15_______________
____Page16_______________


[['09CO022'], ['15CO002'], ['05CO039'], ['05CO044']]

The second query obtains all attributes for the subjects in the query above. It uses the previous query as a sub query.

In [5]:
query2 = ''' select * from search_cloud.cshcodeathon.cda_mvp_v3
where id in
(SELECT distinct p.id FROM search_cloud.cshcodeathon.cda_mvp_v3 AS p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis 
WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') 
AND (_Diagnosis.tumor_stage = 'Stage IIB')) 
AND (_ResearchSubject.primary_disease_site = 'Colon')) )'''

res2 = cl.runQuery(query2)

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________
____Page13_______________
____Page14_______________
____Page15_______________
____Page16_______________
____Page17_______________
____Page18_______________
____Page19_______________
____Page20_______________
____Page21_______________
____Page22_______________
____Page23_______________


Export the query results to a file.

In [6]:
import json
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(res2, f, ensure_ascii=False, indent=3)


Also dump the schema to a file

In [9]:
tableInfo = cl.listTableInfo('search_cloud.cshcodeathon.cda_mvp_v3')
with open('cda_mvp_v3_dc_schema.json', 'w', encoding='utf-8') as f:
    json.dump(tableInfo.schema, f, ensure_ascii=False, indent=3)

Same where clause but requesting some named columns.

In [15]:
cl2 = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com', debug=True)
query3 = ''' select days_to_birth, race, sex, researchsubject
from search_cloud.cshcodeathon.cda_mvp_v3
where id in
(SELECT distinct p.id FROM search_cloud.cshcodeathon.cda_mvp_v3 AS p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis 
WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') 
AND (_Diagnosis.tumor_stage = 'Stage IIB')) 
AND (_ResearchSubject.primary_disease_site = 'Colon')) )'''

res = cl.runQuery(query3)
res

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________
____Page13_______________
____Page14_______________
____Page15_______________
____Page16_______________
____Page17_______________
____Page18_______________
____Page19_______________
____Page20_______________
____Page21_______________
____Page22_______________
____Page23_______________
____Page24_______________
____Page25_______________
____Page26_______________
____Page27_______________
____Page28_______________
____Page29_______________
____Page30_______________
____Page31_______________
____Page32_______________
____Page33_______________
____Page34_______________
____Page35_______________
____Page36_______________
____Page37_______________


[[None,
  'white',
  'male',
  [{'Diagnosis': [{'morphology': '8140/3',
      'tumor_stage': 'Stage IIB',
      'tumor_grade': 'Not Reported',
      'Treatment': [],
      'id': '49674766-d911-4896-8978-c01ed945c4e9',
      'primary_diagnosis': 'Adenocarcinoma, NOS',
      'age_at_diagnosis': None}],
    'Specimen': [{'File': [{'label': '7fb3b3a3-8dba-4ecd-8e7b-d61c9f7627a0.wxs.MuTect2.somatic_annotation.vcf.gz',
        'associated_project': ['CPTAC-2'],
        'drs_uri': 'drs://dg.4DFC:098e18d4-5ece-4bc6-9a79-68f5082da9bc',
        'identifier': [{'system': 'GDC',
          'value': '098e18d4-5ece-4bc6-9a79-68f5082da9bc'}],
        'data_category': 'Simple Nucleotide Variation',
        'byte_size': 11606042,
        'type': None,
        'file_format': None,
        'checksum': '351a76a43d5614d938045d9ff028c4ac',
        'id': '098e18d4-5ece-4bc6-9a79-68f5082da9bc',
        'data_type': 'Annotated Somatic Mutation'},
       {'label': '3bb17a95-aa01-4a01-97f5-1f02907d7bf1.wxs.MuSE.a

Note that in the result above the attribute names are missing at the top level.

We can get the attribute names by specifying a return type of json.

In [20]:
res = cl.runQuery(query2,returnType='json')
res

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________
____Page9_______________
____Page10_______________
____Page11_______________
____Page12_______________
____Page13_______________
____Page14_______________
____Page15_______________
____Page16_______________
____Page17_______________
____Page18_______________
____Page19_______________
____Page20_______________
____Page21_______________
____Page22_______________
____Page23_______________
____Page24_______________
____Page25_______________
____Page26_______________
____Page27_______________
____Page28_______________
____Page29_______________
____Page30_______________
____Page31_______________
____Page32_______________
____Page33_______________
____Page34_______________
____Page35_______________
____Page36_______________
____Page37_______________
____Page38______________

[{'days_to_birth': None,
  'race': 'white',
  'sex': 'male',
  'ethnicity': 'not hispanic or latino',
  'id': '05CO039',
  'researchsubject': [{'Diagnosis': [{'morphology': '8140/3',
      'tumor_stage': 'Stage IIB',
      'tumor_grade': 'Not Reported',
      'Treatment': [],
      'id': '49674766-d911-4896-8978-c01ed945c4e9',
      'primary_diagnosis': 'Adenocarcinoma, NOS',
      'age_at_diagnosis': None}],
    'Specimen': [{'File': [{'label': '7fb3b3a3-8dba-4ecd-8e7b-d61c9f7627a0.wxs.MuTect2.somatic_annotation.vcf.gz',
        'associated_project': ['CPTAC-2'],
        'drs_uri': 'drs://dg.4DFC:098e18d4-5ece-4bc6-9a79-68f5082da9bc',
        'identifier': [{'system': 'GDC',
          'value': '098e18d4-5ece-4bc6-9a79-68f5082da9bc'}],
        'data_category': 'Simple Nucleotide Variation',
        'byte_size': 11606042,
        'type': None,
        'file_format': None,
        'checksum': '351a76a43d5614d938045d9ff028c4ac',
        'id': '098e18d4-5ece-4bc6-9a79-68f5082da9bc',
      

and save to file

In [21]:
with open('data2.json', 'w', encoding='utf-8') as f:
    json.dump(res, f, ensure_ascii=False, indent=3)