# GDC to CRDC-H Conversion

This notebook demonstrates one method for converting GDC data into CRDC-H instance data: by reading node data as JSON and writing it out in the LinkML model. The LinkML can be used to [generate](https://github.com/linkml/linkml#python-dataclasses) [Python DataClasses](https://docs.python.org/3/library/dataclasses.html), which can then be exported in several data publication format, such as JSON or RDF.

## Setup

We start by installing the [LinkML](https://pypi.org/project/linkml/) and [pandas](https://pypi.org/project/pandas/) packages.

In [1]:
import sys

# Install LinkML.
# We use our own fork of LinkML, but all changes made to this repository will eventually be sent
# upstream to the main LinkML release.
#!{sys.executable} -m pip install git+https://github.com/cancerDHC/linkml.git@ccdh-dev#egg=linkml

# Install pandas.
#!{sys.executable} -m pip install pandas

## Loading GDC data as an example

We start by loading the result of a GDC query in JSON.

In [2]:
import json
import pandas

with open('head-and-mouth/gdc-head-and-mouth.json') as file:
    gdc_head_and_mouth = json.load(file)
    
pandas.DataFrame(gdc_head_and_mouth)

Unnamed: 0,aliquot_ids,case_id,created_datetime,diagnoses,diagnosis_ids,disease_type,id,primary_site,sample_ids,samples,...,submitter_sample_ids,submitter_slide_ids,updated_datetime,analyte_ids,portion_ids,submitter_analyte_ids,submitter_portion_ids,days_to_lost_to_followup,index_date,lost_to_followup
0,[cfcde639-3045-4f66-84a6-ec74b090a5b6],cd7e514f-71ba-4cc1-b74a-a22c6248169c,2017-06-01T08:57:57.249456-05:00,"[{'age_at_diagnosis': 19592, 'classification_o...",[5d2d67d1-4611-4a18-9a66-89823aaa8e3c],Adenomas and Adenocarcinomas,cd7e514f-71ba-4cc1-b74a-a22c6248169c,Nasopharynx,[bdc73f48-dc0b-487d-abbe-e3a977b6830a],[{'created_datetime': '2017-06-01T10:44:57.790...,...,[AD6426_sample],[AD6426_slide],2018-10-25T11:34:27.425461-05:00,,,,,,,
1,"[9069bdd7-e16a-462c-881c-581c8aab6910, a74915f...",9023c9bf-02a0-4396-8161-304089957b62,,"[{'age_at_diagnosis': 24286, 'ajcc_clinical_m'...",[706b1290-3a85-54ea-a123-e8bd14b085bc],Squamous Cell Neoplasms,9023c9bf-02a0-4396-8161-304089957b62,Larynx,"[8b2588c8-4261-492b-b173-2490a5de668f, badeaed...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-CN-6012-10A, TCGA-CN-6012-01A, TCGA-CN-6...","[TCGA-CN-6012-01Z-00-DX1, TCGA-CN-6012-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[80c6fde2-b6bb-4f40-908a-f116c466d296, 6f77017...","[bada788e-5112-4d21-a079-72729bd0cc83, fe24eea...","[TCGA-CN-6012-01A-11D, TCGA-CN-6012-10A-01W, T...","[TCGA-CN-6012-01A-13-2072-20, TCGA-CN-6012-10A...",,,
2,"[8f695cd3-01dd-4601-8b17-37cf40514422, f0e325f...",55f96a9c-e2c8-4243-8a7e-94bc6fab73a6,,"[{'age_at_diagnosis': 20992, 'ajcc_clinical_m'...",[40954a8e-e4c2-5604-937b-0a79ac7489d2],Squamous Cell Neoplasms,55f96a9c-e2c8-4243-8a7e-94bc6fab73a6,Larynx,"[a7692585-a129-4671-bfe5-98342a326776, b069c55...","[{'composition': None, 'created_datetime': Non...",...,"[TCGA-CV-7261-01Z, TCGA-CV-7261-11A, TCGA-CV-7...","[TCGA-CV-7261-01A-01-TS1, TCGA-CV-7261-01Z-00-...",2019-08-06T14:26:28.608672-05:00,"[a72f2de7-eb40-4818-a104-edb508d5517b, e8120e5...","[177fa10b-0135-468d-b5a3-6f30cc3cd390, f51d76a...","[TCGA-CV-7261-10A-01D, TCGA-CV-7261-01A-11R, T...","[TCGA-CV-7261-10A-01, TCGA-CV-7261-01A-13-2074...",,,
3,"[1265fd12-4706-43b0-84f3-d16d46f20963, 3443e1b...",c9a36eb5-ac3e-424e-bc2e-303de7105957,,"[{'age_at_diagnosis': 21886, 'ajcc_clinical_m'...",[48e8dd81-ed4d-5c54-af66-84e86477d5c8],Squamous Cell Neoplasms,c9a36eb5-ac3e-424e-bc2e-303de7105957,Oropharynx,"[256469d0-5f36-4966-bf4f-3b4297e55f43, bd90f96...","[{'composition': None, 'created_datetime': Non...",...,"[TCGA-BA-A6DL-10A, TCGA-BA-A6DL-01Z, TCGA-BA-A...","[TCGA-BA-A6DL-01Z-00-DX1, TCGA-BA-A6DL-01A-02-...",2019-08-06T14:25:14.243346-05:00,"[ec4487c1-6976-4161-9236-5e6810ed31b7, ffd1e03...","[7f327ef6-4fe6-40c8-aac7-731e051177bb, 2a4b0be...","[TCGA-BA-A6DL-01A-21D, TCGA-BA-A6DL-01A-21R, T...","[TCGA-BA-A6DL-10A-01, TCGA-BA-A6DL-01A-11-A45L...",,,
4,"[59b70846-64f0-489e-8ea5-84a347aedeb8, c8e46ce...",4cffea0b-90a7-4c86-a73f-bb8feca3ada7,,"[{'age_at_diagnosis': 14190, 'ajcc_clinical_m'...",[1da5c51a-ee25-51a6-a4c2-27d8fdcbe24e],Squamous Cell Neoplasms,4cffea0b-90a7-4c86-a73f-bb8feca3ada7,Tonsil,"[1ed245de-fea4-42c9-9197-773bcd12d2a8, 665d4bf...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-CN-5365-01Z, TCGA-CN-5365-10A, TCGA-CN-5...","[TCGA-CN-5365-01Z-00-DX1, TCGA-CN-5365-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[d46b5e9b-3652-45a1-a91d-46277aea3916, 35122dd...","[38c5a4c1-6d01-4885-ba35-0032e6b835b0, 516f802...","[TCGA-CN-5365-01A-01D, TCGA-CN-5365-01A-01W, T...","[TCGA-CN-5365-10A-01, TCGA-CN-5365-01A-21-2072...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,"[1d3b16fd-f98b-45ef-a423-861975f098b6, 0eabe3e...",97640ef0-0259-4244-95ba-48d28c60b372,,"[{'age_at_diagnosis': 19621, 'ajcc_clinical_m'...",[b725e6d2-92c0-5585-9de7-14bb623b472e],Squamous Cell Neoplasms,97640ef0-0259-4244-95ba-48d28c60b372,Larynx,"[fb06ae75-8516-4cdc-ba9e-093444907fc7, 5162217...","[{'composition': None, 'created_datetime': Non...",...,"[TCGA-CN-4738-01A, TCGA-CN-4738-01Z, TCGA-CN-4...","[TCGA-CN-4738-01Z-00-DX1, TCGA-CN-4738-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[4dc95dbe-b10f-4d6e-9413-ae47a0a49865, e637c1c...","[56c7d4e4-5703-4686-98b1-0c3125e5913e, 60d72bd...","[TCGA-CN-4738-01A-02D, TCGA-CN-4738-10A-01W, T...","[TCGA-CN-4738-01A-31-2072-20, TCGA-CN-4738-01A...",,,
556,[96f09bc8-a194-482c-bd17-baf28739e4f8],422a72e7-fe76-411d-b59e-1f0f0812c3cf,2018-09-13T13:42:10.444091-05:00,"[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[842d6984-7c03-4ab6-95db-42fa2ea699db],Squamous Cell Neoplasms,422a72e7-fe76-411d-b59e-1f0f0812c3cf,Larynx,[6f9eeaa3-8bd1-479c-a0fc-98317eb458dc],"[{'biospecimen_anatomic_site': None, 'biospeci...",...,[GENIE-DFCI-010671-11105],,2019-11-18T13:54:59.294543-06:00,,,,,,Initial Genomic Sequencing,
557,"[cd211e89-63f7-44f0-8a76-51703ae45112, 866292c...",4b50aea4-4ad1-4bf6-9cf1-984c28a99c84,,"[{'age_at_diagnosis': 21731, 'ajcc_clinical_m'...",[95d85e5a-b82c-59f8-b7ad-710e019cdebc],Squamous Cell Neoplasms,4b50aea4-4ad1-4bf6-9cf1-984c28a99c84,Hypopharynx,"[1077bf93-cf23-41db-925c-c633921894cc, 4a0d79f...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-TN-A7HL-01A, TCGA-TN-A7HL-01Z, TCGA-TN-A...","[TCGA-TN-A7HL-01Z-00-DX1, TCGA-TN-A7HL-01A-01-...",2019-08-06T14:27:14.277986-05:00,"[6ffc3548-d593-47ab-adf8-6d73075b5fa0, 9426e53...","[cd5864c8-b4e0-4405-b7df-1e0a51865670, f9ef56b...","[TCGA-TN-A7HL-01A-11R, TCGA-TN-A7HL-01A-11D, T...","[TCGA-TN-A7HL-01A-21-A45L-20, TCGA-TN-A7HL-10A...",,,
558,"[0c2f310b-fa59-4f6f-894a-dad920214004, 6ddd527...",0394060d-010e-405f-983d-db525f01f2c3,,"[{'age_at_diagnosis': 23640, 'ajcc_clinical_m'...",[7a67eecc-6f46-5181-8b64-c022d0fd0060],Squamous Cell Neoplasms,0394060d-010e-405f-983d-db525f01f2c3,Hypopharynx,"[5c2b4403-cdd4-4550-ba01-d8ebad9fcbc8, 4467ee1...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-BB-A5HY-10A, TCGA-BB-A5HY-01A, TCGA-BB-A...","[TCGA-BB-A5HY-01Z-00-DX1, TCGA-BB-A5HY-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[ee7f98c5-9c78-4bbe-b44a-a9e357a18058, 2d82983...","[9576f242-6874-4df9-8744-e0755d565358, 8842cbd...","[TCGA-BB-A5HY-01A-11W, TCGA-BB-A5HY-01A-11D, T...","[TCGA-BB-A5HY-01A-11, TCGA-BB-A5HY-10A-01]",,,


## Loading the Python classes for the CRDC-H model

We previously generated the Python DataClasses for the CRDC-H model. We can now load these DataClasses to transform files from the GDC to CRDC-H model.

In [3]:
from ccdh import ccdhmodel as ccdh
import pprint
import json

from linkml.dumpers.yaml_dumper import dumps

pp = pprint.PrettyPrinter(indent=4)

def create_coding(
    code,
    system=None,
    displayLabel=None,
    systemURL=None,
    systemVersion=None
):
    return {
        'code': code,
        'displayLabel': displayLabel,
        'system': system,
        'systemURL': systemURL,
        'systemVersion': systemVersion
    }

def create_codeable_concept(coding=None, text=None):
    return {
        'coding': coding,
        'text': text
    }

def create_identifier(system, value):
    return ccdh.Identifier(
        # type='unknown', # No types currently defined.
        system=system,
        value=value
    )
        

## Convert the input data in pieces

For demonstrative purposes, we'll start by translating pieces of this record into CRDC-H instance data.

Let's start with the samples in the `samples` key (which correspond to a [Specimen](https://cancerdhc.github.io/ccdhmodel/entities/Specimen.html) in the CRDC-H model).

In [4]:
firstSample = gdc_head_and_mouth[0]['samples'][0]
firstSample

{'created_datetime': '2017-06-01T10:44:57.790971-05:00',
 'sample_id': 'bdc73f48-dc0b-487d-abbe-e3a977b6830a',
 'sample_type': 'Metastatic',
 'state': 'released',
 'submitter_id': 'AD6426_sample',
 'tissue_type': 'Not Reported',
 'tumor_descriptor': 'Metastatic',
 'updated_datetime': '2018-11-15T21:10:03.529893-06:00'}

In [5]:
def map_sample(sample):
    specimen:ccdh.Specimen = ccdh.Specimen(id = sample.get('sample_id'))
    specimen.identifier = [create_identifier('GDC', sample['submitter_id'])]
    specimen.source_material_type = sample.get('sample_type')
    specimen.general_tissue_morphology = sample.get('tissue_type')
    specimen.tumor_status_at_collection = sample.get('tumor_descriptor')
    specimen.creation_activity = ccdh.SpecimenCreationActivity(
        date_ended=ccdh.TimePoint(
            dateTime=sample.get('created_datetime')
        )
    )
    return specimen

first_specimen = map_sample(firstSample)

# Unmapped fields:
# - state: released (https://github.com/NCI-GDC/gdcdictionary/blob/develop/gdcdictionary/schemas/_definitions.yaml#L128)
# - updated_datetime

first_specimen
print(dumps(first_specimen))

id: bdc73f48-dc0b-487d-abbe-e3a977b6830a
identifier:
- value: AD6426_sample
  system: GDC
source_material_type: Metastatic
tumor_status_at_collection: Metastatic
creation_activity:
  date_ended:
    dateTime: '2017-06-01T10:44:57.790971-05:00'
general_tissue_morphology: Not Reported



In [6]:
specimens = list(map(lambda entry: list(map(map_sample, entry['samples'])), gdc_head_and_mouth))
specimens[1:3]

[[Specimen(id='8b2588c8-4261-492b-b173-2490a5de668f', identifier=[Identifier(value='TCGA-CN-6012-01Z', system='GDC', type=None)], description=None, specimen_type=None, analyte_type=None, associated_project=None, data_provider=None, source_material_type='Primary Tumor', parent_specimen=[], source_subject=None, source_model_system=None, tumor_status_at_collection=None, creation_activity=SpecimenCreationActivity(activity_type=None, date_started=None, date_ended=TimePoint(id=None, dateTime='2018-05-17T12:19:46.292188-05:00', indexTimePoint=None, offsetFromIndex=None, eventType=[]), performed_by=None, collection_method_type=None, derivation_method_type=None, additive=[], collection_site=None, quantity_collected=None, execution_observation=[], specimen_order=None), processing_activity=[], storage_activity=[], transport_activity=[], contained_in=None, dimensional_measure=None, quantity_measure=[], quality_measure=[], cellular_composition_type=None, histological_composition_measure=[], general

In [7]:
diagnoses = []
for case in gdc_head_and_mouth:
    diagnoses.extend(case['diagnoses'])

for index, diagnosis in enumerate(diagnoses):
    diagnosis['index'] = index

# pandas.DataFrame(sorted(filter(lambda d: d['created_datetime'] is not None, diagnoses), key=lambda d: d['created_datetime'], reverse=True))
pandas.set_option("display.max_rows", None)
pandas.DataFrame(diagnoses).describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age_at_diagnosis,261.0,20901.210728,5267.246124,141.0,18718.0,21689.0,24192.0,32871.0
days_to_last_follow_up,155.0,924.387097,763.86129,1.0,378.5,685.0,1363.0,4241.0
index,560.0,279.5,161.802349,0.0,139.75,279.5,419.25,559.0
days_to_diagnosis,201.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
year_of_diagnosis,201.0,2008.154229,4.780282,1993.0,2006.0,2010.0,2011.0,2013.0
anaplasia_present,0.0,,,,,,,
anaplasia_present_type,0.0,,,,,,,
ann_arbor_b_symptoms,0.0,,,,,,,
ann_arbor_clinical_stage,0.0,,,,,,,
ann_arbor_extranodal_involvement,0.0,,,,,,,


In [8]:
pandas.DataFrame(filter(lambda d: d['days_to_last_follow_up'] is not None, diagnoses))
diagnoses[1]

{'age_at_diagnosis': 24286,
 'ajcc_clinical_m': 'M0',
 'ajcc_clinical_n': 'N1',
 'ajcc_clinical_stage': 'Stage III',
 'ajcc_clinical_t': 'T3',
 'ajcc_pathologic_n': 'N2c',
 'ajcc_pathologic_stage': 'Stage IVA',
 'ajcc_pathologic_t': 'T3',
 'ajcc_staging_system_edition': '7th',
 'classification_of_tumor': 'not reported',
 'created_datetime': None,
 'days_to_diagnosis': 0,
 'days_to_last_follow_up': 1460,
 'days_to_last_known_disease_status': None,
 'days_to_recurrence': None,
 'diagnosis_id': '706b1290-3a85-54ea-a123-e8bd14b085bc',
 'icd_10_code': 'C32.9',
 'last_known_disease_status': 'not reported',
 'morphology': '8070/3',
 'primary_diagnosis': 'Squamous cell carcinoma, NOS',
 'prior_malignancy': 'no',
 'prior_treatment': 'No',
 'progression_or_recurrence': 'not reported',
 'site_of_resection_or_biopsy': 'Larynx, NOS',
 'state': 'released',
 'submitter_id': 'TCGA-CN-6012_diagnosis',
 'synchronous_malignancy': 'No',
 'tissue_or_organ_of_origin': 'Larynx, NOS',
 'tumor_grade': 'not rep

In [9]:
pandas.set_option("display.max_rows", 10)

In [10]:
def create_stage_observation(type, value):
    # TODO: we use valueString, but we should really use valueCodeableConcept
    # once that is implemented.
    return ccdh.CancerStageObservation(
        observation_type=type,
        valueString=value
    )

def create_body_site(site_name):
    site_mappings = {
        'Larynx, NOS': 'Larynx'
    }
    
    if site_name in site_mappings:
        return ccdh.BodySite(site=site_mappings[site_name])
    return None

def create_stage_from_gdc(diag, subject):
    cancer_stage_method_type = None
    if diag.get('ajcc_staging_system_edition') == '7th':
        cancer_stage_method_type = 'AJCC staging system 7th edition'

    obs = ccdh.CancerStageObservationSet(
           method_type=cancer_stage_method_type,
           subject=subject
    )
    obs.observations.extend([
        create_stage_observation('Overall', diag.get('tumor_stage')),
        create_stage_observation('Clinical Overall', diag.get('ajcc_clinical_stage')),
        create_stage_observation('Clinical Tumor (T)', diag.get('ajcc_clinical_t')),
        create_stage_observation('Clinical Node (N)', diag.get('ajcc_clinical_n')),
        create_stage_observation('Clinical Metastasis (M)', diag.get('ajcc_clinical_m')),
        create_stage_observation('Pathological Overall', diag.get('ajcc_pathologic_stage')),
        create_stage_observation('Pathological Tumor (T)', diag.get('ajcc_pathologic_t')),
        create_stage_observation('Pathological Node (N)', diag.get('ajcc_pathologic_n')),
        create_stage_observation('Pathological Metastasis (M)', diag.get('ajcc_pathologic_m'))
    ])
    return obs

def map_diagnosis(diag, subject=None):
    ccdh_diagnosis = ccdh.Diagnosis(
        id=diag.get('diagnosis_id'),
        # age_at_diagnosis=ccdh.Quantity(
        #    # TODO: 'unit' doesn't have values yet.
        #    valueDecimal=diag.get('age_at_diagnosis')
        #),
        condition=diag.get('primary_diagnosis'),
        morphology=diag.get('morphology'),
        metastatic_site=create_body_site('tissue_or_organ_of_origin'),
        grade=diag.get('grade'),
        stage=create_stage_from_gdc(diag, subject)
        # year_at_diagnosis=diag.get('year_of_diagnosis')
    )
    ccdh_diagnosis.identifier = [
        create_identifier('GDC-submitter-id', diag.get('submitter_id'))
    ]

    return ccdh_diagnosis

# Unmapped fields:
# - classification_of_tumor
# - created_datetime
# - days_to_diagnosis
# - days_to_last_follow_up
# - days_to_last_known_disease_status
# - days_to_recurrence
# - icd_10_code
# - last_known_disease_status
# - site_of_resection_or_biopsy
# - state
# - synchronous_malignancy
# - tumor_grade
# - updated_datetime
# - year_of_diagnosis

firstDiagnosis = map_diagnosis(gdc_head_and_mouth[1]['diagnoses'][0])
# TODO: can't represent this as YAML because the YAML transformer doesn't know how to transform Decimal.
firstDiagnosis

Diagnosis(id='706b1290-3a85-54ea-a123-e8bd14b085bc', identifier=[Identifier(value='TCGA-CN-6012_diagnosis', system='GDC-submitter-id', type=None)], subject=None, age_at_diagnosis=None, year_at_diagnosis=None, condition=(text='Squamous cell carcinoma, NOS'), primary_site=[], metastatic_site=[], stage=[CancerStageObservationSet(id=None, category=None, focus=[], subject=None, method_type=[(text='AJCC staging system 7th edition', description='The 7th edition of the criteria developed by the American Joint Committee on Cancer (AJCC) in 2010, used for the classification and staging of neoplastic diseases.')], performed_by=None, observations=[CancerStageObservation(observation_type=(text='Overall', description='The overall stage of the disease'), id=None, category=None, method_type=None, focus=None, subject=None, performed_by=None, valueEntity=None, valueString='stage iva', valueInteger=None, valueDecimal=None, valueBoolean=None, valueDateTime=None, valueQuantity=None, valueCodeableConcept=No

In [11]:
diagnoses = list(map(lambda entry: list(map(map_diagnosis, entry['diagnoses'])), gdc_head_and_mouth))
diagnoses[0:5]

[[Diagnosis(id='5d2d67d1-4611-4a18-9a66-89823aaa8e3c', identifier=[Identifier(value='AD6426_diagnosis', system='GDC-submitter-id', type=None)], subject=None, age_at_diagnosis=None, year_at_diagnosis=None, condition=(text='Adenocarcinoma, NOS'), primary_site=[], metastatic_site=[], stage=[CancerStageObservationSet(id=None, category=None, focus=[], subject=None, method_type=[], performed_by=None, observations=[CancerStageObservation(observation_type=(text='Overall', description='The overall stage of the disease'), id=None, category=None, method_type=None, focus=None, subject=None, performed_by=None, valueEntity=None, valueString='not reported', valueInteger=None, valueDecimal=None, valueBoolean=None, valueDateTime=None, valueQuantity=None, valueCodeableConcept=None), CancerStageObservation(observation_type=(text='Clinical Overall', description='The overall stage of the disease; clinical stage is determined from evidence acquired before treatment (including clinical examination, imaging, 

In [12]:
gdc_head_and_mouth[0]


{'aliquot_ids': ['cfcde639-3045-4f66-84a6-ec74b090a5b6'],
 'case_id': 'cd7e514f-71ba-4cc1-b74a-a22c6248169c',
 'created_datetime': '2017-06-01T08:57:57.249456-05:00',
 'diagnoses': [{'age_at_diagnosis': 19592,
   'classification_of_tumor': 'metastasis',
   'created_datetime': '2017-06-19T09:09:57.388287-05:00',
   'days_to_last_follow_up': None,
   'days_to_last_known_disease_status': None,
   'days_to_recurrence': None,
   'diagnosis_id': '5d2d67d1-4611-4a18-9a66-89823aaa8e3c',
   'last_known_disease_status': 'not reported',
   'morphology': '8140/3',
   'primary_diagnosis': 'Adenocarcinoma, NOS',
   'progression_or_recurrence': 'not reported',
   'site_of_resection_or_biopsy': 'Nasal cavity',
   'state': 'released',
   'submitter_id': 'AD6426_diagnosis',
   'tissue_or_organ_of_origin': 'Overlapping lesion of nasopharynx',
   'tumor_grade': 'Not Reported',
   'tumor_stage': 'not reported',
   'updated_datetime': '2019-07-10T13:16:35.855027-05:00',
   'index': 0}],
 'diagnosis_ids': ['

In [13]:
def map_case(case):
    subject = ccdh.Subject(
        id=case.get('submitter_id')
    )
    rs = ccdh.ResearchSubject(
        id = case.get('id'),
        primary_diagnosis_site=create_body_site(case.get('primary_site')),
        associated_subject = subject,
        primary_diagnosis=list(map(lambda d: map_diagnosis(d, subject), case.get('diagnoses')))
    )
    # rs.primary_diagnosis=list(map(lambda d: map_diagnosis(d, subject), case.get('diagnoses')))

    # Unmapped fields:
    # - 'aliquot_ids'
    # - 'case_id'
    # - 'created_datetime'
    # - 'diagnoses'
    # - 'diagnosis_ids'
    # - 'disease_type'
    # - 'id'
    # - 'primary_site'
    # - 'sample_ids'
    # - 'state'
    # - 'submitter_aliquot_ids'
    # - 'submitter_diagnosis_ids'
    # - 'updated_datetime'

    return rs

firstResearchSubject = map_case(gdc_head_and_mouth[0])

firstResearchSubject

ResearchSubject(id='cd7e514f-71ba-4cc1-b74a-a22c6248169c', associated_subject=Subject(id='AD6426', identifier=[], species=None, breed=None, sex=None, ethnicity=None, race=[], year_of_birth=None, vital_status=None, age_at_death=None, year_of_death=None, cause_of_death=None), identifier=[], description=None, member_of_research_project=None, age_at_enrollment=None, primary_diagnosis_condition=None, primary_diagnosis_site=None, primary_diagnosis=[Diagnosis(id='5d2d67d1-4611-4a18-9a66-89823aaa8e3c', identifier=[Identifier(value='AD6426_diagnosis', system='GDC-submitter-id', type=None)], subject=None, age_at_diagnosis=None, year_at_diagnosis=None, condition=(text='Adenocarcinoma, NOS'), primary_site=[], metastatic_site=[], stage=[CancerStageObservationSet(id=None, category=None, focus=[], subject=Subject(id='AD6426', identifier=[], species=None, breed=None, sex=None, ethnicity=None, race=[], year_of_birth=None, vital_status=None, age_at_death=None, year_of_death=None, cause_of_death=None), m

In [14]:
rss = list(map(map_case, gdc_head_and_mouth))
rss[0:5]


[ResearchSubject(id='cd7e514f-71ba-4cc1-b74a-a22c6248169c', associated_subject=Subject(id='AD6426', identifier=[], species=None, breed=None, sex=None, ethnicity=None, race=[], year_of_birth=None, vital_status=None, age_at_death=None, year_of_death=None, cause_of_death=None), identifier=[], description=None, member_of_research_project=None, age_at_enrollment=None, primary_diagnosis_condition=None, primary_diagnosis_site=None, primary_diagnosis=[Diagnosis(id='5d2d67d1-4611-4a18-9a66-89823aaa8e3c', identifier=[Identifier(value='AD6426_diagnosis', system='GDC-submitter-id', type=None)], subject=None, age_at_diagnosis=None, year_at_diagnosis=None, condition=(text='Adenocarcinoma, NOS'), primary_site=[], metastatic_site=[], stage=[CancerStageObservationSet(id=None, category=None, focus=[], subject=Subject(id='AD6426', identifier=[], species=None, breed=None, sex=None, ethnicity=None, race=[], year_of_birth=None, vital_status=None, age_at_death=None, year_of_death=None, cause_of_death=None), 

## Exporting CRDC-H data as RDF

LinkML supports this via JSON-LD.

In [15]:
del dumps

from linkml.generators.jsonldcontextgen import ContextGenerator
from linkml.dumpers.json_dumper import dumps

jsonldContext = ContextGenerator('ccdh/ccdhmodel.yaml').serialize()

jsonld = dumps(firstDiagnosis, jsonldContext)
print(''.join(jsonld))

{
  "id": "706b1290-3a85-54ea-a123-e8bd14b085bc",
  "identifier": [
    {
      "value": "TCGA-CN-6012_diagnosis",
      "system": "GDC-submitter-id"
    }
  ],
  "condition": {
    "text": "Squamous cell carcinoma, NOS"
  },
  "stage": [
    {
      "method_type": [
        {}
      ],
      "observations": [
        {
          "observation_type": {
            "text": "Overall",
            "description": "The overall stage of the disease"
          },
          "valueString": "stage iva"
        },
        {
          "observation_type": {
            "text": "Clinical Overall",
            "description": "The overall stage of the disease; clinical stage is determined from evidence acquired before treatment (including clinical examination, imaging, endoscopy, biopsy, surgical exploration)"
          },
          "valueString": "Stage III"
        },
        {
          "observation_type": {
            "text": "Clinical Tumor (T)",
            "description": "T classifies the size 

In [16]:
jsonld = dumps(rss, jsonldContext)
print(jsonld)

{
  "@graph": [
    {
      "id": "cd7e514f-71ba-4cc1-b74a-a22c6248169c",
      "associated_subject": {
        "id": "AD6426"
      },
      "primary_diagnosis": [
        {
          "id": "5d2d67d1-4611-4a18-9a66-89823aaa8e3c",
          "identifier": [
            {
              "value": "AD6426_diagnosis",
              "system": "GDC-submitter-id"
            }
          ],
          "condition": {
            "_code": {
              "text": "Adenocarcinoma, NOS"
            }
          },
          "stage": [
            {
              "subject": {
                "id": "AD6426"
              },
              "observations": [
                {
                  "observation_type": {
                    "_code": {
                      "text": "Overall",
                      "description": "The overall stage of the disease"
                    }
                  },
                  "valueString": "not reported"
                },
                {
                  "observ

In [17]:
with open('head-and-mouth/ccdh-head-and-mouth.jsonld', 'w') as file:
    jsonld = dumps(rss, jsonldContext)
    file.write(''.join(jsonld))

In [18]:
# We can read this JSON-LD in Turtle.
from rdflib import Graph

g = Graph()
g.parse(data=jsonld, format="json-ld")
rdfAsTurtle = g.serialize(format="turtle").decode()
print(rdfAsTurtle)

@prefix : <https://example.org/ccdh/> .
@prefix ccdh: <https://example.org/ccdh/> .

[] ccdh:associated_subject [ ccdh:id "GENIE-MDA-4754" ] ;
    ccdh:id "f06450b5-3e30-45c4-a86d-30f0861b265b" ;
    ccdh:primary_diagnosis [ ccdh:condition [ ccdh:_code [ ccdh:text "Carcinoma, NOS" ] ] ;
            ccdh:id "43aa2ba4-d59e-4e9f-9b1f-009fba8ee08a" ;
            ccdh:identifier [ ccdh:system "GDC-submitter-id" ;
                    ccdh:value "GENIE-MDA-4754-15863_diagnosis" ] ;
            ccdh:morphology [ ccdh:_code [ ccdh:text "8010/3" ] ] ;
            ccdh:stage [ ccdh:observations [ ccdh:observation_type [ ccdh:_code [ ccdh:description "The overall stage of the disease; clinical stage is determined from evidence acquired before treatment (including clinical examination, imaging, endoscopy, biopsy, surgical exploration)" ;
                                            ccdh:text "Clinical Overall" ] ] ],
                        [ ccdh:observation_type [ ccdh:_code [ ccdh:description "T 

## Incorporating PDC data

In [19]:
with open('head-and-mouth/pdc-head-and-mouth.json') as file:
    pdc_head_and_mouth = json.load(file)
    
pandas.DataFrame(pdc_head_and_mouth)

Unnamed: 0,case_id,case_submitter_id,days_to_lost_to_followup,demographics,diagnoses,disease_type,externalReferences,index_date,lost_to_followup,primary_site,project_submitter_id,samples
0,0232701d-6d00-440c-af6c-5899fbbf4142,OSCC_13,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '426a2696-f073-4...
1,0e943de7-c277-48f2-8fa9-b2e836b03c2c,OSCC_25,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '38404eb4-20a6-4...
2,1104505a-9890-49ce-8d7d-7a8070261324,OSCC_23,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '15218d5b-fc40-4...
3,195cd133-0d53-402d-b31c-3d4fe0481858,OSCC_37,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '47e8d70c-646d-4...
4,1df726a4-8520-4474-8c00-d238a7384be1,OSCC_06,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '333e24c9-ec45-4...
...,...,...,...,...,...,...,...,...,...,...,...,...
143,df6bef95-c233-4b10-b321-36ef4e79b5d4,OSCC_40,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': 'a3402806-a9ec-4...
144,e11e9155-4ac6-43dc-b8e5-1be822cd2dab,OSCC_47,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '9d1789f8-d629-4...
145,ea7c9fbd-8353-4f3c-9fea-2fba79140536,OSCC_56,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '9c36e4e9-a971-4...
146,f581075d-1b69-4812-9fe4-2bde4aad8bf2,OSCC_38,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '4059003c-b576-4...


In [20]:
firstPdcSample = pdc_head_and_mouth[0]['samples'][0]

firstPdcSample

{'aliquots': [{'aliquot_id': '426a2696-f073-4a0e-bdcc-af62328b5d6d',
   'aliquot_run_metadata': [{'aliquot_run_metadata_id': 'cfb6cae0-6316-47b9-a291-fd2fa9693c93'}],
   'aliquot_submitter_id': 'SAMN05341165_N',
   'analyte_type': 'protein'}],
 'biospecimen_anatomic_site': None,
 'composition': 'Not Reported',
 'current_weight': None,
 'days_to_collection': None,
 'days_to_sample_procurement': None,
 'diagnosis_pathologically_confirmed': None,
 'freezing_method': None,
 'gdc_project_id': None,
 'gdc_sample_id': None,
 'initial_weight': None,
 'intermediate_dimension': None,
 'is_ffpe': None,
 'longest_dimension': None,
 'method_of_sample_procurement': None,
 'oct_embedded': None,
 'pathology_report_uuid': None,
 'preservation_method': None,
 'sample_id': 'd58e2a88-8b0c-4cc4-bb1a-e7734ad58209',
 'sample_submitter_id': 'OSCC_13_N',
 'sample_type': 'Solid Tissue Normal',
 'sample_type_id': None,
 'shortest_dimension': None,
 'time_between_clamping_and_freezing': None,
 'time_between_excis

In [21]:
pdcSpecimen = entities.Specimen(
    id = firstPdcSample['sample_id'],
    analyte_type = entities.CodeableConcept(
        entities.Coding(
            firstSample['sample_type']
        )
    )
)

pdcSpecimen

NameError: name 'entities' is not defined