# GDC to CRDC-H Conversion

This notebook demonstrates one method for converting GDC data into CRDC-H instance data: by reading node data as JSON and writing it out in the LinkML model. The LinkML can be used to [generate](https://github.com/linkml/linkml#python-dataclasses) [Python DataClasses](https://docs.python.org/3/library/dataclasses.html), which can then be exported in several data publication format, such as JSON or RDF.

## Setup

We start by installing the [LinkML](https://pypi.org/project/linkml/) and [pandas](https://pypi.org/project/pandas/) packages.

In [1]:
import sys

# Install LinkML.
!{sys.executable} -m pip install git+https://github.com/linkml/linkml.git

# Install pandas.
!{sys.executable} -m pip install pandas

Collecting git+https://github.com/linkml/linkml.git
  Cloning https://github.com/linkml/linkml.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-_ccxl0je
  Running command git clone -q https://github.com/linkml/linkml.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-_ccxl0je
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting argparse>=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: linkml
  Building wheel for linkml (PEP 517) ... [?25ldone
[?25h  Created wheel for linkml: filename=linkml-0.0.8.dev53-py3-none-any.whl size=134767 sha256=af07098044a57035c0b677e48c092f4fe6ab8968d16a2a95a253c691765577b2
  Stored in directory: /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-ephem-wheel-cache-g3ndz0qj

## Loading GDC data as an example

We start by loading the result of a GDC query in JSON.

In [2]:
import json
import pandas

with open('head-and-mouth/gdc-head-and-mouth.json') as file:
    gdc_head_and_mouth = json.load(file)
    
pandas.DataFrame(gdc_head_and_mouth)

Unnamed: 0,aliquot_ids,case_id,created_datetime,diagnoses,diagnosis_ids,disease_type,id,primary_site,sample_ids,samples,...,submitter_sample_ids,submitter_slide_ids,updated_datetime,analyte_ids,portion_ids,submitter_analyte_ids,submitter_portion_ids,days_to_lost_to_followup,index_date,lost_to_followup
0,[cfcde639-3045-4f66-84a6-ec74b090a5b6],cd7e514f-71ba-4cc1-b74a-a22c6248169c,2017-06-01T08:57:57.249456-05:00,"[{'age_at_diagnosis': 19592, 'classification_o...",[5d2d67d1-4611-4a18-9a66-89823aaa8e3c],Adenomas and Adenocarcinomas,cd7e514f-71ba-4cc1-b74a-a22c6248169c,Nasopharynx,[bdc73f48-dc0b-487d-abbe-e3a977b6830a],[{'created_datetime': '2017-06-01T10:44:57.790...,...,[AD6426_sample],[AD6426_slide],2018-10-25T11:34:27.425461-05:00,,,,,,,
1,"[9069bdd7-e16a-462c-881c-581c8aab6910, a74915f...",9023c9bf-02a0-4396-8161-304089957b62,,"[{'age_at_diagnosis': 24286, 'ajcc_clinical_m'...",[706b1290-3a85-54ea-a123-e8bd14b085bc],Squamous Cell Neoplasms,9023c9bf-02a0-4396-8161-304089957b62,Larynx,"[8b2588c8-4261-492b-b173-2490a5de668f, badeaed...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-CN-6012-10A, TCGA-CN-6012-01A, TCGA-CN-6...","[TCGA-CN-6012-01Z-00-DX1, TCGA-CN-6012-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[80c6fde2-b6bb-4f40-908a-f116c466d296, 6f77017...","[bada788e-5112-4d21-a079-72729bd0cc83, fe24eea...","[TCGA-CN-6012-01A-11D, TCGA-CN-6012-10A-01W, T...","[TCGA-CN-6012-01A-13-2072-20, TCGA-CN-6012-10A...",,,
2,"[8f695cd3-01dd-4601-8b17-37cf40514422, f0e325f...",55f96a9c-e2c8-4243-8a7e-94bc6fab73a6,,"[{'age_at_diagnosis': 20992, 'ajcc_clinical_m'...",[40954a8e-e4c2-5604-937b-0a79ac7489d2],Squamous Cell Neoplasms,55f96a9c-e2c8-4243-8a7e-94bc6fab73a6,Larynx,"[a7692585-a129-4671-bfe5-98342a326776, b069c55...","[{'composition': None, 'created_datetime': Non...",...,"[TCGA-CV-7261-01Z, TCGA-CV-7261-11A, TCGA-CV-7...","[TCGA-CV-7261-01A-01-TS1, TCGA-CV-7261-01Z-00-...",2019-08-06T14:26:28.608672-05:00,"[a72f2de7-eb40-4818-a104-edb508d5517b, e8120e5...","[177fa10b-0135-468d-b5a3-6f30cc3cd390, f51d76a...","[TCGA-CV-7261-10A-01D, TCGA-CV-7261-01A-11R, T...","[TCGA-CV-7261-10A-01, TCGA-CV-7261-01A-13-2074...",,,
3,"[1265fd12-4706-43b0-84f3-d16d46f20963, 3443e1b...",c9a36eb5-ac3e-424e-bc2e-303de7105957,,"[{'age_at_diagnosis': 21886, 'ajcc_clinical_m'...",[48e8dd81-ed4d-5c54-af66-84e86477d5c8],Squamous Cell Neoplasms,c9a36eb5-ac3e-424e-bc2e-303de7105957,Oropharynx,"[256469d0-5f36-4966-bf4f-3b4297e55f43, bd90f96...","[{'composition': None, 'created_datetime': Non...",...,"[TCGA-BA-A6DL-10A, TCGA-BA-A6DL-01Z, TCGA-BA-A...","[TCGA-BA-A6DL-01Z-00-DX1, TCGA-BA-A6DL-01A-02-...",2019-08-06T14:25:14.243346-05:00,"[ec4487c1-6976-4161-9236-5e6810ed31b7, ffd1e03...","[7f327ef6-4fe6-40c8-aac7-731e051177bb, 2a4b0be...","[TCGA-BA-A6DL-01A-21D, TCGA-BA-A6DL-01A-21R, T...","[TCGA-BA-A6DL-10A-01, TCGA-BA-A6DL-01A-11-A45L...",,,
4,"[59b70846-64f0-489e-8ea5-84a347aedeb8, c8e46ce...",4cffea0b-90a7-4c86-a73f-bb8feca3ada7,,"[{'age_at_diagnosis': 14190, 'ajcc_clinical_m'...",[1da5c51a-ee25-51a6-a4c2-27d8fdcbe24e],Squamous Cell Neoplasms,4cffea0b-90a7-4c86-a73f-bb8feca3ada7,Tonsil,"[1ed245de-fea4-42c9-9197-773bcd12d2a8, 665d4bf...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-CN-5365-01Z, TCGA-CN-5365-10A, TCGA-CN-5...","[TCGA-CN-5365-01Z-00-DX1, TCGA-CN-5365-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[d46b5e9b-3652-45a1-a91d-46277aea3916, 35122dd...","[38c5a4c1-6d01-4885-ba35-0032e6b835b0, 516f802...","[TCGA-CN-5365-01A-01D, TCGA-CN-5365-01A-01W, T...","[TCGA-CN-5365-10A-01, TCGA-CN-5365-01A-21-2072...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,"[1d3b16fd-f98b-45ef-a423-861975f098b6, 0eabe3e...",97640ef0-0259-4244-95ba-48d28c60b372,,"[{'age_at_diagnosis': 19621, 'ajcc_clinical_m'...",[b725e6d2-92c0-5585-9de7-14bb623b472e],Squamous Cell Neoplasms,97640ef0-0259-4244-95ba-48d28c60b372,Larynx,"[fb06ae75-8516-4cdc-ba9e-093444907fc7, 5162217...","[{'composition': None, 'created_datetime': Non...",...,"[TCGA-CN-4738-01A, TCGA-CN-4738-01Z, TCGA-CN-4...","[TCGA-CN-4738-01Z-00-DX1, TCGA-CN-4738-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[4dc95dbe-b10f-4d6e-9413-ae47a0a49865, e637c1c...","[56c7d4e4-5703-4686-98b1-0c3125e5913e, 60d72bd...","[TCGA-CN-4738-01A-02D, TCGA-CN-4738-10A-01W, T...","[TCGA-CN-4738-01A-31-2072-20, TCGA-CN-4738-01A...",,,
556,[96f09bc8-a194-482c-bd17-baf28739e4f8],422a72e7-fe76-411d-b59e-1f0f0812c3cf,2018-09-13T13:42:10.444091-05:00,"[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[842d6984-7c03-4ab6-95db-42fa2ea699db],Squamous Cell Neoplasms,422a72e7-fe76-411d-b59e-1f0f0812c3cf,Larynx,[6f9eeaa3-8bd1-479c-a0fc-98317eb458dc],"[{'biospecimen_anatomic_site': None, 'biospeci...",...,[GENIE-DFCI-010671-11105],,2019-11-18T13:54:59.294543-06:00,,,,,,Initial Genomic Sequencing,
557,"[cd211e89-63f7-44f0-8a76-51703ae45112, 866292c...",4b50aea4-4ad1-4bf6-9cf1-984c28a99c84,,"[{'age_at_diagnosis': 21731, 'ajcc_clinical_m'...",[95d85e5a-b82c-59f8-b7ad-710e019cdebc],Squamous Cell Neoplasms,4b50aea4-4ad1-4bf6-9cf1-984c28a99c84,Hypopharynx,"[1077bf93-cf23-41db-925c-c633921894cc, 4a0d79f...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-TN-A7HL-01A, TCGA-TN-A7HL-01Z, TCGA-TN-A...","[TCGA-TN-A7HL-01Z-00-DX1, TCGA-TN-A7HL-01A-01-...",2019-08-06T14:27:14.277986-05:00,"[6ffc3548-d593-47ab-adf8-6d73075b5fa0, 9426e53...","[cd5864c8-b4e0-4405-b7df-1e0a51865670, f9ef56b...","[TCGA-TN-A7HL-01A-11R, TCGA-TN-A7HL-01A-11D, T...","[TCGA-TN-A7HL-01A-21-A45L-20, TCGA-TN-A7HL-10A...",,,
558,"[0c2f310b-fa59-4f6f-894a-dad920214004, 6ddd527...",0394060d-010e-405f-983d-db525f01f2c3,,"[{'age_at_diagnosis': 23640, 'ajcc_clinical_m'...",[7a67eecc-6f46-5181-8b64-c022d0fd0060],Squamous Cell Neoplasms,0394060d-010e-405f-983d-db525f01f2c3,Hypopharynx,"[5c2b4403-cdd4-4550-ba01-d8ebad9fcbc8, 4467ee1...",[{'created_datetime': '2018-05-17T12:19:46.292...,...,"[TCGA-BB-A5HY-10A, TCGA-BB-A5HY-01A, TCGA-BB-A...","[TCGA-BB-A5HY-01Z-00-DX1, TCGA-BB-A5HY-01A-01-...",2019-08-06T14:25:25.511101-05:00,"[ee7f98c5-9c78-4bbe-b44a-a9e357a18058, 2d82983...","[9576f242-6874-4df9-8744-e0755d565358, 8842cbd...","[TCGA-BB-A5HY-01A-11W, TCGA-BB-A5HY-01A-11D, T...","[TCGA-BB-A5HY-01A-11, TCGA-BB-A5HY-10A-01]",,,


## Loading the Python classes for the CRDC-H model

We previously generated the Python DataClasses for the CRDC-H model. We can now load these DataClasses to 

In [None]:
# Install LinkML.
!{sys.executable} -m pip install git+https://github.com/cancerDHC/ccdhmodel.git@reorganize-to-linkml-template


## Convert the input data in pieces

For demonstrative purposes, we'll start by translating pieces of this record into CRDC-H instance data.

Let's start with the samples in the `samples` key (which correspond to a [Specimen](https://cancerdhc.github.io/ccdhmodel/entities/Specimen.html) in the CRDC-H model).

In [4]:
firstSample = gdc_head_and_mouth[0]['samples'][0]
firstSample

{'created_datetime': '2017-06-01T10:44:57.790971-05:00',
 'sample_id': 'bdc73f48-dc0b-487d-abbe-e3a977b6830a',
 'sample_type': 'Metastatic',
 'state': 'released',
 'submitter_id': 'AD6426_sample',
 'tissue_type': 'Not Reported',
 'tumor_descriptor': 'Metastatic',
 'updated_datetime': '2018-11-15T21:10:03.529893-06:00'}

In [5]:
firstSpecimen = entities.Specimen(
    id = firstSample['sample_id'],
    analyte_type = entities.CodeableConcept(
        entities.Coding(
            firstSample['sample_type']
        )
    )
)

firstSpecimen

Specimen(id='bdc73f48-dc0b-487d-abbe-e3a977b6830a', identifier=[], associated_project=None, specimen_type=None, analyte_type=CodeableConcept(coding=[Coding(code='Metastatic', display=None, system=None, version=None)], text=None), derived_from_specimen=[], derived_from_subject=None, source_material_type=None, cellular_composition=None, general_tissue_morphology=None, specific_tissue_morphology=None, current_weight=[], current_volume=[], analyte_concentration=None, analyte_concentration_method=None, matched_normal_flag=[], qualification_status_flag=None)

## Exporting CRDC-H data as RDF

LinkML supports this via JSON-LD.

In [11]:
from linkml.generators.jsonldcontextgen import ContextGenerator
from linkml.dumpers.json_dumper import dumps

with open('crdch/entities.yaml') as file:
    crdchEntities = file.read()
    
jsonldContext = ContextGenerator('crdch/entities.yaml').serialize()

jsonld = dumps(firstSpecimen, jsonldContext)
print(''.join(jsonld))

{
  "id": "bdc73f48-dc0b-487d-abbe-e3a977b6830a",
  "analyte_type": {
    "coding": [
      {
        "code": "Metastatic"
      }
    ]
  },
  "@type": "Specimen",
  "@context": {
    "biolinkml": "https://w3id.org/biolink/biolinkml/",
    "ccdh": "https://example.org/ccdh/",
    "types": "https://example.org/ccdh/datatypes/",
    "@vocab": "https://example.org/ccdh/",
    "coding": {
      "@type": "@id"
    },
    "type": {
      "@type": "@id"
    },
    "identifier": {
      "@type": "@id"
    },
    "taxon": {
      "@type": "@id"
    },
    "comparator": {
      "@type": "@id"
    },
    "unit": {
      "@type": "@id"
    },
    "value": {
      "@type": "xsd:decimal"
    },
    "associated_patient": {
      "@type": "@id"
    },
    "associated_project": {
      "@type": "@id"
    },
    "primary_disease_site": {
      "@type": "@id"
    },
    "primary_disease_type": {
      "@type": "@id"
    },
    "analyte_concentration": {
      "@type": "@id"
    },
    "analyte_concentra

In [12]:
# We can read this JSON-LD in Turtle.
from rdflib import Graph

g = Graph()
g.parse(data=jsonld, format="json-ld")
rdfAsTurtle = g.serialize(format="turtle").decode()
print(rdfAsTurtle)

@prefix : <https://example.org/ccdh/> .
@prefix ccdh: <https://example.org/ccdh/> .

[] a ccdh:Specimen ;
    ccdh:analyte_type [ ccdh:coding [ ccdh:code "Metastatic" ] ] ;
    ccdh:id "bdc73f48-dc0b-487d-abbe-e3a977b6830a" .




## Incorporating PDC data

In [13]:
with open('head-and-mouth/pdc-head-and-mouth.json') as file:
    pdc_head_and_mouth = json.load(file)
    
pandas.DataFrame(pdc_head_and_mouth)

Unnamed: 0,case_id,case_submitter_id,days_to_lost_to_followup,demographics,diagnoses,disease_type,externalReferences,index_date,lost_to_followup,primary_site,project_submitter_id,samples
0,0232701d-6d00-440c-af6c-5899fbbf4142,OSCC_13,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '426a2696-f073-4...
1,0e943de7-c277-48f2-8fa9-b2e836b03c2c,OSCC_25,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '38404eb4-20a6-4...
2,1104505a-9890-49ce-8d7d-7a8070261324,OSCC_23,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '15218d5b-fc40-4...
3,195cd133-0d53-402d-b31c-3d4fe0481858,OSCC_37,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '47e8d70c-646d-4...
4,1df726a4-8520-4474-8c00-d238a7384be1,OSCC_06,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '333e24c9-ec45-4...
...,...,...,...,...,...,...,...,...,...,...,...,...
143,df6bef95-c233-4b10-b321-36ef4e79b5d4,OSCC_40,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': 'a3402806-a9ec-4...
144,e11e9155-4ac6-43dc-b8e5-1be822cd2dab,OSCC_47,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '9d1789f8-d629-4...
145,ea7c9fbd-8353-4f3c-9fea-2fba79140536,OSCC_56,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '9c36e4e9-a971-4...
146,f581075d-1b69-4812-9fe4-2bde4aad8bf2,OSCC_38,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '4059003c-b576-4...


In [14]:
firstPdcSample = pdc_head_and_mouth[0]['samples'][0]

firstPdcSample

{'aliquots': [{'aliquot_id': '426a2696-f073-4a0e-bdcc-af62328b5d6d',
   'aliquot_run_metadata': [{'aliquot_run_metadata_id': 'cfb6cae0-6316-47b9-a291-fd2fa9693c93'}],
   'aliquot_submitter_id': 'SAMN05341165_N',
   'analyte_type': 'protein'}],
 'biospecimen_anatomic_site': None,
 'composition': 'Not Reported',
 'current_weight': None,
 'days_to_collection': None,
 'days_to_sample_procurement': None,
 'diagnosis_pathologically_confirmed': None,
 'freezing_method': None,
 'gdc_project_id': None,
 'gdc_sample_id': None,
 'initial_weight': None,
 'intermediate_dimension': None,
 'is_ffpe': None,
 'longest_dimension': None,
 'method_of_sample_procurement': None,
 'oct_embedded': None,
 'pathology_report_uuid': None,
 'preservation_method': None,
 'sample_id': 'd58e2a88-8b0c-4cc4-bb1a-e7734ad58209',
 'sample_submitter_id': 'OSCC_13_N',
 'sample_type': 'Solid Tissue Normal',
 'sample_type_id': None,
 'shortest_dimension': None,
 'time_between_clamping_and_freezing': None,
 'time_between_excis

In [15]:
pdcSpecimen = entities.Specimen(
    id = firstPdcSample['sample_id'],
    analyte_type = entities.CodeableConcept(
        entities.Coding(
            firstSample['sample_type']
        )
    )
)

pdcSpecimen

Specimen(id='d58e2a88-8b0c-4cc4-bb1a-e7734ad58209', identifier=[], associated_project=None, specimen_type=None, analyte_type=CodeableConcept(coding=[Coding(code='Metastatic', display=None, system=None, version=None)], text=None), derived_from_specimen=[], derived_from_subject=None, source_material_type=None, cellular_composition=None, general_tissue_morphology=None, specific_tissue_morphology=None, current_weight=[], current_volume=[], analyte_concentration=None, analyte_concentration_method=None, matched_normal_flag=[], qualification_status_flag=None)