# GDC to CRDC-H Conversion

This notebook demonstrates one method for converting GDC data into CRDC-H instance data: by reading node data as JSON and writing it out in the LinkML model. The LinkML can be used to [generate](https://github.com/linkml/linkml#python-dataclasses) [Python DataClasses](https://docs.python.org/3/library/dataclasses.html), which can then be exported in several data publication format, such as JSON or RDF.

## Setup

We start by installing the [LinkML](https://pypi.org/project/linkml/) and [pandas](https://pypi.org/project/pandas/) packages.

In [1]:
import sys

# Install LinkML.
!{sys.executable} -m pip install git+https://github.com/linkml/linkml.git

# Install pandas.
!{sys.executable} -m pip install pandas

Collecting git+https://github.com/linkml/linkml.git
  Cloning https://github.com/linkml/linkml.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-t82gs9m5
  Running command git clone -q https://github.com/linkml/linkml.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-t82gs9m5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting argparse>=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0
You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.14/libexec/bin/python3.9 -m pip install --upgrade pip' command.[0m


You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.14/libexec/bin/python3.9 -m pip install --upgrade pip' command.[0m


## Loading GDC data as an example

We start by loading the result of a GDC query in JSON.

In [2]:
import json
import pandas

with open('cptac2-subject-09CO022/gdc_subject_09CO022.json') as file:
    gdc_subject_09CO022 = json.load(file)
    
pandas.DataFrame([gdc_subject_09CO022])

Unnamed: 0,aliquot_ids,case_id,created_datetime,days_to_lost_to_followup,diagnoses,diagnosis_ids,disease_type,id,index_date,lost_to_followup,primary_site,sample_ids,samples,state,submitter_aliquot_ids,submitter_diagnosis_ids,submitter_id,submitter_sample_ids,updated_datetime
0,"[0d8adcbf-13f0-48c3-83df-3fa205b79ae8, 9250d96...",c5421e34-e5c7-4ba5-aed9-146a5575fd8d,2017-01-25T15:29:16.160843-06:00,,"[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb],Adenomas and Adenocarcinomas,c5421e34-e5c7-4ba5-aed9-146a5575fd8d,,,Colon,"[4591a53d-5668-4a70-b44b-e08a3d59267e, b12c257...","[{'biospecimen_anatomic_site': None, 'biospeci...",released,"[f69deaeb-6b6f-4c61-8900-fd0f26_D7_1, 60805d52...",[09CO022-DX],09CO022,"[f69deaeb-6b6f-4c61-8900-fd0f26, 60805d52-8ca1...",2019-10-24T07:59:21.887408-05:00


## Loading the Python classes for the CRDC-H model

We previously generated the Python DataClasses for the CRDC-H model. We can now load these DataClasses to 

In [3]:
from crdch.python import entities

## Convert the input data in pieces

For demonstrative purposes, we'll start by translating pieces of this record into CRDC-H instance data.

Let's start with the samples in the `samples` key (which correspond to a [Specimen](https://cancerdhc.github.io/ccdhmodel/entities/Specimen.html) in the CRDC-H model).

In [4]:
firstSample = gdc_subject_09CO022['samples'][0]
firstSample

{'biospecimen_anatomic_site': None,
 'biospecimen_laterality': None,
 'catalog_reference': None,
 'composition': None,
 'created_datetime': '2017-01-25T15:31:52.788719-06:00',
 'current_weight': None,
 'days_to_collection': None,
 'days_to_sample_procurement': None,
 'diagnosis_pathologically_confirmed': None,
 'distance_normal_to_tumor': None,
 'distributor_reference': None,
 'freezing_method': None,
 'growth_rate': None,
 'initial_weight': None,
 'intermediate_dimension': None,
 'is_ffpe': None,
 'longest_dimension': None,
 'method_of_sample_procurement': None,
 'oct_embedded': None,
 'passage_count': None,
 'pathology_report_uuid': None,
 'preservation_method': None,
 'sample_id': 'b12c257d-7409-4858-9384-c430929a075a',
 'sample_type': 'Blood Derived Normal',
 'sample_type_id': '10',
 'shortest_dimension': None,
 'state': 'released',
 'submitter_id': '60805d52-8ca1-46d4-8101-0ad055',
 'time_between_clamping_and_freezing': None,
 'time_between_excision_and_freezing': None,
 'tissue_t

In [5]:
specimen = entities.Specimen(
    id = firstSample['sample_id'],
    analyte_type = entities.CodeableConcept(
        entities.Coding(
            firstSample['sample_type']
        )
    )
)

specimen

Specimen(id='b12c257d-7409-4858-9384-c430929a075a', identifier=[], associated_project=None, specimen_type=None, analyte_type=CodeableConcept(coding=[Coding(code='Blood Derived Normal', display=None, system=None, version=None)], text=None), derived_from_specimen=[], derived_from_subject=None, source_material_type=None, cellular_composition=None, general_tissue_morphology=None, specific_tissue_morphology=None, current_weight=[], current_volume=[], analyte_concentration=None, analyte_concentration_method=None, matched_normal_flag=[], qualification_status_flag=None)

## Exporting CRDC-H data as RDF

LinkML supports this via JSON-LD.

In [14]:
from linkml.generators.jsonldcontextgen import ContextGenerator
from linkml.dumpers.json_dumper import dumps

with open('crdch/entities.yaml') as file:
    crdchEntities = file.read()
    
jsonldContext = ContextGenerator('crdch/entities.yaml').serialize()

jsonld = dumps(specimen, jsonldContext)
jsonld

'{\n  "id": "b12c257d-7409-4858-9384-c430929a075a",\n  "analyte_type": {\n    "coding": [\n      {\n        "code": "Blood Derived Normal"\n      }\n    ]\n  },\n  "@type": "Specimen",\n  "@context": {\n    "biolinkml": "https://w3id.org/biolink/biolinkml/",\n    "ccdh": "https://example.org/ccdh/",\n    "types": "https://example.org/ccdh/datatypes/",\n    "@vocab": "https://example.org/ccdh/",\n    "coding": {\n      "@type": "@id"\n    },\n    "type": {\n      "@type": "@id"\n    },\n    "identifier": {\n      "@type": "@id"\n    },\n    "taxon": {\n      "@type": "@id"\n    },\n    "comparator": {\n      "@type": "@id"\n    },\n    "unit": {\n      "@type": "@id"\n    },\n    "value": {\n      "@type": "xsd:decimal"\n    },\n    "associated_patient": {\n      "@type": "@id"\n    },\n    "associated_project": {\n      "@type": "@id"\n    },\n    "primary_disease_site": {\n      "@type": "@id"\n    },\n    "primary_disease_type": {\n      "@type": "@id"\n    },\n    "analyte_concentra

In [16]:
# We can read this JSON-LD in Turtle.
from rdflib import Graph

g = Graph()
g.parse(data=jsonld, format="json-ld")
rdfAsTurtle = g.serialize(format="turtle").decode()
print(rdfAsTurtle)

@prefix : <https://example.org/ccdh/> .
@prefix ccdh: <https://example.org/ccdh/> .

[] a ccdh:Specimen ;
    ccdh:analyte_type [ ccdh:coding [ ccdh:code "Blood Derived Normal" ] ] ;
    ccdh:id "b12c257d-7409-4858-9384-c430929a075a" .




## Incorporating PDC data

In [6]:
with open('cptac2-subject-09CO022/pdc_subject_09CO022.json') as file:
    pdc_subject_09CO022 = json.load(file)
    
pandas.DataFrame([pdc_subject_09CO022])

Unnamed: 0,case_id,case_submitter_id,days_to_lost_to_followup,demographics,diagnoses,disease_type,externalReferences,index_date,lost_to_followup,primary_site,project_submitter_id,samples
0,459e3b69-63d6-11e8-bcf1-0a2705229b82,09CO022,0,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Colon Adenocarcinoma,[{'external_reference_id': 'c5421e34-e5c7-4ba5...,,,Colon,CPTAC-2,[{'aliquots': [{'aliquot_id': '208ebc64-6425-1...


In [8]:
firstPdcSample = pdc_subject_09CO022['samples'][0]

firstPdcSample

{'aliquots': [{'aliquot_id': '208ebc64-6425-11e8-bcf1-0a2705229b82',
   'aliquot_run_metadata': [{'aliquot_run_metadata_id': '58a72e91-e26a-11e8-907f-0a2705229b82'},
    {'aliquot_run_metadata_id': 'a2422b46-e26a-11e8-907f-0a2705229b82'}],
   'aliquot_submitter_id': '76498650-fdbf-4f5d-a19a-cce9a2_D2',
   'analyte_type': 'Protein'}],
 'biospecimen_anatomic_site': 'Not Reported',
 'composition': 'Not Reported',
 'current_weight': None,
 'days_to_collection': None,
 'days_to_sample_procurement': None,
 'diagnosis_pathologically_confirmed': 'Not Reported',
 'freezing_method': None,
 'gdc_project_id': '',
 'gdc_sample_id': '',
 'initial_weight': None,
 'intermediate_dimension': None,
 'is_ffpe': None,
 'longest_dimension': None,
 'method_of_sample_procurement': 'Not Reported',
 'oct_embedded': None,
 'pathology_report_uuid': None,
 'preservation_method': 'Not Reported',
 'sample_id': 'f4af3e4d-641b-11e8-bcf1-0a2705229b82',
 'sample_submitter_id': '76498650-fdbf-4f5d-a19a-cce9a2',
 'sample_

In [9]:
pdcSpecimen = entities.Specimen(
    id = firstPdcSample['sample_id'],
    analyte_type = entities.CodeableConcept(
        entities.Coding(
            firstSample['sample_type']
        )
    )
)

pdcSpecimen

Specimen(id='f4af3e4d-641b-11e8-bcf1-0a2705229b82', identifier=[], associated_project=None, specimen_type=None, analyte_type=CodeableConcept(coding=[Coding(code='Blood Derived Normal', display=None, system=None, version=None)], text=None), derived_from_specimen=[], derived_from_subject=None, source_material_type=None, cellular_composition=None, general_tissue_morphology=None, specific_tissue_morphology=None, current_weight=[], current_volume=[], analyte_concentration=None, analyte_concentration_method=None, matched_normal_flag=[], qualification_status_flag=None)

In [10]:
# Let's combine these two specimens in one patient.

# TODO complete.
patient = entities.Patient()

ValueError: id must be supplied