# GDC to CRDC-H Conversion

This notebook demonstrates one method for converting GDC data into CRDC-H instance data: by reading node data as JSON and writing it out in the LinkML model. The LinkML can be used to [generate](https://github.com/linkml/linkml#python-dataclasses) [Python DataClasses](https://docs.python.org/3/library/dataclasses.html), which can then be exported in several data publication format, such as JSON or RDF.

## Setup

We start by installing the [LinkML](https://pypi.org/project/linkml/) and [pandas](https://pypi.org/project/pandas/) packages.

In [2]:
import sys

# Install LinkML.
!{sys.executable} -m pip install git+https://github.com/linkml/linkml.git

# Install pandas.
!{sys.executable} -m pip install pandas

Collecting git+https://github.com/linkml/linkml.git
  Cloning https://github.com/linkml/linkml.git to /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-q_zak21a
  Running command git clone -q https://github.com/linkml/linkml.git /private/var/folders/j7/wl1qrc1n7fv9vnszw_0dsm680000gn/T/pip-req-build-q_zak21a
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting argparse>=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Collecting sqlalchemy
  Downloading SQLAlchemy-1.4.11-cp39-cp39-macosx_10_14_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 2.3 MB/s eta 0:00:01
Collecting greenlet!=0.4.17
  Downloading greenlet-1.0.0-cp39-cp39-macosx_10_14_x86_64.whl (86 kB)
[K     |████████████████████████████████| 86 kB 3.3 MB/s eta 0:00:01
[?25hBuilding wheels for collected 

## Loading GDC data as an example

We start by loading the result of a GDC query in JSON.

In [19]:
import json
import pandas

with open('cptac2-subject-09CO022/gdc_subject_09CO022.json') as file:
    gdc_subject_09CO022 = json.load(file)
    
pandas.DataFrame([gdc_subject_09CO022])

Unnamed: 0,aliquot_ids,case_id,created_datetime,days_to_lost_to_followup,diagnoses,diagnosis_ids,disease_type,id,index_date,lost_to_followup,primary_site,sample_ids,samples,state,submitter_aliquot_ids,submitter_diagnosis_ids,submitter_id,submitter_sample_ids,updated_datetime
0,"[0d8adcbf-13f0-48c3-83df-3fa205b79ae8, 9250d96...",c5421e34-e5c7-4ba5-aed9-146a5575fd8d,2017-01-25T15:29:16.160843-06:00,,"[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb],Adenomas and Adenocarcinomas,c5421e34-e5c7-4ba5-aed9-146a5575fd8d,,,Colon,"[4591a53d-5668-4a70-b44b-e08a3d59267e, b12c257...","[{'biospecimen_anatomic_site': None, 'biospeci...",released,"[f69deaeb-6b6f-4c61-8900-fd0f26_D7_1, 60805d52...",[09CO022-DX],09CO022,"[f69deaeb-6b6f-4c61-8900-fd0f26, 60805d52-8ca1...",2019-10-24T07:59:21.887408-05:00


## Loading the Python classes for the CRDC-H model

We previously generated the Python DataClasses for the CRDC-H model. We can now load these DataClasses to 

In [9]:
from crdch.python import entities

## Convert the input data in pieces

For demonstrative purposes, we'll start by translating pieces of this record into CRDC-H instance data.

Let's start with the samples in the `samples` key (which correspond to a [Specimen](https://cancerdhc.github.io/ccdhmodel/entities/Specimen.html) in the CRDC-H model).

In [21]:
firstSample = gdc_subject_09CO022['samples'][0]
firstSample

{'biospecimen_anatomic_site': None,
 'biospecimen_laterality': None,
 'catalog_reference': None,
 'composition': None,
 'created_datetime': '2017-01-25T15:31:52.788719-06:00',
 'current_weight': None,
 'days_to_collection': None,
 'days_to_sample_procurement': None,
 'diagnosis_pathologically_confirmed': None,
 'distance_normal_to_tumor': None,
 'distributor_reference': None,
 'freezing_method': None,
 'growth_rate': None,
 'initial_weight': None,
 'intermediate_dimension': None,
 'is_ffpe': None,
 'longest_dimension': None,
 'method_of_sample_procurement': None,
 'oct_embedded': None,
 'passage_count': None,
 'pathology_report_uuid': None,
 'preservation_method': None,
 'sample_id': 'b12c257d-7409-4858-9384-c430929a075a',
 'sample_type': 'Blood Derived Normal',
 'sample_type_id': '10',
 'shortest_dimension': None,
 'state': 'released',
 'submitter_id': '60805d52-8ca1-46d4-8101-0ad055',
 'time_between_clamping_and_freezing': None,
 'time_between_excision_and_freezing': None,
 'tissue_t

In [31]:
specimen = entities.Specimen(
    id = firstSample['sample_id'],
    analyte_type = entities.CodeableConcept(
        entities.Coding(
            firstSample['sample_type']
        )
    )
)

specimen

Specimen(id='b12c257d-7409-4858-9384-c430929a075a', identifier=[], associated_project=None, specimen_type=None, analyte_type=CodeableConcept(coding=[Coding(code='Blood Derived Normal', display=None, system=None, version=None)], text=None), derived_from_specimen=[], derived_from_subject=None, source_material_type=None, cellular_composition=None, general_tissue_morphology=None, specific_tissue_morphology=None, current_weight=[], current_volume=[], analyte_concentration=None, analyte_concentration_method=None, matched_normal_flag=[], qualification_status_flag=None)