### Using FHIR to find files for genomic analysis 

We are imagining here we are a genomic researcher who wants to query on phenotypic attributes of patients within a study and do a compute. We know we can get DRS ids out of the Kids First data - and we know how ot do a compute on those files onve we have a DRS id (see other notebooks for how). So we have a SMART goal to shoot for.

At this point we can't see what study a Patient is part of. The relationship to a study is a relationship that belongs to ResearchSubject not Patient.

Can we find what Research Studies exist?

In [1]:
import sys
import os
import json
import requests

FHIR_SERVER = 'https://kf-api-fhir-service.kidsfirstdrc.org'
#FHIR_SERVER = 'https://ncpi-api-fhir-service-dev.kidsfirstdrc.org'

# Optional: Turn off SSL verification. Useful when dealing with a corporate proxy with self-signed certificates.
# This should be set to True unless you actually see certificate errors.
VERIFY_SSL = True

if not VERIFY_SSL:
    requests.packages.urllib3.disable_warnings()

# Kids First uses cookie-based authentication
# Get my locally saved cookie
full_cookie_path = os.path.expanduser('~/.keys/ncpi_prod_fhir_cookie.txt')
with open(full_cookie_path) as f:
        kf_cookie = f.read()

# We make a requests.Session to ensure consistent headers/cookie across all the requests we make
sess = requests.Session()
sess.headers.update({'Accept': 'application/fhir+json'})
sess.verify = VERIFY_SSL
sess.cookies['AWSELBAuthSessionCookie-0'] = kf_cookie

# Test out the cookie by querying the server metadata
r = sess.get(f"{FHIR_SERVER}/metadata")

if "<!DOCTYPE html>" in r.text:
    sys.stderr.write('ERROR: Could not authenticate with Kids First. The cookie may need to be updated')

SSLError: HTTPSConnectionPool(host='d3b-center.auth0.com', port=443): Max retries exceeded with url: /authorize?client_id=oQHmXPB3ICijjRnK77hQNwjwL5aW536Z&redirect_uri=https%3A%2F%2Fkf-api-fhir-service.kidsfirstdrc.org%2Foauth2%2Fidpresponse&response_type=code&scope=openid&state=TIfCUxK9i4maiNCPU66cguERCMxZzJixqXmvcmDHVyLldXuTSlVMAcxW0OFsLzKCb6pKJAduDhggd5g%2F1N1dDN1a8IZoWsoNbVCfGXebdnNGkrGa2vBqnuxBfQF8JTmbxlEAdT2Bu4D0KVMiIISD2Ug47QBKmczlYIM9o4EiK1HWX5TGBQZAryP9k2UTD4gbshJqB6VupU%2BLNc8QSc0%3D (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

In [8]:
r = sess.get(f"{FHIR_SERVER}/ResearchStudy")
study_bundle = r.json()

# In the bundle obtained total is not present
#print(f"Number of matches: {patient_bundle['total']}")
print(f"Number of Studies included in Bundle: {len(study_bundle['entry'])}")
studies = [entry['resource'] for entry in study_bundle['entry']]

for s in studies:

    print (s['id'])
    print (s['identifier'][1]['value'])
    print (s['title'])
    #print(json.dumps(s, indent=3))
    print('_'*40)

Number of Studies included in Bundle: 28
276195
ResearchStudy-SD_9PYZAHHE
Genomic Studies of Orofacial Cleft Birth Defects
________________________________________
272532
ResearchStudy-SD_7NQ9151J
Genome-wide Sequencing to Identify the Genes Responsible for Enchondromatoses and Related Malignant Tumors
________________________________________
269246
ResearchStudy-SD_B8X3C1MX
Kids First: Craniofacial Microsomia: Genetic Causes and Pathway Discovery
________________________________________
267437
ResearchStudy-SD_DK0KRWK8
Whole Genome Sequencing of African and Asian Orofacial Clefts Case-Parent Triads
________________________________________
260974
ResearchStudy-SD_NMVV8A1Y
Kids First: Genetics of Structural Defects of the Kidney and Urinary Tract
________________________________________
125635
ResearchStudy-SD_YNSSAPHE
TARGET: Neuroblastoma (NBL)
________________________________________
123556
ResearchStudy-SD_W0V965XZ
Genomic Analysis of Familial Leukemia
______________________________

Query for the Familial Leukemia study as I have access to that

In [9]:
fll_study_id = '123556'
r = sess.get(f"{FHIR_SERVER}/ResearchStudy?_id={fll_study_id}")
study_bundle = r.json()

print(f"Number of Studies included in Bundle: {len(study_bundle['entry'])}")

# Create list of just the Patient Resources in the Bundle
studies = [entry['resource'] for entry in study_bundle['entry']]
print(json.dumps(studies[0], indent=3))

Number of Studies included in Bundle: 1
{
   "resourceType": "ResearchStudy",
   "id": "123556",
   "meta": {
      "versionId": "2",
      "lastUpdated": "2022-01-19T01:26:25.167+00:00",
      "source": "#cdPsDsTirYYBOYpQ",
      "profile": [
         "http://hl7.org/fhir/StructureDefinition/ResearchStudy"
      ]
   },
   "identifier": [
      {
         "system": "https://kf-api-dataservice.kidsfirstdrc.org/studies/",
         "value": "SD_W0V965XZ"
      },
      {
         "system": "urn:kids-first:unique-string",
         "value": "ResearchStudy-SD_W0V965XZ"
      },
      {
         "system": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=",
         "value": "phs001738.v1.p1"
      }
   ],
   "title": "Genomic Analysis of Familial Leukemia",
   "status": "completed",
   "category": [
      {
         "coding": [
            {
               "system": "http://snomed.info/sct",
               "code": "86049000",
               "display": "Malignant neoplasm

ResearchSubject is being used. What do the resources look like.

In [10]:
r = sess.get(f"{FHIR_SERVER}/ResearchSubject")
subject_bundle = r.json()

# In the bundle obtained total is not present
#print(f"Number of matches: {subject_bundle['total']}")
print(f"Number of Subjects included in Bundle: {len(subject_bundle['entry'])}")
subjects = [entry['resource'] for entry in subject_bundle['entry']]

for sub in subjects:
    print(json.dumps(sub))
    print('_____________________')

Number of Subjects included in Bundle: 50
{"resourceType": "ResearchSubject", "id": "104225", "meta": {"versionId": "3", "lastUpdated": "2021-11-16T16:53:15.838+00:00", "source": "#WtnuAKQbiCsiqGsp", "profile": ["http://hl7.org/fhir/StructureDefinition/ResearchSubject"], "tag": [{"code": "SD_0TYVY1TW"}]}, "identifier": [{"value": "360"}, {"system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/", "value": "PT_FP981XCW"}, {"system": "urn:kids-first:unique-string", "value": "ResearchSubject-SD_0TYVY1TW-PT_FP981XCW"}], "status": "off-study", "study": {"reference": "ResearchStudy/104182"}, "individual": {"reference": "Patient/103052"}}
_____________________
{"resourceType": "ResearchSubject", "id": "104235", "meta": {"versionId": "3", "lastUpdated": "2021-11-16T16:53:16.284+00:00", "source": "#doWQbZZcKr5YPgyU", "profile": ["http://hl7.org/fhir/StructureDefinition/ResearchSubject"], "tag": [{"code": "SD_0TYVY1TW"}]}, "identifier": [{"value": "512"}, {"system": "https://kf-api-d

Getting ourselves a list of the resources actually used in Kids First.

In [33]:
populatedRes = {'ResearchSubject': 11529,
 'DocumentReference': 81538,
 'PractitionerRole': 3,
 'Practitioner': 4,
 'Group': 1614,
 'Organization': 22,
 'SearchParameter': 1635,
 'Task': 3212,
 'ResearchStudy': 7,
 'Specimen': 55715,
 'StructureDefinition': 664,
 'OperationDefinition': 46,
 'ValueSet': 1329,
 'CodeSystem': 1070,
 'Observation': 7957,
 'DiagnosticReport': 5,
 'Condition': 120405,
 'CompartmentDefinition': 5,
 'Patient': 11538}

Let's explore some of them. 
Condition would be useful if we eanted to find subjects with a particular disease.

In [11]:
response = sess.get(f"{FHIR_SERVER}/Condition")
cond_bundle = response.json()

# In the bundle obtained total is not present
#print(f"Number of matches: {subject_bundle['total']}")
print(f"Number of Condition included in Bundle: {len(cond_bundle['entry'])}")
conditions = [entry['resource'] for entry in cond_bundle['entry']]

for condition in conditions:
    print(json.dumps(condition))
    print('_____________________')

Number of Condition included in Bundle: 50
{"resourceType": "Condition", "id": "104958", "meta": {"versionId": "2", "lastUpdated": "2021-11-16T09:49:50.487+00:00", "source": "#FZsSU2fO9pJxG6cc", "profile": ["https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/disease"], "tag": [{"code": "SD_0TYVY1TW"}]}, "identifier": [{"system": "https://kf-api-dataservice.kidsfirstdrc.org/diagnoses/", "value": "DG_25PXA42V"}, {"system": "urn:kids-first:unique-string", "value": "Condition-SD_0TYVY1TW-DG_25PXA42V"}], "clinicalStatus": {"coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-clinical", "code": "active", "display": "Active"}], "text": "Active"}, "category": [{"coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-category", "code": "encounter-diagnosis", "display": "Encounter Diagnosis"}]}], "code": {"text": "EA+TEF"}, "bodySite": [{"text": "Not Reported"}], "subject": {"reference": "Patient/103034"}}
_____________________
{"resourceType": "Conditio

### Getting the BAM files
It turns out the BAM files have implemented as Documents in FHIR. Their details can be accessed via the DocumentReference resource.

We filter on documents which use a profile defined for NCPI DRS references

In [12]:
response = sess.get(f'{FHIR_SERVER}/DocumentReference?_profile=https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/drs-document-reference')

drs_id_bundle = response.json()

# In the bundle obtained total is not present
#print(f"Number of matches: {doc_bundle['total']}")
print(f"Number of DRS files included in Bundle: {len(drs_id_bundle['entry'])}")
drs_ids = [entry['resource'] for entry in drs_id_bundle['entry']]

for drs_id in drs_ids:
    print(json.dumps(drs_id))
    print('_'*50)

Number of DRS files included in Bundle: 50
{"resourceType": "DocumentReference", "id": "283186", "meta": {"versionId": "3", "lastUpdated": "2022-01-18T22:19:57.062+00:00", "source": "#N2omm298271KBelG", "profile": ["https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/drs-document-reference"], "tag": [{"code": "SD_ZXJFFMEF"}]}, "identifier": [{"system": "https://kf-api-dataservice.kidsfirstdrc.org/genomic-files/", "value": "GF_AYWGJGRW"}, {"system": "urn:kids-first:unique-string", "value": "DocumentReference-SD_ZXJFFMEF-GF_AYWGJGRW"}], "status": "current", "docStatus": "final", "type": {"text": "Variant Calls Index"}, "subject": {"reference": "Patient/21953"}, "securityLabel": [{"coding": [{"system": "http://terminology.hl7.org/CodeSystem/v3-Confidentiality", "code": "R", "display": "restricted"}], "text": "True"}, {"text": "SD_ZXJFFMEF"}, {"coding": [{"code": "c999"}], "text": "phs001714.c999"}, {"coding": [{"code": "c1"}], "text": "phs001714.c1"}], "content": [{"attachment": {

How would we find all the Document References for a patient?

Define a function that does a couple of things that are a routine part of our approach
a) deals with pagination
b) deals with converting one or more bundles to a list of resources 
c) uses the session we created

It takes any query that would be submitted to the server endpoint.

Care should be taken to not submit queries which would return large number of resources e.g. "/Patient"

If you are uncertain use the page_limit to set a maximum number of pages that would be returned.

In [19]:
def run_query(query, page_limit=None):
    response = sess.get(f"{FHIR_SERVER}/{query}")
    bundles = [response.json()]
    next_page_link = next(filter(lambda link: link['relation'] == 'next', bundles[0]['link']), None)
    page_count = 1
    if page_limit:
        if page_count >= page_limit:
            next_page_link = None
    while next_page_link:
        next_page = sess.get(next_page_link['url']).json()
        bundles.append(next_page)
        next_page_link = next(filter(lambda link: link['relation'] == 'next', next_page['link']), None)
        page_count += 1
        if page_limit:
            if page_count >= page_limit:
                next_page_link = None
    
    resources = [entry['resource'] for sb in bundles for entry in sb['entry']]
    return resources

# NOTE: No cell output.

In [20]:
documents = run_query("Patient?identifier=Patient-SD_BHJXBDQK-PT_ZRQQC2S9")
print("# of documents:{}".format(len(documents)))

# of documents:1


In [21]:
for p in documents:
    for i in p['identifier']:
        print(i)
    print(p)
    print('_'*80)

{'value': 'C945009'}
{'system': 'https://kf-api-dataservice.kidsfirstdrc.org/participants/', 'value': 'PT_ZRQQC2S9'}
{'system': 'urn:kids-first:unique-string', 'value': 'Patient-SD_BHJXBDQK-PT_ZRQQC2S9'}
{'resourceType': 'Patient', 'id': '74104', 'meta': {'versionId': '2', 'lastUpdated': '2021-11-16T08:42:36.313+00:00', 'source': '#AUYp4dxWaLedbTb9', 'profile': ['http://hl7.org/fhir/StructureDefinition/Patient'], 'tag': [{'code': 'SD_BHJXBDQK'}]}, 'extension': [{'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-race', 'extension': [{'url': 'text', 'valueString': 'White'}, {'url': 'ombCategory', 'valueCoding': {'system': 'urn:oid:2.16.840.1.113883.6.238', 'code': '2106-3', 'display': 'White'}}]}, {'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity', 'extension': [{'url': 'text', 'valueString': 'Not Hispanic or Latino'}, {'url': 'ombCategory', 'valueCoding': {'system': 'urn:oid:2.16.840.1.113883.6.238', 'code': '2186-5', 'display': 'Not Hispanic or La

In [22]:
documents = run_query("DocumentReference?subject=74104")
print("# of documents:{}".format(len(documents)))

# of documents:71


Create a dataframe for the files

In [24]:
documents[1]

{'resourceType': 'DocumentReference',
 'id': '490694',
 'meta': {'versionId': '1',
  'lastUpdated': '2021-10-16T01:24:08.401+00:00',
  'source': '#dbXRh1FqIuTWfpA0',
  'profile': ['https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/drs-document-reference']},
 'identifier': [{'system': 'https://kf-api-dataservice.kidsfirstdrc.org/genomic-files/',
   'value': 'GF_4RTXDX2E'},
  {'system': 'urn:kids-first:unique-string',
   'value': 'DocumentReference-SD_BHJXBDQK-GF_4RTXDX2E'}],
 'status': 'current',
 'docStatus': 'final',
 'type': {'text': 'Pathology Reports'},
 'subject': {'reference': 'Patient/74104'},
 'securityLabel': [{'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/v3-Confidentiality',
     'code': 'R',
     'display': 'restricted'}],
   'text': 'True'},
  {'text': 'SD_BHJXBDQK'},
  {'coding': [{'code': 'c1'}], 'text': 'SD_BHJXBDQK.c1'}],
 'content': [{'attachment': {'url': 'drs://data.kidsfirstdrc.org//5f874516-28cb-40da-a356-f1c3c548cc0d'}},
  {'attachment': 

In [27]:
documents[1]['content'][1]

{'attachment': {'extension': [{'url': 'https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/file-size',
    'valueDecimal': 555176},
   {'url': 'https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/hashes',
    'valueCodeableConcept': {'coding': [{'display': 'md5'}],
     'text': '77ffade22206b7304d0da0c0b9bb10e4'}},
   {'url': 'https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/hashes',
    'valueCodeableConcept': {'coding': [{'display': 'sha256'}],
     'text': 'c1032f11649fbb89ea32421e9455cbea2c9f774372ef9489cc9b5acd2294b319'}}],
  'url': 's3://kf-study-us-east-1-prd-sd-bhjxbdqk/source/images/7316-3139/Pathology/Path-Report-7316-3139-Redacted.pdf',
  'title': 'Path-Report-7316-3139-Redacted.pdf'},
 'format': {'display': 'pdf'}}

In [28]:
details = [c for c in documents[1]['content'] if 'format' in c]
print(json.dumps(details, indent =3))

[
   {
      "attachment": {
         "extension": [
            {
               "url": "https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/file-size",
               "valueDecimal": 555176
            },
            {
               "url": "https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/hashes",
               "valueCodeableConcept": {
                  "coding": [
                     {
                        "display": "md5"
                     }
                  ],
                  "text": "77ffade22206b7304d0da0c0b9bb10e4"
               }
            },
            {
               "url": "https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/hashes",
               "valueCodeableConcept": {
                  "coding": [
                     {
                        "display": "sha256"
                     }
                  ],
                  "text": "c1032f11649fbb89ea32421e9455cbea2c9f774372ef9489cc9b5acd2294b319"
               }
         

In [29]:
details[0]['attachment']['title']

'Path-Report-7316-3139-Redacted.pdf'

In [31]:
import pandas as pd

myDocs = []
for d in documents:
    kfid = [did for did in d['identifier'] if did['system']=='urn:kids-first:unique-string']
    details = [c for c in d['content'] if 'format' in c]
    if len(details) > 0:
        fmat = details[0]['format']['display']
        title = details[0]['attachment']['title']
    else:
        title = ""
        fmat = ""
    myDocs.append({
        #"kfid":kfid[0]['value'],
    "type":d['type']['text'],
    "format":fmat,
    "title":title

    #"drs":djson['content'][0]['attachment']['url']
                  }
    )

docsDF = pd.DataFrame(myDocs)
docsDF = docsDF.sort_values(["type", "format"])
from IPython.display import display, HTML
docsDF.style.set_properties(subset=['title'], **{'width': '12px'})
display (docsDF)



Unnamed: 0,type,format,title
2,Aligned Reads,bam,1f2e4474-09f7-4cd0-b9eb-dca358bf16af.bam
15,Aligned Reads,bam,01933cf0-982d-456e-ba21-bccbbbaacea7.bam
63,Aligned Reads,bam,81526ece-87af-4885-925e-105fd8e7f8b4.bam
66,Aligned Reads,bam,81526ece-87af-4885-925e-105fd8e7f8b4.local.tra...
67,Aligned Reads,bam,9fbd4b22-d4c0-4b3d-a8cf-a2cf5c347d1f.Aligned.o...
...,...,...,...
51,Variant Calls Index,tbi,ebd9c103-0ffe-466c-b573-ac01fa6b264d.consensus...
13,gVCF,gVCF,882e7692-5b48-4249-acc6-e93a4f10d232.g.vcf.gz
53,gVCF,gVCF,88afb639-3022-42aa-8e53-0a8b6c8e1329.g.vcf.gz
11,gVCF Index,tbi,882e7692-5b48-4249-acc6-e93a4f10d232.g.vcf.gz.tbi


In [199]:
docsDF.to_csv('subject_74104_files.tsv', sep="\t")

For these 71 files
Nothing tells us what the different bam files are.
Nothing tells us which crai applies to which cram.

Perhaps looking at samples for this subject might tell us that sequencing was done on more than one sample, e.g. tumor and somatic.

How do we query for samples. We've seen that there are Specimen resources.

The following is a guess - based on what worked for Documents

In [32]:
specimens = run_query("Specimen?subject=74104")
print("# of specimens:{}".format(len(specimens)))

# of specimens:19


OK! Let's look at the specimens

In [33]:
print (len(specimens))
for s in specimens:
    print(json.dumps(s, indent=3))
    print ('_'*50)

19
{
   "resourceType": "Specimen",
   "id": "457487",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2021-10-15T23:46:51.221+00:00",
      "source": "#zJ9auAJadjNul0rs",
      "profile": [
         "http://hl7.org/fhir/StructureDefinition/Specimen"
      ]
   },
   "identifier": [
      {
         "system": "https://kf-api-dataservice.kidsfirstdrc.org/biospecimens/",
         "value": "BS_EJG0RFX6"
      },
      {
         "system": "urn:kids-first:unique-string",
         "value": "Specimen-SD_BHJXBDQK-BS_EJG0RFX6"
      }
   ],
   "status": "available",
   "type": {
      "text": "Not Reported"
   },
   "subject": {
      "reference": "Patient/74104"
   },
   "collection": {
      "bodySite": {
         "text": "Not Reported"
      }
   }
}
__________________________________________________
{
   "resourceType": "Specimen",
   "id": "457486",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2021-10-15T23:46:51.081+00:00",
      "source": "#QtYDSYWXmLnq19SX",
  

It's hard to see what's going on there. Let's see if we can summarize. First it looks like there may have been specimens collected at more than one time. There's an extension used to record age. It's done in a complicated way - by defining and event birth, a relationship "after" and an offset a value and unit. (Wouldn't a CDE for days after birth have been easier?)

Let's look at when the specimens were collected.

In [34]:
specimenCollection =[]
for s in specimens:
    if '_collectedDateTime' in s['collection']:
        offsetDuration = s['collection']['_collectedDateTime']['extension'][0]['extension'][2]['valueDuration']
        specimenCollection.append ({
    "id": s['id'],
    "time": offsetDuration['value'],
    "unit": offsetDuration['unit']
            })
        
    else:
            specimenCollection.append ({
    "id": s['id'],
    "time": None,
    "unit": None
            })    
            
            
specimenCollectionDF = pd.DataFrame(specimenCollection)
specimenCollectionDF

Unnamed: 0,id,time,unit
0,457487,,
1,457486,5407.0,days
2,457484,5407.0,days
3,457482,5407.0,days
4,427006,5407.0,days
5,427005,5407.0,days
6,427004,5407.0,days
7,427003,5407.0,days
8,426999,5407.0,days
9,426997,5407.0,days


Apart from one unknown specimen all were collected on the same day.

What was collected?

That still doesn't tell us a lot. What was different about the solid tissue samples. Tumor? Normal? And the multiple blood samples can't be distinguished. There's also nothing linking the files to the specimens. Which were used for the sequencing and prodcued the bam files? 

The Collection method and site. The blood may have been collected during the surgical procedure

In [35]:
specimenDetails = []
for specimen in specimens:
    kfid = [did for did in specimen['identifier'] if did['system']=='urn:kids-first:unique-string']
    specimenDetails.append ({
    "id": specimen['id'],
    "urn:kids-first:unique-string": kfid[0]['value'],
    "type": specimen['type']['text'],
    "site": specimen['collection']['bodySite']['text']
    #"method": specimen['collection']['method']['text']
    })
specimensDF = pd.DataFrame(specimenDetails)
specimensDF = specimensDF.sort_values(["urn:kids-first:unique-string"])
specimensDF



Unnamed: 0,id,urn:kids-first:unique-string,type,site
5,427005,Specimen-SD_BHJXBDQK-BS_2WDK6MWX,Solid Tissue,Frontal Lobe
7,427003,Specimen-SD_BHJXBDQK-BS_32VYKVBS,Solid Tissue,Frontal Lobe
3,457482,Specimen-SD_BHJXBDQK-BS_4EN8D8Y4,Solid Tissue,Frontal Lobe
17,426986,Specimen-SD_BHJXBDQK-BS_4Q972C4Q,Peripheral Whole Blood,Frontal Lobe
8,426999,Specimen-SD_BHJXBDQK-BS_5D77A3Q6,Solid Tissue,Frontal Lobe
15,426988,Specimen-SD_BHJXBDQK-BS_7HM52B6P,Peripheral Whole Blood,Frontal Lobe
12,426992,Specimen-SD_BHJXBDQK-BS_8RNAR5N3,Solid Tissue,Frontal Lobe
13,426990,Specimen-SD_BHJXBDQK-BS_9A8NPXQ0,Peripheral Whole Blood,Frontal Lobe
4,427006,Specimen-SD_BHJXBDQK-BS_CWA50RW8,Solid Tissue,Frontal Lobe
0,457487,Specimen-SD_BHJXBDQK-BS_EJG0RFX6,Not Reported,Not Reported


In [94]:
print(json.dumps(specimens[1],indent=3))

{
   "resourceType": "Specimen",
   "id": "457486",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2021-10-15T23:46:51.081+00:00",
      "source": "#QtYDSYWXmLnq19SX",
      "profile": [
         "http://hl7.org/fhir/StructureDefinition/Specimen"
      ]
   },
   "identifier": [
      {
         "system": "https://kf-api-dataservice.kidsfirstdrc.org/biospecimens/",
         "value": "BS_PEFRDKDZ"
      },
      {
         "system": "urn:kids-first:unique-string",
         "value": "Specimen-SD_BHJXBDQK-BS_PEFRDKDZ"
      }
   ],
   "status": "available",
   "type": {
      "coding": [
         {
            "system": "http://snomed.info/sct",
            "code": "258435002",
            "display": "Tumor tissue sample (specimen)"
         },
         {
            "system": "http://snomed.info/sct",
            "code": "258566005",
            "display": "Deoxyribonucleic acid sample (specimen)"
         }
      ],
      "text": "Solid Tissue"
   },
   "subject": {
      "r

### Back to Genomics
Which is our goal, and the leukemia study

Find the subjects in the study

In [195]:
studyid = 123556
subjects = run_query(f"ResearchSubject?study={studyid}")
print("# of subjects:{}".format(len(subjects)))

print("# of subjects in study {}:{}".format(studyid, len(subjects)))
for s in subjects[30:40]:
    print(json.dumps(s, indent=3))
    print('_'*50)

# of subjects:620
# of subjects in study 123556:620
{
   "resourceType": "ResearchSubject",
   "id": "124146",
   "meta": {
      "versionId": "3",
      "lastUpdated": "2021-11-16T16:54:47.863+00:00",
      "source": "#33wdBTeajVBdfxma",
      "profile": [
         "http://hl7.org/fhir/StructureDefinition/ResearchSubject"
      ],
      "tag": [
         {
            "code": "SD_W0V965XZ"
         }
      ]
   },
   "identifier": [
      {
         "value": "56830.003"
      },
      {
         "system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/",
         "value": "PT_SEK700X2"
      },
      {
         "system": "urn:kids-first:unique-string",
         "value": "ResearchSubject-SD_W0V965XZ-PT_SEK700X2"
      }
   ],
   "status": "off-study",
   "study": {
      "reference": "ResearchStudy/123556"
   },
   "individual": {
      "reference": "Patient/122881"
   }
}
__________________________________________________
{
   "resourceType": "ResearchSubject",
   "id": "12

Define a function to get Files

In [162]:
def getFiles(sess, subject_id):
    response = sess.get(f"{FHIR_SERVER}/DocumentReference?subject={subject_id}")
    bundle = response.json()
    documents = [entry['resource'] for entry in bundle['entry']]

    print("# of documents for subject {} :{}".format(subject_id, len(documents)))

    myDocs = []
    for d in documents:
        #print(json.dumps(d, indent=3))
        myDocs.append({

        "type":d['type']['text'],
        "format":d['content'][0]['format']['display']
        })

    docsDF = pd.DataFrame(myDocs)
    docsDF = docsDF.sort_values(["type", "format"])
    print(docsDF)

And use that function to get the files for the patients above

In [191]:
response = sess.get(f"{FHIR_SERVER}/DocumentReference?subject=122881")
bundle = response.json()
if 'entry' in bundle:
    documents = [entry['resource'] for entry in bundle['entry']]
else:
    documents = []
print("# of documents:{}".format(len(documents)))

# of documents:0


In [192]:
documents = run_query("DocumentReference?subject=122881")

NameError: name 'resources' is not defined

In [165]:
for s in subjects[30:40]:
    patient = s['individual']['reference']
    print(patient)
    getFiles(sess, patient)
    print('_'*50)

Patient/122881


KeyError: 'entry'

### Can we search for bams specifically?

#TODO fix this

In [116]:
response = sess.get(f"{FHIR_SERVER}/DocumentReference?format__text=bam")
bundle = response.json()

bams = [entry['resource'] for entry in bundle['entry']]
print("# of bams:{}".format(len(bams)))

import pandas as pd

file1 = open("bam_subjects.txt","w") 
for d in bams:
    #print(djson['subject'])
    file1.write(d['subject']['reference']+'\n')

file1.close()

KeyError: 'entry'

Try getting everything and inspect the results

In [118]:
response = sess.get(f"{FHIR_SERVER}/DocumentReference")
bundle = response.json()

documents = [entry['resource'] for entry in bundle['entry']]
print("# of documents:{}".format(len(documents)))

#import pandas as pd

file1 = open("all_docs.txt","w") 
for d in documents:
    if 'format' in d['content'][0]:
        fmt = d['content'][0]['format']['display']
    else:
        fmt = 'none'
    file1.write("{}\t{}\n".format(fmt,d['subject']['reference']))                
file1.close()



# of documents:50


Try a different study

In [89]:
studyid = 701322
response = sess.get(f"{FHIR_SERVER}/ResearchSubject?study={studyid}")
bundle = response.json()

subjects = [entry['resource'] for entry in bundle['entry']]
print("# of subjects:{}".format(len(subjects)))

print("# of subjects in study {}:{}".format(studyid, len(subjects)))
for s in subjects[30:40]:
    print(json.dumps(s, indent=3))
    print('_'*50)

# of subjects:50
# of subjects in study 701322:50
{
   "resourceType": "ResearchSubject",
   "id": "705462",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2021-06-16T02:04:43.838+00:00",
      "source": "#Djd3HyXDFStZyuOw",
      "profile": [
         "http://hl7.org/fhir/StructureDefinition/ResearchSubject"
      ]
   },
   "identifier": [
      {
         "system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/",
         "value": "PT_41TVK406"
      },
      {
         "system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/",
         "value": "?study_id=SD_BHJXBDQK&external_id=C3291480"
      },
      {
         "system": "urn:kids-first:unique-string",
         "value": "ResearchSubject-SD_BHJXBDQK-C3291480"
      }
   ],
   "status": "off-study",
   "study": {
      "reference": "ResearchStudy/701322"
   },
   "individual": {
      "reference": "Patient/701278"
   }
}
__________________________________________________
{
   "resourceType": "Re

In [113]:
#TODO - fix the query by format below

In [111]:
def getFiles2(sess, subject_id, file_type):
    response = sess.get(f"{FHIR_SERVER}/DocumentReference?subject={subject_id}&format_display={file_type}")
    bundle = response.json()
    myDocs = []
    if 'entry' in bundle:
        documents = [entry['resource'] for entry in bundle['entry']]

        print("# of {} files for subject {} :{}".format(file_type, subject_id, len(documents)))

        
        for d in documents:
            #print(json.dumps(d, indent=3))
            myDocs.append({

            "type":d['type']['text'],
            "format":d['content'][0]['format']['display']
            })

    if len(myDocs) > 0:
        docsDF = pd.DataFrame(myDocs)
        docsDF = docsDF.sort_values(["type", "format"])
        print(docsDF)
    else:
        print("No {} files found for subject {}".format(file_type, subject_id))

In [112]:
for s in subjects[30:40]:
    patient = s['individual']['reference']
    getFiles2(sess, patient, 'bam')
    print('_'*50)

No bam files found for subject Patient/701278
__________________________________________________
No bam files found for subject Patient/701284
__________________________________________________
No bam files found for subject Patient/701297
__________________________________________________
No bam files found for subject Patient/701272
__________________________________________________
No bam files found for subject Patient/701290
__________________________________________________
No bam files found for subject Patient/701280
__________________________________________________
No bam files found for subject Patient/701279
__________________________________________________
No bam files found for subject Patient/701319
__________________________________________________
No bam files found for subject Patient/701271
__________________________________________________
No bam files found for subject Patient/701292
__________________________________________________
