### dbGaP Asthma datasets for Computable Cohort Representation hackathon

This is a second approach to identify asthma datasets in dbGaP by using the *dbGaP on FHIR* server.

The following steps search for studies in dbGaP with Asthma in the title

Acknowledging use of code snippets from [NIH FHIR training](https://github.com/NIH-ODSS/fhir-exercises/tree/main/Python) Exercise 0.

In [1]:
import sys
import os
import json
import requests

FHIR_SERVER = 'https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1'

# Optional: Turn off SSL verification. Useful when dealing with a corporate proxy with self-signed certificates.
# This should be set to True unless you actually see certificate errors.
VERIFY_SSL = True

if not VERIFY_SSL:
    requests.packages.urllib3.disable_warnings()

# We make a requests.Session to ensure consistent headers across all the requests we make
sess = requests.Session()
sess.headers.update({'Accept': 'application/fhir+json'})
sess.verify = VERIFY_SSL

# Test out the cookie by querying the server metadata
r = sess.get(f"{FHIR_SERVER}/metadata")



More background before we get to the real thing.

Define a function to deal with pagination of FHIR results and return results from multiple pages as a single list.

In [2]:
def run_query(query, page_limit=None):
    response = sess.get(f"{FHIR_SERVER}/{query}")
    bundles = [response.json()]
    next_page_link = next(filter(lambda link: link['relation'] == 'next', bundles[0]['link']), None)
    page_count = 1
    if page_limit:
        if page_count >= page_limit:
            next_page_link = None
    while next_page_link:
        next_page = sess.get(next_page_link['url']).json()
        bundles.append(next_page)
        next_page_link = next(filter(lambda link: link['relation'] == 'next', next_page['link']), None)
        page_count += 1
        if page_limit:
            if page_count >= page_limit:
                next_page_link = None
    
    if 'entry' in bundles[0]:
        resources = [entry['resource'] for sb in bundles for entry in sb['entry']]
        return resources
    else:
        return []

To help understand what a Research Study resource looks like in dbGaP on FHIR we'll run the following query for a single study and list the whole resource.

This helps see that some of the data we want to extract from the resource are held in extensions to the FHIR model.

In [3]:
documents = run_query("ResearchStudy?_id=phs001156")
print("# of studies:{}".format(len(documents)))

for s in documents:

    print ("Study id: {}".format(s['id']))
    print ("Study title: {}".format(s['title']))
    print ("Full resource")
    print(json.dumps(s, indent=3))
    print('_'*40)

# of studies:1
Study id: phs001156
Study title: The EVE Asthma Genetics Consortium: Building Upon GWAS
Full resource
{
   "resourceType": "ResearchStudy",
   "id": "phs001156",
   "meta": {
      "versionId": "1",
      "lastUpdated": "2022-04-18T02:42:56.893-04:00",
      "source": "#zfnyPER0c6MFRVAg",
      "security": [
         {
            "system": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/CodeSystem/DbGaPConcept-SecurityStudyConsent",
            "code": "public",
            "display": "public"
         }
      ]
   },
   "extension": [
      {
         "url": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyOverviewUrl",
         "valueUrl": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001156.v2.p1"
      },
      {
         "url": "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-ReleaseDate",
         "valueDate": "2018-05-08"
      },
      {
         "url": "https://dbgap-api.ncbi.

With the information above we can now run our query for studies with the word asthma in the title. We rely on standard FHIR query syntax for this. According to the FHIR standard the search will be case insensitive.

In [4]:
documents = run_query("ResearchStudy?title:contains=asthma")

print("# of studies:{}".format(len(documents)))

for s in documents:

    print ("Study id: {}".format(s['id']))
    print ("Study title: {}".format(s['title']))

#    keyValList = ["https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyConsents"]
#    expectedResult = [d for d in s['extension'] if d['url'] in keyValList]

    #print(json.dumps(s, indent=3))
    #print('_'*40)

# of studies:29
Study id: phs000166
Study title: SNP Health Association Resource (SHARe) Asthma Resource Project (SHARP)
Study id: phs000233
Study title: Genome Wide Association Study of Asthma
Study id: phs000355
Study title: Genome Wide Association for Asthma and Lung Function
Study id: phs000422
Study title: NHLBI GO-ESP: Lung Cohorts Exome Sequencing Project (Asthma)
Study id: phs000886
Study title: An Omics View of Asthma through Monozygotic Twins
Study id: phs000920
Study title: NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II)
Study id: phs000921
Study title: NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE)
Study id: phs000988
Study title: NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica
Study id: phs001009
Study title: Determinants of Asthma Following RSV Bronchiolitis in Early Life
Study id: phs001123
Study title: Consortium on Asthma among African-ancestry Populations in the Americas
Study id

### Extracting information about the studies
We'll define another function for convenience. Because so many of the study details are in extensions, and there are extensions within extensions, the following function helps us deal with an extension at any level.

In [5]:
# for a given resource find the extension identified by a given url
# The assumption is that there is only one such extension within a given resource
# For the dbGaP ResearchStudy resource that is true
def getExtension(resource, uri):
    exts = [d for d in resource['extension'] if d['url'] == uri]
    return exts[0]

We can now use the function above to find the extensions for
* the number of subjects in the study
* the consent groups within the study

The following extracts these details and puts them into a list.

In [6]:
studies = []
for s in documents:

    print (s['id'])
    
    print (s['title'])
    # use our function to find the "study content" extension
    content = getExtension(s, "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-Content")
    # use our function again to find the "number of subjects" extension nested within the content extension
    subject_ext = getExtension(content, "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-Content-NumSubjects")
    #print(subject_ext)
    # Handle the fact that not all studies may have this extension
    if len(subject_ext) > 0 and 'value' in subject_ext['valueCount']:
        subject_count = subject_ext['valueCount']['value']
    else:
        subject_count = 0
    
    #keyValList = "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyConsents"
    #consent_ext = [d for d in s['extension'] if d['url'] == keyValList]
    # Now find the extension containing the study consents
    consent_ext = getExtension(s, "https://dbgap-api.ncbi.nlm.nih.gov/fhir/x1/StructureDefinition/ResearchStudy-StudyConsents")
    #print(json.dumps(consent_ext[0], indent=3))
    # extract the display name for each consent group and print them
    consents = [d['valueCoding']['display'] for d in consent_ext['extension'] ]
    print(consents)
    # Add the relevant details to our list of studies
    study = {"id":s['id'], "title":s["title"], "num_subjects":subject_count, "consents":consents}
    studies.append(study)
    print('_'*40)

phs000166
SNP Health Association Resource (SHARe) Asthma Resource Project (SHARP)
['NRUP', 'ARR']
________________________________________
phs000233
Genome Wide Association Study of Asthma
['Analysis']
________________________________________
phs000355
Genome Wide Association for Asthma and Lung Function
['Analysis']
________________________________________
phs000422
NHLBI GO-ESP: Lung Cohorts Exome Sequencing Project (Asthma)
['GRU']
________________________________________
phs000886
An Omics View of Asthma through Monozygotic Twins
['GRU']
________________________________________
phs000920
NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II)
['DS-LD-IRB-COL']
________________________________________
phs000921
NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE)
['DS-LD-IRB-COL']
________________________________________
phs000988
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica
['NRUP', 'DS-ASTHMA-IRB-MDS-RD

We can then put our list of studies into a DataFrame for display.

We're listing the consent so we can see if any of the studies could potentially be used in the Computable Cohort Representation asthma exercise.

In [8]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)
df = pd.DataFrame(studies)
df.sort_values(by=['id'], inplace=True)
df


Unnamed: 0,id,title,num_subjects,consents
0,phs000166,SNP Health Association Resource (SHARe) Asthma Resource Project (SHARP),4046,"[NRUP, ARR]"
1,phs000233,Genome Wide Association Study of Asthma,0,[Analysis]
2,phs000355,Genome Wide Association for Asthma and Lung Function,0,[Analysis]
3,phs000422,NHLBI GO-ESP: Lung Cohorts Exome Sequencing Project (Asthma),191,[GRU]
4,phs000886,An Omics View of Asthma through Monozygotic Twins,74,[GRU]
5,phs000920,NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II),4860,[DS-LD-IRB-COL]
6,phs000921,"NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE)",2106,[DS-LD-IRB-COL]
7,phs000988,NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica,4230,"[NRUP, DS-ASTHMA-IRB-MDS-RD]"
8,phs001009,Determinants of Asthma Following RSV Bronchiolitis in Early Life,178,"[NRUP, DS-AAR-IRB]"
9,phs001123,Consortium on Asthma among African-ancestry Populations in the Americas,14548,"[NRUP, GRU-IRB, HMB, DS-LD, HMB-IRB-NPU, DS-FDO-IRB-NPU, HMB-IRB, DS-FDO-IRB]"


In [9]:
df.to_csv("/Users/forei/Downloads/asthma.csv")