## Comparisons of methods for searching FHIR

This notebook explores querying Kids First FHIR resources both through native FHIR and through GA4GH Data Connect. The aim is to see how execute phenotypic data queries that researchers might want to do.

Staring with Data Connect
Patient resource from the NCPI FHIR Resource created under Project Forge were put into a BigQuery table as native FHIR json. These can then be queries through SQL using functions that can search within and/or unpack the nested json in the resource.

This is an example of a Patient resource for this Kids First data.

In [4]:
exampleRecord = {'resourceType': 'Patient',
 'id': '451135',
 'meta': {'versionId': '1',
  'lastUpdated': '2020-11-04T19:07:52.139+00:00',
  'source': '#dkai2MIaZ2WetDpt',
  'profile': ['http://hl7.org/fhir/StructureDefinition/Patient']},
 'extension': [{'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity',
   'extension': [{'url': 'ombCategory',
     'valueCoding': {'system': 'urn:oid:2.16.840.1.113883.6.238',
      'code': '2135-2',
      'display': 'Hispanic or Latino'}},
    {'url': 'text', 'valueString': 'Hispanic or Latino'}]},
  {'url': 'http://hl7.org/fhir/us/core/StructureDefinition/us-core-race',
   'extension': [{'url': 'ombCategory',
     'valueCoding': {'system': 'urn:oid:2.16.840.1.113883.6.238',
      'code': '2106-3',
      'display': 'White'}},
    {'url': 'text', 'valueString': 'White'}]}],
 'identifier': [{'system': 'https://kf-api-dataservice.kidsfirstdrc.org/participants?study_id=SD_PREASA7S&external_id=',
   'value': '309'},
  {'system': 'urn:kids-first:unique-string',
   'value': 'Patient|SD_PREASA7S|309'}],
 'gender': 'female'}

Querying on gender is fairly straightforward as it is not deeply nested. 

Note: Can we assume that gender uses a standard set of values? Hold that thought.

In [16]:
import json

from fasp.search  import DataConnectClient

searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com')

query = """select id, patient from kidsfirst.ga4gh_tables.patient 
where json_extract_scalar(patient, '$.gender') = 'female'"""

res = searchClient.runQuery(query)
print("Found {} female patients".format(len(res)))

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
Found 1034 female patients


Continuing with a query on what one might also think is a simple, and semantically defined attribute - ethnicity. However, in these resources ethnicity is recorded in an attribute which is an extension to the FHIR model. A level of indirection is necessary to query on that attribute. The value of ethnicity must then be unpacked.

Somewhat complex to do. Starting one step at a time

The following where clause will select any patient that contains ethnicity. We're not yet filtering on a particular value for ethnicity, but we are going one step at a time.

```where json_extract_scalar(patient, '$.extension[0].url') = 
'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity'```

Running that as follows

In [17]:
import json

from fasp.search  import DataConnectClient

searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com')

query = """select id, patient from kidsfirst.ga4gh_tables.patient 
where json_extract_scalar(patient, '$.extension[0].url') = 
'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity' 
limit 3"""
#TODO query on the value of ethnicity with AND

res = searchClient.runQuery(query)

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


We can now unpack the results, which will perhaps give us some clues about how to formulate a query on specific values of ethnicity.

In [None]:
for r in res:
    patient = r[1]
    print(patient['id'], patient['gender'])
    for e in patient['extension']:
        print (e['url'])
        print(e['extension'][0]['url'])
        vc = e['extension'][0]['valueCoding']
        print(vc['code'], vc['display'])

We can query on the code for ethnicity. Note that we have to rely on the extensions being at a particular index in the array of extensions. 

Issue: that does not scale.

In [41]:
import json

from fasp.search  import DataConnectClient

searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com')

query = """select id, patient from kidsfirst.ga4gh_tables.patient 
where json_extract_scalar(patient, '$.gender') = 'female'
and json_extract_scalar(patient, '$.extension[0].url') = 
'http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity' 
and json_extract_scalar(patient, '$.extension[0].extension[0].valueCoding.code') = '2135-2'
limit 20"""


res = searchClient.runQuery(query)

for r in res:
    patient = r[1]
    print(patient['id'], patient['gender'])
    for e in patient['extension']:
        print (e['url'])
        print(e['extension'][0]['url'])
        vc = e['extension'][0]['valueCoding']
        print(vc['code'], vc['display'])
    print('_______________________')

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
451134 female
http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity
ombCategory
2135-2 Hispanic or Latino
http://hl7.org/fhir/us/core/StructureDefinition/us-core-race
ombCategory
2106-3 White
_______________________
451135 female
http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity
ombCategory
2135-2 Hispanic or Latino
http://hl7.org/fhir/us/core/StructureDefinition/us-core-race
ombCategory
2106-3 White
_______________________
451136 female
http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity
ombCategory
2135-2 Hispanic or Latino
http://hl7.org/fhir/us/core/StructureDefinition/us-core-race
ombCategory
2106-3 White
_______________________
451165 female
http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity
ombCategory
2135-2 Hispanic or Latino
http://hl7.org/fhir/us/core/StructureDefinition/us

We might at this point try formulating the SQL query that would make sure we are not relying on ethnicity always being the first attribute in the list of extensions. A subquery or a use of the unnest function and a join will probably do it. However, this is going to involve some fairly advanced querying. Perhaps we might look to see if there are simopler options.

Does native FHIR querying make it any easier?

We should not one thing here, it's not the SQL that makes this hard it's the underlying way the data is structured. Notably the fact that basic attrubutes have ended up as extensions. Remember the relative simplicity of the gender query in the first example.

Using direct FHIR queries will still have to deal with that underlying fact, but perhaps the FHIR python APIs have been written having that complexity in mind.

### Perform the same query directly via FHIR
Is it any easier to specify the query and to unpack the results?

Using the NIH Cloud Platform Interoperability (NCPI) FHIR server directly.

Note the file with the cookie for the NCPI FHIR server should contain the following
{"Cookie":"AWSELBAuthSessionCookie-0=your_cookie_here"}
The following provides instructions on how to get the cookie
https://github.com/NIH-NCPI/ncpi-api-fhir-service

First a basic query to check we can query the FHIR server via the fhir-py library.

But before that some preliminaries!

This is a way (under development) to get the cookie required for authentication. Until it's completed we save the cookie to ~/.keys/ncpi_fhir_cookie.json

In [9]:
import requests
fhir_server = 'https://ncpi-api-fhir-service-dev.kidsfirstdrc.org'
#new_server = 'https://kf-api-fhir-service.kidsfirstdrc.org'
x = requests.get(fhir_server)
print (x.cookies)

<RequestsCookieJar[<Cookie _csrf=XfN0sk_Yrg_MTFny2mmOUQ9S for d3b-center.auth0.com/usernamepassword/login>]>


This makes use of the FHIR API directly via the requests module. The code here is resued from 

In [1]:
import sys
import os
import json
import requests

#FHIR_SERVER = 'https://kf-api-fhir-service.kidsfirstdrc.org'
FHIR_SERVER = 'https://ncpi-api-fhir-service-dev.kidsfirstdrc.org'

# Optional: Turn off SSL verification. Useful when dealing with a corporate proxy with self-signed certificates.
# This should be set to True unless you actually see certificate errors.
VERIFY_SSL = False

if not VERIFY_SSL:
    requests.packages.urllib3.disable_warnings()



# Kids First uses cookie-based authentication
# Get my locally saved cookie
full_cookie_path = os.path.expanduser('~/.keys/ncpi_fhir_cookie.json')
with open(full_cookie_path) as f:
        cookies = json.load(f)

# We make a requests.Session to ensure consistent headers/cookie across all the requests we make
s = requests.Session()
s.headers.update({'Accept': 'application/fhir+json'})
s.headers.update(cookies)
s.verify = VERIFY_SSL


# Test out the cookie by querying the server metadata
r = s.get(f"{FHIR_SERVER}/metadata")

if "<!DOCTYPE html>" in r.text:
    sys.stderr.write('ERROR: Could not authenticate with Kids First. The cookie may need to be updated')

Now we have established the basics to access FHIR, back to the gender query.

Note that for the following we are ignoring retrieving multiple pages. In a previous version of this notebook a python FHIR client was used which isolated the user from dealing with low level request concerns and issues such as pagination. Clients of that kind can return a stream of results. This is useful for domain level users.

For the illustrative purposes of this notebook we will ignore the need for pagination for now. The main objective here is to illustrate how useful phenotypic queries can be formulated in FHIR.

In [14]:
# Search for patients by gender
r = s.get(f"{FHIR_SERVER}/Patient?gender=female")
patient_bundle = r.json()

# In the bundle obtained total is not present
#print(f"Number of matches: {patient_bundle['total']}")
print(f"Number of Patients included in Bundle: {len(patient_bundle['entry'])}")

# Create list of just the Patient Resources in the Bundle
patients = [entry['resource'] for entry in patient_bundle['entry']]


Number of Patients included in Bundle: 50


Look at some of the details of patients

In [15]:
for p in patients[2:10]:
    print(json.dumps(p))
    print('_____________________')

{"resourceType": "Patient", "id": "539318", "meta": {"versionId": "1", "lastUpdated": "2021-04-28T23:43:03.678+00:00", "source": "#Zpvw5NWtHnxvwxWr", "profile": ["http://hl7.org/fhir/StructureDefinition/Patient"]}, "identifier": [{"system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/", "value": "PT_RP789F44"}, {"system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/", "value": "?study_id=SD_7NQ9151J&external_id=BH3504_2"}, {"system": "urn:kids-first:unique-string", "value": "Patient|SD_7NQ9151J|BH3504_2"}], "gender": "female"}
_____________________
{"resourceType": "Patient", "id": "539310", "meta": {"versionId": "1", "lastUpdated": "2021-04-28T23:43:03.546+00:00", "source": "#R8H3r13WvVHAuDkR", "profile": ["http://hl7.org/fhir/StructureDefinition/Patient"]}, "identifier": [{"system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/", "value": "PT_XNXCHGGH"}, {"system": "https://kf-api-dataservice.kidsfirstdrc.org/participants/", "value": "?study_i

#TODO Let's get to querying on ethncity.