<a href="https://colab.research.google.com/github/bptripp/ai-course/blob/main/ehr_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Encoding an Electronic Health Record (EHR)
A deep network may need EHR information from multiple modalities to make the best inferences. For example, in the recent literature, ICD discharge codes have been inferred more accurately by systems that use the full record of a hospital visit than by systems that use only a discharge letter as input.

You have already seen how to encode notes and images as vectors. As an additional example, consider how to encode a 12-lead ECG order. Most elements of the order are selected from drop-down lists in the EHR interface at the time of order. Each item in each drop-down list can be added to the system's vocabulary. In a state-of-the-art system, this vocabulary would be one-hot encoded and fed into a transformer. The transformer would learn optimal embedding vectors, attention matrices, etc. for a self-supervised task, and then fine tuned to answer questions, or follow instructions, or perform some other particular task. However, this process is too complex and time-consuming to work through here.  

Instead, as a simplified example, the code below reads a patient's orders from an electronic health record and encodes the elements of each order using pre-trained word embeddings.

Start by downloading a set of pre-trained embeddings. The embeddings below are similar to word2vec embeddings, and should only take a few seconds to download.

*Run the code below to download pretrained word embeddings. The code will also print an example vector for the word "cortisol".*

In [1]:
import gensim.downloader
vectors = gensim.downloader.load('glove-twitter-25')
print(vectors.get_vector('cortisol'))


[-0.042121 -0.30836  -1.2449    2.0054    0.82295   2.0069   -0.20042
 -0.73573   1.2235    0.013556  0.47971   0.78728  -0.43291   0.77171
  0.46065   1.0751   -0.73819   1.2626    1.9331   -0.31433  -0.46394
  0.90599   0.67538  -0.44131  -0.86128 ]


The next step is to query a patient's orders from an EHR system. To make this more realistic, the code below uses the HL7 FHIR protocol to query a fictional EHR. This will require importing a package called "fhirclient". However, fhirclient is not yet installed on the server where you are running this code.

*Run the code below to download and install fhirclient on this server.*

In [2]:
!pip install git+https://github.com/smart-on-fhir/client-py.git

Collecting git+https://github.com/smart-on-fhir/client-py.git
  Cloning https://github.com/smart-on-fhir/client-py.git to /tmp/pip-req-build-a7tt4a4j
  Running command git clone --filter=blob:none --quiet https://github.com/smart-on-fhir/client-py.git /tmp/pip-req-build-a7tt4a4j
  Resolved https://github.com/smart-on-fhir/client-py.git to commit df634f5354aec83335ca45648552f84d1964c033
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting isodate (from fhirclient==4.1.0)
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: fhirclient
  Building wheel for fhirclient (setup.py) ... [?25l[?25hdone
  Created wheel for fhirclient: filename=fhirclient-4.1.0-py2.py3-none-any.whl size=683652 sha256=0aae2fef54dcf0493d8760da8f9def6424af2ddea8d344526aee3ee13e01ef4

It is now possible to import code from the fhirclient package and connect to an EHR that supports the HL7 FHIR protocol. Rather than accessing real patient data, the code below connects to a fictional EHR provided by the HL7 organization.

*Run the code below to connect to a fictional EHR.*

In [6]:
from fhirclient import client
import fhirclient.models.servicerequest as sr

settings = {
    'app_id': 'EHR',
    'api_base': 'http://hapi.fhir.org/baseR4/'
}
ehr = client.FHIRClient(settings=settings)


The following code defines a function that looks up the embedding for a given word. If the word is unknown to the vocabulary, the function indicates this by returning the value *None* (recall that this means the value is undefined).

*Run the code below to create this function.*

In [7]:
import numpy as np

vector_length = 25 # our embedding vectors are 25 numbers long

def get_word_vector(word):
  if vectors.__contains__(word):
    return vectors.get_vector(word)
  else:
    return None

In order for a deep network to use information about an order, it is necessary to encode different elements of the order as vectors. Since the order structure in the EHR may be complex, one way forward is to write functions for each required element of information to extract this information and produce a corresponding vector. For simplicity, the code below defines functions that produce vectors for the text-based order description and the order status.

*Run the code below to create functions that produce vectors for an order's status and description. *

In [14]:
def get_status_vector(order):
  # The status will be a single word (e.g. "active") that appears in the
  #  vocabulary, so we can just return the corresponding vector.
  return get_word_vector(order.resource.status)

def get_description_vector(order):
  # The description will contain multiple words. We produce a summary
  # vector by adding their vectors together.
  description = order.resource.code.coding[0].display
  description = re.sub('[^\w ]', '', description)
  words = description.lower().split()
  description_vector = np.zeros(vector_length)
  for word in words:
    vector = get_word_vector(word)
    if vector is not None:
      description_vector = description_vector + vector
  return description_vector



You are nearly done. Now you can use a patient's medical record number to retrieve their "service requests" (orders), and create vectors for each one. This code may take a few seconds to run.

In [16]:
import re

patient_mrn = '2782378'
bundle = sr.ServiceRequest.where(struct={'subject': patient_mrn}).perform(ehr.server)

order_vectors = [] # start with empty list of vectors
for order in bundle.entry:
  order_vectors.append(get_status_vector(order))
  order_vectors.append(get_description_vector(order))

print(np.array(order_vectors))


[[-2.53149986e-01  8.59239995e-02 -8.99049997e-01 -9.47350025e-01
   9.77339983e-01  2.30829999e-01  5.89619994e-01 -1.71680003e-01
   3.44660014e-01  4.36550006e-02 -5.71120024e-01 -2.41300002e-01
  -2.55819988e+00  4.75789994e-01 -2.71550007e-02  4.83990014e-01
   4.29300010e-01 -3.38609993e-01  1.61579996e-01 -4.39350009e-01
  -4.17409986e-01 -4.25179988e-01 -3.19889992e-01 -2.78120011e-01
  -1.16789997e+00]
 [-3.19875009e+00  4.42328995e+00 -2.91655003e+00 -1.44630004e+00
   4.52315497e+00 -2.99352004e+00  4.01353185e+00 -7.52678719e+00
   4.60556604e+00  1.93320994e+00  1.49409295e+00 -1.41415702e+00
  -2.59652003e+01  4.76883046e-01  2.18562900e+00 -1.31574000e+00
   4.55082007e+00 -2.91629310e+00 -1.19922797e+00 -1.43058301e+00
  -3.91386001e+00 -5.87148013e+00 -4.52575998e+00 -1.57049969e-01
  -1.72378498e+00]
 [-2.53149986e-01  8.59239995e-02 -8.99049997e-01 -9.47350025e-01
   9.77339983e-01  2.30829999e-01  5.89619994e-01 -1.71680003e-01
   3.44660014e-01  4.36550006e-02 -5.7

This list of numbers looks meaningless to us, but it contains a concise summary of the patient's orders in exactly the form that a deep network needs. This is not so strange. Consider that our brains receive information only in the form of patterns of action potentials from sensory organs. What we have created here is something like a sensory organ for a deep network, one that senses electronic health records directly.