# RAG on FHIR with Knowledge Graphs

### 2. Neo4J & Jupyter Environment
This notebook needs an instance of [Neo4j](https://www.neo4j.com) to talk to. I used docker to run Neo4J locally using the following command:
```
docker run --name testneo4j -p7474:7474 -p7687:7687 -d \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -v $HOME/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/plugins:/plugins \
    --env NEO4J_AUTH=neo4j/password \
    neo4j:latest
```
**Note:** No particular plugins are needed. 

You can also use a Neo4J Aurora instance. 

#### Jupyter Environment
Regardless of how you run Neo4J. You need to set some environment variables in the notebook's environment:

| Variable | Description | Value for above Docker |
|----------|-------------|------------------------|
| NEO4J_URL | Where to find the instance of Neo4j. | bolt://localhost:7687 |
| NEO4J_USER | The username for the database. | neo4j |
| NEO4J_PASSWORD | The password for the database. | password |


### 3. Synthetic data and working directory
The data I used for this notebook came from [Synthea](https://synthea.mitre.org/). Using the 

All the questions here us the FHIR Bundle: `fhir_data/stanfor_llm_on_fhir`

In [1]:
# Imports needed

import glob
import json
import os
import re

# Imports from other local python files
from helpers.neo4j_graph import Graph
from helpers.FHIR_to_graph import resource_to_node, resource_to_edges

## Establish Database Connection

The cell connects to the Neo4J instance. It relies on several environment variables. 

**PLEASE NOTE**: The variable have been changed to support multiple databases in the same instance. 

| Variable            | Description                          | Sample Value          |
|---------------------|--------------------------------------|-----------------------|
| FHIR_GRAPH_URL      | Where to find the instance of Neo4j. | bolt://localhost:7687 |
| FHIR_GRAPH_USER     | The username for the database.       | neo4j                 |
| FHIR_GRAPH_PASSWORD | The password for the database.       | password              |
| FHIR_GRAPH_DATABASE | The name of the database instance.   | neo4j                 |

In [2]:
NEO4J_URI = os.getenv('FHIR_GRAPH_URL', 'neo4j://localhost:7687')
USERNAME = os.getenv('FHIR_GRAPH_USER', 'neo4j')
PASSWORD = os.getenv('FHIR_GRAPH_PASSWORD', 'password')
DATABASE = os.getenv('FHIR_GRAPH_DATABASE', 'neo4j')

graph = Graph(NEO4J_URI, USERNAME, PASSWORD, DATABASE)

## Helper Database Cells

The following three cells are here to be used to manage the database. They do not need to be run on a blank database. 

http://localhost:7474/browser/

In [6]:
print(graph.resource_metrics())

[['Patient', 1], ['CarePlan', 2], ['CareTeam', 2], ['ImagingStudy', 4], ['MedicationRequest', 6], ['AllergyIntolerance', 9], ['Condition', 15], ['Procedure', 18], ['DocumentReference', 31], ['Encounter', 31], ['DiagnosticReport', 33], ['Immunization', 35], ['Claim', 37], ['ExplanationOfBenefit', 37], ['Observation', 233]]


In [7]:
print(graph.database_metrics())

(624, 2468)


In [8]:
graph.wipe_database()

'Deleted 624 nodes and 2468 relationships in 0.279 seconds'

## Load FHIR into the Graph

This cell opens the bundle and creates the nodes and edges in the graph for each resource. 

Every resource will result in a node that has a label based on the resource type and as a `resource`. The values within the resource will be flattened 
into properties within the node. Also, a property called `text` will include a string representation of the resource. 

Additionally, nodes will be created for every unique date (ignoring time) found in the FHIR resources. 

Edges will be created for every reference in the resource to something that can be found within the bundles loaded. So the linking resource doesn't have 
to be in the same bundle, but it must be in a bundle that is loaded. 

Edges will also connect resources to the dates found inside them. 

**Warning:** This cell may take sometime to run. 

In [9]:
#synthea_bundles = glob.glob("/home/baptvit/Documents/github/mestrado/fhir-rag/fhir_rag/fhir_data/sythea_fhir/Alfonso758_Bins636_e80d4c62-149a-a6a6-4b39-9d4aa3e07ba7.json")
synthea_bundles = glob.glob("/home/baptvit/Documents/github/mestrado/fhir-rag/fhir_rag/fhir_data/stanford_llm_on_fhir/*.json")
synthea_bundles = synthea_bundles[0:1]
synthea_bundles.sort()
synthea_bundles

['/home/baptvit/Documents/github/mestrado/fhir-rag/fhir_rag/fhir_data/stanford_llm_on_fhir/Beatris270_Bogan287_5b3645de-a2d0-d016-0839-bab3757c4c58.json']

In [10]:
nodes = []
edges = []
dates = set() # set is used here to make sure dates are unique
for bundle_file_name in synthea_bundles:
    with open(bundle_file_name) as raw:
        bundle = json.load(raw)
        for entry in bundle['entry']:
            resource_type = entry['resource']['resourceType']
            if resource_type != 'Provenance':
                # generated the cypher for creating the resource node 
                nodes.append(resource_to_node(entry['resource'], bundle_file_name.split("/")[-1].replace(".json", "")))
                # generated the cypher for creating the reference & date edges and capture dates
                node_edges, node_dates = resource_to_edges(entry['resource'])
                edges += node_edges
                dates.update(node_dates)

Create the nodes for each resource

In [11]:
# create the nodes for resources
for node in nodes:
    graph.query(node)

Create nodes for the import dates

In [12]:
date_pattern = re.compile(r'([0-9]+)/([0-9]+)/([0-9]+)')

# create the nodes for dates
for date in dates:
    date_parts = date_pattern.findall(date)[0]
    cypher_date = f'{date_parts[2]}-{date_parts[0]}-{date_parts[1]}'
    cypher = 'CREATE (:Date {name:"' + date + '", id: "' + date + '", date: date("' + cypher_date + '"), text:"' + date + '"})'
    graph.query(cypher)

Create the edges/relationships between each resources

In [13]:
# create the edges
for edge in edges:
    try:
        graph.query(edge)
    except:
        print(f'Failed to create edge: {edge}')


## Create the Vector Embedding Index in the Graph

This cell creates a Vector Index in Neo4J. It looks at nodes labeled as `resource` and indexes the string representation in the `text` property. 

**Warning:** This cell may take sometime to run. 

In [2]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print('Sorry no cuda.')

In [3]:
device

'cuda'

In [14]:
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

Neo4jVector.from_existing_graph(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5", devic=device),
    url=NEO4J_URI,
    username=USERNAME,
    password=PASSWORD,
    database=DATABASE,
    index_name='fhir_text',
    node_label="resource",
    text_node_properties=['text'],
    embedding_node_property='embedding',
)

  from tqdm.autonotebook import tqdm, trange


<langchain_community.vectorstores.neo4j_vector.Neo4jVector at 0x76fe70e17450>

### Create Vector Index 

This cell creates a new vector index, using the index created above. 

This is here because running the cell above can take time and only should be done one time when the DB is created. 

In [15]:
vector_index = Neo4jVector.from_existing_index(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url=NEO4J_URI,
    username=USERNAME,
    password=PASSWORD,
    database=DATABASE,
    index_name='fhir_text'
)