# RAG on FHIR with Knowledge Graphs

### 2. Neo4J & Jupyter Environment
This notebook needs an instance of [Neo4j](https://www.neo4j.com) to talk to. I used docker to run Neo4J locally using the following command:
```
docker run --name testneo4j -p7474:7474 -p7687:7687 -d \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -v $HOME/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/plugins:/plugins \
    --env NEO4J_AUTH=neo4j/password \
    neo4j:latest
```
**Note:** No particular plugins are needed. 

You can also use a Neo4J Aurora instance. 

#### Jupyter Environment
Regardless of how you run Neo4J. You need to set some environment variables in the notebook's environment:

| Variable | Description | Value for above Docker |
|----------|-------------|------------------------|
| NEO4J_URL | Where to find the instance of Neo4j. | bolt://localhost:7687 |
| NEO4J_USER | The username for the database. | neo4j |
| NEO4J_PASSWORD | The password for the database. | password |


### 3. Synthetic data and working directory
The data I used for this notebook came from [Synthea](https://synthea.mitre.org/). Using the 

All the questions here us the FHIR Bundle: `fhir_data/stanfor_llm_on_fhir`

In [1]:
# Imports needed

import glob
import json
import os
import re

## Establish Database Connection

The cell connects to the Neo4J instance. It relies on several environment variables. 

**PLEASE NOTE**: The variable have been changed to support multiple databases in the same instance. 

| Variable            | Description                          | Sample Value          |
|---------------------|--------------------------------------|-----------------------|
| FHIR_GRAPH_URL      | Where to find the instance of Neo4j. | bolt://localhost:7687 |
| FHIR_GRAPH_USER     | The username for the database.       | neo4j                 |
| FHIR_GRAPH_PASSWORD | The password for the database.       | password              |
| FHIR_GRAPH_DATABASE | The name of the database instance.   | neo4j                 |

In [2]:
NEO4J_URI = os.getenv('FHIR_GRAPH_URL', 'neo4j://localhost:7687')
USERNAME = os.getenv('FHIR_GRAPH_USER', 'neo4j')
PASSWORD = os.getenv('FHIR_GRAPH_PASSWORD', 'password')
DATABASE = os.getenv('FHIR_GRAPH_DATABASE', 'neo4j')

## Helper Database Cells

The following three cells are here to be used to manage the database. They do not need to be run on a blank database. 

http://localhost:7474/browser/

### Create Vector Index 

This cell creates a new vector index, using the index created above. 

This is here because running the cell above can take time and only should be done one time when the DB is created. 

In [3]:
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

vector_index = Neo4jVector.from_existing_index(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url=NEO4J_URI,
    username=USERNAME,
    password=PASSWORD,
    database=DATABASE,
    index_name='fhir_text'
)

  from tqdm.autonotebook import tqdm, trange


In [32]:
from questions import stanford_llm_on_fhir_questions

question = stanford_llm_on_fhir_questions["Q2"]
question

'What are the most common side effects for each medication I am taking?'

In [24]:
question = "Do I have allergi of Aspirin ?"

In [25]:
response = vector_index.similarity_search(question, k=50) # k_nearest is not used here because we don't have a retrieval query yet.

### Similary search will be better when we need a feacture extraction

In [38]:
response = vector_index.similarity_search_with_score(query=question, k=50, score_threshold=0.80) 

### Hibryd search with key name

In [36]:
response = vector_index.similarity_search_with_score(query=question, k=100, score_threshold=0.80, filter={"name": "AllergyIntolerance"}) 

In [37]:
import json 

output_len = len(response)
print(f"len return {output_len}")

list_output = []
output_str = '' 
for i in range(output_len):
    try:
        output_str += response[i][0].page_content

        list_output.append(json.loads(response[i][0].page_content)["resourceType"])
    except:
        print(response[i][0].page_content)

list(set(list_output))

dictionary = {}
for item in list_output:
    dictionary[item] = dictionary.get(item, 0) + 1
    
print(dictionary)

len return 9
{'AllergyIntolerance': 9}


## Count the amount of tokens in the output

Here we are interesting in see the amout of tokens of the output and analise wheater the will fit in the provided context window.


In [28]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [29]:
print(num_tokens_from_string(output_str, "cl100k_base"))

18893


In [30]:
output_str

'{"resourceType": "AllergyIntolerance", "id": "faf91c2b-8de6-9a64-aa0c-feceab910bfe", "meta": {"profile": ["http://hl7.org/fhir/us/core/StructureDefinition/us-core-allergyintolerance"]}, "clinicalStatus": {"coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-clinical", "code": "active"}]}, "verificationStatus": {"coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-verification", "code": "confirmed"}]}, "type": "allergy", "category": ["medication"], "criticality": "low", "code": {"coding": [{"system": "http://www.nlm.nih.gov/research/umls/rxnorm", "code": "1191", "display": "Aspirin"}], "text": "Aspirin"}, "patient": {"reference": "urn:uuid:5b3645de-a2d0-d016-0839-bab3757c4c58"}, "recordedDate": "2017-08-30T11:37:42+00:00", "reaction": [{"manifestation": [{"coding": [{"system": "http://snomed.info/sct", "code": "21522001", "display": "Abdominal pain (finding)"}], "text": "Abdominal pain (finding)"}], "severity": "moderate"}, {"manife