<a href="https://colab.research.google.com/github/ekrombouts/GenCareAI/blob/main/scripts/130_RAGIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenCare AI: Retrieval Augmented Generation of care notes

**Author:** Eva Rombouts  
**Date:** 2024-06-03  
**Version:** 1.0

### Description
In this script the anonymous client notes generated [here](https://github.com/ekrombouts/GenCareAI/blob/main/scripts/100_GenerateAnonymousCareNotes.ipynb) are processed and stored in a vector database (Chroma). It enables efficient querying of the database using OpenAI's embeddings and a retrieval-augmented generation system, which enhances the searchability and accessibility of client data.

*Document Loading and Processing*: Documents are loaded from the Hugging Face platform, split into smaller sections by a LangChain Text Splitter, and pre-processed. The notes are split into smaller chunks, even though this probably was not necessary for this dataset. This step was taken for completeness to ensure scalability and efficiency as data complexity or volume increases in the future.  
*Database Initialization and Population*: A Chroma vector database is initialized and populated it with the embedded document chunks.  
*Query Operations*: Utilizes the RetrievalQA pipeline to search the database using natural language queries, demonstrating the application's capability to retrieve and display relevant information.

### Goal
My goal is to use this vector database to retrieve relevant examples for few-shot inference in order to create synthetic client notes. This approach can help me improve the generation of this data by providing specific, contextually relevant examples that guide the model's results.

### Setup and configuration
- When running in CoLab Google Drive is mounted to persistently store the Chroma vector database.
- Retrieve API keys for OpenAI and HuggingFace, providing authentication for accessing the [embedding model](https://platform.openai.com/docs/guides/embeddings), the [QA model](https://platform.openai.com/docs/models) and the [dataset](https://huggingface.co/datasets/ekrombouts/dutch_nursing_home_reports). ***Please note*** that the embedding isn't free. Embedding the 35.000+ notes costs appr $0.15. The costs for the examples of querying the database in this notebook are negligible.

### Recommended Resources
- [RAG - Retrieval Augmented Generation](https://www.youtube.com/playlist?list=PL8motc6AQftn-X1HkaGG9KjmKtWImCKJS) with Sam Witteveen on YouTube
- [Python RAG Tutorial (with Local LLMs)](https://www.youtube.com/watch?v=2TJxpyO3ei4&t=323s) by Pixegami on YouTube
- And of course the [Langchain documentation](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)

In [1]:
import os
# Determines the current environment (Google Colab or local)
def check_environment():
    try:
        import google.colab
        return "Google Colab"
    except ImportError:
        pass

    return "Local Environment"

In [2]:
# Installs and settings depending on the environment
# When running in CoLab, the Google drive is mounted and necessary packages are installed.
# Data paths are set and API keys retrieved

env = check_environment()

if env == "Google Colab":
    print("Running in Google Colab")
    !pip install -q langchain langchain-openai langchain-community chromadb datasets
    from google.colab import drive, userdata
    drive.mount('/content/drive')
    DATA_DIR = '/content/drive/My Drive/Colab Notebooks/GenCareAI/data'
    OPENAI_API_KEY = userdata.get('GCI_OPENAI_API_KEY')
    HF_TOKEN = userdata.get('HF_TOKEN')
else:
    print("Running in Local Environment")
    # !pip install -q chromadb datasets
    DATA_DIR = '../data'
    from dotenv import load_dotenv
    load_dotenv()
    OPENAI_API_KEY = os.getenv('GCI_OPENAI_API_KEY')
    HF_TOKEN = os.getenv('HF_TOKEN')

Running in Local Environment


In [3]:
# Import necessary modules from Langchain and Hugging Face
from datasets import load_dataset
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.chains import RetrievalQA
from pprint import pprint

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Constants for dataset and storage paths
PATH_HF_DATASET = 'ekrombouts/dutch_nursing_home_reports'
PATH_DB_GCAI = os.path.join(DATA_DIR, 'chroma_db_gcai')
COLLECTION_NAME = 'anonymous_reports'
MODEL = 'text-embedding-ada-002'

### Functions

In [5]:
def load_documents():
  """Load the dataset from Hugging Face, ensuring access via token."""
  dataset = load_dataset(PATH_HF_DATASET, token=HF_TOKEN)
  loader = HuggingFaceDatasetLoader(PATH_HF_DATASET,
                                    page_content_column='report')
  return loader.load()

In [6]:
def split_documents(documents: list[Document]):
  """Split large text documents into manageable chunks for better handling by ML models."""
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=800,
      chunk_overlap=100)
  chunks = text_splitter.split_documents(documents)
  # Index each chunk to maintain unique identifiers
  for idx, chunk in enumerate(chunks):
      chunk.metadata['id'] = str(idx)
  return chunks

In [7]:
def initialize_vectordb(persist_directory, embedding_function, collection_name):
    """Initialize the Chroma vector database, either loading an existing one or creating a new one."""
    if os.path.exists(persist_directory):
        return Chroma(persist_directory=persist_directory,
                      embedding_function=embedding_function,
                      collection_name=collection_name)
    else:
        return Chroma.from_documents(documents=[],
                                     embedding=embedding_function,
                                     persist_directory=persist_directory,
                                     collection_name=collection_name)

In [8]:
def load_existing_ids(vectordb):
    """Fetch existing document IDs from the database to avoid duplicates."""
    try:
        existing_items = vectordb.get(include=[])
        existing_ids = set(existing_items["ids"])
    except:
        existing_ids = set()
    return existing_ids

In [9]:
# Modified from https://github.com/pixegami/rag-tutorial-v2/blob/main/populate_database.py
def add_new_documents(vectordb, documents):
    """Add new documents to the database only if they don't already exist."""
    existing_ids = load_existing_ids(vectordb)
    print(f"Number of existing documents in DB: {len(existing_ids)}")
    # Only add documents that don't exist in the DB.
    new_documents = []
    for document in documents:
        if document.metadata["id"] not in existing_ids:
            new_documents.append(document)
    if len(new_documents):
        print(f"Adding new documents: {len(new_documents)}")
        new_document_ids = [document.metadata["id"] for document in new_documents]
        vectordb.add_documents(new_documents, ids=new_document_ids)
    else:
        print("No new documents to add")


### Embed and store texts

In [11]:
# Load, split and process documents
documents = load_documents()
chunks = split_documents(documents=documents)

# # Consider experimenting with a smaller dataset
# documents_sample = documents[:15]
# chunks = split_documents(documents=documents_sample)

print(len(documents))
print(len(chunks))

Downloading readme: 100%|██████████| 1.43k/1.43k [00:00<00:00, 2.26MB/s]
Downloading data: 100%|██████████| 5.06M/5.06M [00:00<00:00, 5.58MB/s]
Generating train split: 100%|██████████| 35129/35129 [00:00<00:00, 370352.31 examples/s]


35129
35233


In [12]:
# Initialize vector databse, using OpenAI embeddings
embedding = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=MODEL)
vectordb = initialize_vectordb(PATH_DB_GCAI, embedding, COLLECTION_NAME)

In [13]:
# Add new documents to the database
add_new_documents(vectordb, chunks)

Number of existing documents in DB: 35233
No new documents to add


### Query the database

In [14]:
## To read the db from file

# vectordb = Chroma(persist_directory=FN_DB_GCAI,
#                   embedding_function=OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=MODEL),
#                   collection_name = COLLECTION_NAME
#                   )

In [15]:
# Delete all items in the db

# items = vectordb.get(include=[])
# existing_ids = items["ids"]
# vectordb.delete(ids=existing_ids)

In [16]:
# Retrieve metadata and document IDs from the database
items = vectordb.get(include=['metadatas'])
existing_ids = set(items["ids"])
metadata = items['metadatas']
print(f"Number of existing documents in DB: {len(existing_ids)}")
print(metadata[0])

Number of existing documents in DB: 35233
{'id': '0', 'topic': 'ADL'}


In [17]:
# Set up a retriever for document querying
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

In [18]:
# Query the vector database using similarity search
query = 'gewichtsverlies'
docs = retriever.invoke(query)

print(f'Number of docs: {len(docs)}\n')
print(f'Retriever search type: {retriever.search_type}\n')

print(f'Documents most similar to "{query}":')
for doc in docs:
  print(doc.page_content)

Number of docs: 4

Retriever search type: similarity

Documents most similar to "gewichtsverlies":
"Mw verloor vannacht onverwachts een grote hoeveelheid gewicht. Eetlust lijkt verminderd. Weeg mw elke ochtend en let op mogelijke oorzaken."
"Observatie van mevr. Veenstra toont gewichtsverlies en afname in eetlust. Overleg gestart met voedingsdeskundige voor aanpassingen in dieet en monitoring."
"Mw vertoont tekenen van verminderde eetlust en algemene zwakte. Gewichtsverlies van 2 kg opgemerkt sinds vorige week."
"Consult gevraagd bij di\u00ebtist voor voedingsadvies dhr. ivm plotseling gewichtsverlies."


In [19]:
# Initialize the QA chain for answering questions using the retrieved documents
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(api_key=OPENAI_API_KEY),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [20]:
# Define a function to process the response
def process_llm_response(llm_response):
    print(100 * '*')
    print(f"\nresult: {llm_response['result']}")
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['id'], source.metadata['topic'])

In [21]:
# function to test the pipeline
def query_retrieval_pipeline(query):
  llm_response = qa_chain.invoke(query)
  pprint(llm_response)
  print(process_llm_response(llm_response))

In [22]:
query_retrieval_pipeline ("Wat moet je doen als je client afvalt in gewicht?")

{'query': 'Wat moet je doen als je client afvalt in gewicht?',
 'result': ' In dit geval is het belangrijk om het gewicht regelmatig te '
           'controleren en mogelijke oorzaken van het gewichtsverlies te '
           'onderzoeken. Daarnaast is het ook belangrijk om contact op te '
           'nemen met de huisarts en eventuele aanpassingen in het dieet te '
           'bespreken. Verder is het ook van belang om de situatie te '
           'bespreken met de familie en eventueel aanpassingen in het zorgplan '
           'te maken.',
 'source_documents': [Document(page_content='"Bij het tillen merkte ik dat mw meer gewicht lijkt te hebben verloren. Graag gewicht nogmaals controleren."', metadata={'id': '7846', 'topic': 'mobiliteit'}),
                      Document(page_content='"Mw verloor vannacht onverwachts een grote hoeveelheid gewicht. Eetlust lijkt verminderd. Weeg mw elke ochtend en let op mogelijke oorzaken."', metadata={'id': '15838', 'topic': 'symptomen'}),
             

In [23]:
query_retrieval_pipeline("Wat moet je doen als je client agressief gedrag vertoont?")

{'query': 'Wat moet je doen als je client agressief gedrag vertoont?',
 'result': ' Het is belangrijk om de situatie te de-escaleren door rust te '
           'brengen en de cliënt zijn/haar gevoelens te laten uiten. Het is '
           'ook belangrijk om op veilige afstand te blijven en extra zorg en '
           'aandacht te bieden tijdens de persoonlijke verzorging. Het kan ook '
           'helpen om samen een rustige activiteit te doen om de cliënt te '
           'kalmeren.',
 'source_documents': [Document(page_content='"Dhr vertoont agressief gedrag en duwt een medebewoner weg die te dichtbij komt. Hij maakt dreigende gebaren en kan moeilijk worden gekalmeerd door het personeel."', metadata={'id': '6770', 'topic': 'onrust'}),
                      Document(page_content='"Meneer wordt agressief tijdens de persoonlijke verzorging, slaat wild om zich heen en schreeuwt dat hij naar huis wil. Extra zorg en aandacht geboden om de situatie te de-escaleren."', metadata={'id': '15311', '

In [24]:
query_retrieval_pipeline("Wat kan je doen als een cliënt onrustig is 's nachts?")

{'query': "Wat kan je doen als een cliënt onrustig is 's nachts?",
 'result': ' \n'
           '\n'
           'Possible actions could include applying comfort measures, having '
           'the night shift be extra vigilant, discussing possible causes with '
           'the nurse or doctor, and adjusting sleep medication if necessary.',
 'source_documents': [Document(page_content='"Mw vertoont onrustig gedrag en lijkt angstig te zijn in de avonduren. Comfortmaatregelen toepassen en nachtdienst extra alert laten zijn."', metadata={'id': '7166', 'topic': 'symptomen'}),
                      Document(page_content='"Graag overleg met arts over toenemende onrust bij mevr. in de nacht."', metadata={'id': '4203', 'topic': 'medisch_logistiek'}),
                      Document(page_content='"Verpleegkundige observeert onrustig gedrag bij meneer in de avonduren. Graag mogelijke oorzaken bespreken en indien nodig slaapmedicatie aanpassen."', metadata={'id': '4798', 'topic': 'medisch_logistiek'})

In [25]:
query_retrieval_pipeline("Welke interventies zijn ingezet voor het verbeteren van de nachtrust?")

{'query': 'Welke interventies zijn ingezet voor het verbeteren van de '
          'nachtrust?',
 'result': '\n'
           'Er zijn verschillende interventies ingezet om de nachtrust te '
           'verbeteren, zoals het bespreken van nachtelijke onrust en het '
           'advies van een slaapspecialist inwinnen, het overwegen van '
           'nachtrust bevorderende interventies, het aanpassen van '
           'slaapmedicatie en het plannen van rustgevende activiteiten, en het '
           'implementeren van extra aandacht voor rust en slaap.',
 'source_documents': [Document(page_content='"Notitie voor artsenvisite: bespreken van nachtelijke onrust bij mevr. en mogelijke interventies om slaapkwaliteit te verbeteren. Advies van slaapspecialist gewenst."', metadata={'id': '21953', 'topic': 'medisch_logistiek'}),
                      Document(page_content='"Gisteravond was meneer onrustig en had hij moeite met in slaap komen. Overweeg nachtrust bevorderende interventies."', metadata={

In [None]:
query_retrieval_pipeline("Wat zijn leuke dingen om te doen met bezoek?")