<a href="https://colab.research.google.com/github/ekrombouts/GenCareAI/blob/main/notebooks/100_note_generation/130_RAGIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenCare AI: Retrieval Augmented Generation of care notes

**Author:** Eva Rombouts  
**Date:** 2024-06-03  
**Updated:** 2024-09-30  
**Version:** 2.0

### Description
In this script the anonymous client notes generated [here](https://github.com/ekrombouts/GenCareAI/blob/main/scripts/100_GenerateAnonymousCareNotes.ipynb) are processed and stored in a vector database (Chroma). It enables querying of the database using OpenAI's embeddings and a retrieval-augmented generation system.

*Document Loading and Processing*: Documents are loaded from the Hugging Face platform, split into smaller sections by a LangChain Text Splitter, and pre-processed. The notes are split into smaller chunks, even though this probably was not necessary for this dataset. This step was taken for completeness to ensure scalability in the future.  
*Database Initialization and Population*: A Chroma vector database is initialized and populated it with the embedded document chunks.  
*Query Operations*: The RetrievalQA pipeline let's us search the database using natural language queries, demonstrating the capability to retrieve and display relevant information.

### Goal
My goal is to use this vector database to retrieve relevant examples for few-shot inference in prompts for creating synthetic client notes. This approach can help me improve the generation of this data by providing specific, contextually relevant examples that guide the model's results.

### Setup and configuration
- When running in CoLab Google Drive is mounted to persistently store the Chroma vector database.
- Retrieve API keys for OpenAI and HuggingFace, providing authentication for accessing the [embedding model](https://platform.openai.com/docs/guides/embeddings), the [QA model](https://platform.openai.com/docs/models) and the [dataset](https://huggingface.co/datasets/ekrombouts/dutch_nursing_home_reports). 

### Recommended Resources
- [RAG - Retrieval Augmented Generation](https://www.youtube.com/playlist?list=PL8motc6AQftn-X1HkaGG9KjmKtWImCKJS) with Sam Witteveen on YouTube
- [Python RAG Tutorial (with Local LLMs)](https://www.youtube.com/watch?v=2TJxpyO3ei4&t=323s) by Pixegami on YouTube
- And of course the [Langchain documentation](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)

***Please note*** that the embedding isn't free. Embedding the 35.000+ notes costs appr $0.15. The costs for the examples of querying the database in this notebook are negligible.

In [1]:
!pip install GenCareAI
from GenCareAI.GenCareAIUtils import GenCareAISetup

setup = GenCareAISetup()

if setup.environment == 'Colab':
        !pip install -q langchain langchain-openai langchain-community chromadb datasets langchain-chroma



In [2]:
# Import necessary modules from Langchain and Hugging Face
import os
import pandas as pd
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.document_loaders import HuggingFaceDatasetLoader, DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.chains import RetrievalQA
from pprint import pprint

In [3]:
# Constants for dataset and storage paths
# Use local csv or HuggingFace dataset
# path_dataset = setup.get_file_path('data/gcai_notes.csv') 
path_dataset = "ekrombouts/Olympia_notes"

path_db_gcai = setup.get_file_path('data/chroma_db_gcai_notes')
collection_name = 'anonymous_notes'
model = 'text-embedding-ada-002'

In [4]:
def load_documents(path):
    """Load the dataset either from Hugging Face or a local CSV file based on the path provided."""
    
    try:
        # Try to load the Hugging Face dataset       
        loader = HuggingFaceDatasetLoader(path, page_content_column='note', use_auth_token=setup.get_hf_token())
        return loader.load()
    
    except Exception:
        # If loading as a Hugging Face dataset fails, assume it's a CSV file        
        df = pd.read_csv(path)
        loader = DataFrameLoader(df, page_content_column='note')
        return loader.load()

documents = load_documents(path=path_dataset)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
def split_documents(documents: list[Document]):
  """Split large text documents into manageable chunks for better handling by ML models."""
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=800,
      chunk_overlap=100)
  chunks = text_splitter.split_documents(documents)
  # Index each chunk to maintain unique identifiers
  for idx, chunk in enumerate(chunks):
      chunk.metadata['id'] = str(idx)
  return chunks

chunks = split_documents(documents=documents)

print(len(documents))
print(len(chunks))

19257
19257


In [6]:
def initialize_vectordb(persist_directory, embedding_function, collection_name):
    """Initialize the Chroma vector database, either loading an existing one or creating a new one."""
    if os.path.exists(persist_directory):
        return Chroma(persist_directory=persist_directory,
                      embedding_function=embedding_function,
                      collection_name=collection_name)
    else:
        return Chroma(embedding_function=embedding_function,
                      persist_directory=persist_directory,
                      collection_name=collection_name)

# Initialize vector database, using OpenAI embeddings
embedding = OpenAIEmbeddings(api_key=setup.get_openai_key(), model=model)
vectordb = initialize_vectordb(path_db_gcai, embedding, collection_name)

# If you get an error, run this cell again.(TODO fix it)

In [7]:
# Modified from https://github.com/pixegami/rag-tutorial-v2/blob/main/populate_database.py
def add_new_documents(vectordb, documents, batch_size=5000):
    """Add new documents to the database"""

    def load_existing_ids(vectordb):
        """Fetch existing document IDs from the database to avoid duplicates."""
        try:
            existing_items = vectordb.get(include=[])
            existing_ids = set(existing_items["ids"])
        except:
            existing_ids = set()
        return existing_ids

    existing_ids = load_existing_ids(vectordb)
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    # Only add documents that don't exist in the DB.
    new_documents = []
    for document in documents:
        if document.metadata["id"] not in existing_ids:
            new_documents.append(document)

    if len(new_documents):
        print(f"Total new documents to add: {len(new_documents)}")

        # Process documents in batches
        for i in range(0, len(new_documents), batch_size):
            batch = new_documents[i:i + batch_size]
            batch_ids = [document.metadata["id"] for document in batch]
            vectordb.add_documents(batch, ids=batch_ids)
            print(f"Added batch {i//batch_size + 1} with {len(batch)} documents")
    else:
        print("No new documents to add")

add_new_documents(vectordb, chunks, batch_size=250)

Number of existing documents in DB: 19257
No new documents to add


### Query the database

In [8]:
## To read the db from file

# vectordb = Chroma(persist_directory=FN_DB_GCAI,
#                   embedding_function=OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=model),
#                   collection_name = collection_name
#                   )

In [9]:
# Delete all items in the db

# items = vectordb.get(include=[])
# existing_ids = items["ids"]
# vectordb.delete(ids=existing_ids)

In [10]:
# Retrieve metadata and document IDs from the database
items = vectordb.get(include=['metadatas'])
existing_ids = set(items["ids"])
metadata = items['metadatas']
print(f"Number of existing documents in DB: {len(existing_ids)}")
print(metadata[0])

Number of existing documents in DB: 19257
{'category': 'adl', 'completion': 1, 'generation': 72, 'id': '0'}


In [11]:
# Set up a retriever for document querying
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

In [12]:
# Query the vector database using similarity search
query = 'gewichtsverlies'
docs = retriever.invoke(query)

print(f'Number of docs: {len(docs)}\n')
print(f'Retriever search type: {retriever.search_type}\n')

print(f'Documents most similar to "{query}":')
for doc in docs:
  print(doc.page_content)

Number of docs: 4

Retriever search type: similarity

Documents most similar to "gewichtsverlies":
"Extreem gewichtsverlies geconstateerd, mogelijk onvoldoende voedingsinname, advies inwinnen bij di\u00ebtist."
"Waarneembaar gewichtsverlies en verandering van kledingmaat, duidelijke afname van spiermassa. Voedingsplan bijstellen en calorie\u00ebn verhogen."
"Gewichtsverlies geconstateerd van 2 kilo in de afgelopen week. Voedingsadviezen aangepast en gewicht dagelijks gecontroleerd."
"Familie van cli\u00ebnt heeft aangegeven dat er sprake is van gewichtsverlies, graag di\u00ebtist consulteren"


In [13]:
# Initialize the QA chain for answering questions using the retrieved documents
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(api_key=setup.get_openai_key()),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [16]:
# function to test the pipeline
def query_retrieval_pipeline(query):

  def process_llm_response(llm_response):
      print(100 * '*')
      print(f"\nresult: {llm_response['result']}")
      print('\nSources:')
      for source in llm_response["source_documents"]:
          print(source.metadata['id'], source.metadata['category'])

  llm_response = qa_chain.invoke(query)
  pprint(llm_response)
  print(process_llm_response(llm_response))

In [None]:
query_retrieval_pipeline ("Wat moet je doen als je client afvalt in gewicht?")

In [None]:
query_retrieval_pipeline("Wat moet je doen als je client agressief gedrag vertoont?")

In [None]:
query_retrieval_pipeline("Wat kan je doen als een cliÃ«nt onrustig is 's nachts?")

In [None]:
query_retrieval_pipeline("Welke interventies zijn ingezet voor het verbeteren van de nachtrust?")

In [None]:
query_retrieval_pipeline("Wat zijn leuke dingen om te doen met bezoek?")