<a href="https://colab.research.google.com/github/ekrombouts/GenCareAI/blob/main/script/RAG/200_RAGIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenCare AI: Retrieval Augmented Generation of care notes

**Author:** Eva Rombouts  
**Date:** 2024-06-03  
**Version:** 1.0

### Description
In this script the anonymous client notes generated [here](https://github.com/ekrombouts/GenCareAI/blob/main/scripts/data_generation/100_GenerateCareReportsColab.ipynb) are processed and stored in a vector database (Chroma). It enables efficient querying of the database using OpenAI's embeddings and a retrieval-augmented generation system, which enhances the searchability and accessibility of client data.

*Document Loading and Processing*: Documents are loaded from the Hugging Face platform, split into smaller sections by a LangChain Text Splitter, and pre-processed. The notes are split into smaller chunks, even though this probably was not necessary for this dataset. This step was taken for completeness to ensure scalability and efficiency as data complexity or volume increases in the future.  
*Database Initialization and Population*: A Chroma vector database is initialized and populated it with the embedded document chunks.  
*Query Operations*: Utilizes the RetrievalQA pipeline to search the database using natural language queries, demonstrating the application's capability to retrieve and display relevant information.

### Goal
My goal is to use this vector database to retrieve relevant examples for few-shot inference in order to create synthetic client records. This approach can help me improve the generation of this data by providing specific, contextually relevant examples that guide the model's results.

### Setup and configuration
- Mount Google Drive to persistently store the Chroma vector database.
- Retrieve API keys for OpenAI and HuggingFace, providing authentication for accessing the [embedding model](https://platform.openai.com/docs/guides/embeddings), the [QA model](https://platform.openai.com/docs/models) and the [dataset](https://huggingface.co/datasets/ekrombouts/dutch_nursing_home_reports). ***Please note*** that the embedding isn't free. Embedding the 35.000+ notes costs appr $0.15. The costs for the examples of querying the database in this notebook are negligible.

### Recommended Resources
- [RAG - Retrieval Augmented Generation](https://www.youtube.com/playlist?list=PL8motc6AQftn-X1HkaGG9KjmKtWImCKJS) with Sam Witteveen on YouTube
- [Python RAG Tutorial (with Local LLMs)](https://www.youtube.com/watch?v=2TJxpyO3ei4&t=323s) by Pixegami on YouTube
- And of course the [Langchain documentation](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)

In [None]:
# Install required libraries
!pip install -q langchain langchain-openai langchain-community chromadb datasets

In [None]:
# Import necessary modules from Langchain and Hugging Face
import os
from datasets import load_dataset
from google.colab import drive, userdata
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.chains import RetrievalQA
from pprint import pprint

In [None]:
# Constants for dataset and storage paths
PATH_HF_DATASET = 'ekrombouts/dutch_nursing_home_reports'
PATH_DB_GCAI = '/content/drive/MyDrive/Colab Notebooks/GenCareAI/data/chroma_db_gcai'
COLLECTION_NAME = 'anonymous_reports'
MODEL = 'text-embedding-ada-002'

### Functions

In [None]:
def load_documents():
  """Load the dataset from Hugging Face, ensuring access via token."""
  dataset = load_dataset(PATH_HF_DATASET, token=HF_TOKEN)
  loader = HuggingFaceDatasetLoader(PATH_HF_DATASET,
                                    page_content_column='report')
  return loader.load()

In [None]:
def split_documents(documents: list[Document]):
  """Split large text documents into manageable chunks for better handling by ML models."""
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=800,
      chunk_overlap=100)
  chunks = text_splitter.split_documents(documents)
  # Index each chunk to maintain unique identifiers
  for idx, chunk in enumerate(chunks):
      chunk.metadata['id'] = str(idx)
  return chunks

In [None]:
def initialize_vectordb(persist_directory, embedding_function, collection_name):
    """Initialize the Chroma vector database, either loading an existing one or creating a new one."""
    if os.path.exists(persist_directory):
        return Chroma(persist_directory=persist_directory,
                      embedding_function=embedding_function,
                      collection_name=collection_name)
    else:
        return Chroma.from_documents(documents=[],
                                     embedding=embedding_function,
                                     persist_directory=persist_directory,
                                     collection_name=collection_name)

In [None]:
def load_existing_ids(vectordb):
    """Fetch existing document IDs from the database to avoid duplicates."""
    try:
        existing_items = vectordb.get(include=[])
        existing_ids = set(existing_items["ids"])
    except:
        existing_ids = set()
    return existing_ids

In [None]:
# Modified from https://github.com/pixegami/rag-tutorial-v2/blob/main/populate_database.py
def add_new_documents(vectordb, documents):
    """Add new documents to the database only if they don't already exist."""
    existing_ids = load_existing_ids(vectordb)
    print(f"Number of existing documents in DB: {len(existing_ids)}")
    # Only add documents that don't exist in the DB.
    new_documents = []
    for document in documents:
        if document.metadata["id"] not in existing_ids:
            new_documents.append(document)
    if len(new_documents):
        print(f"Adding new documents: {len(new_documents)}")
        new_document_ids = [document.metadata["id"] for document in new_documents]
        vectordb.add_documents(new_documents, ids=new_document_ids)
    else:
        print("No new documents to add")


### Embed and store texts

In [None]:
# Setup and authentication
drive.mount('/content/drive')
OPENAI_API_KEY = userdata.get('GCI_OPENAI_API_KEY')
HF_TOKEN = userdata.get('HF_TOKEN')

In [None]:
# Load, split and process documents
documents = load_documents()
chunks = split_documents(documents=documents)

# # Consider experimenting with a smaller dataset
# documents_sample = documents[:15]
# chunks = split_documents(documents=documents_sample)

print(len(documents))
print(len(chunks))

In [None]:
# Initialize vector databse, using OpenAI embeddings
embedding = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=MODEL)
vectordb = initialize_vectordb(PATH_DB_GCAI, embedding, COLLECTION_NAME)

In [None]:
# Add new documents to the database
add_new_documents(vectordb, chunks)

### Query the database

In [None]:
## To read the db from file

# vectordb = Chroma(persist_directory=FN_DB_GCAI,
#                   embedding_function=OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=MODEL),
#                   collection_name = COLLECTION_NAME
#                   )

In [None]:
# Delete all items in the db

# items = vectordb.get(include=[])
# existing_ids = items["ids"]
# vectordb.delete(ids=existing_ids)

In [None]:
# Retrieve metadata and document IDs from the database
items = vectordb.get(include=['metadatas'])
existing_ids = set(items["ids"])
metadata = items['metadatas']
print(f"Number of existing documents in DB: {len(existing_ids)}")
print(metadata[0])

In [None]:
# Set up a retriever for document querying
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

In [None]:
# Query the vector database using similarity search
query = 'gewichtsverlies'
docs = retriever.invoke(query)

print(f'Number of docs: {len(docs)}\n')
print(f'Retriever search type: {retriever.search_type}\n')

print(f'Documents most similar to "{query}":')
for doc in docs:
  print(doc.page_content)

In [None]:
# Initialize the QA chain for answering questions using the retrieved documents
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(api_key=OPENAI_API_KEY),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [None]:
# Define a function to process the response
def process_llm_response(llm_response):
    print(100 * '*')
    print(f"\nresult: {llm_response['result']}")
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['id'], source.metadata['topic'])

In [None]:
# function to test the pipeline
def query_retrieval_pipeline(query):
  llm_response = qa_chain.invoke(query)
  pprint(llm_response)
  print(process_llm_response(llm_response))

In [None]:
query_retrieval_pipeline ("Wat moet je doen als je client afvalt in gewicht?")

In [None]:
query_retrieval_pipeline("Wat moet je doen als je client agressief gedrag vertoont?")

In [None]:
query_retrieval_pipeline("Wat kan je doen als een cliënt onrustig is 's nachts?")

In [None]:
query_retrieval_pipeline("Welke interventies zijn ingezet voor het verbeteren van de nachtrust?")

In [None]:
query_retrieval_pipeline("Wat zijn leuke dingen om te doen met bezoek?")