<a href="https://colab.research.google.com/github/ekrombouts/GenCareAI/blob/olympia/notebooks/100_note_generation/112_RAGIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenCare AI: Retrieval Augmented Generation of care notes

**Author:** Eva Rombouts  
**Date:** 2024-06-03  
**Updated:** 2024-09-30  
**Version:** 2.0

### Description
In the previous notebook, we generated progress notes for nursing home residents using a GPT model. In this notebook, we will create a vector database to store these progress notes, which will later be used as few-shot examples for generating new client records.

The workflow includes loading a dataset of care notes, splitting these notes into manageable chunks, and embedding them using OpenAI’s text-embedding-ada-002 model. These embeddings are stored in a Chroma vector database.

- Documents can be loaded either from a csv file or from HuggingFace.
- The notes are split into smaller chunks, even though this probably was not necessary for this dataset. This step was taken for completeness to ensure scalability in the future.  
- When running in CoLab Google Drive is mounted to persistently store the Chroma vector database.
- Retrieve API keys for OpenAI and HuggingFace, providing authentication for accessing the [embedding model](https://platform.openai.com/docs/guides/embeddings), the [QA model](https://platform.openai.com/docs/models) and the [dataset](https://huggingface.co/datasets/ekrombouts/Gardenia_notes).

### Recommended Resources
- [RAG - Retrieval Augmented Generation](https://www.youtube.com/playlist?list=PL8motc6AQftn-X1HkaGG9KjmKtWImCKJS) with Sam Witteveen on YouTube
- [Python RAG Tutorial (with Local LLMs)](https://www.youtube.com/watch?v=2TJxpyO3ei4&t=323s) by Pixegami on YouTube
- And of course the [Langchain documentation](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)

***Please note*** that the embedding isn't free. Embedding the 35.000+ notes costs appr $0.15. The costs for the examples of querying the database in this notebook are negligible.

In [None]:
!pip install GenCareAI
from GenCareAI.GenCareAIUtils import GenCareAISetup

setup = GenCareAISetup()

if setup.environment == 'Colab':
        !pip install -q langchain langchain-openai langchain-community chromadb datasets langchain-chroma

In [2]:
# Import necessary modules from Langchain and Hugging Face
import os
import pandas as pd
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.document_loaders import HuggingFaceDatasetLoader, DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

In [3]:
# Constants for dataset and storage paths
# Use local csv or HuggingFace dataset
# path_dataset = setup.get_file_path('data/gcai_notes.csv')
path_dataset = "ekrombouts/Gardenia_notes"

path_db_gcai = setup.get_file_path('data/chroma_db_gcai')
collection_name = 'Gardenia'
model = 'text-embedding-ada-002'

In [None]:
def load_documents(path):
    """Load the dataset either from Hugging Face or a local CSV file based on the path provided."""

    try:
        # Try to load the Hugging Face dataset
        loader = HuggingFaceDatasetLoader(path=path,
                                          page_content_column='note',
                                          # use_auth_token=setup.get_hf_token()
                                          )
        return loader.load()

    except Exception as e:
        # If loading as a Hugging Face dataset fails, assume it's a CSV file
        df = pd.read_csv(path)
        loader = DataFrameLoader(df, page_content_column='note')
        return loader.load()

documents = load_documents(path=path_dataset)

In [None]:
def split_documents(documents: list[Document]):
  """Split large text documents into manageable chunks for better handling by ML models."""
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=800,
      chunk_overlap=100)
  chunks = text_splitter.split_documents(documents)
  # Index each chunk to maintain unique identifiers
  for idx, chunk in enumerate(chunks):
      chunk.metadata['id'] = str(idx)
  return chunks

chunks = split_documents(documents=documents)

print(len(documents))
print(len(chunks))

In [None]:
def initialize_vectordb(persist_directory, embedding_function, collection_name):
    """Initialize the Chroma vector database, either loading an existing one or creating a new one."""
    if os.path.exists(persist_directory):
        return Chroma(persist_directory=persist_directory,
                      embedding_function=embedding_function,
                      collection_name=collection_name)
    else:
        return Chroma(embedding_function=embedding_function,
                      persist_directory=persist_directory,
                      collection_name=collection_name)

# Initialize vector database, using OpenAI embeddings
embedding = OpenAIEmbeddings(api_key=setup.get_openai_key(), model=model)
vectordb = initialize_vectordb(path_db_gcai, embedding, collection_name)

# If you get an error, run this cell again.(TODO fix it)

In [None]:
# Modified from https://github.com/pixegami/rag-tutorial-v2/blob/main/populate_database.py
def add_new_documents(vectordb, documents, batch_size=5000):
    """Add new documents to the database"""

    def load_existing_ids(vectordb):
        """Fetch existing document IDs from the database to avoid duplicates."""
        try:
            existing_items = vectordb.get(include=[])
            existing_ids = set(existing_items["ids"])
        except:
            existing_ids = set()
        return existing_ids

    existing_ids = load_existing_ids(vectordb)
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    # Only add documents that don't exist in the DB.
    new_documents = []
    for document in documents:
        if document.metadata["id"] not in existing_ids:
            new_documents.append(document)

    if len(new_documents):
        print(f"Total new documents to add: {len(new_documents)}")

        # Process documents in batches
        for i in range(0, len(new_documents), batch_size):
            batch = new_documents[i:i + batch_size]
            batch_ids = [document.metadata["id"] for document in batch]
            vectordb.add_documents(batch, ids=batch_ids)
            print(f"Added batch {i//batch_size + 1} with {len(batch)} documents")
    else:
        print("No new documents to add")

add_new_documents(vectordb, chunks, batch_size=1000)