## Assignment Week 6

### RAG using Wikipedia

RAG is a technique of optimizing the output of a large language models that involves referencing a knowledge outside of its training data sources before generating a response. RAG can be used to augment the LLM knowledge with information that was not part of the original training data. In the below activity RAG approach is used with OpenAPI LLM model (chatgp-4o-mini) which have a knowledge cut-off date of `July-2024`. Hence, gpt-4o-mini model will not be able to accurately answer information about events that happened after the `cut-off` date. However, it may still make up some answer even it was not part of training data, but that may be false or not trustworthy. Using, RAG approach relevant and trustworthy response can be received by passing the additional data in the context.

For this exercise, question about wildfires that happened in January 2025 will be sent to LLM `without RAG approach`, and the same questions will be again asked by providing information in the context about California 2025 wildfires from Wikipedia web page `January 2025 Souther California` using `RAG approach`.  Then the responses will be investigated.

In [69]:
# !pip install lark

In [1]:
import json
import os
import re
import spacy
import lark

In [2]:
# get api key from file
with open("../../../apikeys/openai-keys.json", "r") as key_file:
    api_key = json.load(key_file)["default_api_key"]
os.environ["OPENAI_API_KEY"] = api_key

In [9]:
nlp = spacy.load("en_core_web_md")

In [3]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from uuid import uuid4,uuid5
import uuid
from langchain_core.documents import Document
import tiktoken as tk

The chunks are converted to vector embeddings and stored in a vector database which enables retrieval of relevant information from the knowledge base matching the user query and sent as context in the prompt to LLM.

Open AI's [text-embedding-3-small][1] embedding model is used to generate the embeddings. it is important to note that this model is not Gen AI model, instead it is used to generates vector embedings. 

[1]:https://platform.openai.com/docs/guides/embeddings

In [4]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [5]:

vector_store = Chroma(
    collection_name="ehr_collection",
    embedding_function=embeddings,
    persist_directory="./vector-stores/chroma_db",
)

##### Load PDFs page

In [14]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import MarkdownHeaderTextSplitter

In [15]:
def extract_metadata(filename):
    return filename.split("_")[0].lower(),  str(filename.split("_")[-1].split(".")[0])


In [16]:
pdf_files = [fnames for fnames in os.listdir(os.path.join(os.getcwd(),"pdfs")) if re.search(".pdf",fnames)]
pdf_files

['Alexis Sparks_EHR_20250215.pdf']

In [17]:
pat_name, dos = extract_metadata(pdf_files[0])
pat_name,dos

('alexis sparks', '20250215')

In [18]:
loader = PyPDFLoader(os.path.join(os.getcwd(),"pdfs",pdf_files[0]))

In [19]:
pages = []
async for page in loader.alazy_load():
    pages.append(page)


#### use spacy for PHI detection

In [20]:
# print(f"{pages[0].metadata}\n")
# print(pages[0].page_content)
print(len(pages))

2


In [21]:
pages[1].page_content

"* Metformin 500 mg twice daily\n### Health Conditions\n* Type 2 Diabetes (recently diagnosed)\n* Obstructive Sleep Apnea (OSA)\n### Recent Diagnosis\n* Morbid Obesity\n* Essential Hypertension\n* Type 2 Diabetes\n### Personal Health Assessment\n* Lifestyle:\n    * Exercises: Rarely; sedentary lifestyle\n    * Diet: High in carbohydrates and sugars; minimal fruits and vegetables\n    * Sleep: 5-6 hours per night, often interrupted\n    * Stress: Moderate to high due to work and personal life\n* Tobacco Use: Never smoked\n* Alcohol Use: Social drinker, approximately 2-3 drinks per week\n* Occupation: Office manager, primarily desk job with limited physical activity\n### Physician's Notes\n* Patient presents with a significant risk for cardiovascular disease given his morbid obesity and\nhypertension. Discussed the importance of lifestyle modifications including diet changes and\nincreased physical activity. Referral to a dietitian provided to assist with meal planning.\nRecommended foll

In [33]:
def redact_dob(page_content):
    match = None
    if re.search("DOB",page_content):
        match = re.search("DOB",page_content)
    elif re.search("Date Of Birth",page_content):
        match = re.search("Date Of Birth",page_content)
    else:
        match = None
    if match:
        dob_start = match.span()[1]+1
        dob = page_content[dob_start: dob_start+11]
        page_content = (page_content[0:dob_start]
                        +  "<PHI>"
                        + "".join(["X" for _ in page_content[dob_start: dob_start+11]])
                        + "</PHI> "
                        + page_content[dob_start+11:]
                       )
    return page_content

In [23]:
def redact_name(page_content):
    match = re.search("Name",page_content)
    if match:
        name_split  = page_content[match.span()[0]+6:].split("*")
        name_length = len(name_split[0])
        page_content = (page_content[0:match.span()[0]+6] + 
                        "<PHI>" + 
                        "".join(["X" for _ in range(name_length)]) 
                        + "</PHI> "
                        + " ".join([ x for x in name_split[1:]])
                       )
    return page_content


In [24]:
def redact_name_references(page_content):
    doc = nlp(page_content.replace("\n",""))
    
    for ent in doc.ents:
        # print(ent.text, ent.start_char, ent.end_char, ent.label_)
        if ent.label_ in ["PERSON"]:
            
            page_content = (page_content[0:ent.start_char]
                    +  "<PHI>"
                    + "".join(["X" for _ in page_content[ent.start_char:ent.end_char] ])
                    + "</PHI> "
                    + page_content[ent.end_char+11:]
                   )
        
    return page_content  

In [25]:
# redacted_pages = []
# for each_page in pages:
#     page_content = each_page.page_content
#     page_content = redact_dob(page_content)
#     page_content = redact_name(page_content)
#     page_content = redact_name_references(page_content)
#     redacted_pages.append(page_content)

In RAG approach, the additional infomation from knowledge base is first chunked (splitted) and stored as vector embeddings in a vector database.

1. Perform Chunking

In [26]:

text_splitter = RecursiveCharacterTextSplitter(
    separators=["###"],
    chunk_size = 100,
    chunk_overlap  = 10,
    length_function = len,
    is_separator_regex = False,
)

all_splits = text_splitter.split_documents(pages)
print(f"Number of splits : {len(all_splits)}")

Number of splits : 10


In [27]:
all_splits[0:5]

[Document(metadata={'source': '/Users/biswajitmac/Documents/Biswajit/github/DSC670/ehr/pdfs/Alexis Sparks_EHR_20250215.pdf', 'page': 0}, page_content='* Name: Alexis Sparks\n* DOB: 1971-01-25\n* Age: 38\n* Gender: Male\n* Patient ID: 123456'),
 Document(metadata={'source': '/Users/biswajitmac/Documents/Biswajit/github/DSC670/ehr/pdfs/Alexis Sparks_EHR_20250215.pdf', 'page': 0}, page_content='### Vital Signs (Recorded on 2023-10-20)\n* Blood Pressure: 145/95 mmHg\n* Heart Rate: 82 bpm\n* Respiratory Rate: 18 breaths/min\n* Temperature: 98.6 °F\n* Oxygen Saturation (SpO2): 98%\n* BMI: 42.1 kg/m²\n'),
 Document(metadata={'source': '/Users/biswajitmac/Documents/Biswajit/github/DSC670/ehr/pdfs/Alexis Sparks_EHR_20250215.pdf', 'page': 0}, page_content='### Family History\n* Father: Hypertension, Type 2 Diabetes\n* Mother: Morbid Obesity, Hyperlipidemia\n* Siblings: One brother with obesity-related health issues\n'),
 Document(metadata={'source': '/Users/biswajitmac/Documents/Biswajit/github/

It is also a good idea to get the number of tokens chunks may consume.

In [30]:
def get_num_token(split):
    encoding = tk.get_encoding("cl100k_base")
    encoded_string = encoding.encode(split)
    print(f"Number of total tokens: {len(encoded_string)}")

In [46]:
def load_document_embeddings(chunk, pat_name, dos, split_num):
    document_1 = Document(
    page_content=chunk,
    metadata={
        "patient_name": pat_name,
        "date_of_service":dos,
        "version":"1.0"
    },
    # id=hash(f"{pat_name}-{dos}-{split_num}"),
    id = uuid5(uuid.NAMESPACE_URL,f"{pat_name}-{dos}-{split_num}")
    )
    
    documents = [
    document_1,
    ]
    print("--"*50)
    print(document_1.metadata)
    print(document_1.page_content)
    print(document_1.id)
    print("--"*50)
    return_ids = vector_store.add_documents(documents=documents)
    print(f"Document id added: {return_ids}")

2. Generate embeddings and store embeddings in vector database

In [47]:
for split_num, each_page in enumerate(all_splits):
    page_content = each_page.page_content
    page_content = redact_dob(page_content)
    page_content = redact_name(page_content)
    page_content = redact_name_references(page_content)
    get_num_token(page_content)
    load_document_embeddings(page_content, pat_name, dos, split_num)

Number of total tokens: 46
----------------------------------------------------------------------------------------------------
{'patient_name': 'alexis sparks', 'date_of_service': '20250215', 'version': '1.0'}
* Name: <PHI>XXXXXXXXXXXXXX</PHI>  DOB:<PHI>XXXXXXXXXXX</PHI> 
  Age: 38
  Gender: Male
  Patient ID: 123456
77c69a1c-e938-53af-b662-7578face43cc
----------------------------------------------------------------------------------------------------
Document id added: ['77c69a1c-e938-53af-b662-7578face43cc']
Number of total tokens: 82
----------------------------------------------------------------------------------------------------
{'patient_name': 'alexis sparks', 'date_of_service': '20250215', 'version': '1.0'}
### Vital Signs (Recorded on 2023-10-20)
* Blood Pressure: 145/95 mmHg
* Heart Rate: 82 bpm
* Respiratory Rate: 18 breaths/min
* Temperature: 98.6 °F
* Oxygen Saturat<PHI>XXXX</PHI> 
* BMI: 42.1 kg/m²

f30a4fe8-cacb-5934-ad3e-5c4b332660b3
--------------------------------

In [52]:
results = vector_store.similarity_search(
    "do patient ever dignosed with diabetes",
    k=2,
    filter={"patient_name": "alexis sparks"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* ### Recent Diagnosis
* Morbid Obesity
* Essential Hypertension
* Type 2 Diabetes [{'date_of_service': '20250215', 'patient_name': 'alexis sparks', 'version': '1.0'}]
* ### Health Conditions
* Type 2 Diabetes (recently diagnosed)
* Obstructive Sleep Apnea (OSA) [{'date_of_service': '20250215', 'patient_name': 'alexis sparks', 'version': '1.0'}]


In [54]:
filter = {
    "$and": [
        {
            "patient_name": {
                "$eq": "alexis sparks"
            }
        },
        {
            "date_of_service": {
                "$eq": "20250215"
            }
        }
    ]
}

results = vector_store.similarity_search(
    "do patient ever dignosed with diabetes",
    k=2,
    filter=filter,
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* ### Recent Diagnosis
* Morbid Obesity
* Essential Hypertension
* Type 2 Diabetes [{'date_of_service': '20250215', 'patient_name': 'alexis sparks', 'version': '1.0'}]
* ### Health Conditions
* Type 2 Diabetes (recently diagnosed)
* Obstructive Sleep Apnea (OSA) [{'date_of_service': '20250215', 'patient_name': 'alexis sparks', 'version': '1.0'}]


### Vector DB is loaded, now use Retriver

In [6]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

#### With RAG

In [4]:
# from langchain.chains import RetrievalQA
# from langchain.prompts import PromptTemplate
# import pprint

In [14]:
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer."""

template = "You are a bot that answers questions from a patients electronic health ecord or charts.\n\
            If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\
            {context}\n\
            Question: {question}"

custom_rag_prompt = PromptTemplate.from_template(
    template=template,
    # input_variables=["context", "question"]
)

In [10]:
llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")

In [46]:
question = input("Enter your question: ")

Enter your question:  whats is the blood pressure of Name:alexis sparks dos:20250215


In [47]:
name = question.lower().split("name:")[-1].split("\n")[0].split("dos:")[0].strip()

In [48]:
dos = question.lower().split("name:")[-1].split("\n")[0].split("dos:")[-1].split("\n")[0].strip()

In [51]:
formatted_question = question.lower().split("name:")[0]

In [52]:
name,dos,formatted_question

('alexis sparks', '20250215', 'whats is the blood pressure of ')

In [64]:
# retrieved_docs = vector_store.similarity_search(question)
filter = {
    "$and": [
        {
            "patient_name": {
                "$eq": name
            }
        },
        {
            "date_of_service": {
                "$eq": str(dos)
            }
        }
    ]
}

retrieved_docs = vector_store.similarity_search(
    question,
    k=2,
    filter=filter,
)
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
doc_source = [doc.page_content for doc in retrieved_docs]
# docs_content
# custom_rag_prompt.format(context=docs_content, question=formatted_question)
prompt = custom_rag_prompt.invoke({"question": formatted_question, "context": docs_content})
answer = llm.invoke(prompt)
# prompt
# custom_rag_prompt

"You are a bot that answers questions from a patients electronic health ecord or charts.\n            If you don't know the answer, just say that you don't know, don't try to make up an answer.\n            ### Vital Signs (Recorded on 2023-10-20)\n* Blood Pressure: 145/95 mmHg\n* Heart Rate: 82 bpm\n* Respiratory Rate: 18 breaths/min\n* Temperature: 98.6 °F\n* Oxygen Saturat<PHI>XXXX</PHI> \n* BMI: 42.1 kg/m²\n\n\n### Current Medications\n* Lisinopril 20 mg daily\n            Question: whats is the blood pressure of "

In [59]:
answer.content

'The blood pressure recorded is 145/95 mmHg.'

In [7]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

In [8]:
metadata_field_info = [
    AttributeInfo(
        name="patient_name",
        description="The name of the patient",
        type="string",
    ),
    AttributeInfo(
        name="date_of_service",
        description="The date when patient receied services or service date or date of service in yyymmdd format",
        type="string",
    ),
]
document_content_description = "chart or electronic health record of patients"

In [11]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vector_store,
    document_content_description,
    metadata_field_info,
)

In [13]:
docs = retriever.invoke("What is th blood pressure of alexis sparks".lower())

In [None]:
docs_content = "\n\n".join(doc.page_content for doc in docs)
# docs_content
# custom_rag_prompt.format(context=docs_content, question=formatted_question)
prompt = custom_rag_prompt.invoke({"question": formatted_question, "context": docs_content})
answer = llm.invoke(prompt)
# prompt
# custom_rag_prompt

In [None]:
# qa_with_source = RetrievalQA.from_chain_type(
#     llm=llm,
#     chain_type="stuff",
#     retriever=store.as_retriever(),
#     chain_type_kwargs={"prompt": PROMPT, },
#     return_source_documents=True,
# )

In [None]:
# pprint.pprint(
#     qa_with_source("How many people were killed in southern california wildfire in January 2025?")
# )

RAG approach is working as expected. LLM was able to accurately answer the question that southern California wildfires in January 2025 "killed at least 27 people". As we see above that the prompt includes the information about the southern california wildfires from the Wikipedia in the `context`. The `retriever` first retrived the context information from the vector database that closely matches the user question and then included it as part of the prompt for the LLM call.

#### Without RAG

In [None]:


# document_1 = Document(
#     page_content=page_content,
#     metadata={
#         "patient_name": pat_name,
#         "date_of_service":dos,
#         "version":"1.0"
#     },
#     id=hash(f"{pat_name}-{dos}-{split_num}"),
# )

# document_2 = Document(
#     page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
#     metadata={"source": "news"},
#     id=2,
# )

# document_3 = Document(
#     page_content="Building an exciting new project with LangChain - come check it out!",
#     metadata={"source": "tweet"},
#     id=3,
# )

# document_4 = Document(
#     page_content="Robbers broke into the city bank and stole $1 million in cash.",
#     metadata={"source": "news"},
#     id=4,
# )

# document_5 = Document(
#     page_content="Wow! That was an amazing movie. I can't wait to see it again.",
#     metadata={"source": "tweet"},
#     id=5,
# )

# document_6 = Document(
#     page_content="Is the new iPhone worth the price? Read this review to find out.",
#     metadata={"source": "website"},
#     id=6,
# )

# document_7 = Document(
#     page_content="The top 10 soccer players in the world right now.",
#     metadata={"source": "website"},
#     id=7,
# )

# document_8 = Document(
#     page_content="LangGraph is the best framework for building stateful, agentic applications!",
#     metadata={"source": "tweet"},
#     id=8,
# )

# document_9 = Document(
#     page_content="The stock market is down 500 points today due to fears of a recession.",
#     metadata={"source": "news"},
#     id=9,
# )

# document_10 = Document(
#     page_content="I have a bad feeling I am going to get deleted :(",
#     metadata={"source": "tweet"},
#     id=10,
# )

# documents = [
#     document_1,
#     # document_2,
#     # document_3,
#     # document_4,
#     # document_5,
#     # document_6,
#     # document_7,
#     # document_8,
#     # document_9,
#     # document_10,
# ]
# uuids = [str(uuid4()) for _ in range(len(documents))]

# vector_store.add_documents(documents=documents, ids=uuids)
# vector_store.add_documents(documents=documents, ids=uuids)

Question 2: How many widfires have affected the Los Angeles metropolitan area in 2025?

In [None]:
messages=[
        {
            "role": "system", 
            "content": f"You are a bot that answers questions about January 2025 Southern California wildfires.\n\
            If you donot know the answer, simply state that you donot know."
        },
        {
            "role": "user", 
            "content": f"Q: How many widfires have affected the Los Angeles metropolitan area in 2025?\nA:"
        }
    ]

In [None]:
response = openai_model.invoke(messages)

In [None]:
print(f"Response from LLM without RAG:\n {response.content}")

Same observation as question 1, OpenAI LLM was not able to answer the question. 

#### With RAG

In [None]:
pprint.pprint(
    qa_with_source("How many widfires have affected the Los Angeles metropolitan area in 2025?")
)

Received the corcect response from LLM about number of wildfires effecting LA in 2025 because the information about the number of wildfires effecting the LA area in 2025 is passed in the context of the prompt which is sent to LLM.

Apart from RAG's capability to provide reponses by using information from a knowledge base, the above responses also includes the source document from which the response were based. This is another important factor in RAG approach which allows the user to trust the answer and make sure the response is grounded in the knowledge base.