# chunk_and_load_data.ipynb

This notebook contains code to load data to a Qdrant database for the CareCompanion app. You must first run `fetch_data.ipynb` to scrape and format the data. You must also create APIs for OpenAI and Anthropic.

Import environment variables

In [1]:
from dotenv import load_dotenv
import os

load_dotenv('../app/.env')

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

Load our document corpus from a file. (fetch_data.ipynb can be used to generate the file)

In [3]:
myfile = "source_documents.json"

import json
from langchain.schema import Document

# Load JSON data
with open(myfile, 'r') as file:
    data = json.load(file)

# Convert JSON data into a list of LangChain Document objects
docs = [
    Document(page_content=item["page_content"], metadata=item["metadata"])
    for item in data
]

print(f"loaded {len(docs)} docs")

loaded 216 docs


Split the documents into reasonably sized chunks that work for most embedding models

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,       
    chunk_overlap=150,     
)

split_docs = []

for doc in docs:

    splits = text_splitter.split_text(doc.page_content)
    for i,split in enumerate(splits):
        metadata_with_chunk = {**doc.metadata, "chunk_id": i}
            
        # Create the document with the updated metadata
        split_doc = Document(page_content=split, metadata=metadata_with_chunk)
        split_docs.append(split_doc)

print(f"len(docs): {len(docs)}, len(split_docs):{len(split_docs)}")
print(split_docs[0])

len(docs): 216, len(split_docs):1349
page_content='alzheimer's disease and dementia | alzheimer's disease and dementia | cdc     alzheimer's disease and dementia alzheimer's basics learn about signs and symptoms of alzheimer's disease and who is affected. aug. 15, 2024 dementia basics learn about common types of dementia, signs and symptoms, and risk factors. aug. 17, 2024 signs and symptoms of alzheimer's learn how to recognize the early signs of alzheimer's disease. signs and symptoms of dementia learn what early signs and symptoms of dementia to look out for. tools and resources find a variety of resources about alzheimer’s disease and healthy aging. reducing risk learn what lifestyle behaviors can reduce the risk of developing dementia. additional topics healthy aging at any age information to help you stay healthy and strong throughout your life. sept. 3, 2024 alzheimer's disease program evidence-based, scientific information to educate, inform, and assist translating research int

Set up embeddings - we'll use OpenAI's text-embedding-3-large

In [5]:
from langchain_openai import OpenAIEmbeddings
embedding_model = "text-embedding-3-large"
openai_embeddings = OpenAIEmbeddings(
    model=embedding_model,
    openai_api_key=OPENAI_API_KEY  
)


Use an additional chunking strategy (Semantic chunking)

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from tqdm import tqdm

semantic_text_splitter = SemanticChunker(openai_embeddings,
    breakpoint_threshold_type="percentile")

semantic_split_docs = []

for doc in tqdm(docs):

    splits = semantic_text_splitter.split_text(doc.page_content)
    for i,split in enumerate(splits):
        metadata_with_chunk = {**doc.metadata, "chunk_id": i}
            
        # Create the document with the updated metadata
        semantic_split_doc = Document(page_content=split, metadata=metadata_with_chunk)
        semantic_split_docs.append(semantic_split_doc)

print(f"len(docs): {len(docs)}, len(semantic_split_docs):{len(semantic_split_docs)}")
print(semantic_split_docs[0])

Let's add the docs to a vector store. Make sure qdrant is running first (see README.md for more details). We can create it once and re-use it after that.

In [6]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

url="http://localhost:6333"

client = QdrantClient(url=url, prefer_grpc=True)
print(client.get_collections())

collection_name_fixed = "DementiaCare_Fixed"
collection_name_semantic = "DementiaCare_Semantic"

collections=[CollectionDescription(name='DementiaCare_Semantic'), CollectionDescription(name='PottyTraining'), CollectionDescription(name='DementiaCare_Fixed')]


In [None]:

client.delete_collection(collection_name_fixed)
try:
    qdrant_vector_store_fixed = QdrantVectorStore.from_documents(
        split_docs,
        openai_embeddings,
        url=url,
        prefer_grpc=True,
        collection_name=collection_name_fixed,
    )
except Exception as e:
    print(f"Encountered error creating vector store: {e}")

if qdrant_vector_store_fixed: print(f"Created vector store {collection_name_fixed}")


In [None]:

client.delete_collection(collection_name_semantic)
try:
    qdrant_vector_store_semantic = QdrantVectorStore.from_documents(
        semantic_split_docs,
        openai_embeddings,
        url=url,
        prefer_grpc=True,
        collection_name=collection_name_semantic,
    )
except Exception as e:
    print(f"Encountered error creating vector store: {e}")

if qdrant_vector_store_semantic: print(f"Created vector store {collection_name_semantic}")

print(client.get_collections())


Make sure we can load the vectorstore too

In [7]:
# If the collections already exist, just load them
from langchain_qdrant import QdrantVectorStore
url="http://localhost:6333"


store_fixed = QdrantVectorStore.from_existing_collection(
    embedding=openai_embeddings,
    collection_name=collection_name_fixed,
    url=url
)

store_semantic = QdrantVectorStore.from_existing_collection(
    embedding=openai_embeddings,
    collection_name=collection_name_semantic,
    url=url
)

Test it out by itself

In [9]:
from langchain.retrievers import EnsembleRetriever

similarity_retriever_semantic = store_semantic.as_retriever(k=10)
similarity_retriever_fixed = store_fixed.as_retriever(k=10)

retriever = EnsembleRetriever(retrievers=[similarity_retriever_semantic,similarity_retriever_fixed])
results = retriever.invoke("How does stress impact dementia caregivers?")

for result in results: print(result)

page_content='Levine, and S. Samis, “Home Alone: Family Caregivers Providing Complex Chronic Care,” AARP Public Policy Institute & United Hospital Fund, 2012. 2 Considering all of the responsibilities that dementia caregivers often shoulder, it is of no surprise that the Burden of Care Index2 shows them as one of the more burdened groups of caregivers. Nearly half of dementia caregivers are in a high-burden situation. Dementia caregivers are not the most-burdened group—for example, cancer caregivers are more likely to be in high-burden care relationships (62 percent).3 However, whereas cancer caregiver relationships are short and episodic, dementia caregiver relationships tend to be longer: nearly seven in ten (69 percent) dementia caregivers have provided care for more than a year, and three in ten have provided care for more than five years. This high burden of care over a longer period can take a significant mental and physical toll on dementia caregivers. Nearly half of dementia ca

Test it out in a simple RAG chain: Create a prompt, initialize an LLM, and then use the retriever in a chain

In [10]:
from langchain_core.prompts import PromptTemplate

RAG_PROMPT_TEMPLATE = """
You are an empathetic, kind assistant that specializes in helping informal caregivers of dementia and Alzheimer's patients
navigate the stresses and questions of everyday life. Answer the question based on the context. If the answer is not
in the context, say you don't know. Be concise and conversational, and answer in language that a high school 
graduate with no specialized training can understand. 

You must never give medical, legal, or financial advice. Always make sure the 
user contacts a professional if it is an emergency or if they need medical advice.

<context>
{context}
</context>

<question>
{query}
<question>
"""

rag_prompt = PromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

In [11]:
from langchain_anthropic import ChatAnthropic

haiku_model_id = "claude-3-haiku-20240307" # cheaper and better to use for prototyping, although we'll use 3.5 in our app
claude_3_5_sonnet_model_id = "claude-3-5-sonnet-20240620"

llm = ChatAnthropic(
    model=haiku_model_id,    
    anthropic_api_key=ANTHROPIC_API_KEY
)

In [12]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough

# prototype a simple function to tack on sources at the end
def add_sources(context:list[Document])->str:
    sources_str = ""
    if len(context)>0:
        i = 1
        sources_str = "Sources: "

        for doc in context:
            if not doc.metadata.get("url") in sources_str:
                sources_str += f'[<a href="{doc.metadata["url"]}">{i}</a>] '
                i+=1

    return sources_str

# standard RAG that passes the context through
max_context = 4
rag_chain = (
    {"context": itemgetter("query") | retriever | (lambda docs: docs[:max_context]), "query": itemgetter("query")} 
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm, "context": itemgetter("context")}
)

In [13]:
async def answer_question(user_input):
    answer = await rag_chain.ainvoke(input={'query':user_input})
    answer["response"].pretty_print()
    print(add_sources(answer["context"]))   
    return answer["response"],answer["context"]


In [14]:
_, context = await answer_question("who is a typical caregiver?")


Based on the context provided, a typical dementia caregiver has the following characteristics:

- Majority are women (58%)
- Average age is 54 years old, about 6 years older than non-dementia caregivers
- Two-thirds are 50 years or older, and one-quarter are 65 or older
- About two-thirds are non-Hispanic white
- 61% have less than a college degree
- Median household income is around $58,200 per year, close to the national median

The context also provides some insights into millennial (ages 18-34) dementia caregivers, who tend to:

- Care for a grandparent (44%) or parent (26%)
- Provide 18.5 hours of care per week, which is significantly less than older caregivers
- Perform 1.9 ADLs and 4.3 IADLs on average
- 57% do medical/nursing tasks
- 38% are the sole unpaid caregiver
- 81% are employed while caregiving
- 48% are African-American or Hispanic
- 47% have household income less than $50,000

So in summary, the typical dementia caregiver is middle-aged or older, predominantly female

In [15]:
await answer_question("i found an old man walking by the side of the road and he doesn't remember anything, what should i do?")
await answer_question("what are some early warning signs of dementia?")
await answer_question("my mom is having a bad day and wants the car keys, what do i do?")


Based on the context provided, here is what I would suggest if you encounter an older person who seems lost and confused:

First, try to stay calm and focus on helping the person safely. Approach them gently and ask if they need assistance. See if you can get any information from them, like their name or where they live. 

If they are unable to provide that, your next steps should be:

1. Contact the local police non-emergency number and report the situation. Explain that you believe the person may have dementia or memory issues and is lost. The police can help locate their home and loved ones.

2. If possible, try to keep an eye on the person and stay with them until the police arrive, without putting yourself at risk. Avoid leaving them alone.

3. Consider getting the person's photo and any other identifying details you can, in case the police need it to help find their family.

4. Do not try to transport the person yourself or take them to an unfamiliar location. Let the profession

(AIMessage(content='I\'m sorry to hear your mom is having a difficult day. Here are a few suggestions that may help:\n\nDon\'t argue with her about the car keys. That could escalate the situation. Instead, try to understand what need she is trying to meet by asking for the keys. She may feel a desire for independence or to go home. \n\nYou could gently acknowledge her feelings, saying something like "I know you want to go out right now. What is it about that that you\'re missing?" Then try to address the underlying need, perhaps by offering to go for a short walk together or looking at old photos to reminisce about home.\n\nAvoid directly refusing her the keys, as this could make her more upset. Instead, try to redirect her attention to something soothing or engaging. Ultimately, your goal is to provide comfort and meet her needs, not to control the situation.\n\nIf she becomes agitated or aggressive, leave the situation and seek help from other family members or professionals if neede