Set up tools to recursively load and clean our web-based sources.

- Alzheimer's Association Resources for Caregivers: https://www.alz.org/help-support/caregiving
- CDC resources for dementia: https://www.cdc.gov/alzheimers-dementia/index.html
- Mayo Clinic: https://www.mayoclinic.org/diseases-conditions/dementia/symptoms-causes/syc-20352013
- The Alzheimer's Society (international): https://www.alzheimers.org.uk
- WebMD: https://www.webmd.com/alzheimers/

Other data sources we could explore to expand the POC: 

- Alzheimer's Foundation of America: https://alzfdn.org/
- Dementia Society of America: https://www.dementiasociety.org
- Family Caregiver Alliance: https://www.caregiver.org

And top hits from Pubmed for a variety of specific search terms might be good: https://pubmed.ncbi.nlm.nih.gov/?term=dementia+caregiver.  

In [171]:
urls = ["https://www.cdc.gov/alzheimers-dementia",
        "https://www.cdc.gov/alzheimers-dementia/about",
        "https://www.cdc.gov/alzheimers-dementia/prevention",
        "https://www.cdc.giv/alzheimers-dementia/healthy-people-2030",
        "https://www.alz.org/help-support/caregiving",
        "https://www.mayoclinic.org/diseases-conditions/dementia/",
        "https://www.alzheimers.org.uk/about-dementia",
        "https://www.webmd.com/alzheimers"
]

In [172]:
from langchain_community.document_loaders import RecursiveUrlLoader
import requests

# This example uses `beautifulsoup4` and `lxml`
import re
from bs4 import BeautifulSoup

def custom_metadata_extractor(html: str, url: str, response: requests.Response):
    soup = BeautifulSoup(html, "lxml")
    # Extract the page title from the HTML
    title = soup.title.string if soup.title else "No Title"
    return {
        "url": url,
        "title": title
    }

def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    # Remove unwanted tags
    for tag in soup(['nav', 'footer', 'header', 'aside', 'script', 'style']):
        tag.decompose()
    # Extract text
    text = soup.get_text(separator=' ', strip=True)
    # Clean up whitespace
    clean_text = re.sub(r'\s+', ' ', text).strip()
    return clean_text


In [173]:
import random

docs = []
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
    # Add more user agents as needed - this helps grab content from sites like mayo clinic and webmd that block bots
]

headers = {"User-Agent": random.choice(user_agents)}

for url in urls:
    loader = RecursiveUrlLoader(url,
                                extractor=bs4_extractor,
                                metadata_extractor=custom_metadata_extractor,
                                headers = headers,
                                max_depth=6,
                                use_async=True,
                                timeout=30
                                )
    new_docs = await loader.aload()
    docs.extend(new_docs)



In [174]:
pdfs = []
for doc in docs:
    if ".pdf" in doc.metadata.get("url",""):
        pdfs.append(doc.metadata.get("url"))

In [175]:
print(pdfs)

[]


In [176]:
unwanted_terms = ["learn more", "get involved with your local chapter","Find a support group near you","join our community",
                  "print this page","skip directly to site content","skip directly to search",
                  "an official website of the united states government", "sign up for email updates",
                  "here's how you know official websites use .gov a .gov website belongs to an official government organization in the united states. secure .gov websites use https a lock ( ) or https:// means you've safely connected to the .gov website. share sensitive information only on official, secure websites."]

docs = [doc for doc in docs if "es-mx" not in doc.metadata.get("url", "").lower()] # English only
docs = [doc for doc in docs if ".pdf" not in doc.metadata.get("url","")] # html only 
docs = [doc for doc in docs if "doctors-departments" not in doc.metadata.get("url","")]
docs = [doc for doc in docs if "care-at-mayo-clinic" not in doc.metadata.get("url","")]
docs = [doc for doc in docs if "cuidado" not in doc.metadata.get("title","")]
docs = [doc for doc in docs if "site.html" not in doc.metadata.get("url","")]
docs = [doc for doc in docs if ("page not found" not in doc.metadata.get("title").lower() and "404" not in doc.metadata.get("title") and "500" not in doc.metadata.get("title") and "403" not in doc.metadata.get("title"))] # failed pages
docs = [doc for doc in docs if "page not found" not in doc.page_content.lower()]
docs = [doc for doc in docs if "access denied" not in doc.page_content.lower()]
docs = [doc for doc in docs if "runtime server error" not in doc.page_content.lower()]

# deduplicate
unique_docs = {}
for doc in docs:
    url = doc.metadata.get("url")
    if url and url not in unique_docs:
        unique_docs[url] = doc
docs = list(unique_docs.values())

# tidy up
for doc in docs:
    content = doc.page_content.lower()
    for term in unwanted_terms:
        content = content.replace(term,"")
    doc.page_content=content

In [177]:
import hashlib

def hash_content(content):
    return hashlib.md5(content.encode('utf-8')).hexdigest()

unique_docs = {}
for doc in docs:
    content_hash = hash_content(doc.page_content)
    if content_hash not in unique_docs:
        unique_docs[content_hash] = doc

# Convert to list if needed
docs = list(unique_docs.values())


In [179]:
from pprint import pprint
print(len(docs))
for doc in docs:
    pprint(doc.metadata)
    pprint(doc.page_content)
    print("\n")

479
{'title': "Alzheimer's Disease and Dementia | Alzheimer's Disease and Dementia "
          '| CDC',
 'url': 'https://www.cdc.gov/alzheimers-dementia'}
("alzheimer's disease and dementia | alzheimer's disease and dementia | "
 "cdc     alzheimer's disease and dementia alzheimer's basics learn about "
 "signs and symptoms of alzheimer's disease and who is affected. aug. 15, 2024 "
 'dementia basics learn about common types of dementia, signs and symptoms, '
 "and risk factors. aug. 17, 2024 signs and symptoms of alzheimer's learn how "
 "to recognize the early signs of alzheimer's disease. signs and symptoms of "
 'dementia learn what early signs and symptoms of dementia to look out for. '
 'tools and resources find a variety of resources about alzheimer’s disease '
 'and healthy aging. reducing risk learn what lifestyle behaviors can reduce '
 'the risk of developing dementia. additional topics healthy aging at any age '
 'information to help you stay healthy and strong throughout y

In [98]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,       # Adjust based on your needs
    chunk_overlap=200,     # Overlap to maintain context
)
split_docs = text_splitter.split_documents(docs)

print(f"len(docs): {len(docs)}, len(split_docs):{len(split_docs)}")

len(docs): 113, len(split_docs):510


In [24]:
for i in range(4): print(split_docs[i])

page_content='you are caring for is stay physically and emotionally strong.    support groups support groups create a safe, confidential and supportive environment. find a support group near you.  daily care by using creativity and caregiving skills, you can adapt routines and activities as needs change.  activities learn how to modify activities to enhance quality of life.  communication and alzheimer's get strategies to help both you and the person with dementia communicate and connect.  daily care plan get tips on organizing the day, planning activities and creating a daily plan.  safety safety is important for everyone, but the need for a comprehensive safety plan becomes vital as dementia progresses.  care options & planning there is no one-size-fits all formula when it comes to alzheimer’s care. each family’s situation is unique. learn about care options in-home care in-home care allows a person with alzheimer's to stay in a familiar environment. it also can be of great assistanc

In [72]:
from langchain_qdrant import QdrantVectorStore
url="http://localhost:6333"

qdrant_vector_store = QdrantVectorStore.from_documents(
    docs,
    embeddings,
    url=url,
    prefer_grpc=True,
    collection_name="PottyTraining",
)

In [73]:
retriever = qdrant_vector_store.as_retriever(
    search_type="mmr",  # Options: 'similarity', 'mmr', etc.
    search_kwargs={"k": 5}     # Number of documents to retrieve
)
retriever.invoke("How is potty training boys different from potty training girls")

[Document(metadata={'language': None, 'title': '\n  Potty Training Boys and Girls – Potty Genius\n  ', 'content_type': 'text/html; charset=utf-8', 'source': 'https://pottygenius.com/blogs/blog/potty-training-differences-in-boys-and-girls', '_id': '88a3095a-aaa3-4c82-a333-7883f091f1c2', '_collection_name': 'PottyTraining'}, page_content='potty training boys and girls – potty genius                                             potty genius blog  games  shop ! potty genius blog — potty training boys — potty training girls — potty training methods  —  stories games  shop potty genius blog potty training boys and girls potty training is challenging regardless of your toddler’s gender. that said, potty training boys is a bit different than potty training girls. while it is obvious that males and females use the bathroom differently, there are some other distinct potty training differences parents may run into when potty training boys versus girls. by brittany tacket, ma brittany tackett is a 

In [80]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": itemgetter("query") | retriever, "query": itemgetter("query")} 
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm}
)

In [84]:
from pprint import pprint
answer = rag_chain.invoke(input={'query':"Should I use candy to help potty train my child?"})
pprint(answer)

{'response': 'Using candy to help potty train your child is a common practice, '
             "but it's not necessarily the most effective or recommended "
             'approach. Here are some things to consider:\n'
             '\n'
             '**Why candy might work:**\n'
             '\n'
             '1. **Temporary association**: Candy can associate the act of '
             'using the bathroom with something pleasant, like food.\n'
             '2. **Immediate satisfaction**: Receiving a treat can provide '
             'temporary relief and motivation for your child to use the '
             'potty.\n'
             '\n'
             '**However, there are also some potential drawbacks:**\n'
             '\n'
             '1. **Unintended consequences**: Over-reliance on candy might '
             'lead to dependence on treats as a means of getting your child to '
             'go pee or poo.\n'
             "2. **Lack of long-term habit formation**: Candy doesn't teach "
     