Set up tools to recursively load and clean our web-based sources.

- Alzheimer's Association Resources for Caregivers: https://www.alz.org/help-support/caregiving
- CDC resources for dementia: https://www.cdc.gov/alzheimers-dementia/index.html
- Mayo Clinic: https://www.mayoclinic.org/diseases-conditions/dementia/symptoms-causes/syc-20352013
- The Alzheimer's Society (international): https://www.alzheimers.org.uk
- WebMD: https://www.webmd.com/alzheimers/

Other data sources we could explore to expand the POC: 

- Alzheimer's Foundation of America: https://alzfdn.org/
- Dementia Society of America: https://www.dementiasociety.org
- Family Caregiver Alliance: https://www.caregiver.org

And top hits from Pubmed for a variety of specific search terms might be good: https://pubmed.ncbi.nlm.nih.gov/?term=dementia+caregiver.  

In [31]:
urls = ["https://www.cdc.gov/alzheimers-dementia",
        "https://www.cdc.gov/alzheimers-dementia/about",
        "https://www.cdc.gov/alzheimers-dementia/prevention",
        "https://www.cdc.giv/alzheimers-dementia/healthy-people-2030",
        "https://www.alz.org/help-support/caregiving",
        "https://www.mayoclinic.org/diseases-conditions/dementia/",
        "https://www.alzheimers.org.uk/about-dementia",
        "https://www.webmd.com/alzheimers"
]

pdfs = ["https://www.cdc.gov/alzheimers-dementia/media/pdfs/2024/05/BrainHealthKeyFactsResources.pdf",
        "https://www.caregiving.org/wp-content/uploads/2020/05/Dementia-Caregiving-in-the-US_February-2017.pdf",
        "https://archrespite.org/wp-content/uploads/2021/12/9-Steps_Dementia-Caregiver-2.pdf"]

In [32]:
from langchain_community.document_loaders import RecursiveUrlLoader
import requests

# This example uses `beautifulsoup4` and `lxml`
import re
from bs4 import BeautifulSoup

def custom_metadata_extractor(html: str, url: str, response: requests.Response):
    soup = BeautifulSoup(html, "lxml")
    # Extract the page title from the HTML
    title = soup.title.string if soup.title else "No Title"
    return {
        "url": url,
        "title": title
    }

def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    # Remove unwanted tags
    for tag in soup(['nav', 'footer', 'header', 'aside', 'script', 'style']):
        tag.decompose()
    # Extract text
    text = soup.get_text(separator=' ', strip=True)
    # Clean up whitespace
    clean_text = re.sub(r'\s+', ' ', text).strip()
    return clean_text


In [33]:
import random

docs = []
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
    # Add more user agents as needed - this helps grab content from sites like mayo clinic and webmd that block bots/scraping
]

headers = {"User-Agent": random.choice(user_agents)}

for url in urls:
    loader = RecursiveUrlLoader(url,
                                extractor=bs4_extractor,
                                metadata_extractor=custom_metadata_extractor,
                                headers = headers,
                                max_depth=6,
                                use_async=True,
                                timeout=60
                                )
    new_docs = await loader.aload()
    docs.extend(new_docs)

Unable to load https://www.cdc.gov/alzheimers-dementia/media/pdfs/2024/05/BrainHealthKeyFactsResources.pdf. Received error 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte of type UnicodeDecodeError
Unable to load https://www.cdc.giv/alzheimers-dementia/healthy-people-2030. Received error Cannot connect to host www.cdc.giv:443 ssl:default [nodename nor servname provided, or not known] of type ClientConnectorDNSError
Unable to load https://www.alz.org/help-support/caregiving/stages-behaviors/wandering?lang=en-US&amp;amp;form=FUNSETYDEFK&form=FUNYWTPCJBN. Received error  of type TimeoutError
Unable to load https://www.alz.org/help-support/caregiving?form=FUNSETYDEFK?lang=es-MX. Received error  of type TimeoutError
Unable to load https://www.alz.org/help-support/caregiving/daily-care?lang=en-US?lang=es-MX&amp;form=FUNSMRYZSMP&amp;form=FUNSETYDEFK. Received error  of type TimeoutError
Unable to load https://www.alz.org/help-support/caregiving?form=FUNSETYDEFK&

In [34]:
for doc in docs:
    if ".pdf" in doc.metadata.get("url",""):
        pdfs.append(doc.metadata.get("url"))
print(pdfs)

['https://www.cdc.gov/alzheimers-dementia/media/pdfs/2024/05/BrainHealthKeyFactsResources.pdf']


In [35]:
unwanted_terms = ["learn more", "get involved with your local chapter","Find a support group near you","join our community",
                  "print this page","skip directly to site content","skip directly to search",
                  "an official website of the united states government", "sign up for email updates",
                  "skip to main content","skip navigation menu","share this page email whatsapp messenger", "make a donation",
                  "here's how you know official websites use .gov a .gov website belongs to an official government organization in the united states. secure .gov websites use https a lock ( ) or https:// means you've safely connected to the .gov website. share sensitive information only on official, secure websites.",
                  "email (required) first name (required) last name (required) you can change what you receive at any time and we will never sell your details to third parties. here’s our privacy policy ."]

docs = [
    doc for doc in docs 
    if all([
        "es-mx" not in doc.metadata.get("url", "").lower(),          # English only
        ".pdf" not in doc.metadata.get("url", ""),                    # HTML only
        "doctors-departments" not in doc.metadata.get("url", ""),
        "care-at-mayo-clinic" not in doc.metadata.get("url", ""),
        "cuidado" not in doc.metadata.get("title", ""),
        "site.html" not in doc.metadata.get("url", ""),
        "page not found" not in doc.metadata.get("title", "").lower(),
        "404" not in doc.metadata.get("title", ""),
        "500" not in doc.metadata.get("title", ""),
        "403" not in doc.metadata.get("title", ""),
        "page not found" not in doc.page_content.lower(),
        "access denied" not in doc.page_content.lower(),
        "runtime server error" not in doc.page_content.lower()
    ])
]

# deduplicate urls
unique_docs = {}
for doc in docs:
    url = doc.metadata.get("url")
    if url and url not in unique_docs:
        unique_docs[url] = doc
docs = list(unique_docs.values())

# tidy up
for doc in docs:
    content = doc.page_content.lower()
    for term in unwanted_terms:
        content = content.replace(term,"")
    doc.page_content=content
import hashlib

# deduplicate content
def hash_content(content):
    return hashlib.md5(content.encode('utf-8')).hexdigest()

unique_docs = {}
for doc in docs:
    content_hash = hash_content(doc.page_content)
    if content_hash not in unique_docs:
        unique_docs[content_hash] = doc

# Convert to list if needed
docs = list(unique_docs.values())


In [36]:
from pprint import pprint
print(len(docs))
for doc in docs[:10]:
    pprint(doc.metadata)
    pprint(doc.page_content)
    print("\n")

488
{'title': "Alzheimer's Disease and Dementia | Alzheimer's Disease and Dementia "
          '| CDC',
 'url': 'https://www.cdc.gov/alzheimers-dementia'}
("alzheimer's disease and dementia | alzheimer's disease and dementia | "
 "cdc     alzheimer's disease and dementia alzheimer's basics learn about "
 "signs and symptoms of alzheimer's disease and who is affected. aug. 15, 2024 "
 'dementia basics learn about common types of dementia, signs and symptoms, '
 "and risk factors. aug. 17, 2024 signs and symptoms of alzheimer's learn how "
 "to recognize the early signs of alzheimer's disease. signs and symptoms of "
 'dementia learn what early signs and symptoms of dementia to look out for. '
 'tools and resources find a variety of resources about alzheimer’s disease '
 'and healthy aging. reducing risk learn what lifestyle behaviors can reduce '
 'the risk of developing dementia. additional topics healthy aging at any age '
 'information to help you stay healthy and strong throughout y

In [37]:
import requests, os
from langchain.document_loaders import PyMuPDFLoader
from langchain.schema import Document

pdf_docs = []
for i,pdf_url in enumerate(pdfs):
    response = requests.get(pdf_url)
    pdf_path = f"downloaded_document_{i}.pdf"

    # Save the PDF to a file
    with open(pdf_path, "wb") as f:
        f.write(response.content)

    # Load the PDF using PyMuPDFLoader
    loader = PyMuPDFLoader(pdf_path)
    pages = loader.load()

    # clean up - specific to the cdc doc
    pages = pages[1:]
    full_text = "\n".join(
        re.sub(
            r"Brain Health Key Facts and Resources \| 2014\s*|Page\s*\d+|\n{2,}",  # Remove unwanted patterns
            " ",  # Replace with a space to avoid excessive newlines
            page.page_content
        ).strip() for page in pages
    )
    full_text = re.sub(r"\s{2,}", " ", full_text).strip()

    # preserve the url
    full_doc = Document(page_content=full_text, metadata={"url": pdf_url})

    pdf_docs.append(full_doc)
    os.remove(pdf_path)


In [38]:
print(len(pdf_docs))
for doc in pdf_docs: print(doc)

1
page_content='Alcohol Use Alcohol may act differently in older adults than in younger people. Some older adults can feel “high” without increasing the amount of alcohol they drink. This can make them more likely to be confused or have accidents, including falls, broken bones, and car crashes, which can cause head injuries among other problems. If people choose to drink alcohol, U.S. Dietary Guidelines for Americans say that moderate drinking is up to two drinks a day for men, and one for women. Some people should not drink alcohol. Many older adults should be extra careful because they often take medicines that can interact with it. For example:  Alcohol and over-the-counter cough and cold remedies together can cause drowsiness and potential accidental overdoses. Older people are at even greater risk for these side effects.  Using alcohol with common blood pressure medicines can increase risk for dizziness, drowsiness, and changes in heartbeat. Talk with your health care provider i

In [39]:
docs.extend(pdf_docs)

In [40]:
import json
from langchain.schema import Document

# Convert documents to a list of dictionaries
serialized_docs = [
    {"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs
]

# Save the serialized documents to a JSON file
file = "source_documents.json"
with open("source_documents.json", "w") as f:
    json.dump(serialized_docs, f, indent=4)

print(f"Documents have been serialized to {file}")

Documents have been serialized to source_documents.json


In [189]:
total_len = 0
min = 1000
max = 0
for doc in docs:
    n = len(doc.page_content)
    total_len+=n
    if n>max: max = n
    if n<min: min = n

print(f"Average: {total_len/len(docs)}, Min: {min}, Max: {max}")

Average: 7106.855042016807, Min: 263, Max: 26028
