# Enhance Medical Assistant chatbot with RAG integration

In this lab, I’m improving the performance of the fine-tuned LLM-based Medical Assistant chatbot (https://github.com/diyorarti/medical-assistant) by integrating a Retriever-Augmented Generation (RAG) pipeline. The system leverages three external medical books as the knowledge base to provide more accurate, context-aware, and evidence-based responses.


# Data Loader 1
### Implement data loading pipeline for medical knowledge sources

Developed a data loader to process all PDF files from a folder, extracting text page by page and attaching relevant metadata. The data is loaded using `LangChain.document_loaders.PyPDFLoader`, which treats each page of a PDF as a separate Document object containing both `metadata` and `page_content`.

DATA SOURCES:
1. "Aging: Natural or Disease?" by Alex Zhavoronkov (https://www.ncbi.nlm.nih.gov/books/NBK561517/)
2. "Basic Epidemiology" by R. Bonita, R. Beaglehole, and T. Kjellström (https://iris.who.int/items/3d726576-7d68-4e66-a18b-25abf77c4894)
3. "Genes and Disease" by NCBI (https://www.ncbi.nlm.nih.gov/books/NBK22183/)

In [1]:
import hashlib
from pathlib import Path

def sha256_file(path:Path, chunk_size:int=8192)->str:
    """
    Comupute file Fingerprint
    helps to track whether a file's content has changed or avoid reprocessing duplicates
    """
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

In [3]:
from pathlib import Path
from typing import List
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document

def load_data(data_dir:str, min_chars: int = 30) -> List[Document]:
    """
    Recursively load PDFs under `data_dir` and return page-level documents
    enriched with stable file metadata (absolute path, source path, mtime, sha256)

    Args:
        data_dir:Root directory to scan for PDFs
        min_chars:Minimum non-whitespace charaters to keep a page
    """
    data_dir = Path(data_dir)
    files = list(data_dir.glob("**/*.pdf"))
    print(f"Number of files {len(files)}")
    all_documents: List[Document] = []
    skipped_pages = 0

    for file in files:
        abs_path = file.resolve()
        print(f"Processing file: {abs_path}")

        try:
            loader = PyPDFLoader(str(abs_path))
            documents = loader.load()
        except Exception as e:
            print(f"!! Skipping {abs_path} due to error: {e}")
            continue

        file_hash = sha256_file(abs_path)
        try:
            mtime = abs_path.stat().st_mtime
        except Exception:
            mtime = None

        for doc in documents:
            content = (doc.page_content or "").strip()
            if len(content) < min_chars:
                skipped_pages += 1
                continue
            meta = dict(doc.metadata or {})
            page = meta.get("page")
            meta.update({
                "source":str(abs_path),
                "source_file":str(abs_path),
                "source_name":abs_path.name,
                "file_type":"pdf",
                "file_sha256":file_hash,
                "file_mtime":mtime,
                "page":page if page is not None else None,
            })
            doc.metadata=meta
        all_documents.extend(documents)
    print(f"skipped (short/blank) so far: {skipped_pages}")
    print(f"\nTotal documents loaded: {len(all_documents)}")
    return all_documents

In [4]:
documents = load_data("../data")

Number of files 3
Processing file: D:\Users\User\RAG-API\data\Aging_natural_or_disease.pdf
Processing file: D:\Users\User\RAG-API\data\basic_epidemiology.pdf
Processing file: D:\Users\User\RAG-API\data\Genes_and_Disease.pdf
skipped (short/blank) so far: 81

Total documents loaded: 466


In [6]:
documents[4].metadata

{'producer': 'Antenna House PDF Output Library 6.6.1477 (Linux64)',
 'creator': 'AH XSL Formatter V6.6 MR7 for Linux64 : 6.6.9.39847 (2019-07-29T09:58+09)',
 'creationdate': '2020-09-03T09:54:06-05:00',
 'moddate': '2020-09-03T09:54:06-05:00',
 'title': 'Aging: natural or disease? A view from medical textbooks',
 'trapped': '/False',
 'source': 'D:\\Users\\User\\RAG-API\\data\\Aging_natural_or_disease.pdf',
 'total_pages': 19,
 'page': 4,
 'page_label': '5',
 'source_file': 'D:\\Users\\User\\RAG-API\\data\\Aging_natural_or_disease.pdf',
 'source_name': 'Aging_natural_or_disease.pdf',
 'file_type': 'pdf',
 'file_sha256': 'f700a35f9fd7ef679f1c72a53b46a46adad1c2273dad8c7d40aeb97c543bb921',
 'file_mtime': 1760668682.7098148}

In [8]:
print(documents[4].page_content)

Table 1. Selected quotations arguing against the aging vs. disease dichotomy.
Charcot, 1881, p. 
2043
“The textural changes which old age induce in the organism sometimes attain such a point that the 
physiological and pathological states seem to mingle by an imperceptible transition and to be no longer sharply 
distinguishable. ”
Kleemeier, 1965, p. 
5544
“Can the effects of aging per se be distinguished from those of pathology? (…) To attribute to aging all time 
associated changes to which no specific cause can be found is at best a temporary holding tactic which will 
suffice only as long as we are ignorant of the mechanism involved. Time alone causes nothing. ”
Hall, 1984, p. 78f45 “ Attempts have been made by numerous workers to separate physiological from pathological aging. The two 
are, however, so interrelated as to make attempts relatively abortive. It would be far more relevant to accept the 
existence of a continuum of ageing phenomena. ”
Rattan, 1991, p. 
52646
“ Although