### PDF Docs - unstructure data

The **./medcare_docs** directory contains dome Medcare Mock institute documents:

01 - Membership Master Guide, Tiered Benefit Matrix, Enrollment Windows, Dependent Verification.  
02 - Clinical Services Directory, Diagnostic Modalities, Specialized Surgical Units, Preventive Care Tiers.  
03 - Data Privacy & Patient Rights, HIPAA-Level Encryption, Consent Revocation, Subject Access Requests.  
04 - Financial & Billing Operations, Co-payment Schedules, Insurance Arbitration, Debt Mitigation.  
05 - Global Facility Network,On-Site Medcare Clinics, External Affiliate Hospitals, Regional Zones.  
06 - Administrative Workflows,Pre-Admission Protocols, Record Retention, Discharge Paperwork.  
07 - Legal Liability & Compliance,Arbitration Clauses, Malpractice Limits, Regulatory Reporting.  
08 - Emergency & Urgent Triage,Immediate Life-Threat Protocols, Level 1-5 Triage, After-Hours Care.  
09 - Pharmacy & Medication Policy,Formulary Tiering, Prior Authorization, Controlled Substance Rules.  
10 - Telehealth & Digital Services,Virtual Consultation Standards, Remote Vitals Monitoring, Portal Access.  

In this file we chunk them to create a [JSONL](https://jsonlines.org/) file of chunks.

In [1]:
import os
import json
import fitz  # PyMuPDF
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [2]:
# --- CONFIGURATION ---
INPUT_FOLDER = "./medcare_docs"  # Folder containing your 10 PDFs
OUTPUT_FILE = "medcare_knowledge_base.jsonl"

# The Splitter: 
# We use a 1000 character limit with 150 char overlap.
# This ensures that if a medical warning is at the end of a chunk, 
# it's repeated at the start of the next for context.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# --- CONFIGURATION ---
INPUT_FOLDER = "./medcare_docs"  # Folder containing your 10 PDFs
OUTPUT_FILE = "medcare_knowledge_base.jsonl"

In [3]:
def process_pdf_file(file_path):
    """
    Generator: Opens a single PDF, chunks it, and yields 
    one dictionary entry at a time.
    """
    filename = os.path.basename(file_path)
    doc = fitz.open(file_path)
    
    for page_num, page in enumerate(doc):
        page_text = page.get_text("text")
        chunks = text_splitter.split_text(page_text)
        
        for i, chunk_content in enumerate(chunks):
            # Yielding prevents building a giant list in memory
            yield {
                "chunk_id": f"{filename}_{page_num}_{i}",
                "text": chunk_content.strip(),
                "metadata": {
                    "source_file": filename,
                    "page_number": page_num + 1,
                    "org": "Medcare"
                }
            }
    doc.close()


def main():
    if not os.path.exists(INPUT_FOLDER):
        print(f"Directory {INPUT_FOLDER} not found.")
        return

    print("ðŸš€ Starting Medcare PDF Extraction...")
    
    with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
        for filename in os.listdir(INPUT_FOLDER):
            if filename.lower().endswith(".pdf"):
                file_path = os.path.join(INPUT_FOLDER, filename)
                print(f"  ðŸ§¬ Processing: {filename}")

                # Use the generator to stream chunks into the file
                for chunk_entry in process_pdf_file(file_path):
                    f.write(json.dumps(chunk_entry) + '\n')

    print(f"âœ… Finished! Knowledge base saved to {OUTPUT_FILE}")


if __name__ == "__main__":
    main()

ðŸš€ Starting Medcare PDF Extraction...
  ðŸ§¬ Processing: 1-Medcare_Membership_Eligibility.pdf
  ðŸ§¬ Processing: 10-Medcare_Telehealth_Terms.pdf
  ðŸ§¬ Processing: 2-Medcare_Clinical_Services.pdf
  ðŸ§¬ Processing: 3-Medcare_Privacy_Rights.pdf
  ðŸ§¬ Processing: 4-Medcare_Financial_Policy.pdf
  ðŸ§¬ Processing: 5-Medcare_Facilities_Network.pdf
  ðŸ§¬ Processing: 6-Medcare_Admin_Procedures.pdf
  ðŸ§¬ Processing: 7-Medcare_Legal_Compliance.pdf
  ðŸ§¬ Processing: 8-Medcare_Urgent_Care.pdf
  ðŸ§¬ Processing: 9-Medcare_Pharmacy_Policy.pdf
âœ… Finished! Knowledge base saved to medcare_knowledge_base.jsonl
