## Document Parsing


### Unstructured IO Parsing

In [41]:
import os
from pathlib import Path

from pypdf import PdfReader, PdfWriter

# ---- inputs/outputs
ROOT = Path(os.getcwd()).resolve().parents[1]
PDF_PATH = ROOT / "data" / "raw" / "example.pdf"
SLICE_PATH = ROOT / "data" / "raw" / "example_slice.pdf"
OUT_PATH = ROOT / "data" / "processed" / "example_p468_470.json"

START_PAGE, END_PAGE = 508, 510

# ---- slice PDF with PyPDF
reader = PdfReader(str(PDF_PATH))
writer = PdfWriter()
for p in range(START_PAGE - 1, END_PAGE):
    writer.add_page(reader.pages[p])
SLICE_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(SLICE_PATH, "wb") as f:
    writer.write(f)

In [52]:
from unstructured.partition.pdf import partition_pdf


PDF_PATH = "../../data/raw/example_slice.pdf"
OUT_PATH = "../../data/processed/example.json"




chunks = partition_pdf(
    filename=PDF_PATH,
    infer_table_structure=True,
    strategy="hi_res",
    extract_image_block_types=["Image"],
    extract_image_block_to_payload=True,
    chunking_strategy="by_title",
    max_characters=10000,
    combine_text_under_n_chars=2000,
    new_after_n_chars=6000,
)

print(f"Extracted {len(chunks)} elements")


Extracted 5 elements


In [55]:
from IPython.display import Markdown, display

display(Markdown(str(chunks[0].text)))
print(chunks[0].metadata.text_as_html)

display(Markdown(str(chunks[0].metadata.text_as_html)))

TABLE 67-1 Molecular Pathways Mediating Drug Disposition

ENZYME SUBSTRATESa INHIBITORSa CYP3A Calcium channel Amiodarone blockers Antiarrhythmics Ketoconazole, (lidocaine, quinidine, itraconazole mexiletine) HMG-CoA reductase Erythromycin, inhibitors (“statins”; see clarithromycin text) Cyclosporine, tacrolimus Ritonavir Indinavir, saquinavir, Gemfibrozil and other ritonavir fibrates CYP2D6b Timolol, metoprolol, Quinidine (even at ultra- carvedilol low doses) Propafenone, flecainide Tricyclic antidepressants Tricyclic antidepressants Fluoxetine, paroxetine Fluoxetine, paroxetine CYP2C9b Warfarin Amiodarone Phenytoin Fluconazole Glipizide Phenytoin Losartan CYP2C19b Omeprazole Omeprazole Mephenytoin Clopidogrel CYP2B6b Efavirenz Thiopurine 6-Mercaptopurine, S-methyltransferaseb azathioprine N-acetyltransferaseb Isoniazid Procainamide Hydralazine Some sulfonamides UGT1A1b Irinotecan Pseudocholinesteraseb Succinylcholine TRANSPORTER SUBSTRATESa INHIBITORSa P-glycoprotein Digoxin Quinidine HIV protease inhibitors Amiodarone Many CYP3A substrates Verapamil Cyclosporine Itraconazole Erythromycin SLCO1B1b Simvastatin and some

other statins

aInhibitors affect the molecular pathway and thus may decrease substrate metabolism. bClinically important genetic variants described; see Chap. 68.

Note: A listing of CYP substrates, inhibitors, and inducers is maintained at https:// drug-interactions.medicine.iu.edu/MainTable.aspx.

may exert important pharmacologic activity, as discussed further below. Therapeutic antibodies are very slowly eliminated (allowing infrequent dosing, e.g., monthly injections), probably by lysosomal uptake and degradation.

Clinical Implications of Altered Bioavailability Some drugs undergo near-complete presystemic metabolism and thus cannot be administered orally. Nitroglycerin cannot be used orally because it is completely extracted prior to reaching the systemic circulation. The drug is, therefore, used by the sublingual, transdermal, or intravascular routes, which bypass presystemic metabolism.

Some drugs with very extensive presystemic metabolism can still be administered by the oral route, using much higher doses than those required intravenously. Thus, a typical intravenous dose of verapamil is 1–5 mg, compared to a usual single oral dose of 40–120 mg. Administration

HPIM21e_Part3_p465-p480.indd 467

of low-dose aspirin can result in exposure of cyclooxygenase in platelets in the portal vein to the drug, but systemic sparing because of first- pass aspirin deacylation in the liver. This is an example of presystemic metabolism being exploited to therapeutic advantage.

467

<table><tbody><tr><td rowspan="5">CYP3A</td><td>Calcium channel blockers</td><td>Amiodarone</td></tr><tr><td>Antiarrhythmics (lidocaine, quinidine, mexiletine)</td><td>Ketoconazole, itraconazole</td></tr><tr><td>HMG-CoA reductase inhibitors (“statins”; see text)</td><td>Erythromycin, | clarithromycin</td></tr><tr><td>Cyclosporine, tacrolimus</td><td>| Ritonavir</td></tr><tr><td>Indinavir, saquinavir, ritonavir</td><td>Gemfibrozil and other fibrates</td></tr><tr><td rowspan="5">CYP2D6° cyP2cg?</td><td>Timolol, metoprolol, carvedilol</td><td>Quinidine (even at ultra- low doses)</td></tr><tr><td>Propafenone, flecainide</td><td>| Tricyclic antidepressants</td></tr><tr><td>Tricyclic antidepressants</td><td>| Fluoxetine, paroxetine</td></tr><tr><td>Fluoxetine, paroxetine</td><td></td></tr><tr><td></td><td>Amiodarone</td></tr><tr><td></td><td>Warfarin Phenytoin</td><td>Fluconazole</td></tr><tr><td rowspan="4">CyP2C19° CYP2B6°</td><td>Losartan Omeprazole</td><td>Omeprazole</td></tr><tr><td>Mep

<table><tbody><tr><td rowspan="5">CYP3A</td><td>Calcium channel blockers</td><td>Amiodarone</td></tr><tr><td>Antiarrhythmics (lidocaine, quinidine, mexiletine)</td><td>Ketoconazole, itraconazole</td></tr><tr><td>HMG-CoA reductase inhibitors (“statins”; see text)</td><td>Erythromycin, | clarithromycin</td></tr><tr><td>Cyclosporine, tacrolimus</td><td>| Ritonavir</td></tr><tr><td>Indinavir, saquinavir, ritonavir</td><td>Gemfibrozil and other fibrates</td></tr><tr><td rowspan="5">CYP2D6° cyP2cg?</td><td>Timolol, metoprolol, carvedilol</td><td>Quinidine (even at ultra- low doses)</td></tr><tr><td>Propafenone, flecainide</td><td>| Tricyclic antidepressants</td></tr><tr><td>Tricyclic antidepressants</td><td>| Fluoxetine, paroxetine</td></tr><tr><td>Fluoxetine, paroxetine</td><td></td></tr><tr><td></td><td>Amiodarone</td></tr><tr><td></td><td>Warfarin Phenytoin</td><td>Fluconazole</td></tr><tr><td rowspan="4">CyP2C19° CYP2B6°</td><td>Losartan Omeprazole</td><td>Omeprazole</td></tr><tr><td>Mephenytoin</td><td></td></tr><tr><td>Clopidogrel</td><td></td></tr><tr><td>Efavirenz</td><td></td></tr><tr><td>Thiopurine S-methyltransferase®</td><td>6-Mercaptopurine, azathioprine</td><td></td></tr><tr><td rowspan="3">N-acetyltransferase” UGTIA1® Pseudocholinesterase?</td><td>Isoniazid Procainamide Hydralazine</td><td></td></tr><tr><td>sulfonamides</td><td></td></tr><tr><td>Some Irinotecan Succinylcholine</td><td></td></tr><tr><td rowspan="5">P-glycoprotein SLCO1B1&gt;</td><td>substrates</td><td></td></tr><tr><td>Digoxin HIV protease inhibitors Many CYP3A Simvastatin and some</td><td>Quinidine Amiodarone Verapamil Cyclosporine Itraconazole Erythromycin</td></tr><tr><td></td><td></td></tr><tr><td></td><td></td></tr><tr><td></td><td></td></tr></tbody></table>

In [57]:
chunks[0].metadata.orig_elements

[<unstructured.documents.elements.Text at 0x24860806120>,
 <unstructured.documents.elements.Table at 0x248605beea0>,
 <unstructured.documents.elements.Text at 0x248622ad2b0>,
 <unstructured.documents.elements.FigureCaption at 0x2489e0d5a90>,
 <unstructured.documents.elements.NarrativeText at 0x248622ada90>,
 <unstructured.documents.elements.NarrativeText at 0x248622ae120>,
 <unstructured.documents.elements.NarrativeText at 0x248622ae270>,
 <unstructured.documents.elements.NarrativeText at 0x248622ae660>,
 <unstructured.documents.elements.NarrativeText at 0x248622ae820>,
 <unstructured.documents.elements.Image at 0x2483efd9e50>,
 <unstructured.documents.elements.NarrativeText at 0x24860805fd0>,
 <unstructured.documents.elements.Text at 0x24860805be0>]

## OCR + Unstrucured

In [31]:
from pathlib import Path

SLICE = Path("../../data/raw/example_p468_470.pdf").resolve()
OCRD  = Path("../../data/raw/example_page_parsed.pdf").resolve()
OCRD.parent.mkdir(parents=True, exist_ok=True)

In [32]:
from ocrmypdf.api import ocr
from ocrmypdf import ExitCode
import pikepdf

code = ocr(
    input_file=str(SLICE),
    output_file=str(OCRD),
    language=["eng"],   # or just ["eng"]
    jobs=4,
    use_threads=True,          # set to False/None for process-based parallelism
    skip_text=True,            # only OCR image-only pages
    force_ocr=False,
    redo_ocr=False,
    optimize=0,
    rotate_pages=True,
    deskew=True,
    clean=False,
    clean_final=False,
    progress_bar=True,         # show progress; supported by the API
    output_type="pdf",      # uncomment if you need PDF/A output
)


n_pages = 0

if code == ExitCode.ok:
    with pikepdf.open(str(OCRD)) as pdf:
        n_pages = len(pdf.pages)

print(f"[OK] OCR’d + fixed PDF has {n_pages} pages → {OCRD}")

[OK] OCR’d + fixed PDF has 3 pages → C:\Users\alvar\OneDrive\Desktop\TFM\code\RAG\data\raw\example_page_parsed.pdf


Now the Unstructured run after the OCR has parsed the scanned PDF

In [None]:
from unstructured.partition.pdf import partition_pdf
import os
import json

PDF_PATH = "../../data/raw/example_page_parsed.pdf"
OUT_PATH = "../../data/processed/unstructured/ocr_output.json"
# parse only page 508

chunks = partition_pdf(
    filename=PDF_PATH,
    infer_table_structure=True,
    strategy="hi_res",
    extract_image_block_types=["Image"],
    extract_image_block_to_payload=True,
    chunking_strategy="by_title",
    max_characters=10000,
    combine_text_under_n_chars=2000,
    new_after_n_chars=6000,
)

from IPython.display import Markdown, display

display(Markdown(str(chunks[2].text)))




Transiently high drug concentrations after rapid intravenous admin- istration can occasionally be used to advantage. The use of midazolam for intravenous sedation, for example, depends upon its rapid uptake by the brain during the distribution phase to produce sedation quickly, with subsequent egress from the brain during the redistribution of the drug as equilibrium is achieved.

Similarly, adenosine must be administered as a rapid bolus in the treatment of reentrant supraventricular tachycardias (Chap. 246) to prevent elimination by very rapid (t1/2 of seconds) uptake into erythro- cytes and endothelial cells before the drug can reach its clinical site of action, the atrioventricular node.

Clinical Implications of Altered Protein Binding Many drugs circulate in the plasma partly bound to plasma proteins. Since only unbound (free) drug can distribute to sites of pharmacologic action,

HPIM21e_Part3_p465-p480.indd 468

drug response is related to the free rather than the total circulating plasma drug concentration. In chronic kidney or liver disease, protein binding may be decreased and thus drug actions increased. In some situations (myocardial infarction, infection, surgery), acute phase reactants transiently increase binding of some drugs and thus decrease efficacy. These changes assume the greatest clinical importance for drugs that are highly protein-bound since even a small change in protein binding can result in large changes in free drug; for example, a decrease in binding from 99 to 98% doubles the free drug concentration from 1 to 2%. For some drugs (e.g., phenytoin), monitoring free rather than total drug concentrations can be useful.

■ DRUG ELIMINATION

Drug elimination reduces the amount of drug in the body over time. An important approach to quantifying this reduction is to consider that drug concentrations at the beginning and end of a time period are unchanged, and that a specific volume of the body has been “cleared” of the drug during that time period. This defines clearance as volume/ time. Clearance includes both drug metabolism and excretion.

Clinical Implications of Altered Clearance While elimination half-life determines the time required to achieve steady-state plasma concentration (Css), the magnitude of that steady state is determined by clearance (Cl) and dose alone. For a drug administered as an intrave- nous infusion, this relationship is:

Css = dosing rate/Cl or dosing rate = Cl • Css

When a drug is administered orally, the average plasma concentration within a dosing interval (Cavg,ss) replaces Css, and the dosage (dose per unit time) must be increased if bioavailability (F) is <100%:

Dose/time = Cl • Cavg,ss/F

Genetic variants, drug interactions, or diseases that reduce the activity of drug-metabolizing enzymes or excretory mechanisms lead to decreased clearance and, hence, a requirement for a downward dose adjustment to avoid toxicity. Conversely, some drug interactions and genetic variants increase the function of drug elimination pathways, and hence, increased drug dosage is necessary to maintain a therapeu- tic effect.

■ ACTIVE DRUG METABOLITES

Metabolites may produce effects similar to, overlapping with, or dis- tinct from those of the parent drug. Accumulation of the major metab- olite of procainamide, N-acetylprocainamide (NAPA), likely accounts for marked QT prolongation and torsades de pointes ventricular tachy- cardia (Chap. 252) during therapy with procainamide. Neurotoxicity during therapy with the opioid analgesic meperidine is likely due to accumulation of normeperidine, especially in renal disease.

Prodrugs are inactive compounds that require metabolism to gener- ate active metabolites that mediate the drug effects. Examples include many angiotensin-converting enzyme (ACE) inhibitors, the angi- otensin receptor blocker losartan, the antineoplastic irinotecan, the antiestrogen tamoxifen, the analgesic codeine (whose active metabolite morphine probably underlies the opioid effect during codeine admin- istration), and the antiplatelet drug clopidogrel. Drug metabolism has also been implicated in bioactivation of procarcinogens and in the generation of reactive metabolites that mediate certain ADRs (e.g., acetaminophen hepatotoxicity, discussed below).

In [37]:

from IPython.display import Markdown, display

display(Markdown(str(chunks[0].text)))
print(chunks[1])

TABLE 67-1 Molecular Pathways Mediating Drug Disposition

ENZYME SUBSTRATESa INHIBITORSa CYP3A Calcium channel Amiodarone blockers Antiarrhythmics Ketoconazole, (lidocaine, quinidine, itraconazole mexiletine) HMG-CoA reductase Erythromycin, inhibitors (“statins”; see clarithromycin text) Cyclosporine, tacrolimus Ritonavir Indinavir, saquinavir, Gemfibrozil and other ritonavir fibrates CYP2D6b Timolol, metoprolol, Quinidine (even at ultra- carvedilol low doses) Propafenone, flecainide Tricyclic antidepressants Tricyclic antidepressants Fluoxetine, paroxetine Fluoxetine, paroxetine CYP2C9b Warfarin Amiodarone Phenytoin Fluconazole Glipizide Phenytoin Losartan CYP2C19b Omeprazole Omeprazole Mephenytoin Clopidogrel CYP2B6b Efavirenz Thiopurine 6-Mercaptopurine, S-methyltransferaseb azathioprine N-acetyltransferaseb Isoniazid Procainamide Hydralazine Some sulfonamides UGT1A1b Irinotecan Pseudocholinesteraseb Succinylcholine TRANSPORTER SUBSTRATESa INHIBITORSa P-glycoprotein Digoxin Quinidine HIV protease inhibitors Amiodarone Many CYP3A substrates Verapamil Cyclosporine Itraconazole Erythromycin SLCO1B1b Simvastatin and some

other statins

aInhibitors affect the molecular pathway and thus may decrease substrate metabolism. bClinically important genetic variants described; see Chap. 68.

Note: A listing of CYP substrates, inhibitors, and inducers is maintained at https:// drug-interactions.medicine.iu.edu/MainTable.aspx.

may exert important pharmacologic activity, as discussed further below. Therapeutic antibodies are very slowly eliminated (allowing infrequent dosing, e.g., monthly injections), probably by lysosomal uptake and degradation.

Clinical Implications of Altered Bioavailability Some drugs undergo near-complete presystemic metabolism and thus cannot be administered orally. Nitroglycerin cannot be used orally because it is completely extracted prior to reaching the systemic circulation. The drug is, therefore, used by the sublingual, transdermal, or intravascular routes, which bypass presystemic metabolism.

Some drugs with very extensive presystemic metabolism can still be administered by the oral route, using much higher doses than those required intravenously. Thus, a typical intravenous dose of verapamil is 1–5 mg, compared to a usual single oral dose of 40–120 mg. Administration

HPIM21e_Part3_p465-p480.indd 467

of low-dose aspirin can result in exposure of cyclooxygenase in platelets in the portal vein to the drug, but systemic sparing because of first- pass aspirin deacylation in the liver. This is an example of presystemic metabolism being exploited to therapeutic advantage.

467

■ PLASMA HALF-LIFE

Most pharmacokinetic processes, such as elimination, are first-order; that is, the rate of the process depends on the amount of drug present. Elimination can occasionally be zero-order (fixed amount eliminated per unit time), and this can be clinically important (see “Principles of Dose Selection,” later in this chapter). In the simplest pharmacokinetic model (Fig. 67-2A), a drug bolus (D) is administered instantaneously to a central compartment, from which drug elimination occurs as a first-order process. Occasionally, central and other compartments cor- respond to physiologic spaces (e.g., plasma volume), whereas in other cases, they are simply mathematical functions used to describe drug disposition. The first-order nature of drug elimination leads directly to the relationship describing drug concentration (C) at any time (t) following the bolus:

C = D V •e − ( 0.69t/t 1/2 )

c

C

L9 YA LdVHO

H

A

P

T

E

R

6

7

P

r

i

where Vc is the volume of the compa

In [36]:
chunks[0].metadata.orig_elements

[<unstructured.documents.elements.Text at 0x24860905550>,
 <unstructured.documents.elements.Table at 0x2484d735090>,
 <unstructured.documents.elements.Text at 0x2486093b690>,
 <unstructured.documents.elements.FigureCaption at 0x248605befd0>,
 <unstructured.documents.elements.NarrativeText at 0x2486093b7e0>,
 <unstructured.documents.elements.NarrativeText at 0x2486093b9a0>,
 <unstructured.documents.elements.NarrativeText at 0x2486093ba80>,
 <unstructured.documents.elements.NarrativeText at 0x2486093bd20>,
 <unstructured.documents.elements.NarrativeText at 0x2486093bf50>,
 <unstructured.documents.elements.Image at 0x248606643e0>,
 <unstructured.documents.elements.NarrativeText at 0x24860904d70>,
 <unstructured.documents.elements.Text at 0x24860905400>]

## LLamaParse

In [28]:
import os
from pathlib import Path
from dotenv import load_dotenv
from pypdf import PdfReader, PdfWriter
from llama_parse import LlamaParse

# ---- load environment variables from .env
load_dotenv()
api_key = os.getenv("LLAMAPARSE_API_KEY")
if not api_key:
    raise RuntimeError("LLAMAPARSE_API_KEY not set in .env")

# ---- inputs/outputs

PDF_PATH = "../../data/raw/example_p468_470.pdf"
OUT_PATH = "../../data/processed/llamaparse_output.md"


instruction = ("Preserve the document structure and content maintaining, mathematical equations \n"
               "Pay special attention also to the table structure and content so that it matches the original one (headers included)")

parser = LlamaParse(
    api_key=api_key,
    result_type="markdown",
    language="en",
    num_workers=4,
    parsing_instruction = instruction

   
)

In [29]:
try:
    print("[llamaparse] sending slice…")
    docs = parser.load_data(str(PDF_PATH))
    print("[llamaparse] returned docs:", len(docs))
except Exception as e:
    print("[llamaparse] exception:", repr(e))
    raise

# If nothing came back, print more breadcrumbs
if len(docs) == 0:
    print("[warn] LlamaParse returned 0 docs. Quick checklist:")
    print("  - Is the file really text/scan? Try a different page range.")
    print("  - Try result_type='text' or 'structured'.")
    print("  - Try without slicing: pass the full PDF once.")
    print("  - Check network/proxy; Llama Cloud needs outbound https.")
    print("  - Double-check key validity (a bad key can yield empty results).")

# ------

[llamaparse] sending slice…
Started parsing the file under job_id 2d5beec6-abee-4fac-a26e-a7caa2d41de4
[llamaparse] returned docs: 3


In [30]:
from pathlib import Path
import re, html

SINGLE_MD = Path(OUT_PATH)

HEADING_RE = re.compile(r"(?m)^(#{1,6})(\S)")                     # ensure '#Title' -> '# Title'
TABLE_ROW_RE = re.compile(r"^\s*\|")                               # lines that look like pipe-table rows
TABLE_SEP_RE = re.compile(r"^\s*\|?\s*:?-{2,}:?\s*(\|\s*:?-{2,}:?\s*)+\|?\s*$")

def normalize_markdown(md: str) -> str:
    if not md:
        return ""
    # 1) decode HTML entities (&#x3C; -> <)
    md = html.unescape(md)

    # 2) fix headings missing a space: '#Title' -> '# Title'
    md = HEADING_RE.sub(lambda m: f"{m.group(1)} {m.group(2)}", md)

    # 3) collapse >2 blank lines to max 2
    md = re.sub(r"\n{3,}", "\n\n", md)

    # 4) ensure a blank line BEFORE headings
    md = re.sub(r"(?m)([^\n])\n(#{1,6}\s)", r"\1\n\n\2", md)

    # 5) ensure a blank line AFTER headings (if next line is not blank)
    md = re.sub(r"(?m)^(#{1,6}\s.+)\n(?!\n|\||-{3,})", r"\1\n\n", md)

    # 6) ensure blank lines around tables (pipe-style)
    lines = md.splitlines()
    out = []
    for i, line in enumerate(lines):
        # insert blank line before a table row if previous non-empty line isn't blank or header
        if TABLE_ROW_RE.match(line):
            if out and out[-1].strip() and not TABLE_ROW_RE.match(out[-1]) and not TABLE_SEP_RE.match(out[-1]):
                out.append("")  # blank line before table
        out.append(line)
        # if this is a table separator row, ensure there's at least one row above & below (renderer-friendly)
        # (we won’t synthesize; just spacing)
        if TABLE_SEP_RE.match(line):
            # ensure there is a blank line after the table end (will be added when next non-table appears)
            pass
    md = "\n".join(out)

    # 7) demote standalone page banners like "# 468" to smaller heading or italic line
    md = re.sub(r"(?m)^#\s+(\d{1,5})\s*$", r"### Page \1", md)

    # 8) strip stray trailing spaces
    md = re.sub(r"[ \t]+(\n)", r"\1", md)

    return md.strip() + "\n"

def safe_anchor(s: str) -> str:
    s = re.sub(r"[^\w\s-]", "", s).strip().lower()
    s = re.sub(r"\s+", "-", s)
    return s[:80] or "chunk"

toc = ["# Parsed Output", "## Table of Contents"]
parts = []

for i, d in enumerate(docs):
    raw = (getattr(d, "text", "") or "").strip()
    if not raw:
        continue
    md = normalize_markdown(raw)

    # build a visible section header so chunk boundaries are obvious
    meta = dict(getattr(d, "metadata", {}) or {})
    page = meta.get("page_number") or meta.get("page") or "n/a"
    # use first heading in the chunk as title, else "Chunk i"
    m = re.search(r"(?m)^\s*#{1,6}\s+(.+)$", md)
    title = m.group(1).strip() if m else f"Chunk {i}"
    anchor = f"chunk-{i}-{safe_anchor(title)}"

    toc.append(f"- [Chunk {i} — p. {page}: {title}](#{anchor})")
    parts.append(
        f"\n\n---\n\n<a id='{anchor}'></a>\n\n## Chunk {i} — Page {page}\n\n{md}"
    )

final_md = "\n".join(toc) + "\n" + "".join(parts)
SINGLE_MD.parent.mkdir(parents=True, exist_ok=True)
SINGLE_MD.write_text(final_md, encoding="utf-8")
print(f"Saved single markdown → {SINGLE_MD}")

Saved single markdown → ..\..\data\processed\llamaparse_output.md
