diff --git a/.gitignore b/.gitignore index 3133382..ccb93a7 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,11 @@ build/ data/cache/ *.sqlite +# Docling eval — keep FINDINGS.md and sources/, ignore generated run artifacts +data/docling_eval/* +!data/docling_eval/FINDINGS.md +!data/docling_eval/sources/ + # OS .DS_Store Thumbs.db diff --git a/README.md b/README.md index 9f6a2df..6376e50 100644 --- a/README.md +++ b/README.md @@ -208,6 +208,21 @@ html_parser.py pdf_parser.py text_cleaner.py +Note on PDF table extraction (Docling refiner): + +The extraction stage uses an in-tree PDF parser (PyMuPDF + pdfplumber) as the +default and a Docling TableFormer post-pass to refine table sections when an +in-tree result looks broken or when the source URL is on a curated allowlist +of publishers whose tables are known to be hard (CDC MMWR, certain WHO +situation reports). + +The first PDF that triggers the refiner downloads the Docling layout and +TableFormer models (~40 MB) to the HuggingFace cache (`~/.cache/huggingface/`) +and holds them in memory (~1.5 GB) for the lifetime of the process. The +feature is toggled with `ExtractionConfig.enable_docling_refiner` — when +disabled, no Docling imports occur and behaviour matches the pre-refiner +pipeline exactly. + --- ## Insight Stage diff --git a/bioscancast/extraction/chunking.py b/bioscancast/extraction/chunking.py index ff02a14..0d5c14f 100644 --- a/bioscancast/extraction/chunking.py +++ b/bioscancast/extraction/chunking.py @@ -50,6 +50,7 @@ def normalize_chunks( page_number=chunk.page_number, table_data=None, token_count=part_tokens, + extractor=chunk.extractor, ) ) diff --git a/bioscancast/extraction/config.py b/bioscancast/extraction/config.py index b80e1a5..19b595e 100644 --- a/bioscancast/extraction/config.py +++ b/bioscancast/extraction/config.py @@ -1,6 +1,7 @@ from __future__ import annotations -from dataclasses import dataclass +from dataclasses import dataclass, field +from typing import List @dataclass @@ -10,4 +11,27 @@ class ExtractionConfig: pdf_max_pages: int = 100 chunk_target_tokens: int = 800 chunk_max_tokens: int = 1500 + user_agent: str = ( + "BioScanCast/0.1 (+https://github.com/algorithmicgovernance/BioScanCast)" + ) + + # ---- Docling table refiner ---- + enable_docling_refiner: bool = True + """Toggle the Docling post-pass that refines PDF table sections. + + When False, no Docling imports occur and behaviour is identical to the + pre-refiner pipeline. + """ + + docling_source_allowlist: List[str] = field( + default_factory=lambda: [ + "cdc.gov/mmwr/", + "cdn.who.int/media/docs/default-source/_sage-", + "cdn.who.int/media/docs/default-source/documents/emergencies/situation-reports/", + ] + ) + """Source URL substrings known to contain hard tables. Match triggers Docling unconditionally.""" + + docling_sparse_cell_threshold: float = 0.5 + """Non-empty-cell ratio below which a table is flagged as suspect and triggers Docling.""" impersonate: str = "chrome" diff --git a/bioscancast/extraction/docling_refiner.py b/bioscancast/extraction/docling_refiner.py new file mode 100644 index 0000000..edda060 --- /dev/null +++ b/bioscancast/extraction/docling_refiner.py @@ -0,0 +1,364 @@ +"""Docling-based table refiner. + +Optional post-pass over `ParsedContent` produced by `PdfParser`. When triggered +(URL allowlist hit or a heuristic flag on a "broken" in-tree table), runs +Docling's TableFormer on the original PDF bytes and replaces the in-tree table +sections with Docling's rendering. + +Docling and its transitive deps (`transformers`, `torch`, ...) are intentionally +*lazy-imported* — instantiating `DoclingTableRefiner` is the only path that +touches them. When the feature flag is off, no Docling import ever happens. +""" + +from __future__ import annotations + +import io +import logging +from dataclasses import replace +from typing import Any, FrozenSet, List, Optional, Sequence, Tuple + +from .config import ExtractionConfig +from .parsers.base import ParsedContent, SectionContent + +logger = logging.getLogger(__name__) + + +class DoclingTableRefiner: + """Refines table sections in a `ParsedContent` using Docling. + + The converter is constructed once per instance (Docling models cost + ~10-30s and ~1.5 GB RAM to load), so the pipeline should hold one + instance per process. + """ + + def __init__( + self, + config: ExtractionConfig, + *, + converter: Optional[Any] = None, + ) -> None: + self._config = config + # Allow dependency injection for tests; real construction is lazy. + self._converter = converter + + # ---------- public API ---------- + + def refine( + self, + parsed: ParsedContent, + *, + source_url: str, + content: bytes, + ) -> ParsedContent: + """Return either the original `parsed` or a copy with table sections + replaced by Docling output. + + Triggers (first match wins): + 1. `source_url` matches the configured allowlist. + 2. Any in-tree table looks "broken" by the heuristic. + + Short-circuits to a no-op for OCR-required PDFs — Docling without OCR + cannot help there. + """ + if parsed.is_partial and parsed.partial_reason == "requires_ocr": + logger.debug("docling refiner skipped: requires_ocr") + return parsed + + # Always compute broken-table indices: even URL-triggered runs need + # them, so the merge step knows which in-tree sections to drop when + # Docling produces a different but better table on another page. + flagged = _broken_table_reasons( + parsed, threshold=self._config.docling_sparse_cell_threshold + ) + broken_indices = frozenset(i for i, _ in flagged) + + url_match = _should_refine_by_url( + source_url, self._config.docling_source_allowlist + ) + if url_match: + logger.info( + "docling refiner triggered: source-allowlist hit for %s", source_url + ) + elif flagged: + for _, reason in flagged: + logger.info("docling refiner triggered: %s", reason) + else: + logger.debug( + "docling refiner skipped: no trigger matched for %s", source_url + ) + return parsed + + return self._do_refine(parsed, content, broken_indices=broken_indices) + + # ---------- internals ---------- + + def _do_refine( + self, + parsed: ParsedContent, + content: bytes, + *, + broken_indices: FrozenSet[int] = frozenset(), + ) -> ParsedContent: + try: + converter = self._get_converter() + except Exception as exc: # pragma: no cover - construction failures + logger.warning("docling converter unavailable: %s", exc) + return parsed + + try: + result = converter.convert(content) + except Exception as exc: + logger.warning("docling conversion failed: %s", exc) + return parsed + + docling_doc = getattr(result, "document", None) + if docling_doc is None: + logger.warning("docling result has no document; leaving parsed unchanged") + return parsed + + return _merge_docling_tables_into_parsed( + parsed, docling_doc, broken_indices=broken_indices + ) + + def _get_converter(self) -> Any: + if self._converter is not None: + return self._converter + self._converter = _build_converter() + return self._converter + + +# ---------- helpers ---------- + + +def _should_refine_by_url(source_url: str, allowlist: Sequence[str]) -> bool: + if not source_url: + return False + lowered = source_url.lower() + return any(pattern.lower() in lowered for pattern in allowlist) + + +def _broken_table_reasons( + parsed: ParsedContent, *, threshold: float +) -> List[Tuple[int, str]]: + """Inspect every table section in `parsed` and return `(section_index, + reason)` pairs for any that look broken. + + A table is suspect when: + - non-empty-cell ratio < `threshold` and it has at least 3 rows and 2 cols + - more than half its rows have exactly one non-empty cell (over-segmentation) + """ + flagged: List[Tuple[int, str]] = [] + for i, section in enumerate(parsed.sections): + if section.chunk_type != "table" or not section.table_rows: + continue + rows = section.table_rows + if len(rows) < 3: + continue + max_cols = max((len(r) for r in rows), default=0) + if max_cols < 2: + continue + + total_cells = sum(len(r) for r in rows) + if total_cells == 0: + continue + non_empty = sum( + 1 for row in rows for cell in row if cell and str(cell).strip() + ) + ratio = non_empty / total_cells + + page_label = section.page_number if section.page_number is not None else "?" + + if ratio < threshold: + flagged.append( + ( + i, + f"suspect table on page {page_label} " + f"(empty-cell ratio {ratio:.2f})", + ) + ) + continue + + single_cell_rows = sum( + 1 + for row in rows + if sum(1 for cell in row if cell and str(cell).strip()) == 1 + ) + if single_cell_rows > len(rows) / 2: + flagged.append( + ( + i, + f"suspect table on page {page_label} " + f"(over-segmented: {single_cell_rows}/{len(rows)} rows have a single cell)", + ) + ) + return flagged + + +def _merge_docling_tables_into_parsed( + parsed: ParsedContent, + docling_doc: Any, + *, + broken_indices: FrozenSet[int] = frozenset(), +) -> ParsedContent: + """Replace in-tree table sections with Docling-rendered tables. + + Strategy: + 1. Page-based matching: for each in-tree table section, find a Docling + table on the same page (in order of appearance) and replace. + 2. Drop unmatched in-tree table sections whose index is in + `broken_indices` (the heuristic flagged them as suspect). + 3. Insert any leftover Docling tables as new sections, at the + document-order position corresponding to their page. + + `broken_indices` refers to indices into `parsed.sections` as it was + when the heuristic ran (i.e. the original section list). + """ + docling_tables_by_page: dict[int, list] = {} + for table in getattr(docling_doc, "tables", []) or []: + prov = getattr(table, "prov", None) or [] + if not prov: + continue + page_no = getattr(prov[0], "page_no", None) + if page_no is None: + continue + docling_tables_by_page.setdefault(page_no, []).append(table) + + cursor: dict[int, int] = {} + matched_docling: set[int] = set() + new_sections: List[SectionContent] = [] + + for i, section in enumerate(parsed.sections): + if section.chunk_type != "table" or section.page_number is None: + new_sections.append(section) + continue + + page = section.page_number + idx = cursor.get(page, 0) + candidates = docling_tables_by_page.get(page, []) + if idx < len(candidates): + docling_table = candidates[idx] + cursor[page] = idx + 1 + new_rows = _docling_table_to_rows(docling_table) + if new_rows: + matched_docling.add(id(docling_table)) + new_sections.append( + replace( + section, + table_rows=new_rows, + extractor="docling", + ) + ) + continue + # Fall through: Docling table empty -> treat as no match. + + # No Docling replacement for this in-tree table. + if i in broken_indices: + # The heuristic confirmed this in-tree table is garbage; + # drop it rather than leave noise in the output. + continue + new_sections.append(section) + + # Insert leftover Docling tables in page order. + leftover: List[Tuple[int, Any]] = [] + for page_no, tables in docling_tables_by_page.items(): + for table in tables: + if id(table) not in matched_docling: + leftover.append((page_no, table)) + leftover.sort(key=lambda pair: pair[0]) + + for page_no, table in leftover: + rows = _docling_table_to_rows(table) + if not rows: + continue + insert_at = 0 + for j, existing in enumerate(new_sections): + if ( + existing.page_number is not None + and existing.page_number <= page_no + ): + insert_at = j + 1 + new_sections.insert( + insert_at, + SectionContent( + section_path=None, + page_number=page_no, + text="", + chunk_type="table", + table_rows=rows, + extractor="docling", + ), + ) + + parsed.sections = new_sections + return parsed + + +def _docling_table_to_rows(table: Any) -> List[List[str]]: + """Convert a Docling `TableItem` into row-major plain-string cells. + + Walks `table.data.table_cells` directly so we don't pull in pandas. + Each cell carries `start_row_offset_idx`/`start_col_offset_idx`; we lay + them out on a grid of size `num_rows x num_cols` and stringify the text. + """ + data = getattr(table, "data", None) + if data is None: + return [] + cells = list(getattr(data, "table_cells", []) or []) + if not cells: + return [] + + num_rows = int(getattr(data, "num_rows", 0) or 0) + num_cols = int(getattr(data, "num_cols", 0) or 0) + if num_rows <= 0 or num_cols <= 0: + # Fall back to inferring shape from cell offsets. + num_rows = max(int(getattr(c, "end_row_offset_idx", 0) or 0) for c in cells) + num_cols = max(int(getattr(c, "end_col_offset_idx", 0) or 0) for c in cells) + if num_rows <= 0 or num_cols <= 0: + return [] + + grid: List[List[str]] = [["" for _ in range(num_cols)] for _ in range(num_rows)] + for cell in cells: + r = int(getattr(cell, "start_row_offset_idx", 0) or 0) + c = int(getattr(cell, "start_col_offset_idx", 0) or 0) + text = (getattr(cell, "text", "") or "").strip() + if 0 <= r < num_rows and 0 <= c < num_cols: + grid[r][c] = text + return grid + + +def _build_converter() -> Any: + """Construct a thin wrapper around the real Docling `DocumentConverter` + that takes raw PDF bytes. + + Imports are deferred to this function so that turning off the refiner + means no Docling/torch/transformers import ever happens. The wrapper + layer lets the refiner stay agnostic of Docling-specific stream types, + which keeps test injection simple. + """ + from docling.datamodel.base_models import DocumentStream, InputFormat + from docling.datamodel.pipeline_options import ( + PdfPipelineOptions, + TableFormerMode, + ) + from docling.document_converter import DocumentConverter, PdfFormatOption + + pipeline_options = PdfPipelineOptions( + do_ocr=False, + do_table_structure=True, + ) + pipeline_options.table_structure_options.mode = TableFormerMode.FAST + + real_converter = DocumentConverter( + format_options={ + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options), + }, + ) + + class _BytesConverter: + def convert(self, content: bytes): + stream = DocumentStream( + name="document.pdf", stream=io.BytesIO(content) + ) + return real_converter.convert(stream) + + return _BytesConverter() diff --git a/bioscancast/extraction/parsers/base.py b/bioscancast/extraction/parsers/base.py index 393d283..95b176c 100644 --- a/bioscancast/extraction/parsers/base.py +++ b/bioscancast/extraction/parsers/base.py @@ -24,6 +24,9 @@ class SectionContent: table_rows: Optional[List[List[str]]] = None """Row-major table data when chunk_type is 'table'.""" + extractor: Optional[str] = None + """Which backend produced this section ('pymupdf', 'pdfplumber', 'docling', ...).""" + @dataclass class ParsedContent: diff --git a/bioscancast/extraction/parsers/pdf_parser.py b/bioscancast/extraction/parsers/pdf_parser.py index 0ae263a..180b66c 100644 --- a/bioscancast/extraction/parsers/pdf_parser.py +++ b/bioscancast/extraction/parsers/pdf_parser.py @@ -67,12 +67,15 @@ def parse(self, content: bytes, *, source_url: str) -> ParsedContent: # Extract tables with PyMuPDF tables_on_page = self._extract_tables_pymupdf(page) + table_extractor = "pymupdf" # If PyMuPDF found no tables, try pdfplumber as fallback if not tables_on_page and self._page_looks_tabular(page): tables_on_page = self._extract_tables_pdfplumber( content, page_num ) + if tables_on_page: + table_extractor = "pdfplumber" for table_rows in tables_on_page: sections.append( @@ -82,6 +85,7 @@ def parse(self, content: bytes, *, source_url: str) -> ParsedContent: text="", chunk_type="table", table_rows=table_rows, + extractor=table_extractor, ) ) @@ -129,6 +133,7 @@ def parse(self, content: bytes, *, source_url: str) -> ParsedContent: page_number=page_number, text=combined, chunk_type="prose", + extractor="pymupdf", ) ) current_text_parts = [] @@ -147,6 +152,7 @@ def parse(self, content: bytes, *, source_url: str) -> ParsedContent: page_number=page_number, text=combined, chunk_type="prose", + extractor="pymupdf", ) ) diff --git a/bioscancast/extraction/pipeline.py b/bioscancast/extraction/pipeline.py index 76e9aed..0ae2d99 100644 --- a/bioscancast/extraction/pipeline.py +++ b/bioscancast/extraction/pipeline.py @@ -24,6 +24,8 @@ class ExtractionPipeline: def __init__(self, *, config: ExtractionConfig | None = None) -> None: self._config = config or ExtractionConfig() self._parsers = get_parsers(pdf_max_pages=self._config.pdf_max_pages) + # Lazily constructed on first PDF that reaches the refiner step. + self._docling_refiner = None def run(self, filtered_docs: List[FilteredDocument]) -> List[Document]: """Process documents in order of extraction_priority. @@ -97,8 +99,28 @@ def extract_one(self, filtered_doc: FilteredDocument) -> Document: fetch_result=fetch_result, ) - # Step 4: Convert ParsedContent → Document with chunks + # Step 3b: Docling table refiner (PDFs only, feature-flagged) document_type = self._detect_document_type(content_type) + if ( + self._config.enable_docling_refiner + and document_type == "pdf" + ): + refiner = self._get_docling_refiner() + if refiner is not None: + try: + parsed = refiner.refine( + parsed, + source_url=filtered_doc.url, + content=fetch_result.content_bytes, + ) + except Exception as exc: + logger.warning( + "Docling refiner failed for %s: %s", + filtered_doc.url, + exc, + ) + + # Step 4: Convert ParsedContent → Document with chunks chunks = self._build_chunks(parsed, doc_id) # Step 5: Normalize chunks @@ -149,6 +171,23 @@ def extract_one(self, filtered_doc: FilteredDocument) -> Document: extracted_dates=extracted_dates, ) + def _get_docling_refiner(self): + """Lazily build (and cache) the Docling refiner. + + Returns None if the heavy Docling imports or model load fail — the + pipeline then falls back to the in-tree parser output unchanged. + """ + if self._docling_refiner is not None: + return self._docling_refiner + try: + from .docling_refiner import DoclingTableRefiner + + self._docling_refiner = DoclingTableRefiner(self._config) + except Exception as exc: + logger.warning("Docling refiner unavailable, continuing without: %s", exc) + self._docling_refiner = None + return self._docling_refiner + def _make_failed_document( self, fdoc: FilteredDocument, @@ -191,6 +230,7 @@ def _build_chunks( page_number=section.page_number, table_data=section.table_rows, token_count=approx_token_count(section.text), + extractor=section.extractor, ) ) return chunks diff --git a/bioscancast/schemas/document.py b/bioscancast/schemas/document.py index b521bbc..d415995 100644 --- a/bioscancast/schemas/document.py +++ b/bioscancast/schemas/document.py @@ -38,6 +38,9 @@ class DocumentChunk: token_count: Optional[int] = None """Approximate token count (tokeniser-dependent).""" + extractor: Optional[str] = None + """Backend that produced this chunk ('pymupdf', 'pdfplumber', 'docling', 'trafilatura', ...).""" + @dataclass class Document: diff --git a/bioscancast/tests/test_extraction_docling_refiner.py b/bioscancast/tests/test_extraction_docling_refiner.py new file mode 100644 index 0000000..842c2db --- /dev/null +++ b/bioscancast/tests/test_extraction_docling_refiner.py @@ -0,0 +1,613 @@ +"""Tests for bioscancast.extraction.docling_refiner. + +Docling is heavyweight (~1.5 GB RAM and ~10-30 s model load on construction). +Every test in this module uses a fake converter injected into the refiner, +so no real Docling model is ever loaded. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import List, Optional +from unittest.mock import MagicMock + +import pytest + +from bioscancast.extraction.config import ExtractionConfig +from bioscancast.extraction.docling_refiner import ( + DoclingTableRefiner, + _broken_table_reasons, + _docling_table_to_rows, + _merge_docling_tables_into_parsed, + _should_refine_by_url, +) +from bioscancast.extraction.parsers.base import ParsedContent, SectionContent + + +# --------------------------------------------------------------------------- +# Stubs that mimic the bits of the Docling object model we touch +# --------------------------------------------------------------------------- + + +@dataclass +class StubProv: + page_no: int + + +@dataclass +class StubTableCell: + start_row_offset_idx: int + end_row_offset_idx: int + start_col_offset_idx: int + end_col_offset_idx: int + text: str + + +@dataclass +class StubTableData: + num_rows: int + num_cols: int + table_cells: List[StubTableCell] + + +@dataclass +class StubTable: + data: StubTableData + prov: List[StubProv] + + +@dataclass +class StubDoclingDocument: + tables: List[StubTable] = field(default_factory=list) + + +def _make_stub_table( + rows: List[List[str]], *, page_no: int +) -> StubTable: + num_rows = len(rows) + num_cols = max(len(r) for r in rows) if rows else 0 + cells = [] + for r, row in enumerate(rows): + for c, value in enumerate(row): + cells.append( + StubTableCell( + start_row_offset_idx=r, + end_row_offset_idx=r + 1, + start_col_offset_idx=c, + end_col_offset_idx=c + 1, + text=value, + ) + ) + return StubTable( + data=StubTableData(num_rows=num_rows, num_cols=num_cols, table_cells=cells), + prov=[StubProv(page_no=page_no)], + ) + + +def _section( + chunk_type: str, + *, + page_number: Optional[int] = None, + table_rows: Optional[List[List[str]]] = None, + text: str = "", + extractor: Optional[str] = None, +) -> SectionContent: + return SectionContent( + section_path=None, + page_number=page_number, + text=text, + chunk_type=chunk_type, + table_rows=table_rows, + extractor=extractor, + ) + + +# --------------------------------------------------------------------------- +# _should_refine_by_url +# --------------------------------------------------------------------------- + + +class TestShouldRefineByUrl: + def test_match_in_allowlist(self): + assert _should_refine_by_url( + "https://www.cdc.gov/mmwr/volumes/75/wr/mm7509a1.htm", + ["cdc.gov/mmwr/"], + ) + + def test_case_insensitive(self): + assert _should_refine_by_url( + "https://WWW.CDC.GOV/MMWR/foo.pdf", + ["cdc.gov/mmwr/"], + ) + + def test_no_match(self): + assert not _should_refine_by_url( + "https://reuters.com/world/article", ["cdc.gov/mmwr/"] + ) + + def test_empty_url(self): + assert not _should_refine_by_url("", ["cdc.gov/mmwr/"]) + + def test_empty_allowlist(self): + assert not _should_refine_by_url("https://cdc.gov/mmwr/x", []) + + +# --------------------------------------------------------------------------- +# _broken_table_reasons +# --------------------------------------------------------------------------- + + +class TestBrokenTableReasons: + def test_healthy_table_passes(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=1, + table_rows=[ + ["Country", "Cases"], + ["Sudan", "100"], + ["DRC", "250"], + ["Nigeria", "75"], + ], + ), + ], + ) + assert _broken_table_reasons(parsed, threshold=0.5) == [] + + def test_sparse_table_flagged(self): + rows = [ + ["A", "", "", "", "", "", "", "", "", "", "", "", ""], + ["", "", "", "", "", "", "", "", "", "", "", "", ""], + ["", "", "", "", "", "", "", "", "", "", "", "", ""], + ["", "", "", "", "", "", "", "", "", "", "", "", ""], + ["", "", "", "", "", "", "", "", "", "", "", "", ""], + ["", "", "", "", "", "", "", "", "", "", "", "", ""], + ] + parsed = ParsedContent( + raw_text="", + sections=[_section("table", page_number=4, table_rows=rows)], + ) + flagged = _broken_table_reasons(parsed, threshold=0.5) + assert len(flagged) == 1 + idx, reason = flagged[0] + assert idx == 0 + assert "page 4" in reason + assert "empty-cell ratio" in reason + + def test_over_segmented_flagged(self): + # Most rows have only one non-empty cell -- looks like per-column over-segmentation. + rows = [ + ["Header", "value"], + ["x", ""], + ["y", ""], + ["z", ""], + ["w", ""], + ["v", ""], + ] + parsed = ParsedContent( + raw_text="", + sections=[_section("table", page_number=2, table_rows=rows)], + ) + flagged = _broken_table_reasons(parsed, threshold=0.5) + assert len(flagged) == 1 + idx, reason = flagged[0] + assert idx == 0 + assert "over-segmented" in reason + + def test_skips_non_table_sections(self): + parsed = ParsedContent( + raw_text="", + sections=[_section("prose", page_number=1, text="hello world")], + ) + assert _broken_table_reasons(parsed, threshold=0.5) == [] + + def test_skips_too_small_tables(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=1, + table_rows=[["", ""], ["", ""]], + ), + ], + ) + assert _broken_table_reasons(parsed, threshold=0.5) == [] + + +# --------------------------------------------------------------------------- +# _docling_table_to_rows +# --------------------------------------------------------------------------- + + +class TestDoclingTableToRows: + def test_simple_grid(self): + stub = _make_stub_table( + [["Country", "Cases"], ["Sudan", "100"], ["DRC", "250"]], + page_no=1, + ) + rows = _docling_table_to_rows(stub) + assert rows == [["Country", "Cases"], ["Sudan", "100"], ["DRC", "250"]] + + def test_missing_data_returns_empty(self): + class _NoData: + data = None + + assert _docling_table_to_rows(_NoData()) == [] + + +# --------------------------------------------------------------------------- +# _merge_docling_tables_into_parsed +# --------------------------------------------------------------------------- + + +class TestMergeDoclingTables: + def test_replaces_in_tree_table_by_page(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=4, + table_rows=[["", ""], ["", ""], ["", ""]], + extractor="pymupdf", + ), + ], + ) + docling_doc = StubDoclingDocument( + tables=[ + _make_stub_table( + [["State", "Count"], ["NM", "9"], ["TX", "11"]], + page_no=4, + ), + ], + ) + + result = _merge_docling_tables_into_parsed(parsed, docling_doc) + assert len(result.sections) == 1 + section = result.sections[0] + assert section.chunk_type == "table" + assert section.extractor == "docling" + assert section.table_rows == [ + ["State", "Count"], + ["NM", "9"], + ["TX", "11"], + ] + + def test_leaves_table_with_no_matching_page(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=4, + table_rows=[["A", "B"]], + extractor="pymupdf", + ), + ], + ) + docling_doc = StubDoclingDocument( + tables=[ + _make_stub_table([["X", "Y"]], page_no=7), # different page + ], + ) + + result = _merge_docling_tables_into_parsed(parsed, docling_doc) + assert result.sections[0].extractor == "pymupdf" + assert result.sections[0].table_rows == [["A", "B"]] + + def test_multiple_tables_on_same_page_matched_in_order(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=2, + table_rows=[["?", "?"]], + extractor="pymupdf", + ), + _section( + "prose", + page_number=2, + text="some prose between", + extractor="pymupdf", + ), + _section( + "table", + page_number=2, + table_rows=[["?", "?"]], + extractor="pymupdf", + ), + ], + ) + docling_doc = StubDoclingDocument( + tables=[ + _make_stub_table([["first", "1"]], page_no=2), + _make_stub_table([["second", "2"]], page_no=2), + ], + ) + + result = _merge_docling_tables_into_parsed(parsed, docling_doc) + tables = [s for s in result.sections if s.chunk_type == "table"] + assert tables[0].table_rows == [["first", "1"]] + assert tables[0].extractor == "docling" + assert tables[1].table_rows == [["second", "2"]] + assert tables[1].extractor == "docling" + # The prose chunk in between is preserved. + assert any(s.chunk_type == "prose" for s in result.sections) + + def test_leaves_prose_sections_alone(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section("prose", page_number=1, text="hello", extractor="pymupdf"), + ], + ) + docling_doc = StubDoclingDocument( + tables=[_make_stub_table([["X", "Y"]], page_no=1)], + ) + + result = _merge_docling_tables_into_parsed(parsed, docling_doc) + # Prose untouched; the docling table is inserted as a new section. + prose = [s for s in result.sections if s.chunk_type == "prose"] + tables = [s for s in result.sections if s.chunk_type == "table"] + assert len(prose) == 1 + assert prose[0].extractor == "pymupdf" + assert prose[0].text == "hello" + assert len(tables) == 1 + assert tables[0].extractor == "docling" + + def test_drops_unmatched_broken_intree_table(self): + # MMWR-style: in-tree's spurious table is on page 4, Docling's real + # table is on page 3. Page match fails. With broken_indices={0}, + # the in-tree section is dropped and Docling's table is inserted. + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=4, + table_rows=[["", ""], ["", ""], ["", ""]], + extractor="pymupdf", + ), + ], + ) + docling_doc = StubDoclingDocument( + tables=[ + _make_stub_table( + [["Characteristic", "No. (%)"], ["Total", "99"], ["Sex", ""]], + page_no=3, + ), + ], + ) + + result = _merge_docling_tables_into_parsed( + parsed, docling_doc, broken_indices=frozenset([0]) + ) + tables = [s for s in result.sections if s.chunk_type == "table"] + assert len(tables) == 1 + assert tables[0].extractor == "docling" + assert tables[0].page_number == 3 + assert tables[0].table_rows[0] == ["Characteristic", "No. (%)"] + + def test_keeps_unmatched_clean_intree_table(self): + # If an in-tree table didn't match Docling AND wasn't flagged broken, + # keep it (Docling missed a legitimate table). + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=4, + table_rows=[["A", "B"], ["1", "2"]], + extractor="pymupdf", + ), + ], + ) + docling_doc = StubDoclingDocument( + tables=[_make_stub_table([["X", "Y"]], page_no=3)], + ) + + result = _merge_docling_tables_into_parsed( + parsed, docling_doc, broken_indices=frozenset() + ) + tables = [s for s in result.sections if s.chunk_type == "table"] + # Original in-tree table preserved, plus inserted Docling table. + assert len(tables) == 2 + extractors = sorted(t.extractor for t in tables) + assert extractors == ["docling", "pymupdf"] + + def test_unmatched_docling_inserted_in_page_order(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section("prose", page_number=1, text="page1", extractor="pymupdf"), + _section("prose", page_number=5, text="page5", extractor="pymupdf"), + ], + ) + docling_doc = StubDoclingDocument( + tables=[_make_stub_table([["X", "Y"]], page_no=3)], + ) + + result = _merge_docling_tables_into_parsed(parsed, docling_doc) + # Inserted Docling table should sit between the page-1 prose and the page-5 prose. + assert [s.page_number for s in result.sections] == [1, 3, 5] + assert result.sections[1].extractor == "docling" + + +# --------------------------------------------------------------------------- +# DoclingTableRefiner.refine() end-to-end with a fake converter +# --------------------------------------------------------------------------- + + +class _FakeResult: + def __init__(self, document): + self.document = document + + +class _FakeConverter: + def __init__(self, document): + self._document = document + self.convert_calls = 0 + + def convert(self, _stream): + self.convert_calls += 1 + return _FakeResult(self._document) + + +class TestDoclingTableRefinerEndToEnd: + def _config(self) -> ExtractionConfig: + return ExtractionConfig( + enable_docling_refiner=True, + docling_source_allowlist=["cdc.gov/mmwr/"], + docling_sparse_cell_threshold=0.5, + ) + + def test_triggers_on_allowlist(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=1, + table_rows=[["Country", "Cases"], ["Sudan", "5"]], + extractor="pymupdf", + ), + ], + ) + docling_doc = StubDoclingDocument( + tables=[_make_stub_table([["NM", "9"], ["TX", "11"]], page_no=1)], + ) + converter = _FakeConverter(docling_doc) + refiner = DoclingTableRefiner(self._config(), converter=converter) + + out = refiner.refine( + parsed, + source_url="https://www.cdc.gov/mmwr/volumes/75/wr/mm7509a1.htm", + content=b"%PDF-fake-bytes", + ) + + assert converter.convert_calls == 1 + assert out.sections[0].extractor == "docling" + assert out.sections[0].table_rows == [["NM", "9"], ["TX", "11"]] + + def test_triggers_on_heuristic(self): + # Sparse table -> heuristic fires even without allowlist match. + rows = [["A", "", "", ""], ["", "", "", ""], ["", "", "", ""]] + parsed = ParsedContent( + raw_text="", + sections=[ + _section("table", page_number=3, table_rows=rows, extractor="pymupdf"), + ], + ) + docling_doc = StubDoclingDocument( + tables=[_make_stub_table([["Region", "n"], ["X", "1"]], page_no=3)], + ) + converter = _FakeConverter(docling_doc) + refiner = DoclingTableRefiner(self._config(), converter=converter) + + out = refiner.refine( + parsed, + source_url="https://example.org/random.pdf", + content=b"%PDF-fake-bytes", + ) + + assert converter.convert_calls == 1 + assert out.sections[0].extractor == "docling" + + def test_no_trigger_leaves_parsed_unchanged(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=1, + table_rows=[ + ["Country", "Cases"], + ["Sudan", "5"], + ["DRC", "100"], + ], + extractor="pymupdf", + ), + ], + ) + converter = _FakeConverter(StubDoclingDocument()) + refiner = DoclingTableRefiner(self._config(), converter=converter) + + out = refiner.refine( + parsed, + source_url="https://reuters.com/world/africa/article", + content=b"%PDF-fake-bytes", + ) + + assert converter.convert_calls == 0 + assert out.sections[0].extractor == "pymupdf" + + def test_short_circuits_on_requires_ocr(self): + parsed = ParsedContent( + raw_text="", + sections=[], + is_partial=True, + partial_reason="requires_ocr", + ) + converter = _FakeConverter(StubDoclingDocument()) + refiner = DoclingTableRefiner(self._config(), converter=converter) + + out = refiner.refine( + parsed, + source_url="https://www.cdc.gov/mmwr/volumes/75/wr/mm7509a1.htm", + content=b"%PDF-fake-bytes", + ) + + assert converter.convert_calls == 0 + assert out is parsed + + def test_converter_failure_falls_back_to_parsed(self): + parsed = ParsedContent( + raw_text="", + sections=[ + _section( + "table", + page_number=1, + table_rows=[["A", "B"]], + extractor="pymupdf", + ), + ], + ) + converter = MagicMock() + converter.convert.side_effect = RuntimeError("boom") + refiner = DoclingTableRefiner(self._config(), converter=converter) + + out = refiner.refine( + parsed, + source_url="https://www.cdc.gov/mmwr/volumes/75/wr/mm7509a1.htm", + content=b"%PDF-fake-bytes", + ) + assert out.sections[0].extractor == "pymupdf" + + +# --------------------------------------------------------------------------- +# Pipeline integration: extractor provenance flows through to DocumentChunk +# --------------------------------------------------------------------------- + + +def test_disabling_flag_skips_docling_construction(monkeypatch): + """With enable_docling_refiner=False the pipeline must never instantiate + a refiner (and therefore never touch any Docling import).""" + from bioscancast.extraction.pipeline import ExtractionPipeline + + pipeline = ExtractionPipeline( + config=ExtractionConfig(enable_docling_refiner=False) + ) + + def _fail(*_a, **_kw): + raise AssertionError("DoclingTableRefiner should not be constructed") + + monkeypatch.setattr( + "bioscancast.extraction.docling_refiner.DoclingTableRefiner.__init__", + _fail, + ) + # Force the pipeline's path that decides whether to call the refiner. + assert pipeline._config.enable_docling_refiner is False diff --git a/data/docling_eval/FINDINGS.md b/data/docling_eval/FINDINGS.md new file mode 100644 index 0000000..734b014 --- /dev/null +++ b/data/docling_eval/FINDINGS.md @@ -0,0 +1,192 @@ +# Docling Evaluation — Biosecurity Sources + +Ran `scripts/eval_docling.py` against 8 real biosecurity sources (5 PDFs + 3 HTML). Full per-source metrics are in [`run_log.json`](run_log.json); this file summarises what the outputs look like and where Docling struggles for our use case (ingesting WHO/CDC/ECDC/Africa-CDC outbreak documents). + +Environment: Docling 2.90.0 + docling-core 2.74.0 in a fresh `.venv-docling` (Python 3.13, Windows, CPU-only). OCR disabled (`do_ocr=False`) and `TableFormerMode.FAST` — with OCR on, the first source alone took >11 minutes and still hadn't finished, so the reported timings are the "fast-path" numbers. + +## Summary + +| Source | Category | Pages | Tables | Chunks | Elapsed | Status | +| --- | --- | ---:| ---:| ---:| ---:| --- | +| WHO Mpox Sitrep #64 | PDF (WHO sitrep) | 15 | 1 | 38 | **274.6 s** (slow) | ok | +| WHO Cholera Epi Update #34 | PDF (WHO sitrep) | 8 | 1 | 17 | 197.7 s | ok | +| CDC MMWR — NM Measles (mm7509a1) | PDF (MMWR) | 5 | 1 | 20 | 110.5 s | ok | +| ECDC CDTR Week 16 | PDF (ECDC) | 12 | 4 (all empty) | 28 | **324.6 s** (slow) | ok | +| Africa CDC Weekly (April 2026) | PDF (Africa CDC) | 15 | 2 (all empty) | **0** | **523.4 s** (slow) | ok† | +| Reuters — healthcare/pharma landing | HTML | — | — | — | 13.3 s | **error** (401) | +| CIDRAP — Utah measles | HTML | 0 | 0 | 16 | 13.8 s | ok | +| ProMED recent-posts listing | HTML | 0 | 1 | 17 | 13.8 s | ok | + +† Africa CDC returned 0 chunks — the PDF is fully image-based and yielded no extractable text with OCR off. + +7/8 succeeded, 3/8 breached the 240 s "slow" threshold, and 1 hard failure (Reuters, bot-protected). Total wall-clock for the 8 sources was ~25 minutes; first-run model download added ~40 MB and ~60 s on top. + +## What's in the Markdown + +### Tables — row/column structure and readable case counts + +| Source | Tables in doc | `num_rows × num_cols` | Readable from MD? | +| --- | ---:| --- | --- | +| WHO mpox sitrep | 1 | 9×4 | **Yes** — country / cases / deaths / reporting-countries readable: e.g. "Madagascar \| 368 \| 1 \| -". | +| WHO cholera update | 1 | 21×8 | **Yes** — full cholera-by-region table (country, cases, deaths, CFR, cases-per-100k, monthly % change). "Democratic Republic of the Congo \| 6 543 \| 148 \| 2.3 \| 5 \| 39 \| 66". | +| CDC MMWR | 1 | 17×2 | **Yes** — demographic/characteristic table rendered. | +| ECDC CDTR | 4 detected | **all 0×0** | **No** — TableFormer flagged the table regions but returned empty cells. Case numbers that sit inside the tables are missing from the markdown; in the body text, inline counts ("Italy (63), Spain (36), France (16)") do come through. | +| Africa CDC | 2 detected | **all 0×0** | **No** — image-only PDF, see below. | +| CIDRAP | 0 | — | n/a (article doesn't have tables). | +| ProMED listing | 1 | 157×2 | **Yes** — table of recent post titles by date renders cleanly. | + +Takeaway: Docling produces clean Markdown tables **when the PDF has native text tables** (the three WHO/CDC reports do). It silently degrades to empty cells when the tables are embedded as images or rely on OCR, and ECDC's CDTR layout falls into that bucket. For BioScanCast, this means case-count tables from WHO/CDC/MMWR are usable as-is, but ECDC/Africa-CDC tables will need either OCR-on fallback or an external data pipe. + +### Section headings + +Heading counts after conversion: mpox 40, cholera 15, MMWR 21, ECDC 30, CIDRAP 15, ProMED 7, Africa CDC 0. Order is preserved in all text-extractable sources: + +- WHO mpox: "## Highlights" → "## Epidemiological update" → "## Global monkeypox virus (MPXV) distribution" → "## Update on mpox outbreak transmission dynamics by virus clade" → "## Clade Ia MPXV" → "## Clade Ib MPXV" → … (matches the PDF's hierarchy). +- CDC MMWR: "## Abstract" → "## Introduction" → "## Investigation and Outcomes" → "## Notification of Confirmed Measles Cases in Texas" → "## Characteristics of Outbreak-Related Measles Cases" → "## Public Health Response" → "## Discussion" → "## Limitations" → "## Implications for Public Health Practice". +- ECDC CDTR: "## This week's topics" → "## Executive summary" → per-disease sections in order. + +Chunk `meta.headings` is populated, so the chunk exposes the full heading path (e.g. `['Measles Outbreak - New Mexico, 2025']`, `['Highlights']`). One caveat: the very first chunk of each doc has `headings=None` because it precedes the first `##` marker. + +### Reading order on multi-column PDFs + +MMWR is classic 2-column journal layout. The output reads correctly: paragraphs in column 1 flow into column 2 without interleaving, footnote markers (`*`, `†`, `§`) stay attached, and footnote bodies are placed near their markers. One quirk: the "INSIDE" sidebar (which lives in column 2 of page 1) gets spliced between body paragraphs rather than being lifted out — annoying for reading but not a correctness issue. + +### HTML: nav / ads / footers stripping + +Docling's HTML pipeline does **not** strip boilerplate. + +- CIDRAP article: lines 1–47 are the site nav (Topics & Projects, Podcasts, About, Search, …), line 48 is the actual article H1, the article body runs ~lines 48–77, and the remaining ~260 lines are other articles, "Choose newsletters" CTAs, and footer. The first chunk's heading is `['Main navigation']`, and chunk 15's headings include `['Tetanus still occurs among all ages in US, mainly in undervaccinated', 'Choose newsletters']` — so **unwanted content is definitely in the chunk stream**. For BioScanCast, HTML news articles will need a `trafilatura`-style pre-pass (we already use it in the existing extraction stage) before handing text to Docling, or a post-pass to filter chunks whose heading path contains "navigation" / "newsletters" / etc. +- ProMED recent-posts: by coincidence the listing page is mostly a `` of recent posts, which Docling preserves as a clean 157-row Markdown table. Good for headlines (MEASLES — ROMANIA, AVIAN INFLUENZA — INDIA (19), …) but not actual post bodies — those live at permalinks we didn't probe. + +### JavaScript-rendered sources + +- Reuters (`https://www.reuters.com/business/healthcare-pharmaceuticals/`): **fails** with `HTTPError: 401 Client Error: HTTP Forbidden` in 13 s. Docling uses a default `requests`/`httpx` fetch that doesn't pass a browser-like user-agent, and Reuters' Cloudflare front rejects it. Any Reuters or AP-equivalent source will need an out-of-band fetch (Playwright, explicit UA, or a news API). +- ProMED listing page: the latest-posts table does render into the initial HTML, so Docling captured it fine. The individual post bodies behind each permalink are likely JS-rendered and would need a different approach. +- CIDRAP: fully server-rendered, docling converted without issue. + +### Publication dates from metadata + +`pub_date` came back `None` for every source. Docling exposes `DoclingDocument.origin` but its only fields are `filename`, `mimetype`, `binary_hash` — no publication / creation date. Dates exist in the body text ("published 26 March 2026", "Week 16, 11–17 April 2026") but have to be extracted with a regex / LLM pass, not from document metadata. For BioScanCast, assume Docling won't give us a publication date; we need a separate parser over the first page. + +### Failures, timeouts, >3-minute runs + +- **Failure**: Reuters 401. Expected for any Cloudflare-fronted news site. +- **>3 min (slow)**: WHO mpox 274.6 s, ECDC CDTR 324.6 s, Africa CDC 523.4 s. Average PDF ran at ~18 s/page with OCR off and TableFormer FAST on CPU; the mpox + ECDC PDFs are layout-dense (figures + tables + multi-column), and the Africa CDC PDF is pure images so the pipeline still runs layout detection on every page. +- **No hangs, no timeouts** — just slow. +- First-run model cost: ~40 MB of downloads the first time (RapidOCR det/rec models — still downloaded even with `do_ocr=False`, but not used; layout-heron, tableformer). Once cached, subsequent runs skip the download. + +### Africa CDC failure mode (0 chunks) + +The markdown for `africa_cdc_weekly_apr2026.md` is 266 bytes — 15 lines, each ``. The PDF has 15 pages but Docling extracted zero text because it's published as a scanned/rasterised document rather than native-text PDF. OCR would be needed to recover anything; see the OCR cost section below for why that's not viable on this hardware. + +## OCR cost evaluation (ECDC CDTR week 16) + +Follow-up run via `scripts/eval_docling_ocr_cost.py`, using `convert(page_range=...)` to time individual pages and project full-doc cost. Results in [data/docling_eval/ocr/per_page_cost.json](ocr/per_page_cost.json): + +| Mode | Mean per page | Projected 12-page doc | Extra bytes vs OCR-off baseline | +| --- | ---:| ---:| --- | +| `do_ocr=False` (baseline) | 22.5 s | ~4.5 min | — (3214 B on p5, 3721 B on p10) | +| `do_ocr=True`, bitmap-only (default) | 132.6 s | ~26.5 min | **+57 B on p1, +0 B on p5/p10** | +| `do_ocr=True`, `force_full_page_ocr=True` | 1055.8 s on p5 alone | ~3.5 hours | **less** content (2753 B vs 3214 B) — OCR overwrote the clean text layer | + +The earlier full-doc OCR-on run was killed at 42 min before ECDC even finished — that was `force_full_page_ocr=True`. Even the saner default (~26.5 min projected) returns essentially nothing for ECDC because the "4 tables detected but 0×0" in the OCR-off run are layout-detection **false positives on chart/figure regions**, not real tables. The case counts ECDC actually publishes ("Italy (63), Spain (36), France (16) and Poland (five)") are already in the text-flow prose that OCR-off captures. Africa CDC was skipped — full-page OCR projection ≈3.5 hours per 15-page doc is unworkable on CPU. + +Practical conclusion: **don't enable Docling OCR on this hardware**. Use OCR-off everywhere. For genuinely scanned PDFs like Africa CDC, route to a different ingestion path (external OCR service, GPU host, or simply skip). + +## Recommendations for the BioScanCast pipeline + +1. **Keep OCR off everywhere** (`do_ocr=False`). The OCR cost evaluation above showed bitmap-only OCR adds ~110 s/page of CPU work and recovers near-zero content on ECDC; full-page OCR is worse. For scanned-only PDFs (Africa CDC), OCR is the only path but the wall-clock makes it infeasible on CPU — handle out-of-band. +2. **HTML pre-filter**. Keep the existing `trafilatura` main-content extraction in the pipeline; hand Docling the cleaned article HTML rather than raw URLs, or drop Docling for HTML entirely and use the current HTML path. Nav/footer chunks from Docling's HTML pipeline are not useful. +3. **Reuters/AP**: Docling's default fetcher can't bypass Cloudflare (401). Feed it pre-fetched HTML from a UA-spoofing fetcher (the `curl` test in [data/docling_eval/sources/](sources/) showed that path works for CIDRAP), or skip news HTML sources in the Docling path. +4. **Publication date**: plan a separate extractor; Docling doesn't expose it. Tier as: HTML `` / JSON-LD via trafilatura → PDF `/CreationDate` via PyMuPDF (noisy) → regex over the first chunk's body text. +5. **Budget wall-clock**: expect 2-5 minutes/PDF on CPU even with OCR off; mpox sitrep was 4.5 min, ECDC 5.4 min, Africa CDC 8.7 min (and useless without OCR). A cron-driven BioScanCast scan that touches 10+ PDFs will want a worker pool or a GPU host; don't put this behind a synchronous API call. +6. **Tables**: the WHO/MMWR tables we care about (country/case/death matrices) come through cleanly as Markdown — downstream code can parse them with a simple Markdown-table reader. ECDC's "tables" are charts/figures and need to be read from the surrounding prose instead. + +## Head-to-head: Docling vs. in-tree `PdfParser` + +Run via `scripts/eval_intree_pdf.py` against the same 5 local PDFs in `data/docling_eval/sources/`. In-tree stack: PyMuPDF + pdfplumber-fallback + font-size heading heuristic + `
DateTitle
Wed Apr 22 2026
DENGUE - BANGLADESH (08): UPDATE
Wed Apr 22 2026
MALARIA - INDIA: (JHARKHAND) MILITARY PERSONNEL, FATAL
Wed Apr 22 2026
FOODBORNE ILLNESS - INDIA (07): (GUJARAT) WEDDING FEAST
Wed Apr 22 2026
DENGUE - CHINA: (HONG KONG) FIRST LOCALLY ACQUIRED CASE
Wed Apr 22 2026
MIDDLE EAST RESPIRATORY SYNDROME CORONAVIRUS - SOMALIA: MERS TRANSMISSION, CAMEL TO HUMAN
Wed Apr 22 2026
Острая кишечная инфекция (вспышка) – Россия (Челябинская, Вологодская области)
Wed Apr 22 2026
Болезнь Лайма – Россия (Челябинская и Калужская области)
Tue Apr 21 2026
MENINGOCOCCAL DISEASE - VIET NAM (13): (CA MAU) OUTBREAK BROUGHT UNDER CONTROL
Tue Apr 21 2026
MEASLES - SUDAN (02): (DARFUR, KORDOFAN) SURGE, FATAL, VACCINATION CAMPAIGN
Tue Apr 21 2026
FOODBORNE ILLNESS - SOUTH AFRICA: (LIMPOPO) FATAL, REQUEST FOR INFORMATION
Tue Apr 21 2026
FOOT & MOUTH DISEASE - BOTSWANA (02): CATTLE, TEMPORARY ABATTOIR CLOSURE, EXPANDING OUTBREAK
Tue Apr 21 2026
EQUINE INFLUENZA - USA (03): (OREGON) HORSE
Tue Apr 21 2026
CORONAVIRUS DISEASE 2019 UPDATE - NIGERIA: (CROSS RIVER) ex CHINA, CONFIRMED
Tue Apr 21 2026
ROCKY MOUNTAIN SPOTTED FEVER - MEXICO (10): (BAJA CALIFORNIA) HIGH CASE FATALITY RATE, ALERT
Tue Apr 21 2026
SYPHILIS - USA: (ALABAMA) INCREASE, BENZATHINE PENICILLIN SHORTAGE
Tue Apr 21 2026
MEASLES - CHINA (02): (HONG KONG) AIRPORT STAFF, CASE CLUSTER
Tue Apr 21 2026
MEASLES - SPAIN: UPDATE
Tue Apr 21 2026
SCHISTOSOMIASIS - ZIMBABWE: (MASHONALAND WEST) FATAL
Tue Apr 21 2026
MEASLES, MENINGITIS - CHAD: RESURGENCE
Tue Apr 21 2026
BABESIA HEGOTELFORUM – EUROPA: NUEVA ESPECIE CARACTERIZADA, INFECCIÓN EN HUMANOS
Tue Apr 21 2026
RABIES - BANGLADESH: (RANGPUR) FOX BITE, FATAL, SHORTAGE OF FREE VACCINE SUPPLY
Tue Apr 21 2026
SUTTONELLA ORNITHOCOLA - SWEDEN: BLUE TIT
Tue Apr 21 2026
CHOLERA - DEMOCRATIC REPUBLIC OF CONGO (12): (NORTH KIVU) FATAL, ALERT
Tue Apr 21 2026
FOODBORNE ILLNESS - VIET NAM (21): (NGHE AN) SUSPECTED, UPDATE
Tue Apr 21 2026
AVIAN INFLUENZA, HUMAN - CHINA (03): (GUANGDONG, YUNNAN, JIANGXI) H9N2
Tue Apr 21 2026
CHIKUNGUNYA – ESPAÑA: AUMENTO MARCADO DE INCIDENCIA, AMPLIACIÓN DE VENTANA DE TRANSMISIÓN
Tue Apr 21 2026
FOOD POISONING - CHINA (02): (HONG KONG) PORCINI
Tue Apr 21 2026
Зарубежное эпидобозрение - корь - Турция, Япония
Tue Apr 21 2026
Зарубежное эпидобозрение - лихорадка чикунгунья - Маврикий (2)
Tue Apr 21 2026
NOROVIRUS - USA (07): (WASHINGTON) SHELLFISH, ALERT, RECALL
Tue Apr 21 2026
AVIAN INFLUENZA - USA (18): (IDAHO, SOUTH DAKOTA, ARKANSAS) DAIRY CATTLE, RNA DETECTION IN SEMEN, 3 BIRD OUTBREAKS
Tue Apr 21 2026
MEASLES - LATVIA (03): INCREASING CASES
Tue Apr 21 2026
EQUINE HERPESVIRUS - USA (08): (VIRGINIA) MYELOENCEPHALOPATHY, HORSE, QUARANTINE
Tue Apr 21 2026
HANTAVIRUS - CHILE (17): (LOS LAGOS)
Tue Apr 21 2026
Зарубежное эпидобозрение - лихорадка чикунгунья - Маврикий (2)
Mon Apr 20 2026
NEW WORLD SCREWWORM - MEXICO (13): WILD ANIMALS, WHITE-TAILED DEER
Mon Apr 20 2026
AFRICAN SWINE FEVER - CHINA (03): MULTIPLE PROVINCES, DOMESTIC, SPREAD
Mon Apr 20 2026
UNDIAGNOSED ILLNESS - KENYA (02): (LAMU) RELATED FATALITY SUSPECTED, REQUEST FOR INFORMATION
Mon Apr 20 2026
BOTULISM - USA (10): (NEW YORK, NEW JERSEY) CRAB CAKES, RISK, RECALL
Mon Apr 20 2026
TUBERCULOSIS - UKRAINE: INCREASE IN CASES
Mon Apr 20 2026
SALMONELLOSIS, SEROTYPE BOCHUM - GERMANY: CHOCOLATE/HAZELNUT SPREAD, ALERT, RECALL, FATALITY
Mon Apr 20 2026
AVIAN INFLUENZA - INDIA (19): (CHHATTISGARH) BIRD, H5N1, WOAH
Mon Apr 20 2026
AVIAN INFLUENZA - INDIA (18): (KARNATAKA) POULTRY, H5N1
Mon Apr 20 2026
MEASLES - ROMANIA: HIGH RATES OF MEASLES, LOW RATES OF VACCINATION
Mon Apr 20 2026
E. COLI EHEC - USA (10): (ARIZONA) 2025, COUNTY FAIR PETTING ZOO BAN
Mon Apr 20 2026
MEASLES - USA (75): (CALIFORNIA, SOUTH CAROLINA, ARIZONA, OREGON, FLORIDA)
Mon Apr 20 2026
LEGIONELLOSIS - USA (10): (NORTH CAROLINA) INCREASED CASES, 2025
Mon Apr 20 2026
E. COLI EHEC - USA (09): O157, RAW CHEDDAR CHEESE, FDA FOLLOWUP
Mon Apr 20 2026
RESEARCH & INNOVATION (72): VIROELIXIR, STREPTOCOCCUS MUTANS, ANTIBACTERIAL ACTIVITY, GREEN TEA, POMEGRANATE
Mon Apr 20 2026
ANTIMICROBIAL STEWARDSHIP (103): CYTOSORB, SEPSIS, HEMOADSORPTION, DRUG BINDING, REVIEW
Sun Apr 19 2026
SURVEILLANCE (113): OMAN, ESCHERICHIA COLI, EXTENDED-SPECTRUM BETA-LACTAMASE, ANTIMICROBIAL RESISTANCE, GENOMIC SURVEILLANCE, ONE HEALTH
Sun Apr 19 2026
SURVEILLANCE (112): TÜRKIYE, CENTRAL LINE CATHETER, BLOODSTREAM INFECTIONS, CARBAPENEM RESISTANCE, INTENSIVE CARE UNIT
Sun Apr 19 2026
Shigellosis: USA, increasing incidence, antimicrobial resistance, 2023
Sun Apr 19 2026
ROTAVIRUS - USA
Sun Apr 19 2026
FOOT & MOUTH DISEASE - CHINA (08): MULTIPLE SPECIES, SEROTYPE SAT 1, SPREAD
Sun Apr 19 2026
HEPATITIS A - ITALY (03): (CAMPANIA) ELEMENTARY SCHOOL
Sun Apr 19 2026
MALARIA - SOUTH AFRICA (02): (GAUTENG) OUTBREAK, FATAL, ALARM
Sun Apr 19 2026
ANTHRAX - KENYA (02): (VIHIGA) CONFIRMED HUMAN CASES, FATAL
Sun Apr 19 2026
Сальмонеллез (вспышка) – Россия (Московская область)
Sun Apr 19 2026
Зарубежное эпидобозрение – оспа обезьян – США (Калифорния)
Sun Apr 19 2026
MPOX CLADO IB – COLOMBIA: (ANTIOQUIA) IDENTIFICACIÓN DE PRIMER CASO EN EL PAÍS
Sun Apr 19 2026
SCOMBROID FISH POISONING - PHILIPPINES (02): (NEGROS OCCIDENTAL)
Sun Apr 19 2026
Бешенство (летальный исход) - Россия (Омская область)
Sun Apr 19 2026
FOODBORNE ILLNESS - VIET NAM (20): (NGHE AN) SUSPECTED, BREAD CONSUMPTION
Sun Apr 19 2026
AVIAN INFLUENZA - NEPAL (04): MULTIPLE DISTRICTS, BIRD, HPAI H5N1, SPREAD
Sun Apr 19 2026
MPOX - USA (03): (CALIFORNIA) SAN FRANCISCO, CLADE I
Sun Apr 19 2026
SALMONELLOSIS - CANADA (05): (ONTARIO) RESTAURANT, MORE CASES
Sat Apr 18 2026
MELIOIDOSIS - THAILAND: CASES/FATALITIES
Sat Apr 18 2026
Лептоспироз - Россия (Республика Карелия)
Sat Apr 18 2026
HANTAVIRUS - CHILE (16): (LOS LAGOS) FATALITY
Sat Apr 18 2026
MALARIA – HONDURAS: (ISLAS DE LA BAHÍA) BROTE, P. VIVAX, MEDIDAS DE CONTENCIÓN
Sat Apr 18 2026
MEASLES - BANGLADESH (03): UPDATE
Sat Apr 18 2026
CHOLERA - DEMOCRATIC REPUBLIC OF CONGO (11): (SOUTH KIVU) EPIDEMIC UPDATE
Sat Apr 18 2026
AVIAN INFLUENZA - CHILE (06): (ARAUCANIA) HPAI H5N1, COMMERCIAL POULTRY
Sat Apr 18 2026
MPOX - DENMARK: CLADE 1B, FIRST REPORT
Sat Apr 18 2026
STREPTOCOCCUS SUIS - VIET NAM (02): (HO CHI MINH CITY) RAW PORK PROCESSING
Sat Apr 18 2026
MENINGOCOCCAL DISEASE - UK (12): (ENGLAND) SEROGROUP B, SEPARATE CASES
Sat Apr 18 2026
FOODBORNE ILLNESS - INDIA (06): (UTTARAKHAND) WEDDING
Sat Apr 18 2026
TRICHODERMA WHITE MOLD, MOREL - CHINA: NEW HOST
Sat Apr 18 2026
AVIAN INFLUENZA - USA (17): (ARKANSAS) COMMERCIAL POULTRY
Sat Apr 18 2026
FOODBORNE ILLNESS - INDIA (05): (ODISHA) SUSPECTED, STUDENTS
Sat Apr 18 2026
EQUINE HERPESVIRUS - USA (07): (NEW JERSEY) HORSE
Sat Apr 18 2026
GUILLAIN-BARRE SYNDROME - BRAZIL: RISK AFTER DENGUE INFECTION
Fri Apr 17 2026
ANTIMICROBIAL ENVIRONMENTAL CONTAMINATION (35): THAILAND, SHEWANELLA ALGAE, ANTIMICROBIAL RESISTANCE, PENAEUS MONODON
Fri Apr 17 2026
INVASIVE MOSQUITO - QATAR /بعوض -قطر: زيادة الانتشار
Fri Apr 17 2026
FEBRE AMARELA - BRASIL (04) (ESTADO DE SÃO PAULO), HUMANOS, OBITO, SURTO
Fri Apr 17 2026
Зарубежное эпидобозрение – корь – США (Юта)
Fri Apr 17 2026
Астраханская пятнистая лихорадка - Россия (Астраханская область)
Fri Apr 17 2026
MPOX - GHANA (05): EPIDEMIC UPDATE
Fri Apr 17 2026
AVIAN INFLUENZA - COTE D'IVOIRE: (ZANZAN) POULTRY, H5N1, WOAH
Fri Apr 17 2026
RESEARCH & INNOVATION (71): CARBAPENEM-RESISTANT ACINETOBACTER BAUMANNII, COLISTIN, ANTIMICROBIAL SUSCEPTIBILITY TESTING, VITEK2 AST-N440
Fri Apr 17 2026
RESEARCH & INNOVATION (70): RESPIRATORY PATHOGENS, ESSENTIAL OILS, ANTIBIOTIC ADJUVANTS, SYNERGY, BIOFILM
Fri Apr 17 2026
CRIMEAN-CONGO HEMORRHAGIC FEVER - IRAQ (02): (DHI QAR) NEW CASES, FATAL / حمى القرم الكنغولية النزفية - العراق (2): (ذي قار) حالات جديدة، مميتة
Fri Apr 17 2026
SURVEILLANCE (111): THAILAND, MALASSEZIA PACHYDERMATIS, NANOEMULSION, ANTIFUNGAL, SKIN AND EAR INFECTIONS, DOGS
Fri Apr 17 2026
ANTIMICROBIAL STEWARDSHIP (102): COMMUNITY-ACQUIRED PNEUMONIA, SHORT VERSUS LONGER ANTIBIOTIC DURATION, 2017-2024
Fri Apr 17 2026
ENFERMEDAD DE CAUSA DESCONOCIDA – BURUNDI: (MPANDA) BROTE, MUERTES, INVESTIGACIÓN EPIDEMIOLÓGICA EN CURSO
Fri Apr 17 2026
DENGUE - TONGA: INCREASE, OUTBREAK DECLARED, MOH
Fri Apr 17 2026
RABIES - PAKISTAN (05): (SINDH) DOG, HUMAN, FATAL
Fri Apr 17 2026
EPIZOOTIC HEMORRHAGIC DISEASE - USA (01): (OREGON) CATTLE
Fri Apr 17 2026
TETANUS - USA (02): (IDAHO, MINNESOTA, MISSOURI, WISCONSIN) CDC, 2024
Fri Apr 17 2026
STRANGLES - USA (03): (FLORIDA) HORSE
Fri Apr 17 2026
ANTHRAX - KENYA: (MERU) HUMAN, CONSUMPTION OF INFECTED COW MEAT
Fri Apr 17 2026
MENINGOCOCCAL DISEASE - VIET NAM (12): (HO CHI MINH CITY) MENINGITIS, FATAL
Thu Apr 16 2026
HAEMOPHILUS INFLUENZAE, GROUP B - USA: (ALASKA, OREGON, WASHINGTON) INVASIVE, HOMELESS/SUBSTANCE ABUSE ADULT CLUSTERS, SEQUENCE TYPE 6, 2023-2025
Thu Apr 16 2026
FOOT AND MOUTH DISEASE - ZAMBIA (02): (WESTERN) OUTBREAK CONFIRMED
Thu Apr 16 2026
GASTROENTERITIS - USA (02): CRUISE SHIP, CDC, INTERNATIONAL WATERS
Thu Apr 16 2026
SENECAVIRUS A, SWINE - USA: SWINE HEALTH, NATIONWIDE
Thu Apr 16 2026
FOODBORNE ILLNESS - ISRAEL: (TEL AVIV) CHILDREN, DAYCARE CENTERS, REQUEST FOR INFORMATION
Thu Apr 16 2026
MEASLES - JAPAN (05): CONTINUED INCREASE IN CASES
Thu Apr 16 2026
Q FEVER - AUSTRALIA: INCREASING CASES
Thu Apr 16 2026
RABIES - VIET NAM (16): (QUANG NGAI) DOG, HUMAN, RISK OF RABIES OUTBREAKS, CONTROL
Thu Apr 16 2026
FOOT & MOUTH DISEASE - CHINA (07): CATTLE, SEROTYPE SAT 1, WOAH
Thu Apr 16 2026
FOOT & MOUTH DISEASE - INDONESIA (03): (WEST JAVA) INTERCEPTED COWHIDES
Thu Apr 16 2026
Завозные инфекции (малярия, тропические лихорадки, холера, брюшной тиф) данные за 2025 год - Россия
Thu Apr 16 2026
Бруцеллез (сельскохозяйственные животные) - Россия (Краснодарский край)
Thu Apr 16 2026
Болезнь Лайма - Россия (Санкт-Петербург, Ленинградская область, Орловская область, Ставропольский край)
Thu Apr 16 2026
Зарубежное эпидобозрение – полиомиелит (cVDPV2) – Демократическая Республика Конго, Нигерия, Сомали
Thu Apr 16 2026
DENGUE - BANGLADESH (07): UPDATE
Thu Apr 16 2026
LASSA FEVER - NIGERIA (26): (OYO) FATAL
Thu Apr 16 2026
YELLOW FEVER - BOLIVIA: (SANTA CRUZ) FATAL
Thu Apr 16 2026
CHOLERA - CAYMAN ISLANDS
Thu Apr 16 2026
SARAMPIÓN – GUATEMALA (03): BROTE EN PROGRESO, MUERTES PEDIÁTRICAS
Thu Apr 16 2026
MPOX UPDATE - PAKISTAN (07): (SINDH) SPREAD, LOCAL TRANSMISSION
Thu Apr 16 2026
CHIKUNGUNYA - INDIA (02): (KERALA)
Thu Apr 16 2026
HIV/AIDS - INDIA (02): (NAGALAND) SCREENING
Thu Apr 16 2026
FIEBRE AMARILLA – BOLIVIA: (SANTA CRUZ) BROTE, CASO LETAL, CASOS ADICIONALES, ACCIONES DE BLOQUEO
Thu Apr 16 2026
FOOT & MOUTH DISEASE - CHINA (06): MULTIPLE REGIONS, LIVESTOCK, PREVENTION AND CONTROL
Thu Apr 16 2026
MENINGOCOCCAL DISEASE - TAIWAN: UNIDENTIFIED SOURCE, REQUEST FOR INFORMATION
Thu Apr 16 2026
AVIAN INFLUENZA, HUMAN - CHINA (02): (GUANGDONG, GUANGXI ZHUANG) H9N2
Wed Apr 15 2026
MEASLES - USA (74): (UTAH, CALIFORNIA, OREGON, WASHINGTON)
Wed Apr 15 2026
FOODBORNE ILLNESS - TAIWAN (02): (YUNLIN) AMUSEMENT PARK, STUDENTS, REQUEST FOR INFORMATION
Wed Apr 15 2026
SALMONELLOSIS, SEROTYPE INFANTIS - USA: REOCCURRING, EMERGING, OR PERSISTING STRAIN, HUMAN INFECTIONS, POULTRY SOURCE
Wed Apr 15 2026
TUBERCULOSIS - PALAU
Wed Apr 15 2026
COCCIDIOIDOMYCOSIS - USA: (CALIFORNIA) INCREASED OCCURRENCE IN PEDIATRIC POPULATION
Wed Apr 15 2026
TETANUS - USA: CDC, 2009-2023
Wed Apr 15 2026
SALMONELLOSIS - CANADA (04): (NOVA SCOTIA) RESTAURANT
Wed Apr 15 2026
CHIKUNGUNYA - BRASIL (02) (MATO GROSSO DO SUL), SURTO, ATUALIZAÇÃO, AUMENTO DO NÚMERO DE OBITOS
Wed Apr 15 2026
Геморрагическая лихорадка с почечным синдромом – Россия (Нижегородская область) (2)
Wed Apr 15 2026
SURVEILLANCE (110): ROMANIA, FALCONS, GRAM-NEGATIVE BACTERIA, ANTIMICROBIAL RESISTANCE
Wed Apr 15 2026
ANTIMICROBIAL STEWARDSHIP (101): DRUG ALLERGIES, PENICILLINS, GENERAL PRACTICE, DELABELING
Wed Apr 15 2026
ANTIMICROBIAL STEWARDSHIP (100): HOSPITAL-ACQUIRED INFECTIONS, ENTEROCOCCUS FAECALIS, GENTAMICIN RESISTANCE, HOSPITAL STAY
Wed Apr 15 2026
ANTIMICROBIAL STEWARDSHIP (99): INDONESIA, HIGH ANTIBIOTIC EXPOSURE, EMERGENCY DEPARTMENT, 2022
Wed Apr 15 2026
SURVEILLANCE (109): CHINA, STREPTOCOCCUS SUIS SEROTYPE 9, PROPHAGES, GENOMIC ARCHITECTURE AND RESERVOIRS, VIRULENCE GENES
Wed Apr 15 2026
ANTIMICROBIAL STEWARDSHIP (98): VIET NAM, FUNGAL SKIN INFECTION, MISUSING TOPICAL MEDICATIONS
Wed Apr 15 2026
MUMPS - USA (02): (MINNESOTA)
Wed Apr 15 2026
AVIAN INFLUENZA - CHILE (05): (LOS RIOS) SWAN, HPAI H5N1
Wed Apr 15 2026
FUSARIUM HEAD BLIGHT, WHEAT - ETHIOPIA: NEW SPECIES
Wed Apr 15 2026
MEASLES, PERTUSSIS - SUDAN: (NORTH DARFUR) INCREASING INCIDENCE, FATAL / الحصبة – السودان: (شرق دارفور)مميتة، زيادة الانتشار،
Wed Apr 15 2026
ANTIMICROBIAL STEWARDSHIP (97): ACUTE APPENDICITIS, ANTIMICROBIAL RESISTANCE, ANTIBIOTICS, GENTAMICIN
Wed Apr 15 2026
ANTIMICROBIAL STEWARDSHIP (96): BACTERIAL BIOFILMS, ANTIMICROBIAL RESISTANCE, BIOFILM THERAPEUTICS, REVIEW
Wed Apr 15 2026
RESEARCH & INNOVATION (69): ACINETOBACTER BAUMANNII, TORMENTIL, MULTIDRUG RESISTANCE, POTENTIATING COLISTIN, IRON HOMEOSTASIS
Wed Apr 15 2026
SURVEILLANCE (108): CHINA, STAPHYLOCOCCUS AUREUS, CYPERUS ESCULENTUS L. LEAF-STEM EXTRACTS, SYNERGISTIC ANTIBACTERIAL EFFECTS
Wed Apr 15 2026
SURVEILLANCE (107): CHINA, STREPTOCOCCUS MUTANS BIOFILM, ORAL MICROBIOTA DIVERSITY, POLYMETHYL METHACRYLATE MICROPLASTICS
Wed Apr 15 2026
RESEARCH & INNOVATION (68): PROTON-ACTIVATED CHLORIDE CHANNEL, NOVEL THERAPY, BACTERIAL SEPSIS, IN VIVO
Wed Apr 15 2026
SURVEILLANCE (106): COSTA RICA, CARBAPENEM-RESISTANT ACINETOBACTER BAUMANNII, ANTIMICROBIAL RESISTANCE, GENOMIC SURVEILLANCE
Wed Apr 15 2026
SURVEILLANCE (105): SOUTH KOREA, SHRIMP JEOTGAL, BACILLUS SPECIES SNB-066, NEW HYDROXYL FATTY ACIDS, ANTIMICROBIAL AND ANTICANCER PROPERTIES

Why Choose ProMED?

Curated Reports by Human Experts

Our team of subject matter experts meticulously verifies and analyzes 5-20 critical health events daily. Unlike AI-driven systems, our human moderators provide nuanced context and expert commentary, ensuring you receive accurate, reliable, and actionable intelligence.

Comprehensive Historical Data

Gain access to our unparalleled archive spanning 31 years of outbreak data. This vast repository offers invaluable insights into disease patterns, emergence, and evolution, empowering researchers, policymakers, and health professionals with a robust historical context for current and future health challenges.

Trusted by Global Health Leaders

ProMED is the go-to resource for premier health organizations worldwide. From the World Health Organization (WHO) and the US Centers for Disease Control and Prevention (CDC) to cutting-edge AI-based systems and leading universities, our reports inform critical decision-making and research across the globe.

Unlock 31 Years of Outbreak Intelligence

Explore our vast archive of historical data to uncover trends, patterns, and insights that shape global health security.

International Society for

Infectious Diseases

867 Boylston Street

5th Floor #1985

Boston, MA 02116

USA

Phone +1-617-925-5272

Fax +1-617-865-7031

W ISID.org

ProMED Support

© 2026 International Society for Infectious Diseases. All Rights Reserved.

    \ No newline at end of file diff --git a/data/docling_eval/sources/who_cholera_epi34.pdf b/data/docling_eval/sources/who_cholera_epi34.pdf new file mode 100644 index 0000000..a65422a Binary files /dev/null and b/data/docling_eval/sources/who_cholera_epi34.pdf differ diff --git a/data/docling_eval/sources/who_mpox_sitrep64.pdf b/data/docling_eval/sources/who_mpox_sitrep64.pdf new file mode 100644 index 0000000..34e249b Binary files /dev/null and b/data/docling_eval/sources/who_mpox_sitrep64.pdf differ diff --git a/requirements.txt b/requirements.txt index b62800e..778005b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,6 +6,7 @@ trafilatura>=1.12,<2.0 # HTML main-content extraction (strips nav, ads, boilerpl beautifulsoup4>=4.12,<5.0 # HTML parsing for heading hierarchy and table recovery (transitive dep of trafilatura, pinned explicitly for direct usage) PyMuPDF>=1.24,<2.0 # PDF text and table extraction (imports as 'fitz') pdfplumber>=0.11,<1.0 # Fallback PDF table extraction for cases PyMuPDF mishandles +docling[chunking]>=2.90,<3.0 # TableFormer-based refinement for borderless / merged-cell PDF tables (first-run downloads ~40 MB to the HuggingFace cache) tiktoken>=0.7,<1.0 # Approximate token counting (cl100k_base encoding) openai>=1.0,<2.0 # OpenAI API client (used by filtering stage LLM calls) pytest>=8.0,<9.0 # Testing diff --git a/scripts/eval_docling.py b/scripts/eval_docling.py new file mode 100644 index 0000000..9dd4e2b --- /dev/null +++ b/scripts/eval_docling.py @@ -0,0 +1,363 @@ +"""Standalone Docling evaluation against real biosecurity sources. + +Converts a curated set of WHO/CDC/ECDC/Africa-CDC PDFs plus a few HTML news +articles using Docling, saves Markdown + JSON output per source, runs +HybridChunker (max_tokens=512), and writes a summary log. + +Not part of the BioScanCast package; uses its own venv (.venv-docling). +Run from the repo root: + + .venv-docling/Scripts/python.exe scripts/eval_docling.py + +Outputs: data/docling_eval/ +""" +from __future__ import annotations + +import json +import logging +import sys +import time +import traceback +from dataclasses import dataclass, field +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +# Force unbuffered stdout so the progress log streams in real time. +sys.stdout.reconfigure(line_buffering=True) +sys.stderr.reconfigure(line_buffering=True) + +# Route docling's own loggers to stdout so progress downloads are visible. +logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s") + +print("Importing docling...", flush=True) + +# Docling imports +from docling.chunking import HybridChunker +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer +from transformers import AutoTokenizer + +print("docling imported OK.", flush=True) + + +REPO_ROOT = Path(__file__).resolve().parent.parent +OUT_DIR = REPO_ROOT / "data" / "docling_eval" +OUT_DIR.mkdir(parents=True, exist_ok=True) + +# Hard per-source timeout so a hanging conversion never blocks the whole run. +# Docling itself has no convert timeout; we enforce it by tracking elapsed time +# and aborting at the end. (The task requirement is to flag >3 min runs.) +SOFT_TIMEOUT_SEC = 240 # flag anything over this as "slow" + + +@dataclass +class Source: + name: str # file-safe slug used for output files + category: str # who_don | cdc_mmwr | ecdc_cdtr | africa_cdc | reuters | cidrap | promed + url: str + notes: str = "" # any caveats/expectations + + +# Curated list of publicly accessible biosecurity sources (verified URLs). +# NOTE: WHO "Disease Outbreak News" items themselves are HTML-only on who.int, +# so we use WHO outbreak situation-report PDFs (mpox + cholera) which are the +# table-heavy, multi-section PDFs WHO publishes for the same outbreaks. +SOURCES: list[Source] = [ + Source( + name="who_mpox_sitrep64", + category="who_don", + url="https://cdn.who.int/media/docs/default-source/_sage-2026/multi-country-outbreak-of-mpox--external-situation-report_64.pdf?sfvrsn=10400a6e_4&download=true", + notes="WHO multi-country mpox external situation report #64 (table-heavy).", + ), + Source( + name="who_cholera_epi34", + category="who_don", + url="https://cdn.who.int/media/docs/default-source/documents/emergencies/situation-reports/20260221_multi-country_outbreak-of-cholera_epidemiological_update_34.pdf?sfvrsn=c367355_4&download=true", + notes="WHO multi-country cholera epidemiological update #34 (21 Feb 2026).", + ), + Source( + name="cdc_mmwr_nm_measles", + category="cdc_mmwr", + url="https://www.cdc.gov/mmwr/volumes/75/wr/pdfs/mm7509a1-H.pdf", + notes="MMWR Vol 75 No 9 (Mar 12 2026) — Measles Outbreak New Mexico 2025.", + ), + Source( + name="ecdc_cdtr_week16", + category="ecdc_cdtr", + url="https://www.ecdc.europa.eu/sites/default/files/documents/Communicable-disease-threats-report-week-16-2026.pdf", + notes="ECDC Communicable Disease Threats Report week 16 (12-18 Apr 2026).", + ), + Source( + name="africa_cdc_weekly_apr2026", + category="africa_cdc", + url="https://africacdc.org/download/africa-cdc-epidemic-intelligence-weekly-report-april-2026/?wpdmdl=24028", + notes="Africa CDC Epidemic Intelligence Weekly Report, April 2026.", + ), + Source( + name="reuters_bird_flu", + category="reuters", + # Reuters uses Cloudflare bot protection; docling's default httpx fetch + # typically returns 401/403. We include this source precisely to measure + # whether Docling can handle a hardened HTML source out of the box. + url="https://www.reuters.com/business/healthcare-pharmaceuticals/", + notes="Reuters healthcare section front page (tests bot-protected HTML).", + ), + Source( + name="cidrap_utah_measles", + category="cidrap", + url="https://www.cidrap.umn.edu/measles/utah-measles-outbreak-tops-600-cases-now-most-active-us", + notes="CIDRAP news article — Utah measles outbreak tops 600 cases.", + ), + Source( + name="promed_latest", + category="promed", + # ProMED's public homepage lists recent posts; individual post permalinks + # are behind JS. We feed the list page to exercise HTML handling. + url="https://promedmail.org/promed-post/", + notes="ProMED recent-posts listing page (JS-heavy).", + ), +] + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +def _safe_chunk_meta(chunk: Any) -> dict[str, Any]: + """Extract heading path + page refs from a chunk.meta, robust to schema.""" + meta: dict[str, Any] = {} + chunk_meta = getattr(chunk, "meta", None) + if chunk_meta is None: + return meta + + # Heading path: docling exposes chunk.meta.headings (list[str]) in hierarchical chunker. + headings = getattr(chunk_meta, "headings", None) + if headings: + meta["headings"] = list(headings) + + # Page references: docling stores doc_items each with prov -> list[ProvenanceItem(page_no,...)] + pages: set[int] = set() + doc_items = getattr(chunk_meta, "doc_items", None) or [] + for item in doc_items: + provs = getattr(item, "prov", None) or [] + for p in provs: + page_no = getattr(p, "page_no", None) + if isinstance(page_no, int): + pages.add(page_no) + if pages: + meta["pages"] = sorted(pages) + + # Origin (source filename) if available + origin = getattr(chunk_meta, "origin", None) + if origin is not None: + origin_filename = getattr(origin, "filename", None) + if origin_filename: + meta["origin_filename"] = origin_filename + + return meta + + +def _count_tables(doc: Any) -> int: + tables = getattr(doc, "tables", None) or [] + try: + return len(tables) + except TypeError: + return 0 + + +def _count_pages(doc: Any) -> int: + pages = getattr(doc, "pages", None) + if pages is None: + return 0 + try: + return len(pages) + except TypeError: + return 0 + + +def _extract_pub_date(doc: Any) -> str | None: + """Best-effort: docling rarely exposes publication metadata for PDFs. + We look at doc.origin (filename/mimetype/binary_hash) and any top-level meta. + """ + origin = getattr(doc, "origin", None) + if origin is not None: + for attr in ("publication_date", "date", "created"): + val = getattr(origin, attr, None) + if val: + return str(val) + # Some converters attach metadata via doc.meta or doc.properties — try both. + meta = getattr(doc, "meta", None) + if isinstance(meta, dict): + for key in ("publication_date", "date", "created", "creationDate"): + if key in meta and meta[key]: + return str(meta[key]) + return None + + +def convert_one(source: Source, converter: DocumentConverter, + chunker: HybridChunker) -> dict[str, Any]: + """Convert a single source, save outputs, return a metrics record.""" + record: dict[str, Any] = { + "name": source.name, + "category": source.category, + "url": source.url, + "notes": source.notes, + "status": "pending", + "elapsed_sec": None, + "pages": None, + "tables": None, + "chunks": None, + "pub_date": None, + "error": None, + "markdown_path": None, + "doc_json_path": None, + "chunks_json_path": None, + "slow": False, + } + + print(f"\n=== {source.name} ({source.category}) ===", flush=True) + print(f"URL: {source.url}", flush=True) + start = time.monotonic() + try: + result = converter.convert(source.url) + elapsed = time.monotonic() - start + record["elapsed_sec"] = round(elapsed, 2) + record["slow"] = elapsed > SOFT_TIMEOUT_SEC + + doc = result.document + + # Counts + record["pages"] = _count_pages(doc) + record["tables"] = _count_tables(doc) + record["pub_date"] = _extract_pub_date(doc) + + # Save Markdown + md_path = OUT_DIR / f"{source.name}.md" + md_path.write_text(doc.export_to_markdown(), encoding="utf-8") + record["markdown_path"] = str(md_path.relative_to(REPO_ROOT)) + + # Save full document JSON + doc_json_path = OUT_DIR / f"{source.name}.json" + doc_json_path.write_text( + json.dumps(doc.export_to_dict(), indent=2, default=str), + encoding="utf-8", + ) + record["doc_json_path"] = str(doc_json_path.relative_to(REPO_ROOT)) + + # Chunk + chunks_list: list[dict[str, Any]] = [] + for chunk in chunker.chunk(dl_doc=doc): + contextualized = chunker.contextualize(chunk=chunk) + # Token count (using chunker's tokenizer) + try: + token_count = chunker.tokenizer.count_tokens(text=contextualized) + except Exception: + token_count = None + entry: dict[str, Any] = { + "text": chunk.text, + "contextualized_text": contextualized, + "token_count": token_count, + } + entry.update(_safe_chunk_meta(chunk)) + chunks_list.append(entry) + + record["chunks"] = len(chunks_list) + chunks_json_path = OUT_DIR / f"{source.name}_chunks.json" + chunks_json_path.write_text( + json.dumps(chunks_list, indent=2, ensure_ascii=False), + encoding="utf-8", + ) + record["chunks_json_path"] = str(chunks_json_path.relative_to(REPO_ROOT)) + + record["status"] = "ok" + print( + f"OK elapsed={record['elapsed_sec']}s pages={record['pages']} " + f"tables={record['tables']} chunks={record['chunks']} " + f"pub_date={record['pub_date']}", + flush=True, + ) + if record["slow"]: + print(f"WARNING: conversion took >{SOFT_TIMEOUT_SEC}s (marked slow).", flush=True) + except Exception as exc: + elapsed = time.monotonic() - start + record["elapsed_sec"] = round(elapsed, 2) + record["status"] = "error" + record["error"] = f"{type(exc).__name__}: {exc}" + print(f"ERROR after {record['elapsed_sec']}s: {record['error']}", flush=True) + traceback.print_exc(limit=2) + + return record + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + +def main() -> int: + started_at = datetime.now(timezone.utc).isoformat() + print(f"Docling eval — started {started_at}", flush=True) + print(f"Output dir: {OUT_DIR}", flush=True) + + # Single converter reused across all sources. + # Disable OCR: every source in SOURCES is a born-digital PDF or HTML page, + # so OCR just burns 5-10 minutes per PDF on CPU without improving extraction. + # Use FAST TableFormer mode — the accurate model is roughly 3x slower. + print("Constructing DocumentConverter (first run downloads layout models)...", flush=True) + pdf_opts = PdfPipelineOptions() + pdf_opts.do_ocr = False + pdf_opts.do_table_structure = True + pdf_opts.table_structure_options.mode = TableFormerMode.FAST + converter = DocumentConverter( + format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_opts)} + ) + print("Converter ready.", flush=True) + + # HybridChunker with max_tokens=512 on the all-MiniLM-L6-v2 tokenizer + # (the docling default, matches typical embedding contexts). + print("Loading HuggingFace tokenizer (all-MiniLM-L6-v2)...", flush=True) + hf_tokenizer = AutoTokenizer.from_pretrained( + "sentence-transformers/all-MiniLM-L6-v2" + ) + tokenizer = HuggingFaceTokenizer(tokenizer=hf_tokenizer, max_tokens=512) + chunker = HybridChunker(tokenizer=tokenizer) + print("Chunker ready.", flush=True) + + records: list[dict[str, Any]] = [] + for source in SOURCES: + rec = convert_one(source, converter, chunker) + records.append(rec) + + # Summary + finished_at = datetime.now(timezone.utc).isoformat() + summary = { + "started_at": started_at, + "finished_at": finished_at, + "total_sources": len(records), + "ok": sum(1 for r in records if r["status"] == "ok"), + "errors": sum(1 for r in records if r["status"] == "error"), + "slow": sum(1 for r in records if r.get("slow")), + "records": records, + } + (OUT_DIR / "run_log.json").write_text( + json.dumps(summary, indent=2, default=str), encoding="utf-8" + ) + + print("\n=== SUMMARY ===") + print(f"ok={summary['ok']} errors={summary['errors']} slow={summary['slow']}") + for r in records: + status = r["status"] + extra = ( + f"pages={r['pages']} tables={r['tables']} chunks={r['chunks']}" + if status == "ok" + else r["error"] + ) + print(f" [{status:>5}] {r['name']:35s} {r['elapsed_sec']}s {extra}") + + return 0 if summary["errors"] == 0 else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/eval_docling_ocr.py b/scripts/eval_docling_ocr.py new file mode 100644 index 0000000..f785c3b --- /dev/null +++ b/scripts/eval_docling_ocr.py @@ -0,0 +1,213 @@ +"""OCR-on re-conversion of the two PDFs that OCR-off couldn't handle well. + +- ECDC CDTR week 16: OCR-off detected 4 tables but all came back 0x0. +- Africa CDC weekly (April 2026): OCR-off yielded 0 chunks — pure image PDF. + +Reads the locally-downloaded copies under data/docling_eval/sources/ (from +the earlier run) to avoid re-fetching and any publisher-side drift, and +writes OCR-on outputs to data/docling_eval/ocr/ so the OCR-off outputs at +data/docling_eval/*.md stay intact for comparison. + +Run from the repo root: + + .venv-docling/Scripts/python.exe -u scripts/eval_docling_ocr.py +""" +from __future__ import annotations + +import json +import logging +import sys +import time +import traceback +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +sys.stdout.reconfigure(line_buffering=True) +sys.stderr.reconfigure(line_buffering=True) +logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s") + +print("Importing docling...", flush=True) +from docling.chunking import HybridChunker +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode +from docling.document_converter import DocumentConverter, PdfFormatOption +from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer +from transformers import AutoTokenizer +print("docling imported OK.", flush=True) + + +REPO_ROOT = Path(__file__).resolve().parent.parent +SRC_DIR = REPO_ROOT / "data" / "docling_eval" / "sources" +OUT_DIR = REPO_ROOT / "data" / "docling_eval" / "ocr" +OUT_DIR.mkdir(parents=True, exist_ok=True) + + +@dataclass +class Source: + name: str + path: Path + notes: str + + +SOURCES: list[Source] = [ + Source( + name="ecdc_cdtr_week16", + path=SRC_DIR / "ecdc_cdtr_week16.pdf", + notes="OCR-off: 4 tables detected but 0x0 — recover case-count tables via OCR.", + ), + Source( + name="africa_cdc_weekly_apr2026", + path=SRC_DIR / "africa_cdc_weekly_apr2026.pdf", + notes="OCR-off: 0 chunks extracted — entire PDF is scanned images.", + ), +] + + +def _safe_chunk_meta(chunk: Any) -> dict[str, Any]: + meta: dict[str, Any] = {} + chunk_meta = getattr(chunk, "meta", None) + if chunk_meta is None: + return meta + headings = getattr(chunk_meta, "headings", None) + if headings: + meta["headings"] = list(headings) + pages: set[int] = set() + for item in getattr(chunk_meta, "doc_items", None) or []: + for p in getattr(item, "prov", None) or []: + page_no = getattr(p, "page_no", None) + if isinstance(page_no, int): + pages.add(page_no) + if pages: + meta["pages"] = sorted(pages) + return meta + + +def convert_one(source: Source, converter: DocumentConverter, + chunker: HybridChunker) -> dict[str, Any]: + rec: dict[str, Any] = { + "name": source.name, + "path": str(source.path.relative_to(REPO_ROOT)), + "notes": source.notes, + "status": "pending", + "elapsed_sec": None, + "pages": None, + "tables": None, + "table_shapes": None, + "chunks": None, + "markdown_path": None, + "doc_json_path": None, + "chunks_json_path": None, + } + print(f"\n=== {source.name} (OCR=on) ===", flush=True) + print(f"Path: {source.path}", flush=True) + start = time.monotonic() + try: + result = converter.convert(str(source.path)) + elapsed = time.monotonic() - start + rec["elapsed_sec"] = round(elapsed, 2) + doc = result.document + + pages = getattr(doc, "pages", None) or {} + try: + rec["pages"] = len(pages) + except TypeError: + rec["pages"] = 0 + + tables = getattr(doc, "tables", None) or [] + rec["tables"] = len(tables) + shapes: list[str] = [] + for t in tables: + data = getattr(t, "data", None) + n_rows = getattr(data, "num_rows", None) if data is not None else None + n_cols = getattr(data, "num_cols", None) if data is not None else None + shapes.append(f"{n_rows}x{n_cols}") + rec["table_shapes"] = shapes + + md_path = OUT_DIR / f"{source.name}.md" + md_path.write_text(doc.export_to_markdown(), encoding="utf-8") + rec["markdown_path"] = str(md_path.relative_to(REPO_ROOT)) + + doc_json_path = OUT_DIR / f"{source.name}.json" + doc_json_path.write_text( + json.dumps(doc.export_to_dict(), indent=2, default=str), encoding="utf-8" + ) + rec["doc_json_path"] = str(doc_json_path.relative_to(REPO_ROOT)) + + chunks_list: list[dict[str, Any]] = [] + for chunk in chunker.chunk(dl_doc=doc): + contextualized = chunker.contextualize(chunk=chunk) + try: + token_count = chunker.tokenizer.count_tokens(text=contextualized) + except Exception: + token_count = None + entry: dict[str, Any] = { + "text": chunk.text, + "contextualized_text": contextualized, + "token_count": token_count, + } + entry.update(_safe_chunk_meta(chunk)) + chunks_list.append(entry) + rec["chunks"] = len(chunks_list) + + chunks_json_path = OUT_DIR / f"{source.name}_chunks.json" + chunks_json_path.write_text( + json.dumps(chunks_list, indent=2, ensure_ascii=False), + encoding="utf-8", + ) + rec["chunks_json_path"] = str(chunks_json_path.relative_to(REPO_ROOT)) + + rec["status"] = "ok" + print( + f"OK elapsed={rec['elapsed_sec']}s pages={rec['pages']} " + f"tables={rec['tables']} shapes={shapes} chunks={rec['chunks']}", + flush=True, + ) + except Exception as exc: + rec["elapsed_sec"] = round(time.monotonic() - start, 2) + rec["status"] = "error" + rec["error"] = f"{type(exc).__name__}: {exc}" + print(f"ERROR after {rec['elapsed_sec']}s: {rec['error']}", flush=True) + traceback.print_exc(limit=2) + return rec + + +def main() -> int: + print(f"Output dir: {OUT_DIR}", flush=True) + print("Constructing DocumentConverter (OCR=on, RapidOCR, TableFormer FAST)...", flush=True) + pdf_opts = PdfPipelineOptions() + pdf_opts.do_ocr = True + pdf_opts.do_table_structure = True + pdf_opts.table_structure_options.mode = TableFormerMode.FAST + # Leave force_full_page_ocr off: Docling's default is to OCR only bitmap + # regions on otherwise-text pages, which is exactly what we want for + # ECDC (image-embedded tables on text pages). For the fully-scanned + # Africa CDC PDF, every page is one big bitmap so it'll get OCR'd anyway. + converter = DocumentConverter( + format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_opts)} + ) + print("Converter ready.", flush=True) + + print("Loading HuggingFace tokenizer (all-MiniLM-L6-v2)...", flush=True) + hf_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") + tokenizer = HuggingFaceTokenizer(tokenizer=hf_tokenizer, max_tokens=512) + chunker = HybridChunker(tokenizer=tokenizer) + print("Chunker ready.", flush=True) + + records = [convert_one(s, converter, chunker) for s in SOURCES] + (OUT_DIR / "run_log.json").write_text( + json.dumps({"records": records}, indent=2, default=str), encoding="utf-8" + ) + + print("\n=== SUMMARY (OCR=on) ===", flush=True) + for r in records: + extra = ( + f"pages={r['pages']} tables={r['tables']} shapes={r.get('table_shapes')} chunks={r['chunks']}" + if r["status"] == "ok" else r.get("error") + ) + print(f" [{r['status']:>5}] {r['name']:30s} {r['elapsed_sec']}s {extra}", flush=True) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/eval_docling_ocr_cost.py b/scripts/eval_docling_ocr_cost.py new file mode 100644 index 0000000..7679677 --- /dev/null +++ b/scripts/eval_docling_ocr_cost.py @@ -0,0 +1,123 @@ +"""Per-page OCR cost measurement on the ECDC CDTR PDF. + +Earlier run: OCR-on on the full 12-page ECDC CDTR didn't finish in 42 min +before we killed it, which tells us the wall-clock is prohibitive but not +the per-page rate. This script measures cost with `page_range`: + +- OCR-off baseline on pages 1, 5, 10 (3 samples across the doc) +- OCR-on (bitmap-only, default) on the same three pages +- OCR-on (forced full-page) on one page, for upper bound + +Writes each per-page Markdown to data/docling_eval/ocr/per_page__p.md +and logs timings to stdout. Reads from the local source PDF downloaded earlier. +""" +from __future__ import annotations + +import json +import logging +import sys +import time +from pathlib import Path + +sys.stdout.reconfigure(line_buffering=True) +sys.stderr.reconfigure(line_buffering=True) +logging.basicConfig(level=logging.WARNING, format="%(asctime)s %(name)s %(levelname)s %(message)s") + +from docling.datamodel.base_models import InputFormat +from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode +from docling.document_converter import DocumentConverter, PdfFormatOption + + +REPO_ROOT = Path(__file__).resolve().parent.parent +SRC_PDF = REPO_ROOT / "data" / "docling_eval" / "sources" / "ecdc_cdtr_week16.pdf" +OUT_DIR = REPO_ROOT / "data" / "docling_eval" / "ocr" +OUT_DIR.mkdir(parents=True, exist_ok=True) + + +def _make_converter(do_ocr: bool, force_full_page: bool = False) -> DocumentConverter: + opts = PdfPipelineOptions() + opts.do_ocr = do_ocr + opts.do_table_structure = True + opts.table_structure_options.mode = TableFormerMode.FAST + if do_ocr and force_full_page: + opts.ocr_options.force_full_page_ocr = True + return DocumentConverter( + format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)} + ) + + +def _time_single_page(converter: DocumentConverter, page: int, label: str) -> dict: + start = time.monotonic() + result = converter.convert(str(SRC_PDF), page_range=(page, page)) + elapsed = time.monotonic() - start + doc = result.document + tables = getattr(doc, "tables", []) or [] + shapes = [ + f"{getattr(getattr(t,'data',None),'num_rows',None)}x{getattr(getattr(t,'data',None),'num_cols',None)}" + for t in tables + ] + md = doc.export_to_markdown() + out_path = OUT_DIR / f"per_page_{label}_p{page:02d}.md" + out_path.write_text(md, encoding="utf-8") + n_tokens_approx = len(md.split()) + return { + "label": label, + "page": page, + "elapsed_sec": round(elapsed, 2), + "tables": len(tables), + "shapes": shapes, + "md_bytes": len(md), + "md_words_approx": n_tokens_approx, + "md_path": str(out_path.relative_to(REPO_ROOT)), + } + + +def main() -> int: + print(f"Source: {SRC_PDF}", flush=True) + print(f"Output: {OUT_DIR}", flush=True) + + pages_to_test = [1, 5, 10] + + results: list[dict] = [] + + print("\n--- OCR OFF (baseline) ---", flush=True) + conv_off = _make_converter(do_ocr=False) + for p in pages_to_test: + r = _time_single_page(conv_off, p, "ocroff") + print(f" page {p:>2}: elapsed={r['elapsed_sec']:>6.2f}s tables={r['tables']} shapes={r['shapes']} md_bytes={r['md_bytes']}", flush=True) + results.append(r) + + print("\n--- OCR ON (bitmap-only, default) ---", flush=True) + conv_on = _make_converter(do_ocr=True, force_full_page=False) + for p in pages_to_test: + r = _time_single_page(conv_on, p, "ocron") + print(f" page {p:>2}: elapsed={r['elapsed_sec']:>6.2f}s tables={r['tables']} shapes={r['shapes']} md_bytes={r['md_bytes']}", flush=True) + results.append(r) + + print("\n--- OCR ON (force_full_page), page 5 only ---", flush=True) + conv_full = _make_converter(do_ocr=True, force_full_page=True) + r = _time_single_page(conv_full, 5, "ocrfull") + print(f" page {5:>2}: elapsed={r['elapsed_sec']:>6.2f}s tables={r['tables']} shapes={r['shapes']} md_bytes={r['md_bytes']}", flush=True) + results.append(r) + + (OUT_DIR / "per_page_cost.json").write_text( + json.dumps({"results": results}, indent=2, default=str), encoding="utf-8" + ) + + # Summarise + off_times = [r["elapsed_sec"] for r in results if r["label"] == "ocroff"] + on_times = [r["elapsed_sec"] for r in results if r["label"] == "ocron"] + full_times = [r["elapsed_sec"] for r in results if r["label"] == "ocrfull"] + print("\n=== SUMMARY ===", flush=True) + print(f"OCR OFF mean/page: {sum(off_times)/len(off_times):>6.1f}s (pages={pages_to_test})", flush=True) + print(f"OCR ON mean/page: {sum(on_times)/len(on_times):>6.1f}s (pages={pages_to_test})", flush=True) + print(f"OCR ON marginal cost/page: {(sum(on_times)/len(on_times)) - (sum(off_times)/len(off_times)):>6.1f}s", flush=True) + print(f"OCR FULL-PAGE page 5: {full_times[0]:>6.1f}s (vs bitmap-only {[r['elapsed_sec'] for r in results if r['label']=='ocron' and r['page']==5][0]:.1f}s)", flush=True) + total_12p_on = (sum(on_times)/len(on_times)) * 12 + print(f"Projected OCR-on total for 12-page ECDC CDTR: ~{total_12p_on/60:.1f} min", flush=True) + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/eval_hybrid_pdf.py b/scripts/eval_hybrid_pdf.py new file mode 100644 index 0000000..ec88d4a --- /dev/null +++ b/scripts/eval_hybrid_pdf.py @@ -0,0 +1,346 @@ +"""Hybrid eval: run the in-tree PdfParser + DoclingTableRefiner combo +against the same 5 PDFs Docling/in-tree have already been benchmarked on. + +This exercises the new code path from issue #16: in-tree parses, refiner +inspects and (conditionally) replaces table sections with Docling's +rendering when the source URL is on the allowlist OR the in-tree tables +look broken. + +Reads from data/docling_eval/sources/*.pdf +Writes: + - data/docling_eval/hybrid_pdf/{name}.md Markdown re-emitted from refined ParsedContent + - data/docling_eval/hybrid_pdf/{name}.json Full refined ParsedContent + - data/docling_eval/hybrid_pdf/run_log.json Per-source metrics + trigger info + +Run from repo root (uses the docling venv since it imports docling): + + .venv-docling/Scripts/python.exe -u scripts/eval_hybrid_pdf.py +""" +from __future__ import annotations + +import json +import logging +import sys +import time +from pathlib import Path +from typing import Any + +REPO_ROOT = Path(__file__).resolve().parent.parent +sys.path.insert(0, str(REPO_ROOT)) + +from bioscancast.extraction.config import ExtractionConfig # noqa: E402 +from bioscancast.extraction.docling_refiner import ( # noqa: E402 + DoclingTableRefiner, + _broken_table_reasons, + _should_refine_by_url, +) +from bioscancast.extraction.parsers.pdf_parser import PdfParser # noqa: E402 + +SRC_DIR = REPO_ROOT / "data" / "docling_eval" / "sources" +OUT_DIR = REPO_ROOT / "data" / "docling_eval" / "hybrid_pdf" +OUT_DIR.mkdir(parents=True, exist_ok=True) + +# (source basename, plausible publisher URL). URLs are constructed so that +# allowlist patterns fire exactly where the issue says they should: +# - MMWR -> matches `cdc.gov/mmwr/` +# - WHO cholera sitrep -> matches the situation-reports path +# - WHO mpox sitrep (this particular one) -> does NOT match +# - ECDC CDTR -> not on allowlist +# - Africa CDC weekly -> not on allowlist (will short-circuit on requires_ocr anyway) +SOURCES: list[tuple[str, str]] = [ + ( + "who_mpox_sitrep64", + "https://cdn.who.int/media/docs/default-source/documents/emergencies/outbreak-reports/2025-mpox-external-sitrep-64.pdf", + ), + ( + "who_cholera_epi34", + "https://cdn.who.int/media/docs/default-source/documents/emergencies/situation-reports/who-cholera-epi-update-34.pdf", + ), + ( + "cdc_mmwr_nm_measles", + "https://www.cdc.gov/mmwr/volumes/75/wr/mm7509a1.htm", + ), + ( + "ecdc_cdtr_week16", + "https://www.ecdc.europa.eu/sites/default/files/documents/communicable-disease-threats-report-week-16-2025.pdf", + ), + ( + "africa_cdc_weekly_apr2026", + "https://africacdc.org/download/weekly-event-based-surveillance-report-april-2026/", + ), +] + + +# ---------- markdown rendering (mirrors eval_intree_pdf.py) ---------- + + +def _table_to_md(rows: list[list[str]]) -> str: + if not rows: + return "" + n_cols = max(len(r) for r in rows) + norm = [r + [""] * (n_cols - len(r)) for r in rows] + header = norm[0] + body = norm[1:] + lines = ["| " + " | ".join(c.replace("\n", " ").strip() for c in header) + " |"] + lines.append("| " + " | ".join(["---"] * n_cols) + " |") + for row in body: + lines.append("| " + " | ".join(c.replace("\n", " ").strip() for c in row) + " |") + return "\n".join(lines) + + +def _emit_markdown(parsed) -> str: + lines: list[str] = [] + if parsed.title: + lines.append(f"# {parsed.title}\n") + if parsed.published_date: + lines.append(f"*Published: {parsed.published_date.date()}*\n") + if parsed.page_count: + lines.append(f"*Pages: {parsed.page_count}*\n") + + last_path: str | None = None + for s in parsed.sections: + path = s.section_path or "" + if path != last_path and path: + depth = path.count(" > ") + 2 + depth = min(depth, 6) + lines.append(f"\n{'#' * depth} {path.split(' > ')[-1]}\n") + last_path = path + if s.chunk_type == "table" and s.table_rows: + if s.page_number: + lines.append( + f"\n*Table on page {s.page_number} " + f"(extractor: {s.extractor or 'unknown'}):*\n" + ) + lines.append(_table_to_md(s.table_rows) + "\n") + elif s.text: + lines.append(s.text + "\n") + return "\n".join(lines) + + +def _section_summary(parsed) -> dict[str, Any]: + table_sections = [s for s in parsed.sections if s.chunk_type == "table"] + prose_sections = [s for s in parsed.sections if s.chunk_type == "prose"] + table_cells = sum( + len(s.table_rows or []) + * (len((s.table_rows or [[]])[0]) if s.table_rows else 0) + for s in table_sections + ) + table_shapes = [ + f"{len(s.table_rows or [])}x{(len((s.table_rows or [[]])[0]) if s.table_rows else 0)}" + for s in table_sections + ] + extractors = [s.extractor for s in table_sections] + docling_tables = sum(1 for e in extractors if e == "docling") + return { + "n_sections": len(parsed.sections), + "n_prose": len(prose_sections), + "n_tables": len(table_sections), + "table_shapes": table_shapes, + "table_cells_total": table_cells, + "table_extractors": extractors, + "n_tables_docling": docling_tables, + "raw_text_chars": len(parsed.raw_text), + "is_partial": parsed.is_partial, + "partial_reason": parsed.partial_reason, + } + + +# ---------- main ---------- + + +def main() -> int: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s %(name)s %(levelname)s %(message)s", + ) + + parser = PdfParser() + config = ExtractionConfig() # default: refiner enabled, default allowlist + refiner = DoclingTableRefiner(config) # converter lazily built on first trigger + + results: list[dict[str, Any]] = [] + + for name, source_url in SOURCES: + pdf_path = SRC_DIR / f"{name}.pdf" + if not pdf_path.exists(): + print(f"SKIP {name}: file not found", flush=True) + continue + + print(f"\n=== {name} ===", flush=True) + print(f" source_url: {source_url}", flush=True) + content = pdf_path.read_bytes() + + # ---- parse with in-tree ---- + start = time.monotonic() + try: + parsed = parser.parse(content, source_url=source_url) + except Exception as exc: + elapsed = time.monotonic() - start + print( + f"PARSE ERROR after {elapsed:.2f}s: {type(exc).__name__}: {exc}", + flush=True, + ) + results.append( + { + "name": name, + "source_url": source_url, + "status": "parse_error", + "elapsed_sec": round(elapsed, 2), + "error": f"{type(exc).__name__}: {exc}", + } + ) + continue + intree_elapsed = time.monotonic() - start + + # ---- predict triggers (so we record *why* the refiner runs or doesn't) ---- + would_trigger_url = _should_refine_by_url( + source_url, config.docling_source_allowlist + ) + broken_reasons = _broken_table_reasons( + parsed, threshold=config.docling_sparse_cell_threshold + ) + ocr_short_circuit = ( + parsed.is_partial and parsed.partial_reason == "requires_ocr" + ) + + # Count tables before refinement (for diff after). + in_tree_table_shapes = [ + f"{len(s.table_rows or [])}x{(len((s.table_rows or [[]])[0]) if s.table_rows else 0)}" + for s in parsed.sections + if s.chunk_type == "table" + ] + + # ---- run refiner ---- + refine_start = time.monotonic() + try: + refined = refiner.refine(parsed, source_url=source_url, content=content) + except Exception as exc: + refine_elapsed = time.monotonic() - refine_start + print( + f"REFINE ERROR after {refine_elapsed:.2f}s: " + f"{type(exc).__name__}: {exc}", + flush=True, + ) + results.append( + { + "name": name, + "source_url": source_url, + "status": "refine_error", + "intree_elapsed_sec": round(intree_elapsed, 2), + "refine_elapsed_sec": round(refine_elapsed, 2), + "error": f"{type(exc).__name__}: {exc}", + } + ) + continue + refine_elapsed = time.monotonic() - refine_start + + # ---- write artefacts ---- + md_path = OUT_DIR / f"{name}.md" + md_path.write_text(_emit_markdown(refined), encoding="utf-8") + + json_dump = { + "source_url": source_url, + "title": refined.title, + "published_date": ( + refined.published_date.isoformat() if refined.published_date else None + ), + "page_count": refined.page_count, + "is_partial": refined.is_partial, + "partial_reason": refined.partial_reason, + "raw_text_chars": len(refined.raw_text), + "sections": [ + { + "section_path": s.section_path, + "page_number": s.page_number, + "chunk_type": s.chunk_type, + "text": s.text, + "table_rows": s.table_rows, + "extractor": s.extractor, + } + for s in refined.sections + ], + } + (OUT_DIR / f"{name}.json").write_text( + json.dumps(json_dump, indent=2, default=str), encoding="utf-8" + ) + + summary = _section_summary(refined) + rec = { + "name": name, + "source_url": source_url, + "status": "ok", + "intree_elapsed_sec": round(intree_elapsed, 2), + "refine_elapsed_sec": round(refine_elapsed, 2), + "trigger": { + "url_match": would_trigger_url, + "broken_reasons": broken_reasons, + "ocr_short_circuit": ocr_short_circuit, + }, + "title": refined.title, + "published_date": ( + refined.published_date.isoformat() if refined.published_date else None + ), + "page_count": refined.page_count, + "in_tree_table_shapes": in_tree_table_shapes, + **summary, + } + results.append(rec) + print( + f"OK intree={rec['intree_elapsed_sec']}s " + f"refine={rec['refine_elapsed_sec']}s " + f"pages={rec['page_count']} sections={rec['n_sections']} " + f"tables={rec['n_tables']} " + f"docling_tables={rec['n_tables_docling']} " + f"shapes={rec['table_shapes']} " + f"chars={rec['raw_text_chars']}", + flush=True, + ) + print( + f" trigger: url_match={would_trigger_url} " + f"broken={len(broken_reasons)} " + f"ocr_short_circuit={ocr_short_circuit}", + flush=True, + ) + if broken_reasons: + for r in broken_reasons: + print(f" - {r}", flush=True) + if rec.get("partial_reason"): + print(f" partial: {rec['partial_reason']}", flush=True) + + (OUT_DIR / "run_log.json").write_text( + json.dumps({"results": results}, indent=2, default=str), encoding="utf-8" + ) + + print("\n=== SUMMARY (hybrid: in-tree + Docling refiner) ===", flush=True) + for r in results: + if r["status"] == "ok": + trig = r["trigger"] + trigger_label = ( + "URL" + if trig["url_match"] + else "HEURISTIC" + if trig["broken_reasons"] + else "OCR-SKIP" + if trig["ocr_short_circuit"] + else "NONE" + ) + print( + f" [ ok] {r['name']:30s} " + f"intree={r['intree_elapsed_sec']:>6.2f}s " + f"refine={r['refine_elapsed_sec']:>6.2f}s " + f"trigger={trigger_label:9s} " + f"tables={r['n_tables']} ({r['n_tables_docling']} docling) " + f"shapes={r['table_shapes']}", + flush=True, + ) + else: + print( + f" [{r['status']:>6s}] {r['name']:30s} {r.get('error')}", + flush=True, + ) + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/eval_intree_pdf.py b/scripts/eval_intree_pdf.py new file mode 100644 index 0000000..74ecb84 --- /dev/null +++ b/scripts/eval_intree_pdf.py @@ -0,0 +1,201 @@ +"""Head-to-head: run the in-tree PdfParser against the same 5 PDFs Docling +already converted, so we can eyeball Markdown output and compare metrics. + +Reads from data/docling_eval/sources/*.pdf, writes: +- data/docling_eval/intree_pdf/{name}.md — Markdown re-emitted from ParsedContent +- data/docling_eval/intree_pdf/{name}.json — full ParsedContent (sections + metadata) +- data/docling_eval/intree_pdf/run_log.json — aggregate metrics + +Run from repo root: + + .venv-docling/Scripts/python.exe -u scripts/eval_intree_pdf.py +""" +from __future__ import annotations + +import json +import sys +import time +from dataclasses import asdict +from pathlib import Path +from typing import Any + +REPO_ROOT = Path(__file__).resolve().parent.parent +sys.path.insert(0, str(REPO_ROOT)) # for `import bioscancast.*` + +from bioscancast.extraction.parsers.pdf_parser import PdfParser # noqa: E402 + +SRC_DIR = REPO_ROOT / "data" / "docling_eval" / "sources" +OUT_DIR = REPO_ROOT / "data" / "docling_eval" / "intree_pdf" +OUT_DIR.mkdir(parents=True, exist_ok=True) + +SOURCES = [ + "who_mpox_sitrep64", + "who_cholera_epi34", + "cdc_mmwr_nm_measles", + "ecdc_cdtr_week16", + "africa_cdc_weekly_apr2026", +] + + +def _table_to_md(rows: list[list[str]]) -> str: + if not rows: + return "" + # Normalise widths + n_cols = max(len(r) for r in rows) + norm = [r + [""] * (n_cols - len(r)) for r in rows] + header = norm[0] + body = norm[1:] + lines = ["| " + " | ".join(c.replace("\n", " ").strip() for c in header) + " |"] + lines.append("| " + " | ".join(["---"] * n_cols) + " |") + for row in body: + lines.append("| " + " | ".join(c.replace("\n", " ").strip() for c in row) + " |") + return "\n".join(lines) + + +def _emit_markdown(parsed) -> str: + """Render ParsedContent as Markdown so it's directly comparable to Docling's.""" + lines: list[str] = [] + if parsed.title: + lines.append(f"# {parsed.title}\n") + if parsed.published_date: + lines.append(f"*Published: {parsed.published_date.date()}*\n") + if parsed.page_count: + lines.append(f"*Pages: {parsed.page_count}*\n") + + last_path: str | None = None + for s in parsed.sections: + path = s.section_path or "" + # Emit a heading marker when the section_path changes + if path != last_path and path: + depth = path.count(" > ") + 2 # h2 minimum + depth = min(depth, 6) + lines.append(f"\n{'#' * depth} {path.split(' > ')[-1]}\n") + last_path = path + if s.chunk_type == "table" and s.table_rows: + if s.page_number: + lines.append(f"\n*Table on page {s.page_number}:*\n") + lines.append(_table_to_md(s.table_rows) + "\n") + elif s.text: + lines.append(s.text + "\n") + return "\n".join(lines) + + +def _section_summary(parsed) -> dict[str, Any]: + table_sections = [s for s in parsed.sections if s.chunk_type == "table"] + prose_sections = [s for s in parsed.sections if s.chunk_type == "prose"] + table_cells = sum( + len(s.table_rows or []) * (len((s.table_rows or [[]])[0]) if s.table_rows else 0) + for s in table_sections + ) + table_shapes = [ + f"{len(s.table_rows or [])}x{(len((s.table_rows or [[]])[0]) if s.table_rows else 0)}" + for s in table_sections + ] + return { + "n_sections": len(parsed.sections), + "n_prose": len(prose_sections), + "n_tables": len(table_sections), + "table_shapes": table_shapes, + "table_cells_total": table_cells, + "raw_text_chars": len(parsed.raw_text), + "is_partial": parsed.is_partial, + "partial_reason": parsed.partial_reason, + } + + +def main() -> int: + parser = PdfParser() + results: list[dict[str, Any]] = [] + + for name in SOURCES: + pdf_path = SRC_DIR / f"{name}.pdf" + if not pdf_path.exists(): + print(f"SKIP {name}: file not found", flush=True) + continue + + print(f"\n=== {name} ===", flush=True) + content = pdf_path.read_bytes() + start = time.monotonic() + try: + parsed = parser.parse(content, source_url=str(pdf_path)) + elapsed = time.monotonic() - start + except Exception as exc: + elapsed = time.monotonic() - start + print(f"ERROR after {elapsed:.2f}s: {type(exc).__name__}: {exc}", flush=True) + results.append({ + "name": name, + "status": "error", + "elapsed_sec": round(elapsed, 2), + "error": f"{type(exc).__name__}: {exc}", + }) + continue + + # Save MD + JSON + md_path = OUT_DIR / f"{name}.md" + md_path.write_text(_emit_markdown(parsed), encoding="utf-8") + + json_dump = { + "title": parsed.title, + "published_date": parsed.published_date.isoformat() if parsed.published_date else None, + "page_count": parsed.page_count, + "is_partial": parsed.is_partial, + "partial_reason": parsed.partial_reason, + "raw_text_chars": len(parsed.raw_text), + "sections": [ + { + "section_path": s.section_path, + "page_number": s.page_number, + "chunk_type": s.chunk_type, + "text": s.text, + "table_rows": s.table_rows, + } + for s in parsed.sections + ], + } + (OUT_DIR / f"{name}.json").write_text( + json.dumps(json_dump, indent=2, default=str), encoding="utf-8" + ) + + summary = _section_summary(parsed) + rec = { + "name": name, + "status": "ok", + "elapsed_sec": round(elapsed, 2), + "title": parsed.title, + "published_date": parsed.published_date.isoformat() if parsed.published_date else None, + "page_count": parsed.page_count, + **summary, + } + results.append(rec) + print( + f"OK elapsed={rec['elapsed_sec']}s pages={rec['page_count']} " + f"sections={rec['n_sections']} prose={rec['n_prose']} tables={rec['n_tables']} " + f"shapes={rec['table_shapes']} chars={rec['raw_text_chars']} " + f"pub_date={rec['published_date']}", + flush=True, + ) + if rec.get("partial_reason"): + print(f" partial: {rec['partial_reason']}", flush=True) + + (OUT_DIR / "run_log.json").write_text( + json.dumps({"results": results}, indent=2, default=str), encoding="utf-8" + ) + + print("\n=== SUMMARY (in-tree PdfParser) ===", flush=True) + for r in results: + if r["status"] == "ok": + print( + f" [ ok] {r['name']:30s} {r['elapsed_sec']:>6.2f}s " + f"pages={r['page_count']} sections={r['n_sections']} " + f"tables={r['n_tables']} {r['table_shapes']} chars={r['raw_text_chars']} " + f"pub={r['published_date']}", + flush=True, + ) + else: + print(f" [error] {r['name']:30s} {r['elapsed_sec']:>6.2f}s {r.get('error')}", flush=True) + + return 0 + + +if __name__ == "__main__": + sys.exit(main())