Skip to content

v2.0.0

Latest

Choose a tag to compare

@bosd bosd released this 04 Jun 16:30
eef30ac

[2.0.0] — 2026-06-04

The 2.0 release rolls up a substantial backend migration, the resulting
performance work, an optional neural (Table Transformer) backend for
borderless and scanned tables, and a handful of small but user-visible
breaking changes. Heads-up if upgrading from 1.0.x — see the
migration guide:

Breaking

  • Dropped Python 3.9 (EOL October 2025). Minimum supported is now
    Python 3.10. (#740)
  • flavor="lattice" default line_scale changed from 40 to 15 to match
    the long-standing implementation (the CLI and read_pdf docstring used
    to say 40 but the Lattice parser always defaulted to 15). Tables that
    relied on the documented-but-unimplemented 40 will need
    read_pdf(..., line_scale=40) explicitly. (#709)
  • Table.to_excel now defaults to index=False, header=False to match
    Table.to_csv. Excel exports no longer carry the pandas auto-generated
    row index / column header by default. Opt back in with
    table.to_excel(path, index=True, header=True). (#711)
  • TableList constructor materialises its iterable input to a list
    immediately, so bool() and len() work on TableList(generator())
    inputs. A generator passed in will be exhausted at construction time
    rather than at first access. (#710)
  • PDFHandler.pages is a property (was an attribute). Reads work
    unchanged; the value is now resolved lazily on first access. No callers
    in the wild set it, but if you subclassed and overrode it as an
    attribute, that no longer works. (#732)
  • PDF backend swapped from pypdf + pdfminer.six to
    playa-pdf
    . The dependency
    install set is smaller, encrypted-PDF handling is more accurate, and
    parser hot paths shed several layers of per-page-temp-PDF dance. Pure
    import camelot callers should see no API change.

Added

  • Optional neural flavor="ml" backend (Table Transformer / TATR). A
    neural model supplies the row/column/spanning-cell structure while cell
    text is filled from the PDF's own text layer — the model never emits cell
    text, so it cannot hallucinate or alter a value. Aimed at borderless tables,
    where the heuristic parsers plateau: on the FinTabNet borderless benchmark it
    roughly doubles TEDS (~0.20 → ~0.37) over network/hybrid. Heavy
    dependencies are optional and imported lazily — pip install "camelot-py[ml]"
    — so import camelot and the other flavors never load PyTorch. The box→grid
    post-processing and image→PDF mapping are pure (torch-free) and unit-tested.
    (#809)
  • flavor="ml" reads scanned / image-only PDFs via optional OCR. With no
    text layer (ocr="auto", the default) — or always with ocr=True — cell
    text comes from OCR of the rendered page instead of the PDF text layer;
    structure still comes from the model. This lifts Camelot's long-standing
    "needs a text layer" limitation. Opt in with pip install "camelot-py[ocr]".
    Still geometry + recognised text (no invented cells); split_text /
    flag_size aren't supported in OCR mode. (#809)
  • TableList.filter(...) — post-extraction convenience to drop noise /
    low-quality tables by min_rows, min_columns, min_accuracy,
    max_whitespace. Returns a new TableList (composable); all thresholds
    default to a no-op so nothing is dropped unless asked.
  • engine="combined" for flavor="lattice" (and the lattice half of
    flavor="hybrid"): unions the PDF's native vector ruled lines into the
    rasterised OpenCV line masks before contour/joint detection, so tables
    whose rules render faintly (vector strokes, anti-aliasing) are still
    found. Safe by construction — raster always runs, vector lines can only
    add, so output is never worse than engine="raster". It is now the
    default lattice engine and vector lines are clipped to
    table_regions. (#763)
  • engine="vector" for flavor="lattice": detects tables purely from
    the PDF's native vector ruled lines, skipping page rasterisation and
    OpenCV entirely
    — the fastest path for PDFs whose tables are drawn
    with real vector strokes. (#763)
  • engine="vector" for flavor="hybrid" — the render-free hybrid.
    Hybrid's lattice half now also accepts engine="vector", so the network
    text-edge alignment is merged (via the completeness-gated combine) with
    ruled lines read straight from the PDF's vector graphics — no page
    render, no OpenCV
    . On the in-repo ICDAR-2013 benchmark it matches or
    beats engine="raster" hybrid on every metric (F1 0.702→0.726, TEDS
    0.724→0.755, row 0.417→0.464, col 0.689→0.715) at ~6× less time
    (113s→19s); on FinTabNet.c (borderless) it matches raster hybrid's
    quality at ~2.4× less time. Hybrid also now drops empty tables the vector
    ruled-line set can raise from decorative page borders / form rules (which
    in turn lifts engine="raster" hybrid F1 from removing those spurious
    detections). (#39)
  • flavor="auto": render the first requested page, count ruled
    horizontal/vertical lines, pick lattice when ruled and network
    otherwise. Emits a UserWarning naming the chosen flavor. (#737)
  • Table.confidence — unified per-table quality score in [0, 1]
    computed as (accuracy / 100) * (1 - whitespace / 100). Now appears as
    a "confidence" key in Table.parsing_report alongside the existing
    accuracy / whitespace / page / order. Suitable for production
    filtering. The whole parsing_report schema is now documented in the
    property docstring. (#739)
  • per_page parameter on read_pdf(..., per_page={...}) — apply
    per-page kwarg overrides (including flavor) on top of the global
    kwargs. Useful for multi-layout PDFs where some pages need different
    table_areas / columns / flavor than the rest. Concept originally
    proposed by @sverma25 in #41. (#41)
  • strip_text= now accepts a list/tuple of substrings alongside the
    long-standing per-character str form. strip_text=["[1]", "[2]"]
    strips those footnote markers as whole substrings;
    strip_text="[]" keeps the existing per-character behaviour. (#484)
  • replace_text parameter on read_pdf — dict of substring →
    replacement applied to every cell's text just before assignment.
    Unlike strip_text (which can only remove), replace_text rewrites
    with arbitrary text — useful for collapsing soft-broken words
    ({" \n": " "}), normalising abbreviations, or rewriting unit
    names. Keys are matched as literal substrings; when several keys
    could match at the same position the longest one wins. (#482)
  • read_pdf accepts bytes and binary file-like objects as
    filepath, in addition to str/Path and URLs. io.BytesIO, an open
    "rb" handle, requests response .raw, etc. all work. The bytes
    are spilled to a temp file once (so the Lattice OpenCV image
    conversion keeps working) and cleaned up on context-manager exit.
    Long-standing requests #170, #245. (#270)
  • cpu_count parameter on read_pdf(..., parallel=True, cpu_count=N)
    and PDFHandler.parse(...) — caps the worker count when running in
    parallel. Defaults to all cores; clamped to
    [1, multiprocessing.cpu_count()]. (#712)
  • camelot-py CLI alias matching the PyPI package name —
    uvx camelot-py … works directly without the --from camelot-py
    prefix. (#738)
  • --format is now optional in the CLI: when omitted, the format is
    inferred from the --output extension (.csv, .xlsx, .html,
    .json, .md, .sqlite, etc.). (#738)
  • Table.to_excel defaults to index=False, header=False (under
    Breaking but worth calling out under Added too — most users will
    prefer the new shape).
  • Python 3.14 stable + 3.15 experimental rows added to the CI matrix.
    Wheels for both Pythons install correctly on Linux/macOS/Windows. (#706)

Changed

  • Default lattice engine is now "combined" (was "raster"); the
    transient engine="auto" introduced earlier in the 2.0 cycle was removed.
    Existing flavor="lattice" calls pick up combined automatically and it is
    never worse than raster. engine stays lattice-only and is rejected for the
    text-based flavors. (#803)
  • flavor="hybrid" runs its lattice half with engine="combined" too
    (was "raster"). With the completeness gating this lifts hybrid on ruled
    documents — in-repo ICDAR-2013 TEDS 0.724→0.806, row 0.417→0.659,
    col 0.689→0.868. (#807)

Changed (performance)

  • Lattice raster render skips the PNG round-trip (~20-26% faster). The
    page was rendered to a PIL image, saved to a PNG, then immediately
    cv2.imread-ed back
    — the encode alone was ~a quarter of the raster
    time. The Lattice engine now renders straight to an in-memory BGR array
    (ImageConversionBackend.to_array, pdfium-native; other backends fall
    back to convert+imread). Output is byte-identical (PNG was lossless).
    (#40)

  • text_in_bbox ≈ 30× faster on busy lattice pages. The original
    O(n³) duplicate-discard pass became O(n²) in #718, then the whole
    function was NumPy-vectorised in #731 — a 3-4× win on top of #718 on
    realistic 50-500-text-line bboxes. Memory-safe fallback at n > 1500.

  • get_table_index 3-13×. #727 collapsed the row scan + best-overlap
    tracking, #733 added a lazy NumPy + bisect row-band lookup
    (O(log rows)) plus per-table caches on Table (_rows_np,
    _cols_np, _rows_disjoint).

  • read_pdf opens the PDF once per call instead of twice. Page
    resolution is deferred until the parse already has the playa handle
    open. Doubles throughput on workloads that loop over many short PDFs.
    (#732)

  • random_string 4× (#718) and compute_whitespace cleanup (#727) —
    small, mostly readability.

  • A bench/ directory now ships a couple of standalone microbenchmarks
    (bench_get_table_index.py) and a negative-result bench
    (bench_negative_results.py) documenting cases where NumPy did not
    help — useful regression net against well-meaning future rewrites.

Fixed

  • text_in_bbox no longer drops legitimate adjacent-cell text. The
    geometry-only overlap dedup (added for #15 font-render duplicates) was
    discarding any shorter textline ≥80 % contained in a wider neighbour's
    bbox — even when the two carried different text — so overlapping cells
    silently lost content. The discard is now content-aware: a contained box
    is dropped only when its stripped text actually equals the longer
    sibling's. (#814, closes #288 / #625)

  • Precision gate for the lattice/combined engine. Near-empty ruled
    grids (page borders, form rules, header separators — whitespace ≥ 90 %)
    are no longer emitted as tables; they were detection noise that
    false-positived on pages with no real table. On the in-repo ICDAR-2013
    benchmark this lifts combined detection F1 0.665 → 0.778 with TEDS /
    row / col all improving too. (#36)

  • Network parser: suppress nested/overlapping duplicate tables. The
    connectivity search sometimes emitted a partial copy of a table nested
    inside the full detection (same columns, fewer rows), inflating the
    table count and mangling row structure. These are now suppressed
    (keep the larger). On the in-repo ICDAR-2013 benchmark this lifts
    flavor='auto' across the board — F1 0.742→0.765, TEDS 0.744→0.763,
    row 0.517→0.540 — and ~20 % faster. (#35)

  • flavor="hybrid": gate the network-split augmentation by lattice
    completeness.
    Hybrid used to union network's text-derived column
    splits onto lattice's boundaries and parse the merged table with the
    network parser (text-grouped rows) — which over-segmented and wrecked the
    row structure of fully-ruled tables. Now, when lattice already resolved a
    complete ruled grid (interior rules in both directions, joints covering
    the grid, and a row count commensurate with the table's column-aligned
    text rows), that grid is routed to the lattice parser untouched;
    partially-ruled / borderless tables still take the network-augmented path,
    so hybrid's niche wins are preserved. On the in-repo ICDAR-2013 benchmark
    this lifts hybrid TEDS 0.654→0.724 and row 0.172→0.417 (ruled-doc
    subset row 0.19→0.60) with F1 unchanged. (#805, mitigates #38 for hybrid)

  • flavor="auto" was silently broken_detect_flavor passed a
    non-existent resolution= kwarg to the image backend, so the TypeError
    was swallowed and every PDF fell back to network (never lattice).
    Fixed; auto now also detects the flavor per page and routes ruled
    pages through engine="combined", so mixed cover-page/table documents
    parse correctly. (#763)

  • Windows PermissionError when parsing multiple PDFs. The URL-
    downloaded temp file is now removed on PDFHandler.__exit__ /
    close(); the os.remove is wrapped in try/except OSError so the
    shutdown path keeps working even when pdfium/playa still holds a
    handle to the file. (#735, closes #537 / #678)

  • PdfiumBackend leaks document + image handles. convert() now
    uses try/finally so a render that raises still releases pypdfium's
    resources. (#716, closes #660)

  • TableList(generator) no longer raises TypeError on bool() or
    len(). (#710, closes #655)

  • CLI / docs / Lattice default line_scale are consistent at 15
    (see Breaking). (#709, closes #657)

  • Table.to_excel no longer emits the meaningless integer-index row
    and integer-header column
    . (#711, closes #634)

  • CLI options are position-independent (they can sit before or
    after the file argument on any subcommand). (#614, closes #587)

  • Documentation no longer references pdfminer/pypdf as the
    backend
    ; the playa-pdf migration is reflected throughout. (#719)

  • opencv-python conflict warning added to install docs — pip happily
    installs opencv-python alongside opencv-python-headless, breaking
    import cv2 at runtime. (#736, closes #645)

  • how-it-works.rst Network section no longer refers to a missing
    plot. (#736, closes #577)

Security

  • pypdf<6 (CVE-2025-55197) is no longer a dependency; replaced by
    playa-pdf. The pypdf vulnerability does not apply to current Camelot
    even though Camelot never directly called the affected APIs. (closes
    #643)
  • PDFTextExtractionNotAllowed is now actually enforced for
    encrypted PDFs whose user-password permissions forbid extraction —
    the previous architecture (split-into-per-page-temp-PDFs via pypdf)
    silently dropped the encryption metadata after decryption, so the
    check was effectively a no-op. The playa-based parse path keeps the
    document handle open with permissions intact. Note: for unencrypted
    PDFs that claim "no extraction" via /Perms, no mechanism in the PDF
    spec actually enforces the flag and Camelot extracts. (closes #590)

Deprecated / Removed

  • The internal _save_page per-page-temp-PDF helper is gone — no
    external callers known. (gh-#21, gh-#11)
  • pdfminer.six is no longer a direct dependency — playa.miner
    exposes a PDFMiner-compatible layout API; users who imported through
    Camelot keep working without code changes. (gh-#172)