[2.0.0] — 2026-06-04
The 2.0 release rolls up a substantial backend migration, the resulting
performance work, an optional neural (Table Transformer) backend for
borderless and scanned tables, and a handful of small but user-visible
breaking changes. Heads-up if upgrading from 1.0.x — see the
migration guide:
Breaking
- Dropped Python 3.9 (EOL October 2025). Minimum supported is now
Python 3.10. (#740) flavor="lattice"defaultline_scalechanged from 40 to 15 to match
the long-standing implementation (the CLI andread_pdfdocstring used
to say 40 but the Lattice parser always defaulted to 15). Tables that
relied on the documented-but-unimplemented40will need
read_pdf(..., line_scale=40)explicitly. (#709)Table.to_excelnow defaults toindex=False, header=Falseto match
Table.to_csv. Excel exports no longer carry the pandas auto-generated
row index / column header by default. Opt back in with
table.to_excel(path, index=True, header=True). (#711)TableListconstructor materialises its iterable input to a list
immediately, sobool()andlen()work onTableList(generator())
inputs. A generator passed in will be exhausted at construction time
rather than at first access. (#710)PDFHandler.pagesis a property (was an attribute). Reads work
unchanged; the value is now resolved lazily on first access. No callers
in the wild set it, but if you subclassed and overrode it as an
attribute, that no longer works. (#732)- PDF backend swapped from
pypdf+pdfminer.sixto
playa-pdf. The dependency
install set is smaller, encrypted-PDF handling is more accurate, and
parser hot paths shed several layers of per-page-temp-PDF dance. Pure
import camelotcallers should see no API change.
Added
- Optional neural
flavor="ml"backend (Table Transformer / TATR). A
neural model supplies the row/column/spanning-cell structure while cell
text is filled from the PDF's own text layer — the model never emits cell
text, so it cannot hallucinate or alter a value. Aimed at borderless tables,
where the heuristic parsers plateau: on the FinTabNet borderless benchmark it
roughly doubles TEDS (~0.20 → ~0.37) overnetwork/hybrid. Heavy
dependencies are optional and imported lazily —pip install "camelot-py[ml]"
— soimport camelotand the other flavors never load PyTorch. The box→grid
post-processing and image→PDF mapping are pure (torch-free) and unit-tested.
(#809) flavor="ml"reads scanned / image-only PDFs via optional OCR. With no
text layer (ocr="auto", the default) — or always withocr=True— cell
text comes from OCR of the rendered page instead of the PDF text layer;
structure still comes from the model. This lifts Camelot's long-standing
"needs a text layer" limitation. Opt in withpip install "camelot-py[ocr]".
Still geometry + recognised text (no invented cells);split_text/
flag_sizearen't supported in OCR mode. (#809)TableList.filter(...)— post-extraction convenience to drop noise /
low-quality tables bymin_rows,min_columns,min_accuracy,
max_whitespace. Returns a newTableList(composable); all thresholds
default to a no-op so nothing is dropped unless asked.engine="combined"forflavor="lattice"(and the lattice half of
flavor="hybrid"): unions the PDF's native vector ruled lines into the
rasterised OpenCV line masks before contour/joint detection, so tables
whose rules render faintly (vector strokes, anti-aliasing) are still
found. Safe by construction — raster always runs, vector lines can only
add, so output is never worse thanengine="raster". It is now the
default lattice engine and vector lines are clipped to
table_regions. (#763)engine="vector"forflavor="lattice": detects tables purely from
the PDF's native vector ruled lines, skipping page rasterisation and
OpenCV entirely — the fastest path for PDFs whose tables are drawn
with real vector strokes. (#763)engine="vector"forflavor="hybrid"— the render-free hybrid.
Hybrid's lattice half now also acceptsengine="vector", so the network
text-edge alignment is merged (via the completeness-gated combine) with
ruled lines read straight from the PDF's vector graphics — no page
render, no OpenCV. On the in-repo ICDAR-2013 benchmark it matches or
beatsengine="raster"hybrid on every metric (F1 0.702→0.726, TEDS
0.724→0.755, row 0.417→0.464, col 0.689→0.715) at ~6× less time
(113s→19s); on FinTabNet.c (borderless) it matches raster hybrid's
quality at ~2.4× less time. Hybrid also now drops empty tables the vector
ruled-line set can raise from decorative page borders / form rules (which
in turn liftsengine="raster"hybrid F1 from removing those spurious
detections). (#39)flavor="auto": render the first requested page, count ruled
horizontal/vertical lines, picklatticewhen ruled andnetwork
otherwise. Emits aUserWarningnaming the chosen flavor. (#737)Table.confidence— unified per-table quality score in[0, 1]
computed as(accuracy / 100) * (1 - whitespace / 100). Now appears as
a"confidence"key inTable.parsing_reportalongside the existing
accuracy/whitespace/page/order. Suitable for production
filtering. The wholeparsing_reportschema is now documented in the
property docstring. (#739)per_pageparameter onread_pdf(..., per_page={...})— apply
per-page kwarg overrides (includingflavor) on top of the global
kwargs. Useful for multi-layout PDFs where some pages need different
table_areas/columns/flavorthan the rest. Concept originally
proposed by @sverma25 in #41. (#41)strip_text=now accepts a list/tuple of substrings alongside the
long-standing per-characterstrform.strip_text=["[1]", "[2]"]
strips those footnote markers as whole substrings;
strip_text="[]"keeps the existing per-character behaviour. (#484)replace_textparameter onread_pdf— dict of substring →
replacement applied to every cell's text just before assignment.
Unlikestrip_text(which can only remove),replace_textrewrites
with arbitrary text — useful for collapsing soft-broken words
({" \n": " "}), normalising abbreviations, or rewriting unit
names. Keys are matched as literal substrings; when several keys
could match at the same position the longest one wins. (#482)read_pdfacceptsbytesand binary file-like objects as
filepath, in addition to str/Path and URLs.io.BytesIO, an open
"rb"handle,requestsresponse.raw, etc. all work. The bytes
are spilled to a temp file once (so the Lattice OpenCV image
conversion keeps working) and cleaned up on context-manager exit.
Long-standing requests #170, #245. (#270)cpu_countparameter onread_pdf(..., parallel=True, cpu_count=N)
andPDFHandler.parse(...)— caps the worker count when running in
parallel. Defaults to all cores; clamped to
[1, multiprocessing.cpu_count()]. (#712)camelot-pyCLI alias matching the PyPI package name —
uvx camelot-py …works directly without the--from camelot-py
prefix. (#738)--formatis now optional in the CLI: when omitted, the format is
inferred from the--outputextension (.csv,.xlsx,.html,
.json,.md,.sqlite, etc.). (#738)Table.to_exceldefaults toindex=False, header=False(under
Breaking but worth calling out under Added too — most users will
prefer the new shape).- Python 3.14 stable + 3.15 experimental rows added to the CI matrix.
Wheels for both Pythons install correctly on Linux/macOS/Windows. (#706)
Changed
- Default lattice
engineis now"combined"(was"raster"); the
transientengine="auto"introduced earlier in the 2.0 cycle was removed.
Existingflavor="lattice"calls pick up combined automatically and it is
never worse than raster.enginestays lattice-only and is rejected for the
text-based flavors. (#803) flavor="hybrid"runs its lattice half withengine="combined"too
(was"raster"). With the completeness gating this lifts hybrid on ruled
documents — in-repo ICDAR-2013 TEDS 0.724→0.806, row 0.417→0.659,
col 0.689→0.868. (#807)
Changed (performance)
-
Lattice raster render skips the PNG round-trip (~20-26% faster). The
page was rendered to a PIL image, saved to a PNG, then immediately
cv2.imread-ed back — the encode alone was ~a quarter of the raster
time. The Lattice engine now renders straight to an in-memory BGR array
(ImageConversionBackend.to_array, pdfium-native; other backends fall
back to convert+imread). Output is byte-identical (PNG was lossless).
(#40) -
text_in_bbox≈ 30× faster on busy lattice pages. The original
O(n³) duplicate-discard pass became O(n²) in #718, then the whole
function was NumPy-vectorised in #731 — a 3-4× win on top of #718 on
realistic 50-500-text-line bboxes. Memory-safe fallback at n > 1500. -
get_table_index3-13×. #727 collapsed the row scan + best-overlap
tracking, #733 added a lazy NumPy +bisectrow-band lookup
(O(log rows)) plus per-table caches onTable(_rows_np,
_cols_np,_rows_disjoint). -
read_pdfopens the PDF once per call instead of twice. Page
resolution is deferred until the parse already has theplayahandle
open. Doubles throughput on workloads that loop over many short PDFs.
(#732) -
random_string4× (#718) andcompute_whitespacecleanup (#727) —
small, mostly readability. -
A
bench/directory now ships a couple of standalone microbenchmarks
(bench_get_table_index.py) and a negative-result bench
(bench_negative_results.py) documenting cases where NumPy did not
help — useful regression net against well-meaning future rewrites.
Fixed
-
text_in_bboxno longer drops legitimate adjacent-cell text. The
geometry-only overlap dedup (added for #15 font-render duplicates) was
discarding any shorter textline ≥80 % contained in a wider neighbour's
bbox — even when the two carried different text — so overlapping cells
silently lost content. The discard is now content-aware: a contained box
is dropped only when its stripped text actually equals the longer
sibling's. (#814, closes #288 / #625) -
Precision gate for the lattice/combined engine. Near-empty ruled
grids (page borders, form rules, header separators — whitespace ≥ 90 %)
are no longer emitted as tables; they were detection noise that
false-positived on pages with no real table. On the in-repo ICDAR-2013
benchmark this lifts combined detection F1 0.665 → 0.778 with TEDS /
row / col all improving too. (#36) -
Network parser: suppress nested/overlapping duplicate tables. The
connectivity search sometimes emitted a partial copy of a table nested
inside the full detection (same columns, fewer rows), inflating the
table count and mangling row structure. These are now suppressed
(keep the larger). On the in-repo ICDAR-2013 benchmark this lifts
flavor='auto' across the board — F1 0.742→0.765, TEDS 0.744→0.763,
row 0.517→0.540 — and ~20 % faster. (#35) -
flavor="hybrid": gate the network-split augmentation by lattice
completeness. Hybrid used to union network's text-derived column
splits onto lattice's boundaries and parse the merged table with the
network parser (text-grouped rows) — which over-segmented and wrecked the
row structure of fully-ruled tables. Now, when lattice already resolved a
complete ruled grid (interior rules in both directions, joints covering
the grid, and a row count commensurate with the table's column-aligned
text rows), that grid is routed to the lattice parser untouched;
partially-ruled / borderless tables still take the network-augmented path,
so hybrid's niche wins are preserved. On the in-repo ICDAR-2013 benchmark
this lifts hybrid TEDS 0.654→0.724 and row 0.172→0.417 (ruled-doc
subset row 0.19→0.60) with F1 unchanged. (#805, mitigates #38 for hybrid) -
flavor="auto"was silently broken —_detect_flavorpassed a
non-existentresolution=kwarg to the image backend, so theTypeError
was swallowed and every PDF fell back tonetwork(neverlattice).
Fixed;autonow also detects the flavor per page and routes ruled
pages throughengine="combined", so mixed cover-page/table documents
parse correctly. (#763) -
Windows
PermissionErrorwhen parsing multiple PDFs. The URL-
downloaded temp file is now removed onPDFHandler.__exit__/
close(); theos.removeis wrapped intry/except OSErrorso the
shutdown path keeps working even when pdfium/playa still holds a
handle to the file. (#735, closes #537 / #678) -
PdfiumBackendleaks document + image handles.convert()now
usestry/finallyso a render that raises still releases pypdfium's
resources. (#716, closes #660) -
TableList(generator)no longer raisesTypeErroronbool()or
len(). (#710, closes #655) -
CLI / docs / Lattice default
line_scaleare consistent at 15
(see Breaking). (#709, closes #657) -
Table.to_excelno longer emits the meaningless integer-index row
and integer-header column. (#711, closes #634) -
CLI options are position-independent (they can sit before or
after the file argument on any subcommand). (#614, closes #587) -
Documentation no longer references
pdfminer/pypdfas the
backend; theplaya-pdfmigration is reflected throughout. (#719) -
opencv-python conflict warning added to install docs — pip happily
installsopencv-pythonalongsideopencv-python-headless, breaking
import cv2at runtime. (#736, closes #645) -
how-it-works.rstNetwork section no longer refers to a missing
plot. (#736, closes #577)
Security
pypdf<6(CVE-2025-55197) is no longer a dependency; replaced by
playa-pdf. The pypdf vulnerability does not apply to current Camelot
even though Camelot never directly called the affected APIs. (closes
#643)PDFTextExtractionNotAllowedis now actually enforced for
encrypted PDFs whose user-password permissions forbid extraction —
the previous architecture (split-into-per-page-temp-PDFs via pypdf)
silently dropped the encryption metadata after decryption, so the
check was effectively a no-op. The playa-based parse path keeps the
document handle open with permissions intact. Note: for unencrypted
PDFs that claim "no extraction" via/Perms, no mechanism in the PDF
spec actually enforces the flag and Camelot extracts. (closes #590)