v0.4.0
v0.4.0
Features
3 new table structure recognition options!
- Added
TabledFormatter, with support of the fantastic new Tabled library from VikParuchuri. Check out the demo notebook for a quick example. - Added
HistogramFormatter, a super-fast and decently accurate algorithmic option for table structure recognition. The algorithm uses word bboxes to detect separating lines between text. Check out the demo notebook for a quick example. - Added
DITRFormatter. This formatter is a blend between TATRFormatter and HistogramFormatter, being trained to recognize table separating lines rather than cells. It fine tunesmicrosoft/table-transformer-structure-recognition-v1.1-allon PubTables-1M for 15 epochs. Its main draw is mixing and matching deep and algorithmic separating line detection. Check out the demo notebook for a quick example.
These formatters can all be used in combination with any detector (like TATRDetector).
A visual to explain HistogramFormatter:
Bugfixes
- Tweaked spanning cell merging
- Fixed bug where it would overwrite data
- Give warning when importing from
gmftdirectly (usegmft.autoinstead) - Merged PR #32, thanks!
