Chunk output; custom prompts; structured extraction improvements

VikParuchuri released this 27 Jun 19:55

· 213 commits to master since this release

edbcb8c

Marker 1.8.0

Marker will now output a flat list of blocks with associated html, which is useful for RAG
Structured extraction beta is significantly improved, with better performance/accuracy
New LLM sectionheader processor will correctly label section header levels
You can pass a prompt to marker in LLM mode to adjust the output
Marker batch conversion script has somewhat better performance, closer to our inference container - email us at hi@datalab.to if you want to get setup with our inference container (used on prem at top AI research orgs)
Add an option to filter out blank white page images from output
Enable keeping pageheader/pagefooter.

Chunking/RAG improvements

Add chunk output format which is a flat list of chunks with full html in each
Add an llm sectionheader processor that will redo all the header levels against each other properly

Use the sectionheaderprocessor by setting --use_llm, and the chunk output by setting --output_format chunks.

Structured extraction

Fix structured extraction, so it works much better than before (requires llm)
Improve structured extraction test app

You can try with with the streamlit app by running python extraction_app.py.

Promptability/customization!

Add promptability via block_correction_prompt, which can be used to create custom behavior (requires llm)

Try it by setting the block_correction_prompt config key to a specific prompt.

Misc

Get the marker script to perform a bit closer to our inference container by default (inference container gets 10-25 pages/s on H100). Will auto-configure worker count to available VRAM.
Fix where marker would output blank pages as images
Enable keeping pageheader/pagefooter in the output
Adjust llm services to enable text-only input
Add html field to almost every block type

Test pageheader/pagefooter by setting keep_pagefooter_in_output and keep_pageheader_in_output.

What's Changed

Adding image format arg to OpenAI interface by @rgeorgi in #752
fix: don't remove leading newlines when paginate=True for markdown by @zanussbaum in #769
Fix LLM query retry logic and unnecessary sleeping by @runarmod in #772
Update issue templates by @tarun-menta in #767
WIP: Add Blank Page Processor by @tarun-menta in #750
Vik/0625fixes by @VikParuchuri in #768
Marker v1.8 by @VikParuchuri in #773
Add Azure OpenAI service support to marker package by @MauritsBrinkman in #675
Add azure openai by @VikParuchuri in #774

New Contributors

@rgeorgi made their first contribution in #752
@zanussbaum made their first contribution in #769
@runarmod made their first contribution in #772
@MauritsBrinkman made their first contribution in #675

Full Changelog: v1.7.5...v1.8.0

Contributors

rgeorgi, VikParuchuri, and 4 other contributors

Assets 2