Chunk output; custom prompts; structured extraction improvements
Marker 1.8.0
- Marker will now output a flat list of blocks with associated html, which is useful for RAG
- Structured extraction beta is significantly improved, with better performance/accuracy
- New LLM sectionheader processor will correctly label section header levels
- You can pass a prompt to marker in LLM mode to adjust the output
- Marker batch conversion script has somewhat better performance, closer to our inference container - email us at hi@datalab.to if you want to get setup with our inference container (used on prem at top AI research orgs)
- Add an option to filter out blank white page images from output
- Enable keeping pageheader/pagefooter.
Chunking/RAG improvements
- Add chunk output format which is a flat list of chunks with full html in each
- Add an llm sectionheader processor that will redo all the header levels against each other properly
Use the sectionheaderprocessor by setting --use_llm, and the chunk output by setting --output_format chunks.
Structured extraction
- Fix structured extraction, so it works much better than before (requires llm)
- Improve structured extraction test app
You can try with with the streamlit app by running python extraction_app.py.
Promptability/customization!
- Add promptability via
block_correction_prompt, which can be used to create custom behavior (requires llm)
Try it by setting the block_correction_prompt config key to a specific prompt.
Misc
- Get the marker script to perform a bit closer to our inference container by default (inference container gets 10-25 pages/s on H100). Will auto-configure worker count to available VRAM.
- Fix where marker would output blank pages as images
- Enable keeping pageheader/pagefooter in the output
- Adjust llm services to enable text-only input
- Add html field to almost every block type
Test pageheader/pagefooter by setting keep_pagefooter_in_output and keep_pageheader_in_output.
What's Changed
- Adding image format arg to OpenAI interface by @rgeorgi in #752
- fix: don't remove leading newlines when paginate=True for markdown by @zanussbaum in #769
- Fix LLM query retry logic and unnecessary sleeping by @runarmod in #772
- Update issue templates by @tarun-menta in #767
- WIP: Add Blank Page Processor by @tarun-menta in #750
- Vik/0625fixes by @VikParuchuri in #768
- Marker v1.8 by @VikParuchuri in #773
- Add Azure OpenAI service support to marker package by @MauritsBrinkman in #675
- Add azure openai by @VikParuchuri in #774
New Contributors
- @rgeorgi made their first contribution in #752
- @zanussbaum made their first contribution in #769
- @runarmod made their first contribution in #772
- @MauritsBrinkman made their first contribution in #675
Full Changelog: v1.7.5...v1.8.0