Skip to content

Chunk output; custom prompts; structured extraction improvements

Choose a tag to compare

@VikParuchuri VikParuchuri released this 27 Jun 19:55
· 213 commits to master since this release
edbcb8c

Marker 1.8.0

  • Marker will now output a flat list of blocks with associated html, which is useful for RAG
  • Structured extraction beta is significantly improved, with better performance/accuracy
  • New LLM sectionheader processor will correctly label section header levels
  • You can pass a prompt to marker in LLM mode to adjust the output
  • Marker batch conversion script has somewhat better performance, closer to our inference container - email us at hi@datalab.to if you want to get setup with our inference container (used on prem at top AI research orgs)
  • Add an option to filter out blank white page images from output
  • Enable keeping pageheader/pagefooter.

Chunking/RAG improvements

  • Add chunk output format which is a flat list of chunks with full html in each
  • Add an llm sectionheader processor that will redo all the header levels against each other properly

Use the sectionheaderprocessor by setting --use_llm, and the chunk output by setting --output_format chunks.

Structured extraction

  • Fix structured extraction, so it works much better than before (requires llm)
  • Improve structured extraction test app

You can try with with the streamlit app by running python extraction_app.py.

Promptability/customization!

  • Add promptability via block_correction_prompt, which can be used to create custom behavior (requires llm)

Try it by setting the block_correction_prompt config key to a specific prompt.

Misc

  • Get the marker script to perform a bit closer to our inference container by default (inference container gets 10-25 pages/s on H100). Will auto-configure worker count to available VRAM.
  • Fix where marker would output blank pages as images
  • Enable keeping pageheader/pagefooter in the output
  • Adjust llm services to enable text-only input
  • Add html field to almost every block type

Test pageheader/pagefooter by setting keep_pagefooter_in_output and keep_pageheader_in_output.

What's Changed

New Contributors

Full Changelog: v1.7.5...v1.8.0