New OCR model; inline math; beta structured extraction
Surya OCR 3 (inline math, better accuracy/speed)
New OCR model that is more accurate, supports inline math, and is faster on GPU. Use the --format_lines option to OCR inline math properly.
More benchmarks coming soon on this. Math recognition appears to be the best available, but not fully validated yet.
Structured extraction (beta)
We now have an early version of structured extraction. You pass in a file and a pydantic schema to extract data. You can use it like this:
from marker.converters.extraction import ExtractionConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from pydantic import BaseModel
class Links(BaseModel):
links: list[str]
schema = Links.model_json_schema()
config_parser = ConfigParser({
"page_schema": schema
})
converter = ExtractionConverter(
artifact_dict=create_model_dict(),
config=config_parser.generate_config_dict(),
llm_service=config_parser.get_llm_service(),
)
rendered = converter("FILEPATH")This requires you to configure an LLM service - see the docs here.
There is a structured extraction gui app, which you can run with:
pip install streamlit streamlit-ace
marker_extractOCR converter
You can now run OCR and keep characters from marker. This will allow for block equations to be handled properly. You can use it like this:
from marker.converters.ocr import OCRConverter
from marker.models import create_model_dict
converter = OCRConverter(
artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")Misc improvements
- The PDFconverter can now take an io.BytesIO object instead of a filepath.
- Fixed some rare bugs with merging blocks together.
What's Changed
- Keep chars by @VikParuchuri in #662
- Keep chars by @VikParuchuri in #665
- Structured extraction by @VikParuchuri in #687
- WIP: Foundation Model Integration by @tarun-menta in #616
- New OCR model, structured extraction beta by @VikParuchuri in #693
Full Changelog: v1.6.2...v1.7.0