This Solr plugin lets you put word-level OCR text into one or more of you documents' fields and then allows you to obtain structured highlighting data with the text and its position on the page at query time. All this without having to store the OCR data in the index itself, but at arbitrary external locations instead.
It works by extending Solr's standard
UnifiedHighlighter with support for
loading external field values and determining OCR positions from those field
values. This means that all options and query types supported by the
UnifiedHighlighter are also supported for OCR highlighting. The plugin also
works transparently with non-OCR fields and just lets the default
implementation handle those.
The plugin works with all Solr versions >= 7.x (tested with 7.6, 7.7 and 8.0).
- Index hOCR, ALTO or MiniOCR directly without preprocessing
- Retrieve all the information needed to render a highlighted snippet view directly from Solr, without postprocessing
- Keeps your index size manageable by not necessarily storing the OCR in the index
- Download the latest JAR from the GitHub Releases Page
- Drop the JAR into the
core/lib/directory for your Solr core
- Refer to the Getting Started guide for instructions on how to configure Solr
If you want to use the latest bleeding-edge version, you can also compile the plugin yourself. For this you will need at least Java 8 and Maven:
$ mvn package
The JAR will be in
Running the example
The repository includes a full-fledged Docker-based example setup based on the Google Books 1000 Dataset. It consists of 1000 Volumes along with their OCRed text in the hOCR format and all book pages as full resolution JPEG images. The example ships with a search interface that allows querying the OCRed texts and displays the matching passages as highlighted image and text snippets. Also included is a small IIIF-Viewer that allows viewing the complete volumes and searching for text within them.
$ cd example $ docker-compose up -d $ ./ingest_google1000.py
For more information about the example setup, refer to the documentation.
- The supported file size is limited to 2GiB, since Lucene uses 32 bit integers throughout for storing offsets
hl.weightMatchesparameter is not supported when using external UTF-8 files, i.e. it will be ignored and the classical highlighting approach will be used instead.
Found a bug? Want a new feature? Make a fork, create a pull request.
For larger changes/features, it's usually wise to open an issue before starting the work, so we can discuss if it's a fit.
We always appreciate if users let us know how they're using our software and libraries. It helps us to focus our efforts on our open source offerings, so we can create even more useful stuff for the community.
So don't hesitate to drop us a line at email@example.com if you could make use of the plugin :-)