Skip to content

0.6.0

Compare
Choose a tag to compare
@jbaiter jbaiter released this 11 May 15:27
· 220 commits to main since this release

This is a major new release with significant improvements in stability, accuracy and most importantly performance.
Updating is highly recommended, especially for ALTO users, who can expect a speed-up in indexing of up to 6000% (i.e. 60x as fast). We also recommend updating your JVM to at least Java 11 (LTS), since Java 9 introduced a feature that speeds up highlighting significantly.

Performance:

  • Indexing performance drastically improved for ALTO, slightly worse for hOCR and MiniOCR. Under the hood we switched from Regular Expression and Automaton-based parsing to a proper XML parser to support more features and improve correctness. This drastically improved indexing performance for ALTO (6000% speedup, the previous implementation was pathologically slow), but caused a big hit for hOCR (~57% slower) and a slight hit for MiniOCR (~15% slower). These numbers are based on benchmarks done on a ramdisk, so the changes are very likely to be less pronounced in practice, depending on the choice of storage. Note that this makes the parser also more strict in regard to whitespace. If you were indexing OCR documents without any whitespace between word elements before, you will run into problems (see #147).
  • Highlighting performance significantly improved for all formats.
    The time for highlighting a single snippet has gone down for all formats (ALTO 12x as fast, hOCR 10x as fast, MiniOCR 6x as fast). Again, these numbers are based on benchmarks performed on a ramdisk and might be less pronounced in practice, depending on the storage layer.

New Features:

  • Indexing alternative forms encoded in the source OCR files.
    All supported formats offer a way to encode alternative readings for recognized words. The plugin can now parse these from the input files and index them at the same position as the default form. This is a form of index-time term expansion (much like the Synonym Graph Filter shipping with Solr). For example, if you OCR file has the alternatives christmas and christrias for the token clistrias in the span presents on clistrias eve, users would be able to search for "presents christmas" and "presents clistrias" and would get the correct match in both cases, both with full highlighting. Refer to the corresponding section in the documentation for instructions on setting it up.
  • On-the-fly repair of 'broken' markup.
    OcrCharFilterFactory has a new option fixMarkup that enables on-the-fly repair of invalid XML in OCR input documents, namely problems that can arise when the markup contains unescaped instances of <, > and &. This option is disabled by default, we recommend enabling it when your OCR engine exhibits this problem and you are unable to fix the files on disk, since it incurs a bit of a performance hit during indexing.
  • Return snippets in order of appearance.
    By default, Solr scores each highlighted passage as a "mini-document" and returns the passages ordered by their decreasing score. While this is a good match for a lot of use cases, there are many other that are better suited with a simple by-appearance order. This can now be controlled with the new hl.ocr.scorePassages parameter, which will switch to the by-appearance sort order if set to off (it is set to on by default)

API changes:

  • No more need for an explicit hl.fl parameter for highlighting non-OCR fields. By default, if highlighting is enabled and no hl.fl parameter is passed by the user, Solr falls back to highlighting every stored field in the document. Previously this did not work with the plugin and users had to always explicitly specify which fields they wanted to have highlighted. This is no longer necessary, the default behavior now works as expected.
  • Add a new hl.ocr.trackPages parameter to disable page tracking during highlighting.
    This is intended for users who index one page per document, in these cases seeking backwards to determine the page identifier a match is not needed, since the containing document contains enough information to identify the page, improving highlighting performance due to the need for less backwards-seeking in the input files.
  • Add new expandAlternatives attribute to OcrCharFilterFactory. This enables the parsing of alternative readings from input files (see above and the corresponding section in the documentation)
  • Add new hl.ocr.scorePassages parameter to disable sorting of passages by their score. See the above section unter New Features for an explanation of this flag.

Bugfixes:

  • Improved tolerance for incomplete bounding boxes. Previously the occurrence of an incomplete bounding box in a snippet (i.e. with one or more missing coordinates) would crash the whole query. We now simply insert a 0 default value in these cases.
  • Improvements in the handling of hyphenated terms. This release fixes a few bugs in edge cases when handling hyphenated words during indexing, highlighting and snippet text generation.
  • Handle empty field values during indexing. This would previously lead to an exception since the OCR parsers would try to either load a file from the empty string or parse OCR markup from it.