Resolve hyphenation in indexing analysis chain #35

jbaiter · 2019-05-14T09:28:55Z

Currently, hyphenation is not resolved when indexing OCR documents from disk. It would be good to have a way to resolve hyphenations that are designated in the source data during indexing.

A way to go about this would be:

Replace the two hyphenation parts with a single Word block (while keeping the length the same!) that contains the dehyphenated form at indexing time
At highlighting time, check if the highlighted span contains multiple word blocks. If so, use the dehyphenated forms for building the plaintext snippet and the hyphenated parts for calculating the regions and highlighting snippets.

This would have to be implemented for each format:

ALTO supports hyphenation with @SUBSTYPE="HypPart1/2" / @SUBS_CONTENT / and <HYP />
hOCR supports hyphenation by encoding it with 
MiniOCR does not support hyphenation at the moment

An open question is how this is different/better than Solr's own hyphenation filter.

After some thinking, this involves the following tasks:

ALTO: Resolve hyphenation in AltoCharFilterFactory so that de-hphenated tokens are indexed (see 279f3be)
ALTO: Add code to AltoPassageFormatter#getTextFromXml to resolve the hyphenations (see
fb5c201)
hOCR: Add a HocrCharFilterFactory that strips out the  characters so downstream tokenization does not split the hpyenated tokens. (see f770248)

At passage generation time, nothing needs to be done for hOCR, since text renders won't display the soft hyphen anyway.

The text was updated successfully, but these errors were encountered:

jbaiter · 2019-06-13T14:58:09Z

Implemented in #45

jbaiter added the enhancement New feature or request label May 14, 2019

jbaiter mentioned this issue Jun 7, 2019

Handle hyphenation #45

Merged

jbaiter closed this as completed Jun 13, 2019

beatrycze-volk mentioned this issue Sep 11, 2023

Display and index hyphenated words as normal words kitodo/kitodo-presentation#1009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve hyphenation in indexing analysis chain #35

Resolve hyphenation in indexing analysis chain #35

jbaiter commented May 14, 2019 •

edited

Loading

jbaiter commented Jun 13, 2019

Resolve hyphenation in indexing analysis chain #35

Resolve hyphenation in indexing analysis chain #35

Comments

jbaiter commented May 14, 2019 • edited Loading

jbaiter commented Jun 13, 2019

jbaiter commented May 14, 2019 •

edited

Loading