You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, hyphenation is not resolved when indexing OCR documents from disk. It would be good to have a way to resolve hyphenations that are designated in the source data during indexing.
A way to go about this would be:
Replace the two hyphenation parts with a single Word block (while keeping the length the same!) that contains the dehyphenated form at indexing time
At highlighting time, check if the highlighted span contains multiple word blocks. If so, use the dehyphenated forms for building the plaintext snippet and the hyphenated parts for calculating the regions and highlighting snippets.
This would have to be implemented for each format:
ALTO supports hyphenation with @SUBSTYPE="HypPart1/2" / @SUBS_CONTENT / and <HYP />
hOCR supports hyphenation by encoding it with ­
MiniOCR does not support hyphenation at the moment
Currently, hyphenation is not resolved when indexing OCR documents from disk. It would be good to have a way to resolve hyphenations that are designated in the source data during indexing.
A way to go about this would be:
This would have to be implemented for each format:
@SUBSTYPE="HypPart1/2"
/@SUBS_CONTENT
/ and<HYP />
­
An open question is how this is different/better than Solr's own hyphenation filter.
After some thinking, this involves the following tasks:
AltoCharFilterFactory
so that de-hphenated tokens are indexed (see 279f3be)AltoPassageFormatter#getTextFromXml
to resolve the hyphenations (seefb5c201)
HocrCharFilterFactory
that strips out the­
characters so downstream tokenization does not split the hpyenated tokens. (see f770248)At passage generation time, nothing needs to be done for hOCR, since text renders won't display the soft hyphen anyway.
The text was updated successfully, but these errors were encountered: