Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve hyphenation in indexing analysis chain #35

Closed
3 tasks done
jbaiter opened this issue May 14, 2019 · 1 comment
Closed
3 tasks done

Resolve hyphenation in indexing analysis chain #35

jbaiter opened this issue May 14, 2019 · 1 comment
Labels
enhancement New feature or request

Comments

@jbaiter
Copy link
Member

jbaiter commented May 14, 2019

Currently, hyphenation is not resolved when indexing OCR documents from disk. It would be good to have a way to resolve hyphenations that are designated in the source data during indexing.

A way to go about this would be:

  1. Replace the two hyphenation parts with a single Word block (while keeping the length the same!) that contains the dehyphenated form at indexing time
  2. At highlighting time, check if the highlighted span contains multiple word blocks. If so, use the dehyphenated forms for building the plaintext snippet and the hyphenated parts for calculating the regions and highlighting snippets.

This would have to be implemented for each format:

  • ALTO supports hyphenation with @SUBSTYPE="HypPart1/2" / @SUBS_CONTENT / and <HYP />
  • hOCR supports hyphenation by encoding it with &shy;
  • MiniOCR does not support hyphenation at the moment

An open question is how this is different/better than Solr's own hyphenation filter.


After some thinking, this involves the following tasks:

  • ALTO: Resolve hyphenation in AltoCharFilterFactory so that de-hphenated tokens are indexed (see 279f3be)
  • ALTO: Add code to AltoPassageFormatter#getTextFromXml to resolve the hyphenations (see
    fb5c201)
  • hOCR: Add a HocrCharFilterFactory that strips out the &shy; characters so downstream tokenization does not split the hpyenated tokens. (see f770248)

At passage generation time, nothing needs to be done for hOCR, since text renders won't display the soft hyphen anyway.

@jbaiter jbaiter added the enhancement New feature or request label May 14, 2019
@jbaiter
Copy link
Member Author

jbaiter commented Jun 13, 2019

Implemented in #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant