New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocr'd pdf does not land in catalog (Plone 5) #57

Open
pabo3000 opened this Issue Nov 21, 2015 · 1 comment

Comments

Projects
None yet
1 participant
@pabo3000
Copy link

pabo3000 commented Nov 21, 2015

I have ocr'd a pdf with a fax. Now I expect that I can find words from the fax in the plone search. But that does not work. The words are not in the catalog. I have checked /Plone/portal_catalog/plone_lexicon.
I placed a pdb breakpoint in catalog.py. But the breakpoint was never reached. However the SearchableText adapters seem to be registered correctly.

@pabo3000 pabo3000 added the bug label Nov 21, 2015

@pabo3000

This comment has been minimized.

Copy link
Author

pabo3000 commented Nov 23, 2015

After further investigation I found out the following. If a pdf contains the text information then this text is added to the index. You can find it by the Plone search. If you add an image or pdf without text information (an image as pdf) then the text is not added to the index. If you process a pdf with text information it is ocr'd nevertheless. That isn't necessary, because the process is expensive and error-prone. Some words are missrecognized. If you find a misrecognized word in the text view of the document viewer try to find this word in plone's fulltext search. It won't be there. But you find the original word. Example: Imagine the word "proper" would be recognized as "prooer". Then proper is in the index but "prooer" not. In this cases the bug is a feature. Here you see that the ocr'd text is not written to the index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment