Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Ocr'd pdf does not land in catalog (Plone 5) #57
I have ocr'd a pdf with a fax. Now I expect that I can find words from the fax in the plone search. But that does not work. The words are not in the catalog. I have checked /Plone/portal_catalog/plone_lexicon.
After further investigation I found out the following. If a pdf contains the text information then this text is added to the index. You can find it by the Plone search. If you add an image or pdf without text information (an image as pdf) then the text is not added to the index. If you process a pdf with text information it is ocr'd nevertheless. That isn't necessary, because the process is expensive and error-prone. Some words are missrecognized. If you find a misrecognized word in the text view of the document viewer try to find this word in plone's fulltext search. It won't be there. But you find the original word. Example: Imagine the word "proper" would be recognized as "prooer". Then proper is in the index but "prooer" not. In this cases the bug is a feature. Here you see that the ocr'd text is not written to the index.