Any way to stop GetHOCRText or ? #313

CodeWarrior-Hawaii · 2017-02-13T22:33:54Z

I have a B/W image that is of pretty low quality. It has what was originally a colored background that now introduces a lot of noise into the image. Tessearact churns away on this image (in the GetHOCRText method) for about 25 minutes. The outputted text is not correct, but that is actually of little consequence.

My application is for all intents and purposes completely automated, and ends up processing very large numbers of files. For images of this low quality, we are fine with inaccurate detected text. Is there a way to check the Page instance before calling GetHOCRText() for some property that will let me know how valid it is and how complicated the HOCR will be? If it ends up telling me it is junk data, I will just forgo the HOCR altogether.

I tried looking at GetMeanConfidence() but this took the same amount of time to generate as the HOCR text so it will scarcely do me any good (it also came back as .85 which I take to mean it has a confidence of 85% that the text is accurate, so that is of even less use at this point). Any good way of sending an interrupt after a certain amount of time? I suppose I could run the tesseract processing in a different thread and kill it if it takes longer than XX.

Also, I meant to edit the title, and forgot so I apologize for the dangling "or".

jay-hill · 2017-02-13T23:42:37Z

I have a pending pull request (#292) which implements this feature. It adds an additional parameter to GetHOCRText which allows you to specify the time out in milliseconds. I had exactly the same issue, with the OCR process taking a very long time to recognize low quality images and needed a way to early out.

CodeWarrior-Hawaii · 2017-02-14T16:41:54Z

Awesome. That is really terrific, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way to stop GetHOCRText or ? #313

Any way to stop GetHOCRText or ? #313

CodeWarrior-Hawaii commented Feb 13, 2017 •

edited

Loading

jay-hill commented Feb 13, 2017

CodeWarrior-Hawaii commented Feb 14, 2017

Any way to stop GetHOCRText or ? #313

Any way to stop GetHOCRText or ? #313

Comments

CodeWarrior-Hawaii commented Feb 13, 2017 • edited Loading

jay-hill commented Feb 13, 2017

CodeWarrior-Hawaii commented Feb 14, 2017

CodeWarrior-Hawaii commented Feb 13, 2017 •

edited

Loading