Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to stop GetHOCRText or ? #313

Open
CodeWarrior-Hawaii opened this issue Feb 13, 2017 · 2 comments
Open

Any way to stop GetHOCRText or ? #313

CodeWarrior-Hawaii opened this issue Feb 13, 2017 · 2 comments

Comments

@CodeWarrior-Hawaii
Copy link

CodeWarrior-Hawaii commented Feb 13, 2017

I have a B/W image that is of pretty low quality. It has what was originally a colored background that now introduces a lot of noise into the image. Tessearact churns away on this image (in the GetHOCRText method) for about 25 minutes. The outputted text is not correct, but that is actually of little consequence.

My application is for all intents and purposes completely automated, and ends up processing very large numbers of files. For images of this low quality, we are fine with inaccurate detected text. Is there a way to check the Page instance before calling GetHOCRText() for some property that will let me know how valid it is and how complicated the HOCR will be? If it ends up telling me it is junk data, I will just forgo the HOCR altogether.

I tried looking at GetMeanConfidence() but this took the same amount of time to generate as the HOCR text so it will scarcely do me any good (it also came back as .85 which I take to mean it has a confidence of 85% that the text is accurate, so that is of even less use at this point). Any good way of sending an interrupt after a certain amount of time? I suppose I could run the tesseract processing in a different thread and kill it if it takes longer than XX.

Also, I meant to edit the title, and forgot so I apologize for the dangling "or".

@jay-hill
Copy link

I have a pending pull request (#292) which implements this feature. It adds an additional parameter to GetHOCRText which allows you to specify the time out in milliseconds. I had exactly the same issue, with the OCR process taking a very long time to recognize low quality images and needed a way to early out.

@CodeWarrior-Hawaii
Copy link
Author

Awesome. That is really terrific, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants