You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a B/W image that is of pretty low quality. It has what was originally a colored background that now introduces a lot of noise into the image. Tessearact churns away on this image (in the GetHOCRText method) for about 25 minutes. The outputted text is not correct, but that is actually of little consequence.
My application is for all intents and purposes completely automated, and ends up processing very large numbers of files. For images of this low quality, we are fine with inaccurate detected text. Is there a way to check the Page instance before calling GetHOCRText() for some property that will let me know how valid it is and how complicated the HOCR will be? If it ends up telling me it is junk data, I will just forgo the HOCR altogether.
I tried looking at GetMeanConfidence() but this took the same amount of time to generate as the HOCR text so it will scarcely do me any good (it also came back as .85 which I take to mean it has a confidence of 85% that the text is accurate, so that is of even less use at this point). Any good way of sending an interrupt after a certain amount of time? I suppose I could run the tesseract processing in a different thread and kill it if it takes longer than XX.
Also, I meant to edit the title, and forgot so I apologize for the dangling "or".
The text was updated successfully, but these errors were encountered:
I have a pending pull request (#292) which implements this feature. It adds an additional parameter to GetHOCRText which allows you to specify the time out in milliseconds. I had exactly the same issue, with the OCR process taking a very long time to recognize low quality images and needed a way to early out.
I have a B/W image that is of pretty low quality. It has what was originally a colored background that now introduces a lot of noise into the image. Tessearact churns away on this image (in the GetHOCRText method) for about 25 minutes. The outputted text is not correct, but that is actually of little consequence.
My application is for all intents and purposes completely automated, and ends up processing very large numbers of files. For images of this low quality, we are fine with inaccurate detected text. Is there a way to check the Page instance before calling GetHOCRText() for some property that will let me know how valid it is and how complicated the HOCR will be? If it ends up telling me it is junk data, I will just forgo the HOCR altogether.
I tried looking at GetMeanConfidence() but this took the same amount of time to generate as the HOCR text so it will scarcely do me any good (it also came back as .85 which I take to mean it has a confidence of 85% that the text is accurate, so that is of even less use at this point). Any good way of sending an interrupt after a certain amount of time? I suppose I could run the tesseract processing in a different thread and kill it if it takes longer than XX.
Also, I meant to edit the title, and forgot so I apologize for the dangling "or".
The text was updated successfully, but these errors were encountered: