Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tildes (~) in BoxText output not in UTF8 output #6

Closed
robertmeta opened this issue Oct 7, 2014 · 2 comments
Closed

Tildes (~) in BoxText output not in UTF8 output #6

robertmeta opened this issue Oct 7, 2014 · 2 comments

Comments

@robertmeta
Copy link
Contributor

I am going to have to dig through the Tesseract source to try to find why I am getting ~ in the BoxText when it is not in the UTF8 text. I am doing a mapping of boxes to UTF8 text chars, and simply removing the ~'s allows it to line up properly.... trying to puzzle it out.

@robertmeta
Copy link
Contributor Author

This is the "tilde crunch words" for stuff that isn't being detected, I think textord_noise_rejwords controls it.

@robertmeta
Copy link
Contributor Author

"Working As Intended" https://code.google.com/p/tesseract-ocr/source/browse/api/baseapi.cpp (line 86) -- that is a hard-coded tesseract thing -- nothing to be done about it on the go.tesseract side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant