Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback python-based .jpg, .jpeg, .png, .gif extraction #52

Open
deanmalmgren opened this issue Aug 12, 2014 · 0 comments
Open

fallback python-based .jpg, .jpeg, .png, .gif extraction #52

deanmalmgren opened this issue Aug 12, 2014 · 0 comments

Comments

@deanmalmgren
Copy link
Owner

These parsers currently use the tesseract shell command to extract content from image files. This is a great starting point (!!!), but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract more usable across platforms. From a minute of googling around, it doesn't look like there are any real packages that do this, but scikit-learn does have some examples of doing character classification that might be useful:

This sounds like a pretty serious undertaking (and not terribly urgent given the relative portability of the tesseract-ocr package) but I thought I'd create this issue for posterity in case someone knows of a python-based fallback that would be appropriate in these situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant