You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These parsers currently use the tesseract shell command to extract content from image files. This is a great starting point (!!!), but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract more usable across platforms. From a minute of googling around, it doesn't look like there are any real packages that do this, but scikit-learn does have some examples of doing character classification that might be useful:
This sounds like a pretty serious undertaking (and not terribly urgent given the relative portability of the tesseract-ocr package) but I thought I'd create this issue for posterity in case someone knows of a python-based fallback that would be appropriate in these situations.
The text was updated successfully, but these errors were encountered:
These parsers currently use the
tesseract
shell command to extract content from image files. This is a great starting point (!!!), but it might be nice to replicate the behavior of the.pdf
parser and have a pure python fallback method to make textract more usable across platforms. From a minute of googling around, it doesn't look like there are any real packages that do this, but scikit-learn does have some examples of doing character classification that might be useful:This sounds like a pretty serious undertaking (and not terribly urgent given the relative portability of the
tesseract-ocr
package) but I thought I'd create this issue for posterity in case someone knows of a python-based fallback that would be appropriate in these situations.The text was updated successfully, but these errors were encountered: