fallback python-based .jpg, .jpeg, .png, .gif extraction #52

deanmalmgren · 2014-08-12T15:29:25Z

These parsers currently use the tesseract shell command to extract content from image files. This is a great starting point (!!!), but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract more usable across platforms. From a minute of googling around, it doesn't look like there are any real packages that do this, but scikit-learn does have some examples of doing character classification that might be useful:

This sounds like a pretty serious undertaking (and not terribly urgent given the relative portability of the tesseract-ocr package) but I thought I'd create this issue for posterity in case someone knows of a python-based fallback that would be appropriate in these situations.

The text was updated successfully, but these errors were encountered:

deanmalmgren added the enhancement label Aug 12, 2014

deanmalmgren mentioned this issue Oct 3, 2014

pdf parser: chain pdftotext/pdfminer + tesseract #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fallback python-based .jpg, .jpeg, .png, .gif extraction #52

fallback python-based .jpg, .jpeg, .png, .gif extraction #52

deanmalmgren commented Aug 12, 2014

fallback python-based .jpg, .jpeg, .png, .gif extraction #52

fallback python-based .jpg, .jpeg, .png, .gif extraction #52

Comments

deanmalmgren commented Aug 12, 2014