Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract error in preprocessing #7

Closed
one2many opened this issue Mar 1, 2021 · 3 comments
Closed

Tesseract error in preprocessing #7

one2many opened this issue Mar 1, 2021 · 3 comments

Comments

@one2many
Copy link

one2many commented Mar 1, 2021

Attempting to OCR a table and I keep getting an error.
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 69, in preprocess_img
rotate = get_rotate(filepath, tess_params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 79, in get_rotate
subprocess.check_output(tess_command)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['tesseract', '--psm', '0', '--oem', '0', '/Users/andrewmcfadden/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga-20190131-001.png', '-']' returned non-zero exit status 1.

The image is the logo at the top of the page (every page).
ga-20190131-001

@eihli
Copy link
Owner

eihli commented Mar 2, 2021

That last line of the error message is informative.

'tesseract', '--psm', '0', '--oem', '0', '/Users/andrewmcfadden/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga-20190131-001.png', '-'

Try running that command and see if it gives you any more useful info.
tesseract --psm 0 --oem 0 /Users/andrewmcfaddn/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga20190131-001.png -

Maybe tesseract takes a --verbose flag that will print additional useful information about the error?

Looking at that image of the FirstBank logo... it doesn't look like something this library was written for. I don't think it will parse anything from that as-is. It will require a huge amount of customization for that.

@one2many
Copy link
Author

one2many commented Mar 2, 2021

I will try to add the verbose flag and see what happens. The logo is where it gets stuck, but I really don't need it. My end goal is to run this on 60 pdfs of bank statements that are 98% tables. The logo is on the top left of the first page of every document though. I ran this on one pdf and it got stuck on the logo, but if I could ignore it that would be fine.

Attached is sample of one of the tables.
9007-GENERALACCOUNT-20200131-006

Also, have you seen Tabula?

@eihli
Copy link
Owner

eihli commented Mar 8, 2021

This library will need a lot of customization to work with a table like that. The code looks for vertical and horizontal lines to detect a table. And then when it finds a cell, it expects the cell to contain a single line of text. Your example is incompatible with both of those expectations.

The best I can suggest if you want to use anything from this library is to use it as a reference while writing a lot of custom code.

@eihli eihli closed this as completed Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants