New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove tesseract version info from extracted text #48

Closed
deanmalmgren opened this Issue Aug 12, 2014 · 10 comments

Comments

Projects
None yet
5 participants
@deanmalmgren
Owner

deanmalmgren commented Aug 12, 2014

It looks like every image processed with textract has a message like Tesseract Open Source OCR Engine v3.02 with Leptonica at the beginning. This appears to be printed to standard out and is not a reflection of the text that is actually in the document:

vagrant@dev:/vagrant$ textract tests/gif/i_heart_gifs.gif
Tesseract Open Source OCR Engine v3.02 with Leptonica
TEXT


vagrant@dev:/vagrant$ tesseract tests/gif/i_heart_gifs.gif kk.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
vagrant@dev:/vagrant$ cat kk.txt.txt 
TEXT

vagrant@dev:/vagrant$ tesseract tests/jpg/i_heart_jpegs.jpg gg.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
vagrant@dev:/vagrant$ tesseract tests/jpg/i_heart_jpegs.jpg gg.txt > kk
vagrant@dev:/vagrant$ cat kk
Tesseract Open Source OCR Engine v3.02 with Leptonica

This should be a simple fix of just redirecting the output of the tesseract command in a smarter way, but I'm in the middle of something else and thought I'd create a placeholder issue in case someone else has the time to address it.

@ShawnMilo

This comment has been minimized.

Contributor

ShawnMilo commented Aug 12, 2014

I do not see this behavior. Running the latest branch.

Output of tesseract --version:

tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 12, 2014

@deanmalmgren ya I noticed this when I submitted that PR. using Tesseract with stdout actually resolves this issue but apparently there were problems with the tests?

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 12, 2014

@ShawnMilo Perhaps it's because you're running 3.03, they may have removed that header in the output maybe?

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 12, 2014

vagrant@dev:/vagrant$ tesseract --version
tesseract 3.02

I still have 3.02 on my system (which is what is installed by default, I guess, when you run sudo apt-get install tesseract-ocr on Ubuntu 12.04) which is probably a part of the problem.

@christomitov

This comment has been minimized.

Contributor

christomitov commented Aug 12, 2014

Yup I had the same problem. I still think we should get the stdout method to work as using tmp files is still hacky and non-optimal IMO. I noticed something fishy though with stdout when the checksum was the same for all 3 image files xD

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Aug 13, 2014

I'm completely open to using the stdout method, but I think we should make sure textract is as insensitive to the particular version of tesseract that is installed as possible (to make this package as easy to use as possible). If you're inspired, it'd be great to have a solution that uses temporary files for tesseract 3.02 and stdout piping for tesseract 3.03. In the meantime, I thought I'd just patch this real quick.

@davidsonsns

This comment has been minimized.

davidsonsns commented Sep 23, 2015

only the code below got success:

tesseract infile.png outprefix 1>/dev/null 2>&1

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Sep 23, 2015

@davidsonsns is this still an issue?

@monika297

This comment has been minimized.

monika297 commented Feb 23, 2016

@deanmalmgren, I am using tesseract --version
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

At the time of image processing, I am getting a couple of error messages which is filling up my error logs. They are:
Tesseract Open Source OCR Engine v3.03 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixBlockconvAccum: pixd not made
Error in pixBlockconvGray: pixt not made
Error in pixGetWidth: pix not defined
Error in pixGetHeight: pix not defined

Could you please suggest what is wrong?
Looking forward to your help.

Thanks,
Monika

@deanmalmgren

This comment has been minimized.

Owner

deanmalmgren commented Feb 25, 2016

@monika297 I'm sorry to hear you're having trouble with tesseract. In this case, I might suggest putting this issue on StackOverflow to see if others can help. It sounds like an issue with tesseract :(

There may either be a problem with the image or maybe a memory allocation issue (what else is running on your server?), but I'm not enough of a tesseract expert to know for sure

Best of luck,
Dean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment