Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TessPdfRenderer not working with jpg files #31

Closed
PabloRodrizHp opened this issue Feb 9, 2021 · 7 comments
Closed

TessPdfRenderer not working with jpg files #31

PabloRodrizHp opened this issue Feb 9, 2021 · 7 comments

Comments

@PabloRodrizHp
Copy link

Hi.

Having used the alexcohn/tess-two repository, the TessPdfRenderer works with jpg files, but when using the latest version of this library, it doesn't. The generated PDF apparently has the text in it, but not the image, and the workaround is to create a PNG out of the JPG file, which in some situations adds up to 2 seconds of processing. This situation is quite similar to the old issue (from 2015) found in the original rmtheis repository.

Here is the code that works with 'com.rmtheis:tess-two:9.1.0' but not this library:

TessBaseAPI mTess = new TessBaseAPI();
mTess.setDebug(true);
mTess.init(DATA_PATH, lang);

String pdfOutput = Environment.getExternalStorageDirectory().toString() + "/Download/ocrOutput";
String jpegInput = Environment.getExternalStorageDirectory().toString() + "/Download/jpegInput.jpg";
TessPdfRenderer renderer = new TessPdfRenderer(mTess, pdfOutput);
    
mTess.beginDocument(renderer);

File file = new File(jpegInput);
Pix pix = ReadFile.readFile(file);

boolean addedPageOne = mTess.addPageToDocument(pix, file.getAbsolutePath(), renderer);
Log.e(TAG, "convertImageToSearchablePdf: addedPageOne: " + addedPageOne);

boolean endDocument = mTess.endDocument(renderer);
Log.e(TAG, "convertImageToSearchablePdf: endDocument: " + endDocument );

renderer.recycle();
pix.recycle();

Am I missing something in my code, so it works with the other library and not this one?

@Robyer
Copy link
Member

Robyer commented Feb 9, 2021

Hi, I have made no intentional changes to the TessPdfRenderer, so I would expect it should work same as in tess-two library. Based on your link it probably has something to do with missing support for libjpeg, maybe I missed some parameter when compiling Leptonica, to enable the libjpeg support.

If that is the case, I expect that line Pix pix = ReadFile.readFile(file); can't read the file and pix is then null, is that correct?

@PabloRodrizHp
Copy link
Author

PabloRodrizHp commented Feb 25, 2021

Hi.

No is not, pix is not null.

And the generated PDF file, when opened appears with a "There was an error processing a page. There was a problem reading this document(18)"
When I dismiss this error, and I can "see" the document, there is no image, but there is a white rectangle with the same size of the original image, and I can even select text where the original image has text:

image
(There should be an ID there)
Trying to copy the selected text, gives the same error as when opening the file.

Also I found something interesting, (probably because I dont know how this works or how the library is compiled).

When I check the code of the classes TessPdfRenderer, ReadFile, and Pix, in the library cz.adaptech.android:tesseract4android:2.1.0 they are loading the library "png":

1-TessPdfRenderer-Library_2
1-Pix-Library_2
1-ReadFile-Library_2

but in the library com.rmtheis:tess-two:9.1.0 is "pngt":

image
image
image

Could this mean that is a different library vesion, and that could cause the malfunction? Because actually com.rmtheis:tess-two:9.1.0 is using libjpeg 9b, and cz.adaptech.android:tesseract4android:2.1.0 is using libjpeg 9c.

Cheers.

@Robyer
Copy link
Member

Robyer commented Feb 26, 2021

@PabloRodrizHp I see. In that case you can try:

Btw the difference between jpgt / jpeg names is not relevant, I just wanted to use more explicit names for the libraries.

@PabloRodrizHp
Copy link
Author

Sorry @Robyer . Maybe I missed to explain that I am not compiling the library, I am using the dependency.

So I guess that my approach for now, will be to try 5.0.0 tesseract branch, which dependency is cz.adaptech.android:tesseract4android:2.2.0-dev right?

@Robyer
Copy link
Member

Robyer commented Mar 2, 2021

@PabloRodrizHp

Sorry @Robyer . Maybe I missed to explain that I am not compiling the library, I am using the dependency.

I see. In that case you are not really using alexcohn's tess-two, because he doesn't provide compiled library - the README is copied from the original rmtheis' tess-two so the dependency line that you copy to your build.gradle is really the original tess-two library using Tesseract 3.x. I just wanted to clarify this. I couldn't even make alexcohn's library work when I compiled manually his latest code.

Anyway, I've reproduced the problem and found the issue. It is caused by change in Tesseract code itself, but it's easy to work-around. I have just commited the fix to master branch. You just need to compile this library yourself for now, as I don't have time to release the new version yet.

@PabloRodrizHp
Copy link
Author

PabloRodrizHp commented Mar 3, 2021

Hi @Robyer.

I kind of understood that I was not really using alexcohn's tess-two when I saw it was Tesseract 3, but thanks for helping me confirm it.

Also, thank you for the fix. I will try to compile it by myself then.

Btw, the tests carried out where in Android Oreo, API 26, and the master branch (Tesseract 4).

Robyer added a commit that referenced this issue Mar 4, 2021
Robyer added a commit that referenced this issue Mar 4, 2021
@Robyer
Copy link
Member

Robyer commented Mar 4, 2021

@PabloRodrizHp I commited better fix for this issue and also released the new library version. You can use the 4.1.1 version now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants