Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract crash after glibc update on linux (when two languages selected) #1314

Closed
AndrewG10i opened this issue Jan 22, 2023 · 10 comments
Closed

Comments

@AndrewG10i
Copy link

AndrewG10i commented Jan 22, 2023

I am not sure from which side to approach this issue – thus starting from here hoping that someone could help to point me how better to deal with this issue.

Faced with weird issue recently: tesseract started crashing when two languages selected for recognition (in my case that was "chi_tra+eng"). When languages selected separately (one by one) - everything works fine.

I am using org.bytedeco.tesseract-platform and this issue happens on both versions: 5.2.0-1.5.8 and 5.0.1-1.5.7.

Crash dump I am attaching separately, see: glibc issue - hs_err_pid57917.log

Further investigation revealed that it started crashing after CentOS 8 Stream casual packages update with dnf update command, and particularly after glibc update.

From the glibc history below I was able to identify that 2.28-214.el8 is the last properly working version, starting with version 2.28-216.el8 - tesseract start crashing:

# dnf list glibc --showduplicates
...
glibc.x86_64               2.28-214.el8               baseos    <-- last working version
glibc.x86_64               2.28-216.el8               baseos
glibc.x86_64               2.28-220.el8               baseos
glibc.x86_64               2.28-224.el8               baseos

So far the only solution I could get was to downgrade glibc back to 2.28-214.el8 and add it to exclusions of dnf, what obviously is a temporary workaround until this is resolved properly.

Additionally I have found glibc commits history and seems like its change log (but I am not a C guy thus it didn't help me much):
https://git.centos.org/rpms/glibc/commits/c8s
https://rpmfind.net/linux/RPM/centos/8-stream/baseos/x86_64/Packages/glibc-2.28-216.el8.x86_64.html

Highly appreciate any support with this issue!

@AndrewG10i AndrewG10i changed the title Tesseract crash when two languages selected after glibc update on linux Tesseract crash after glibc update on linux (when two languages selected) Jan 22, 2023
@saudet
Copy link
Member

saudet commented Jan 22, 2023

  • Improve malloc implementation (#1871383)

There's probably memory corruption happening that just happens to crash your code after this malloc tweak. It might be caused by memory getting deallocated prematurely. If you set "org.bytedeco.javacpp.nopointergc" to "true" and that stops the crashing, then that's what is happening. You'll need to figure out what is getting deallocated that shouldn't be.

@AndrewG10i
Copy link
Author

AndrewG10i commented Jan 23, 2023

Thank you so much on your reply! I have tried -Dorg.bytedeco.javacpp.nopointergc=true but that didn't help.
So I started working on reproducer and while working on it - seems I found issue with the Tesseract original implementation.

Reproducer code looks as follows:

public static void main(String[] args) {
    TessBaseAPI tessBaseApi = null;
    ETEXT_DESC tessMonitor = null;
    Mat imageMat = null;
    BytePointer outText = null;

    try (InputStream is = TessApi.class.getClassLoader().getResourceAsStream("sample01s.png")) {

        tessBaseApi = new TessBaseAPI();
        int initResult = tessBaseApi.Init(System.getenv("TESSDATA_PREFIX"), "chi_tra+eng"); // Crashes here on glibc version 2.28-216.el8 and later
        if (initResult == 0) {
            LOG.info("TessAPI initialization SUCCESS with langs: " + tessBaseApi.GetInitLanguagesAsString().getString());
        } else {
            LOG.severe("TessAPI initialization FAILED, initCode=" + initResult);
        }

        tessBaseApi.SetPageSegMode(tesseract.PSM_AUTO);

        BufferedImage image = ImageIO.read(is);
        imageMat = Java2DFrameUtils.toMat(image);
        tessBaseApi.SetImage(imageMat.data().asBuffer(), imageMat.size().width(), imageMat.size().height(), imageMat.channels(), (int) imageMat.step1());

        tessMonitor = tesseract.TessMonitorCreate();
        tessBaseApi.Recognize(tessMonitor);

        outText = tessBaseApi.GetUTF8Text();
        System.out.println("OCR output:\n" + outText.getString());
    } catch (IOException ex) {
        LOG.log(Level.SEVERE, null, ex);
    } finally {
        close(imageMat);
        close(outText);
        close(tessMonitor);
        close(tessBaseApi);
    }
}

I noticed that for some language pairs (e.g. chi_tra+eng) for version glibc-2.28-214.el8 it gives the following message in the console:

Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language ''

While starting from glibc version 2.28-216.el8 - it simply crashes.

Investigating it further it seems that this issues occurs when the following conditions are TRUE:

  1. Two languages are defined, e.g.: chi_tra+eng
  2. The first language .traindata file contains param: tessedit_load_sublangs not equal to eng

So I have checked all languages traindata which contain tessedit_load_sublangs and my observations are the following:

TrainData listed below which contains "tessedit_load_sublangs" param gives the following message when used in conjunction with another language:
   Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata
   Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
   Failed loading language ''
aze
aze_cyrl
chi_sim
chi_tra
ell
jpn
srp
uzb
uzb_cyrl
As result tessBaseApi.Init() crashes when glibc version is 2.28-216.el8 or later

TrainData listed below which contains "tessedit_load_sublangs" works fine, most likely because of the following reasons:
ben           - works fine but in has param set as: tessedit_load_sublangs	eng            
mal           - works fine but in has param set as: tessedit_load_sublangs	eng            
srp_latn      - works fine seems because param is commented out: # tessedit_load_sublangs srp
tel           - works fine seems because param is commented out: #tessedit_load_sublangs	eng 

So seems it is not related to the bytedeco wrapper and I will post this issue in the Tesseract repo.
Thank you!

@AndrewG10i
Copy link
Author

AndrewG10i commented Jan 23, 2023

Moreover just tried using org.bytedeco.tesseract-platform@4.1.1-1.5.6 on the glibc.2.28-224.el8 and everything worked fine ("chi_tra+eng" loaded without "Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata" message)!
Whereas org.bytedeco.tesseract-platform@5.3.0-1.5.9-SNAPSHOT still fails with the same error.
So seems definitely a bug in Tesseract source code.

@saudet saudet added bug and removed question labels Jan 23, 2023
@AndrewG10i
Copy link
Author

AndrewG10i commented Jan 24, 2023

I am afraid I need a bit more help here...

I have tried a compiled Tesseract v5.3 cli version from here and it works fine without printing message Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata:

[root@dev1 tess]# tesseract sample01s.png sample01s.png --oem 1 -l chi_tra+eng
Estimating resolution as 327

But using my reproducer code with org.bytedeco.tesseract-platform@5.3.0-1.5.9-SNAPSHOT I am getting warning using the same langs pair tessBaseApi.Init(System.getenv("TESSDATA_PREFIX"), "chi_tra+eng");:

Error opening data file /tmp/tess/tessdata/.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language ''
Jan 24, 2023 12:16:01 PM tess.TessApi main
INFO: TessAPI initialization SUCCESS with langs: chi_tra+eng
Estimating resolution as 327
OCR output:
...

That warning message Please make sure the TESSDATA_PREFIX environment comes from this code.

I am afraid Tesseract project team will not accept my bug report as it is not reproduced using compiled cli version.
Any suggestions how to debug and track down this issue (meaning why reproducer gets a warn that it tried to load lang from .traineddata file)?

@AndrewG10i AndrewG10i reopened this Jan 24, 2023
@saudet
Copy link
Member

saudet commented Jan 24, 2023

I am afraid Tesseract project team will not accept my bug report as it is not reproduced using compiled cli version.

The cli is also available:
http://bytedeco.org/javacpp-presets/tesseract/apidocs/org/bytedeco/tesseract/program/tesseract.html

@AndrewG10i
Copy link
Author

AndrewG10i commented Jan 25, 2023

Thank you once again, yeah, I was able to use that approach. It didn't work from the code (due to linking error tesseract: error while loading shared libraries: libtesseract.so.5.3.0: cannot open shared object file: No such file or directory), but I noticed that executable file tesseract appeared in the cache directory, so I did the following:

  1. Set the Env variable LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/root/.javacpp/cache/tesseract-5.3.0-1.5.9-SNAPSHOT-linux-x86_64.jar/org/bytedeco/tesseract/linux-x86_64/
  1. Then manually launched tesseract via cli, and basically it gave the same result as I am getting from the code:
[root@dev1z tess]# /root/.javacpp/cache/tesseract-5.3.0-1.5.9-SNAPSHOT-linux-x86_64.jar/org/bytedeco/tesseract/linux-x86_64/tesseract sample01s.png sample01s.png.txt --oem 1 -l chi_tra+eng
Error opening data file /local/tessData/tessdata_best-4.1.0/.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language ''
Estimating resolution as 327

So next I decided to investigate more what I have installed with this:

I have tried a compiled Tesseract v5.3 cli version from here

And found the following:

[root@dev1z tess]# find / -name libtesseract*
/root/.javacpp/cache/tesseract-5.3.0-1.5.9-SNAPSHOT-linux-x86_64.jar/org/bytedeco/tesseract/linux-x86_64/libtesseract.so.5.3.0
/root/.javacpp/cache/tesseract-5.3.0-1.5.9-SNAPSHOT-linux-x86_64.jar/org/bytedeco/tesseract/linux-x86_64/libtesseract.so
/usr/lib64/libtesseract.so.5.0.3
/usr/lib64/libtesseract.so.5

So it looks like tesseract from that repo uses outdated lib v5.0.3 instead of v5.3 (even tesseract --version gives output with different versions of dependencies, e.g.: leptonica-1.76.0 VS leptonica-1.83.0).
Thus seems it is good to proceed with registering issue with Tesseract team.

@AndrewG10i
Copy link
Author

Hello @saudet! May I please re-open this issue for quick clarification: so I tried to test long-awaited javacv-platform:1.5.10 release which contains fixed Tesseract version for this issue and faced with the another issue: /lib64/libm.so.6: version GLIBC_2.29' not found`, but running on Rocky Linux 8 with EoL 2029 in my case, which has glibc 2.28.x.

I have already confirmed that all works good on the Rocky Linux 9, but at this moment I am not able to upgrade to that version due to other dependencies.

So may I please seek for your advise in this case: is there any chance to return to the builds of the javacv-platform using glibc 2.28, or now the only solution for me is to compile & build javacv-platform from sources on my own?

Many thanks!

@AndrewG10i AndrewG10i reopened this Jan 29, 2024
@saudet
Copy link
Member

saudet commented Jan 29, 2024

Duplicate of #1379

@saudet saudet marked this as a duplicate of #1379 Jan 29, 2024
@saudet saudet closed this as completed Jan 29, 2024
@AndrewG10i
Copy link
Author

AndrewG10i commented Jan 29, 2024

Okay, I understood, the answer is "compile on your own". ;) Just curious what is for the "enhancement" label was added...? Are there any enhancements planned for compiling, or what? :)

@saudet
Copy link
Member

saudet commented Jan 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants