Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settle on Tesseract model type #545

Closed
apyrgio opened this issue Sep 11, 2023 · 1 comment · Fixed by #548
Closed

Settle on Tesseract model type #545

apyrgio opened this issue Sep 11, 2023 · 1 comment · Fixed by #548

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Sep 11, 2023

Background

There are three different Tesseract model types that we can choose from:

  • tessdata_fast: Fast integer versions of trained LSTM models. Best "value for money" in speed vs accuracy, Integer models.
  • testdata_best: Best (most accurate) trained LSTM models. Best results on Google's eval data, slower, Float models.
  • tessdata: Trained models with fast variant of the "best" LSTM models + legacy models. The LSTM models have been updated with Integer version of tessdata_best LSTM models.

Their differences are outlined in the following sources:

  • https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files.md
    • Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.
    • tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.
    • The third set in tessdata is the only one that supports the legacy recognizer. [...]
  • https://towardsdatascience.com/googles-tesseract-ocr-how-good-is-it-on-documents-d71d4bf7640?gi=82d346e1e9e8

    tessdata and tessdata_best appears to exhibit comparable performance in terms of recognition accuracy. tessdata_fast, on the other hand, is marginally better than the former two models. And as expected, this model is also the fastest.

  • fast vs. best tesseract-ocr/tesseract#1404
    • Best is what is says it is. For languages where we have eval data, it is the network configuration that yielded best results on the eval data.
    • Fast is a speed/accuracy compromise, based on my own judgement, as to what offered the best "value for money" in speed vs accuracy. For some languages, this is still best, but for most not.
      [...] If you want best to run faster, it is easy to integerize "best" at the cost of a small loss in accuracy.

Size Comparison

Model type Compressed (MiB) Uncompressed (MiB)
tessdata_fast 336 668
tessdata_best 638 1357
tessdata 638 1357

Using tessdata_fast shaves ~300MiB from the container image, and ~650MiB disk space.

Distro support

  1. Alpine Linux offers the tessdata model type. See https://git.alpinelinux.org/aports/tree/community/tesseract-ocr/APKBUILD
  2. Debian offers the tessdata-fast model type. See https://tracker.debian.org/pkg/tesseract-lang and https://github.com/AlexanderP/tesseract-lang-debian/blob/master/debian/upstream/metadata
  3. Fedora offers the tessdata-fast model type. See https://src.fedoraproject.org/rpms/tesseract-tessdata/blob/rawhide/f/tesseract-tessdata.spec

Dangerzone originally installed Tesseract language models from Alpine Linux, but due to some issues (#417), we resorted to downloading the tessdata language models directly from GitHub (#422). See:

&& wget https://github.com/tesseract-ocr/tessdata/archive/$TESSDATA_VERSION/tessdata-$TESSDATA_VERSION.tar.gz \

Switching to a different model type is as simple as switching the repo we download tarballs from.

Problem

In Qubes we plan to install the language packs via their RPM counterparts (#431 (comment)). This means that a regular Dangerzone installation will use the tessdata model type, whereas Dangerzone on Qubes will use the tessdata-fast model type.

Suggestion

We have an opportunity to bring these platforms in sync by using tessdata-fast in the Dangerzone container.

Arguments for switching to tesseract-fast:

  1. Available in major Linux distros (although Alpine/Arch Linux deviate from this)
  2. Slightly faster
  3. Smaller size (-300MiB from package, -650MiB from disk)
  4. Almost as accurate as the best models.

Arguments for staying with tesseract:

  1. Backwards compatibility with previous versions of Dangerzone.

Do people see any reason not to use the tessdata-fast models on the Dangerzone container?

Related Issues

@apyrgio apyrgio added this to the 0.5.0 milestone Sep 11, 2023
@eloquence
Copy link
Member

I think the quoted line is pretty persuasive:

Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.

It seems worth considering offering the "slightly more accurate" models once we switch to an architecture with flexible downloads, but I agree for now this seems like an easy win to shave off download size and get cross-platform consistency.

apyrgio added a commit that referenced this issue Sep 18, 2023
Switch to the tessdata-fast Tesseract model, instead of the tessdata
one. The tessdata-fast Tesseract model is much smaller, and a bit faster
than the other one. Also, it's the model that Debian/Fedora ship by
default.

Closes #545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants