Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more OCR languages #422

Merged
merged 4 commits into from
May 24, 2023
Merged

Support more OCR languages #422

merged 4 commits into from
May 24, 2023

Conversation

apyrgio
Copy link
Contributor

@apyrgio apyrgio commented May 22, 2023

Ditch the Alpine Linux packages for Tesseract OCR languages, in favor of directly downloading them from the source.

Fixes #417

@apyrgio apyrgio force-pushed the 417-eng-ocr-alt branch 2 times, most recently from bf43cb7 to 795b264 Compare May 22, 2023 18:19
share/ocr-languages.json Outdated Show resolved Hide resolved
tests/test_ocr.py Outdated Show resolved Hide resolved
share/ocr-languages.json Show resolved Hide resolved
@deeplow
Copy link
Contributor

deeplow commented May 23, 2023

I've tried downloading the whole data and it's 638MB in tar.gz and 1.4GB uncompressed. This makes the final container.tar.gz be 838mb and the complete rpm 831mb.

Without any training data the tar.gz container is 380mb and the built rpm is 365MB.

But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own.

1 similar comment
@deeplow
Copy link
Contributor

deeplow commented May 23, 2023

I've tried downloading the whole data and it's 638MB in tar.gz and 1.4GB uncompressed. This makes the final container.tar.gz be 838mb and the complete rpm 831mb.

Without any training data the tar.gz container is 380mb and the built rpm is 365MB.

But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own.

@deeplow
Copy link
Contributor

deeplow commented May 23, 2023

This one also fixes #357

@apyrgio
Copy link
Contributor Author

apyrgio commented May 23, 2023

I've tried downloading the whole data and it's 638MB in tar.gz and 1.4GB uncompressed. This makes the final container.tar.gz be 838mb and the complete rpm 831mb.

Without any training data the tar.gz container is 380mb and the built rpm is 365MB.

But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own.

Yes, I've noticed the inflation in the size as well. That's the price we pay for extra language coverage. As for downloading extra languages dynamically, that would be very interesting. The Dangerzone client could:

  • Check if the user wants to perform OCR with a specific language.
  • Download the trained model for this language, if not already downloaded.
  • Validate the trained model against a known checksum.
  • Mount the trained model in the Dangerzone container, under /usr/share/tessdata.

This will break air-gapped systems, so I'm not proposing it for now. If at some point we make this distinction though, we could achieve pretty slim images.

Grab Tesseract's trained models from GitHub, instead of from the Alpine
Linux repos. Over the past few months, the models in the Alpine Linux
repos did not remain stable, leading to CI issues.

Since the models are already pre-trained and available through
Tesseract's repo on GitHub, we can use the release tarball that they
offer to install them in the container image, which is basically what
the upstream packages are doing as well.

In order to make sure that we have no regressions, at the time of this
commit we ensured that the hashes of the models offered through the
Alpine Linux repos and the models offered from the GitHub release are
the same. Also, in order to detect future regressions or foul play, we
check the downloaded models against a known checksum. Given that these
models change every few years, updating the checksum should not be an
issue.

Fix #357
Restore the OCR languages to the state they were in
66d3c40, with some minor changes. We
can now do so because we download all the trained models, not just the
ones that Alpine Linux offers.
Remove the Kurdish (Arabic) language ("kur_ara") from the list of
languages that we offer for OCR, since it's not included in the
installed languages.

Interestingly, it is not present in the Apline Linux repos as well, so
this was probably an omission in the first place.
Test that the languages that we provide to users for OCR match the
languages that are installed in the container image

Fixes #417
@apyrgio apyrgio merged commit 641aa13 into main May 24, 2023
24 of 26 checks passed
@apyrgio apyrgio deleted the 417-eng-ocr-alt branch May 24, 2023 19:14
@deeplow deeplow mentioned this pull request May 29, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adapt to multiple language changes in tesseract OCR alpine packaging
2 participants