Support more OCR languages #422

apyrgio · 2023-05-22T14:16:08Z

Ditch the Alpine Linux packages for Tesseract OCR languages, in favor of directly downloading them from the source.

Fixes #417

share/ocr-languages.json

tests/test_ocr.py

share/ocr-languages.json

deeplow · 2023-05-23T08:17:32Z

I've tried downloading the whole data and it's 638MB in tar.gz and 1.4GB uncompressed. This makes the final container.tar.gz be 838mb and the complete rpm 831mb.

Without any training data the tar.gz container is 380mb and the built rpm is 365MB.

But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own.

deeplow · 2023-05-23T08:17:34Z

I've tried downloading the whole data and it's 638MB in tar.gz and 1.4GB uncompressed. This makes the final container.tar.gz be 838mb and the complete rpm 831mb.

Without any training data the tar.gz container is 380mb and the built rpm is 365MB.

But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own.

deeplow · 2023-05-23T08:17:46Z

This one also fixes #357

apyrgio · 2023-05-23T12:43:24Z

I've tried downloading the whole data and it's 638MB in tar.gz and 1.4GB uncompressed. This makes the final container.tar.gz be 838mb and the complete rpm 831mb.

Without any training data the tar.gz container is 380mb and the built rpm is 365MB.

But it's not too bad from the 600mb download we had. But in the future we can probably look into downloading the models dynamically. Although with docker images that'll introduce issues of its own.

Yes, I've noticed the inflation in the size as well. That's the price we pay for extra language coverage. As for downloading extra languages dynamically, that would be very interesting. The Dangerzone client could:

Check if the user wants to perform OCR with a specific language.
Download the trained model for this language, if not already downloaded.
Validate the trained model against a known checksum.
Mount the trained model in the Dangerzone container, under /usr/share/tessdata.

This will break air-gapped systems, so I'm not proposing it for now. If at some point we make this distinction though, we could achieve pretty slim images.

Grab Tesseract's trained models from GitHub, instead of from the Alpine Linux repos. Over the past few months, the models in the Alpine Linux repos did not remain stable, leading to CI issues. Since the models are already pre-trained and available through Tesseract's repo on GitHub, we can use the release tarball that they offer to install them in the container image, which is basically what the upstream packages are doing as well. In order to make sure that we have no regressions, at the time of this commit we ensured that the hashes of the models offered through the Alpine Linux repos and the models offered from the GitHub release are the same. Also, in order to detect future regressions or foul play, we check the downloaded models against a known checksum. Given that these models change every few years, updating the checksum should not be an issue. Fix #357

Restore the OCR languages to the state they were in 66d3c40, with some minor changes. We can now do so because we download all the trained models, not just the ones that Alpine Linux offers.

Remove the Kurdish (Arabic) language ("kur_ara") from the list of languages that we offer for OCR, since it's not included in the installed languages. Interestingly, it is not present in the Apline Linux repos as well, so this was probably an omission in the first place.

Test that the languages that we provide to users for OCR match the languages that are installed in the container image Fixes #417

apyrgio force-pushed the 417-eng-ocr-alt branch 2 times, most recently from bf43cb7 to 795b264 Compare May 22, 2023 18:19

deeplow requested changes May 23, 2023

View reviewed changes

share/ocr-languages.json Outdated Show resolved Hide resolved

tests/test_ocr.py Outdated Show resolved Hide resolved

share/ocr-languages.json Show resolved Hide resolved

apyrgio force-pushed the 417-eng-ocr-alt branch from 795b264 to 7a4e244 Compare May 23, 2023 13:28

apyrgio mentioned this pull request May 24, 2023

Fixes some OCR languages that were changed in upstream apline packaging #418

Closed

deeplow approved these changes May 24, 2023

View reviewed changes

apyrgio added 3 commits May 24, 2023 13:43

Restore the OCR languages

35e439f

Restore the OCR languages to the state they were in 66d3c40, with some minor changes. We can now do so because we download all the trained models, not just the ones that Alpine Linux offers.

ci: Add test for OCR languages

641aa13

Test that the languages that we provide to users for OCR match the languages that are installed in the container image Fixes #417

apyrgio force-pushed the 417-eng-ocr-alt branch from 7a4e244 to 641aa13 Compare May 24, 2023 10:44

apyrgio merged commit 641aa13 into main May 24, 2023
24 of 26 checks passed

apyrgio deleted the 417-eng-ocr-alt branch May 24, 2023 19:14

deeplow mentioned this pull request May 29, 2023

Qubes: Beta integration #412

Closed

7 tasks

deeplow mentioned this pull request Jun 7, 2023

Qubes: Install all other OCR languages #438

Closed

apyrgio mentioned this pull request Sep 11, 2023

Settle on Tesseract model type #545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support more OCR languages #422

Support more OCR languages #422

apyrgio commented May 22, 2023

deeplow commented May 23, 2023

deeplow commented May 23, 2023

deeplow commented May 23, 2023

apyrgio commented May 23, 2023

Support more OCR languages #422

Support more OCR languages #422

Conversation

apyrgio commented May 22, 2023

deeplow commented May 23, 2023

deeplow commented May 23, 2023

deeplow commented May 23, 2023

apyrgio commented May 23, 2023