Dynamic discovery of OCR languages #448

apyrgio · 2023-06-14T16:13:29Z

One of the main contributing factors to the size of our container image is the OCR models for all the languages that we bundle with our application:

# du -hd 1 / | sort -h
...
1.8G    /usr
1.8G    /

# du -hd 1 /usr | sort -h
...
717.7M  /usr/lib
1.1G    /usr/share
1.8G    /usr

# du -hd 1 /usr/share/ | sort -h
...
1015.6M /usr/share/tessdata   <--- OCR models
1.1G    /usr/share/

We don't expect that we will have any user that will need all of the Tesseract models simultaneously, so we need to find something better here. An interesting idea would be to dynamically fetch the models necessary for the conversion (e.g., under Dangerzone's data directory), and mount them to the container under /usr/share/tessdata.

There are two issues with this approach:

The canonical repo would be GitHub, which may reject requests due to rate-limiting.
This is something that cannot work on airgapped installations, so we would need instructions for those.

The text was updated successfully, but these errors were encountered:

deeplow mentioned this issue Jun 15, 2023

Eventual plans for integration into Tails OS? #103

Open

deeplow mentioned this issue Jun 29, 2023

Assess language support impact on Dangerzone #465

Open

This was referenced Sep 6, 2023

Qubes: RPM Packaging #431

Closed

Settle on Tesseract model type #545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic discovery of OCR languages #448

Dynamic discovery of OCR languages #448

apyrgio commented Jun 14, 2023

Dynamic discovery of OCR languages #448

Dynamic discovery of OCR languages #448

Comments

apyrgio commented Jun 14, 2023