Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic discovery of OCR languages #448

Open
apyrgio opened this issue Jun 14, 2023 · 0 comments
Open

Dynamic discovery of OCR languages #448

apyrgio opened this issue Jun 14, 2023 · 0 comments

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Jun 14, 2023

One of the main contributing factors to the size of our container image is the OCR models for all the languages that we bundle with our application:

# du -hd 1 / | sort -h
...
1.8G    /usr
1.8G    /

# du -hd 1 /usr | sort -h
...
717.7M  /usr/lib
1.1G    /usr/share
1.8G    /usr

# du -hd 1 /usr/share/ | sort -h
...
1015.6M /usr/share/tessdata   <--- OCR models
1.1G    /usr/share/

We don't expect that we will have any user that will need all of the Tesseract models simultaneously, so we need to find something better here. An interesting idea would be to dynamically fetch the models necessary for the conversion (e.g., under Dangerzone's data directory), and mount them to the container under /usr/share/tessdata.

There are two issues with this approach:

  1. The canonical repo would be GitHub, which may reject requests due to rate-limiting.
  2. This is something that cannot work on airgapped installations, so we would need instructions for those.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant