Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more OCR languages #422

Merged
merged 4 commits into from
May 24, 2023
Merged

Support more OCR languages #422

merged 4 commits into from
May 24, 2023

Commits on May 23, 2023

  1. container: Grab trained OCR models from GitHub

    Grab Tesseract's trained models from GitHub, instead of from the Alpine
    Linux repos. Over the past few months, the models in the Alpine Linux
    repos did not remain stable, leading to CI issues.
    
    Since the models are already pre-trained and available through
    Tesseract's repo on GitHub, we can use the release tarball that they
    offer to install them in the container image, which is basically what
    the upstream packages are doing as well.
    
    In order to make sure that we have no regressions, at the time of this
    commit we ensured that the hashes of the models offered through the
    Alpine Linux repos and the models offered from the GitHub release are
    the same. Also, in order to detect future regressions or foul play, we
    check the downloaded models against a known checksum. Given that these
    models change every few years, updating the checksum should not be an
    issue.
    
    Fix #357
    apyrgio committed May 23, 2023
    Configuration menu
    Copy the full SHA
    a0d6f0d View commit details
    Browse the repository at this point in the history

Commits on May 24, 2023

  1. Restore the OCR languages

    Restore the OCR languages to the state they were in
    66d3c40, with some minor changes. We
    can now do so because we download all the trained models, not just the
    ones that Alpine Linux offers.
    apyrgio committed May 24, 2023
    Configuration menu
    Copy the full SHA
    35e439f View commit details
    Browse the repository at this point in the history
  2. Remove Kurdish (Arabic) language

    Remove the Kurdish (Arabic) language ("kur_ara") from the list of
    languages that we offer for OCR, since it's not included in the
    installed languages.
    
    Interestingly, it is not present in the Apline Linux repos as well, so
    this was probably an omission in the first place.
    apyrgio committed May 24, 2023
    Configuration menu
    Copy the full SHA
    5bd6097 View commit details
    Browse the repository at this point in the history
  3. ci: Add test for OCR languages

    Test that the languages that we provide to users for OCR match the
    languages that are installed in the container image
    
    Fixes #417
    apyrgio committed May 24, 2023
    Configuration menu
    Copy the full SHA
    641aa13 View commit details
    Browse the repository at this point in the history