Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to multiple language changes in tesseract OCR alpine packaging #417

Closed
deeplow opened this issue May 18, 2023 · 1 comment · Fixed by #422
Closed

Adapt to multiple language changes in tesseract OCR alpine packaging #417

deeplow opened this issue May 18, 2023 · 1 comment · Fixed by #422
Labels

Comments

@deeplow
Copy link
Contributor

deeplow commented May 18, 2023

CI example

Broken as of commit ac88a2d

^@^@python ./dev_scripts/pytest-wrapper.py -v --cov --ignore dev_scripts
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: '3.4.4' is an invalid version and will not be supported in a future release
  warnings.warn(
running tests sequentially
============================= test session starts ==============================
platform linux -- Python 3.10.7, pytest-7.2.2, pluggy-1.0.0 -- /home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/bin/python
cachedir: .pytest_cache
rootdir: /home/user/dangerzone
plugins: mock-3.10.0, forked-1.6.0, cov-3.0.0, xdist-2.5.0
collecting ... collected 67 items                                                             

tests/test_cli.py::TestCliBasic::test_no_args PASSED                     [  1%]
tests/test_cli.py::TestCliBasic::test_help PASSED                        [  2%]
tests/test_cli.py::TestCliBasic::test_display_banner PASSED              [  4%]
tests/test_cli.py::TestCliBasic::test_version PASSED                     [  5%]
tests/test_cli.py::TestCliConversion::test_no_args PASSED                [  7%]
tests/test_cli.py::TestCliConversion::test_help PASSED                   [  8%]
tests/test_cli.py::TestCliConversion::test_display_banner PASSED         [ 10%]
tests/test_cli.py::TestCliConversion::test_version PASSED                [ 11%]
tests/test_cli.py::TestCliConversion::test_invalid_lang PASSED           [ 13%]
tests/test_cli.py::TestCliConversion::test_formats[doc0] PASSED          [ 14%]
tests/test_cli.py::TestCliConversion::test_formats[doc1] PASSED          [ 16%]
tests/test_cli.py::TestCliConversion::test_formats[doc2] PASSED          [ 17%]
tests/test_cli.py::TestCliConversion::test_formats[doc3] PASSED          [ 19%]
tests/test_cli.py::TestCliConversion::test_formats[doc4] PASSED          [ 20%]
tests/test_cli.py::TestCliConversion::test_formats[doc5] PASSED          [ 22%]
tests/test_cli.py::TestCliConversion::test_formats[doc6] PASSED          [ 23%]
tests/test_cli.py::TestCliConversion::test_formats[doc7] PASSED          [ 25%]
tests/test_cli.py::TestCliConversion::test_formats[doc8] PASSED          [ 26%]
tests/test_cli.py::TestCliConversion::test_formats[doc9] PASSED          [ 28%]
tests/test_cli.py::TestCliConversion::test_formats[doc10] PASSED         [ 29%]
tests/test_cli.py::TestCliConversion::test_formats[doc11] PASSED         [ 31%]
tests/test_cli.py::TestCliConversion::test_formats[doc12] PASSED         [ 32%]
tests/test_cli.py::TestCliConversion::test_formats[doc13] PASSED         [ 34%]
tests/test_cli.py::TestCliConversion::test_formats[doc14] PASSED         [ 35%]
tests/test_cli.py::TestCliConversion::test_formats[doc15] PASSED         [ 37%]
tests/test_cli.py::TestCliConversion::test_formats[doc16] PASSED         [ 38%]
tests/test_cli.py::TestCliConversion::test_formats[doc17] PASSED         [ 40%]
tests/test_cli.py::TestCliConversion::test_formats[doc18] PASSED         [ 41%]
tests/test_cli.py::TestCliConversion::test_formats[doc19] PASSED         [ 43%]
tests/test_cli.py::TestCliConversion::test_output_filename PASSED        [ 44%]
tests/test_cli.py::TestCliConversion::test_output_filename_spaces PASSED [ 46%]
tests/test_cli.py::TestCliConversion::test_output_filename_new_dir PASSED [ 47%]
tests/test_cli.py::TestCliConversion::test_sample_not_found PASSED       [ 49%]
tests/test_cli.py::TestCliConversion::test_lang_eng FAILED               [ 50%]
tests/test_cli.py::TestCliConversion::test_filenames[\u201cCurly_Quotes\u201d.pdf] PASSED [ 52%]
tests/test_cli.py::TestCliConversion::test_filenames[\u041e\u0440\u0438\u0433\u0438\u043d\u0430\u043b.pdf] PASSED [ 53%]
tests/test_cli.py::TestCliConversion::test_filenames[spaces test.pdf] PASSED [ 55%]
tests/test_cli.py::TestCliConversion::test_bulk PASSED                   [ 56%]
tests/test_cli.py::TestCliConversion::test_bulk_fail_on_output_filename PASSED [ 58%]
tests/test_cli.py::TestCliConversion::test_archive PASSED                [ 59%]
tests/test_cli.py::TestCliConversion::test_dummy_conversion PASSED       [ 61%]
tests/test_cli.py::TestCliConversion::test_dummy_conversion_bulk PASSED  [ 62%]
tests/test_cli.py::TestSecurity::test_suspicious_double_dash_file PASSED [ 64%]
tests/test_cli.py::TestSecurity::test_suspicious_double_dash_and_equals_file PASSED [ 65%]
tests/test_document.py::test_input_sample_init PASSED                    [ 67%]
tests/test_document.py::test_input_sample_init_archive PASSED            [ 68%]
tests/test_document.py::test_input_sample_after PASSED                   [ 70%]
tests/test_document.py::test_input_file_none PASSED                      [ 71%]
tests/test_document.py::test_input_file_non_existing PASSED              [ 73%]
tests/test_document.py::test_input_file_unreadable PASSED                [ 74%]
tests/test_document.py::test_output_file_unwriteable_dir PASSED          [ 76%]
tests/test_document.py::test_output PASSED                               [ 77%]
tests/test_document.py::test_output_file_none PASSED                     [ 79%]
tests/test_document.py::test_output_file_not_pdf PASSED                  [ 80%]
tests/test_document.py::test_archive_unwriteable_dir PASSED              [ 82%]
tests/test_document.py::test_archive PASSED                              [ 83%]
tests/test_document.py::test_set_output_dir PASSED                       [ 85%]
tests/test_document.py::test_set_output_dir_non_existant PASSED          [ 86%]
tests/test_document.py::test_set_output_dir_is_file PASSED               [ 88%]
tests/test_document.py::test_default_output_filename PASSED              [ 89%]
tests/test_document.py::test_set_output_filename_suffix PASSED           [ 91%]
tests/test_document.py::test_is_unconverted_by_default PASSED            [ 92%]
tests/test_document.py::test_mark_as_safe PASSED                         [ 94%]
tests/test_document.py::test_mark_as_converting PASSED                   [ 95%]
tests/test_document.py::test_mark_as_failed PASSED                       [ 97%]
tests/test_util.py::test_get_resource_path PASSED                        [ 98%]
tests/test_util.py::test_get_subprocess_startupinfo SKIPPED (Windows...) [100%]/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/__init__.py': No source for code: '/home/user/dangerzone/shibokensupport/__init__.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/feature.py': No source for code: '/home/user/dangerzone/shibokensupport/feature.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/__init__.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/__init__.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/errorhandler.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/errorhandler.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/importhandler.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/importhandler.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/layout.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/layout.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/lib/__init__.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/lib/__init__.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/lib/enum_sig.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/lib/enum_sig.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/lib/pyi_generator.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/lib/pyi_generator.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/lib/tool.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/lib/tool.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/loader.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/loader.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/mapping.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/mapping.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/shibokensupport/signature/parser.py': No source for code: '/home/user/dangerzone/shibokensupport/signature/parser.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")
/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/coverage/report.py:113: CoverageWarning: Couldn't parse '/home/user/dangerzone/signature_bootstrap.py': No source for code: '/home/user/dangerzone/signature_bootstrap.py'. (couldnt-parse)
  coverage._warn(msg, slug="couldnt-parse")


=================================== FAILURES ===================================
_______________________ TestCliConversion.test_lang_eng ________________________

self = <tests.test_cli.TestCliConversion object at 0x7f6d7d995f00>

    def test_lang_eng(self) -> None:
        result = self.run_cli([self.sample_doc, "--ocr-lang", "eng"])
>       result.assert_success()

tests/test_cli.py:227: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <CLIResult SystemExit(1)>

    def assert_success(self) -> None:
        """Assert that the command succeeded."""
        try:
>           assert self.exit_code == 0
E           assert 1 == 0
E            +  where 1 = <CLIResult SystemExit(1)>.exit_code

tests/test_cli.py:53: AssertionError
----------------------------- Captured stdout call -----------------------------
<CLIResult args: ['/home/user/dangerzone/tests/test_docs/sample-pdf.pdf', '--ocr-lang', 'eng'], exit code: 1, exception: 1
Output (20 lines follow):
╭──────────────────────────╮
│           ▄██▄           │
│          ██████          │
│         ███▀▀▀██         │
│        ███   ████        │
│       ███   ██████       │
│      ███   ▀▀▀▀████      │
│     ███████  ▄██████     │
│    ███████ ▄█████████    │
│   ████████████████████   │
│    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀    │
│                          │
│    Dangerzone v0.4.1     │
│ https://dangerzone.rocks │
╰──────────────────────────╯

Converting document to safe PDF

Failed to convert document(s)
/home/user/dangerzone/tests/test_docs/sample-pdf.pdf

The original traceback follows:
Traceback (most recent call last):
  File "/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/click/testing.py", line 408, in invoke
    return_value = cli.main(args=args or (), prog_name=prog_name, **extra)
  File "/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.cache/pypoetry/virtualenvs/dangerzone-hQU0mwlP-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/dangerzone/dangerzone/errors.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/user/dangerzone/dangerzone/cli.py", line 116, in cli_main
    sys.exit(1)
SystemExit: 1
------------------------------ Captured log call -------------------------------
ERROR    dangerzone.isolation_provider.base:base.py:59 [doc xkfTj_] 50% Page 1/4 OCR failed
ERROR    dangerzone.isolation_provider.container:container.py:302 pixels-to-pdf failed
@deeplow
Copy link
Contributor Author

deeplow commented May 18, 2023

It looks like this wasn't the first breaking change in the tesseract packaging 3d822e1

@deeplow deeplow added the OCR label May 18, 2023
@deeplow deeplow changed the title English OCR broken in main Adapt to multiple language changes in tesseract OCR alpine packaging May 18, 2023
apyrgio added a commit that referenced this issue May 22, 2023
Test that the languages that we provide to users for OCR match the
languages that are installed in the container image

Fixes #417
apyrgio added a commit that referenced this issue May 22, 2023
Test that the languages that we provide to users for OCR match the
languages that are installed in the container image

Fixes #417
apyrgio added a commit that referenced this issue May 22, 2023
Test that the languages that we provide to users for OCR match the
languages that are installed in the container image

Fixes #417
apyrgio added a commit that referenced this issue May 23, 2023
Test that the languages that we provide to users for OCR match the
languages that are installed in the container image

Fixes #417
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
1 participant