Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686

deeplow · 2024-01-24T14:37:55Z

NOTE: to be merged after #627 since Qubes support for some reason was not working on main. The stream pages PR must have fixed something (can't recall exactly).

PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to pixmap.pdfocr_tobytes() 1, but lower versions require setting instead the TESSDATA_PREFIX environment variable 2.

Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable.

NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in doc_to_pixels.py since PyMuPDF reads this variable as soon as the fitz module is imported 3.

Fixes #682

apyrgio

Looks good. Haven't tested locally yet, but the change makes sense.

deeplow · 2024-02-01T11:53:45Z

Thanks for the review. This can only be merged after #627, so we'll have to wait.

deeplow · 2024-02-05T18:52:56Z

@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.

Simple rename of the __convert() method in the Qubes conversion to make the code structurally similar.

Remove container_log messages ahead of debug info being sent over standard streams.

Merge Qubes and Containers isolation providers core code into the class parent IsolationProviders abstract class. This is done by streaming pages in containers for exclusively in first conversion process. The commit is rather large due to the multiple interdependencies of the code, making it difficult to split into various commits. The main conversion method (_convert) now in the superclass simply calls two methods: - doc_to_pixels() - pixels_to_pdf() Critically, doc_to_pixels is implemented in the superclass, diverging only in a specialized method called "start_doc_to_pixels_proc()". This method obtains the process responsible that communicates with the isolation provider (container / disp VM) via `podman/docker` and qrexec on Containers and Qubes respectively. Known regressions: - progress reports stopped working on containers Fixes #443

If one converted more than one document, since the state of IsolationProvider.percentage would be stored in the IsolationProvider instance, it would get reused for the second document. The fix is to keep it as a local variable, but we can explore having progress stored on the document itself, for example. Or having one IsolationProvider per conversion.

Now that only the second container can send JSON-encoded progress information, we can the untrusted JSON parsing. The parse_progress was also renamed to `parse_progress_trusted` to ensure future developers don't mistake this as a safe method. The old methods for sending untrusted JSON were repurposed to send the progress instead to stderr for troubleshooting in development mode. Fixes #456

@apyrgio

Avoids downloading the container image 4 times in the multi-stage build by first pulling the alpine image once and then building without any pulls. Implemented following a suggestion of @apyrgio.

Conversions methods had changed and that was part of the reason why the tests were failing. Furthermore, due to the `provider.proc`, which stores the associated qrexec / container process, "server" exceptions raise a IterruptedConversion error (now ConverterProcException), which then requires interpretation of the process exit code to obtain the "real" exception.

Now that Qubes and Containers essentially share the same code, we can have both run the same tests.

If we increased the number of parallel conversions, we'd run into an issue where the streams were getting mixed together. This was because the Converter.proc was a single attribute. This breaks it down into a local variable such that this mixup doesn't happen.

Attempt to fix missing issue installing poetry [1]. [1]: https://github.com/freedomofpress/dangerzone/actions/runs/7487413482/job/20379748604?pr=627

Fix issues in older distros that don't yet support python 3.11 where endianness was not a default argument [1]. This is in response to CI failures [2]. [1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes [2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y

@apyrgio

The conversion was hanging arbitrarily [1] on some systems. Sometimes it would send the full page other times stop half-way. Originally found by @apyrgio. Co-authored-by: @apyrgio [1]: #627 (comment)

This reverts commit fea193e. This is part of the purge of timeout-related code since we no longer need it [1]. Non-blocking reads were introduced in the reverted commit in order to be able to cut a stream mid-way due to a timeout. This is no longer needed now that we're getting rid of timeouts. [1]: #687

This reverts commit 344d6f7. Stopwatch is no longer needed now that we're removing timeouts.

Remove timeouts due to several reasons: 1. Lost purpose: after implementing the containers page streaming the only subprocess we have left is LibreOffice. So don't have such a big risk of commands hanging (the original reason for timeouts). 2. Little benefit: predicting execution time is generically unsolvable computer science problem. Ultimately we were guessing an arbitrary time based on the number of pages and the document size. As a guess we made it pretty lax (30s per page or MB). A document hanging for this long will probably lead to user frustration in any case and the user may be compelled to abort the conversion. 3. Technical Challenges with non-blocking timeout: there have been several technical challenges in keeping timeouts that we've made effort to accommodate. A significant one was having to do non-blocking read to ensure we could timeout when reading conversion stream (and then used here) Fixes #687

Since the progress information is now inferred on host based on the number of pages obtained, progress-tracking variables should be removed from the server.

Switching from mounting files to writing to stdout has introduced some Podman crashes in specific environments (Ubuntu Jammy / Debian Bullseye) due to a conmon bug that affects version 2.0.25. Fixing it for various permutations of the environments we support requires the following: 1. CI tests: Install conmon from the oldstable-proposed-updates in our Debian Bullseye / Ubuntu Jammy dev/end-user environments. 2. Developers: Add a line in BUILD.md that suggests users to install conmon from the oldstable-proposed-updates repo, or some other repo they prefer. 3. End-user installations: We will build conmon for Ubuntu Jammy, and wait until the proposed updates repo gets merged in Debian Bullseye. Fixes #685

PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to `pixmap.pdfocr_tobytes()` [1], but lower versions require setting instead the TESSDATA_PREFIX environment variable [2]. Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable. NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in `doc_to_pixels.py` since PyMuPDF reads this variable as soon as the fitz module is imported [3][4]. [1]: https://pymupdf.readthedocs.io/en/latest/pixmap.html#Pixmap.pdfocr_tobytes [2]: https://pymupdf.readthedocs.io/en/latest/installation.html#enabling-integrated-ocr-support [3]: pymupdf/PyMuPDF#2439 [4]: https://github.com/pymupdf/PyMuPDF/blob/5d6a7db/src/__init__.py#L159 Fixes #682

The container image does not need the TESSDATA_PREFIX env variable since its PyMuPDF version is new enough to support `tessdata` as an argument when calling the PyMuPDF tesseract method.

deeplow · 2024-02-07T14:32:03Z

@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.

Verbally @apyrgio confirmed that this was fine to include. So I'll merge this now.

deeplow · 2024-02-07T14:36:51Z

Interesting... GitHub still thinks this is a massive PR after this branch was just 2 commits after the tip of main (with the merging of #627)

apyrgio approved these changes Jan 29, 2024

View reviewed changes

deeplow and others added 19 commits February 6, 2024 18:54

Homogenize qubes and containers inner convert method

dca46d0

Simple rename of the __convert() method in the Qubes conversion to make the code structurally similar.

Containers: remove debug messages (via files)

331b651

Remove container_log messages ahead of debug info being sent over standard streams.

Speed up container image building (pull + build)

0a54f64

Avoids downloading the container image 4 times in the multi-stage build by first pulling the alpine image once and then building without any pulls. Implemented following a suggestion of @apyrgio.

Move Qubes-specific tests also to containers

943bab2

Now that Qubes and Containers essentially share the same code, we can have both run the same tests.

Bump CI macOS python version to 3.11

5e169a8

Attempt to fix missing issue installing poetry [1]. [1]: https://github.com/freedomofpress/dangerzone/actions/runs/7487413482/job/20379748604?pr=627

Fix hanging: disable container logging

07dd54c

The conversion was hanging arbitrarily [1] on some systems. Sometimes it would send the full page other times stop half-way. Originally found by @apyrgio. Co-authored-by: @apyrgio [1]: #627 (comment)

Revert "Add Stopwatch implementation"

4d3f2b3

This reverts commit 344d6f7. Stopwatch is no longer needed now that we're removing timeouts.

Remove leftover progress variable in pixels_to_pdf

8a32d80

Since the progress information is now inferred on host based on the number of pages obtained, progress-tracking variables should be removed from the server.

Remove uneeded TESSDATA_PREFIX setting in container

879fca6

The container image does not need the TESSDATA_PREFIX env variable since its PyMuPDF version is new enough to support `tessdata` as an argument when calling the PyMuPDF tesseract method.

deeplow force-pushed the 682-fix-ocr-lang-qubes branch from da684dd to 879fca6 Compare February 7, 2024 13:18

deeplow marked this pull request as ready for review February 7, 2024 14:31

deeplow merged commit 879fca6 into main Feb 7, 2024

deeplow deleted the 682-fix-ocr-lang-qubes branch February 7, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686

Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686

Uh oh!

deeplow commented Jan 24, 2024

Uh oh!

apyrgio left a comment

Uh oh!

deeplow commented Feb 1, 2024

Uh oh!

deeplow commented Feb 5, 2024

Uh oh!

deeplow commented Feb 7, 2024

Uh oh!

deeplow commented Feb 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686

Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686

Uh oh!

Conversation

deeplow commented Jan 24, 2024

Uh oh!

apyrgio left a comment

Choose a reason for hiding this comment

Uh oh!

deeplow commented Feb 1, 2024

Uh oh!

deeplow commented Feb 5, 2024

Uh oh!

deeplow commented Feb 7, 2024

Uh oh!

deeplow commented Feb 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants