-
Notifications
You must be signed in to change notification settings - Fork 218
Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX #686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
apyrgio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Haven't tested locally yet, but the change makes sense.
|
Thanks for the review. This can only be merged after #627, so we'll have to wait. |
|
@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR. |
Simple rename of the __convert() method in the Qubes conversion to make the code structurally similar.
Remove container_log messages ahead of debug info being sent over standard streams.
Merge Qubes and Containers isolation providers core code into the class parent IsolationProviders abstract class. This is done by streaming pages in containers for exclusively in first conversion process. The commit is rather large due to the multiple interdependencies of the code, making it difficult to split into various commits. The main conversion method (_convert) now in the superclass simply calls two methods: - doc_to_pixels() - pixels_to_pdf() Critically, doc_to_pixels is implemented in the superclass, diverging only in a specialized method called "start_doc_to_pixels_proc()". This method obtains the process responsible that communicates with the isolation provider (container / disp VM) via `podman/docker` and qrexec on Containers and Qubes respectively. Known regressions: - progress reports stopped working on containers Fixes #443
If one converted more than one document, since the state of IsolationProvider.percentage would be stored in the IsolationProvider instance, it would get reused for the second document. The fix is to keep it as a local variable, but we can explore having progress stored on the document itself, for example. Or having one IsolationProvider per conversion.
Now that only the second container can send JSON-encoded progress information, we can the untrusted JSON parsing. The parse_progress was also renamed to `parse_progress_trusted` to ensure future developers don't mistake this as a safe method. The old methods for sending untrusted JSON were repurposed to send the progress instead to stderr for troubleshooting in development mode. Fixes #456
Avoids downloading the container image 4 times in the multi-stage build by first pulling the alpine image once and then building without any pulls. Implemented following a suggestion of @apyrgio.
Conversions methods had changed and that was part of the reason why the tests were failing. Furthermore, due to the `provider.proc`, which stores the associated qrexec / container process, "server" exceptions raise a IterruptedConversion error (now ConverterProcException), which then requires interpretation of the process exit code to obtain the "real" exception.
Now that Qubes and Containers essentially share the same code, we can have both run the same tests.
If we increased the number of parallel conversions, we'd run into an issue where the streams were getting mixed together. This was because the Converter.proc was a single attribute. This breaks it down into a local variable such that this mixup doesn't happen.
Attempt to fix missing issue installing poetry [1]. [1]: https://github.com/freedomofpress/dangerzone/actions/runs/7487413482/job/20379748604?pr=627
Fix issues in older distros that don't yet support python 3.11 where endianness was not a default argument [1]. This is in response to CI failures [2]. [1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes [2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y
The conversion was hanging arbitrarily [1] on some systems. Sometimes it would send the full page other times stop half-way. Originally found by @apyrgio. Co-authored-by: @apyrgio [1]: #627 (comment)
This reverts commit fea193e. This is part of the purge of timeout-related code since we no longer need it [1]. Non-blocking reads were introduced in the reverted commit in order to be able to cut a stream mid-way due to a timeout. This is no longer needed now that we're getting rid of timeouts. [1]: #687
This reverts commit 344d6f7. Stopwatch is no longer needed now that we're removing timeouts.
Remove timeouts due to several reasons: 1. Lost purpose: after implementing the containers page streaming the only subprocess we have left is LibreOffice. So don't have such a big risk of commands hanging (the original reason for timeouts). 2. Little benefit: predicting execution time is generically unsolvable computer science problem. Ultimately we were guessing an arbitrary time based on the number of pages and the document size. As a guess we made it pretty lax (30s per page or MB). A document hanging for this long will probably lead to user frustration in any case and the user may be compelled to abort the conversion. 3. Technical Challenges with non-blocking timeout: there have been several technical challenges in keeping timeouts that we've made effort to accommodate. A significant one was having to do non-blocking read to ensure we could timeout when reading conversion stream (and then used here) Fixes #687
Since the progress information is now inferred on host based on the number of pages obtained, progress-tracking variables should be removed from the server.
Switching from mounting files to writing to stdout has introduced some Podman crashes in specific environments (Ubuntu Jammy / Debian Bullseye) due to a conmon bug that affects version 2.0.25. Fixing it for various permutations of the environments we support requires the following: 1. CI tests: Install conmon from the oldstable-proposed-updates in our Debian Bullseye / Ubuntu Jammy dev/end-user environments. 2. Developers: Add a line in BUILD.md that suggests users to install conmon from the oldstable-proposed-updates repo, or some other repo they prefer. 3. End-user installations: We will build conmon for Ubuntu Jammy, and wait until the proposed updates repo gets merged in Debian Bullseye. Fixes #685
PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to `pixmap.pdfocr_tobytes()` [1], but lower versions require setting instead the TESSDATA_PREFIX environment variable [2]. Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable. NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in `doc_to_pixels.py` since PyMuPDF reads this variable as soon as the fitz module is imported [3][4]. [1]: https://pymupdf.readthedocs.io/en/latest/pixmap.html#Pixmap.pdfocr_tobytes [2]: https://pymupdf.readthedocs.io/en/latest/installation.html#enabling-integrated-ocr-support [3]: pymupdf/PyMuPDF#2439 [4]: https://github.com/pymupdf/PyMuPDF/blob/5d6a7db/src/__init__.py#L159 Fixes #682
The container image does not need the TESSDATA_PREFIX env variable since its PyMuPDF version is new enough to support `tessdata` as an argument when calling the PyMuPDF tesseract method.
da684dd to
879fca6
Compare
Verbally @apyrgio confirmed that this was fine to include. So I'll merge this now. |
|
Interesting... GitHub still thinks this is a massive PR after this branch was just 2 commits after the tip of main (with the merging of #627) |
PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to
pixmap.pdfocr_tobytes()1, but lower versions require setting instead the TESSDATA_PREFIX environment variable 2.Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable.
NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in
doc_to_pixels.pysince PyMuPDF reads this variable as soon as the fitz module is imported 3.Fixes #682