Skip to content

Conversation

@deeplow
Copy link
Contributor

@deeplow deeplow commented Jan 24, 2024

NOTE: to be merged after #627 since Qubes support for some reason was not working on main. The stream pages PR must have fixed something (can't recall exactly).

PyMuPDF versions lower than 1.22.5 pass the tesseract data path as an argument to pixmap.pdfocr_tobytes() 1, but lower versions require setting instead the TESSDATA_PREFIX environment variable 2.

Because on Qubes the pixels to pdf conversion happens on the host and Qubes has a lower PyMuPDF package version, we need to pass instead via environment variable.

NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli instead of closer to the calling method in doc_to_pixels.py since PyMuPDF reads this variable as soon as the fitz module is imported 3.

Fixes #682

Copy link
Contributor

@apyrgio apyrgio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Haven't tested locally yet, but the change makes sense.

@deeplow
Copy link
Contributor Author

deeplow commented Feb 1, 2024

Thanks for the review. This can only be merged after #627, so we'll have to wait.

@deeplow
Copy link
Contributor Author

deeplow commented Feb 5, 2024

@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.

deeplow and others added 19 commits February 6, 2024 18:54
Simple rename of the __convert() method in the Qubes conversion to make
the code structurally similar.
Remove container_log messages ahead of debug info being sent over
standard streams.
Merge Qubes and Containers isolation providers core code into the class
parent IsolationProviders abstract class.

This is done by streaming pages in containers for exclusively in first
conversion process. The commit is rather large due to the multiple
interdependencies of the code, making it difficult to split into various
commits.

The main conversion method (_convert) now in the superclass simply calls
two methods:
  - doc_to_pixels()
  - pixels_to_pdf()

Critically, doc_to_pixels is implemented in the superclass, diverging
only in a specialized method called "start_doc_to_pixels_proc()". This
method obtains the process responsible that communicates with the
isolation provider (container / disp VM) via `podman/docker` and qrexec
on Containers and Qubes respectively.

Known regressions:
  - progress reports stopped working on containers

Fixes #443
If one converted more than one document, since the state of
IsolationProvider.percentage would be stored in the IsolationProvider
instance, it would get reused for the second document. The fix is to
keep it as a local variable, but we can explore having progress stored
on the document itself, for example. Or having one IsolationProvider per
conversion.
Now that only the second container can send JSON-encoded progress
information, we can the untrusted JSON parsing. The parse_progress was
also renamed to `parse_progress_trusted` to ensure future developers
don't mistake this as a safe method.

The old methods for sending untrusted JSON were repurposed to send the
progress instead to stderr for troubleshooting in development mode.

Fixes #456
Avoids downloading the container image 4 times in the multi-stage build
by first pulling the alpine image once and then building without any
pulls.

Implemented following a suggestion of @apyrgio.
Conversions methods had changed and that was part of the reason why
the tests were failing. Furthermore, due to the `provider.proc`, which
stores the associated qrexec / container process, "server" exceptions
raise a IterruptedConversion error (now ConverterProcException), which
then requires interpretation of the process exit code to obtain the
"real" exception.
Now that Qubes and Containers essentially share the same code, we can
have both run the same tests.
If we increased the number of parallel conversions, we'd run into an
issue where the streams were getting mixed together. This was because
the Converter.proc was a single attribute. This breaks it down into a
local variable such that this mixup doesn't happen.
Fix issues in older distros that don't yet support python 3.11 where
endianness was not a default argument [1]. This is in response to CI
failures [2].

[1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
[2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y
The conversion was hanging arbitrarily [1] on some systems. Sometimes it
would send the full page other times stop half-way.

Originally found by @apyrgio.

Co-authored-by: @apyrgio

[1]: #627 (comment)
This reverts commit fea193e.

This is part of the purge of timeout-related code since we no longer
need it [1]. Non-blocking reads were introduced in the reverted commit
in order to be able to cut a stream mid-way due to a timeout. This is
no longer needed now that we're getting rid of timeouts.

[1]: #687
This reverts commit 344d6f7.
Stopwatch is no longer needed now that we're removing timeouts.
Remove timeouts due to several reasons:

1. Lost purpose: after implementing the containers page streaming the
   only subprocess we have left is LibreOffice. So don't have such a
   big risk of commands hanging (the original reason for timeouts).

2. Little benefit: predicting execution time is generically unsolvable
   computer science problem. Ultimately we were guessing an arbitrary
   time based on the number of pages and the document size. As a guess
   we made it pretty lax (30s per page or MB). A document hanging for
   this long will probably lead to user frustration in any case and the
   user may be compelled to abort the conversion.

3. Technical Challenges with non-blocking timeout: there have been
several technical challenges in keeping timeouts that we've made effort
to accommodate. A significant one was having to do non-blocking read to
ensure we could timeout when reading conversion stream (and then used
here)

Fixes #687
Since the progress information is now inferred on host based on the
number of pages obtained, progress-tracking variables should be removed
from the server.
Switching from mounting files to writing to stdout has introduced some
Podman crashes in specific environments (Ubuntu Jammy / Debian Bullseye)
due to a conmon bug that affects version 2.0.25.

Fixing it for various permutations of the environments we support
requires the following:

1. CI tests: Install conmon from the oldstable-proposed-updates in
   our Debian Bullseye / Ubuntu Jammy dev/end-user environments.
2. Developers: Add a line in BUILD.md that suggests users to install
   conmon from the oldstable-proposed-updates repo, or some other repo
   they prefer.
3. End-user installations: We will build conmon for Ubuntu Jammy, and
   wait until the proposed updates repo gets merged in Debian Bullseye.

Fixes #685
PyMuPDF versions lower than 1.22.5 pass the tesseract data path as
an argument to `pixmap.pdfocr_tobytes()` [1], but lower versions require
setting instead the TESSDATA_PREFIX environment variable [2].

Because on Qubes the pixels to pdf conversion happens on the host and
Qubes has a lower PyMuPDF package version, we need to pass instead via
environment variable.

NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli
instead of closer to the calling method in `doc_to_pixels.py` since
PyMuPDF reads this variable as soon as the fitz module is imported
[3][4].

[1]: https://pymupdf.readthedocs.io/en/latest/pixmap.html#Pixmap.pdfocr_tobytes
[2]: https://pymupdf.readthedocs.io/en/latest/installation.html#enabling-integrated-ocr-support
[3]: pymupdf/PyMuPDF#2439
[4]: https://github.com/pymupdf/PyMuPDF/blob/5d6a7db/src/__init__.py#L159

Fixes #682
The container image does not need the TESSDATA_PREFIX env variable since
its PyMuPDF version is new enough to support `tessdata` as an argument
when calling the PyMuPDF tesseract method.
@deeplow deeplow force-pushed the 682-fix-ocr-lang-qubes branch from da684dd to 879fca6 Compare February 7, 2024 13:18
@deeplow deeplow marked this pull request as ready for review February 7, 2024 14:31
@deeplow
Copy link
Contributor Author

deeplow commented Feb 7, 2024

@apyrgio I sneaked in da684dd in this PR since it is more topically related to this PR than in the page streaming PR (this was a leftover from the PyMuPDF integration. Plus, it's small enought that I don't think it deserves its own PR.

Verbally @apyrgio confirmed that this was fine to include. So I'll merge this now.

@deeplow deeplow merged commit 879fca6 into main Feb 7, 2024
@deeplow deeplow deleted the 682-fix-ocr-lang-qubes branch February 7, 2024 14:35
@deeplow
Copy link
Contributor Author

deeplow commented Feb 7, 2024

Interesting... GitHub still thinks this is a massive PR after this branch was just 2 commits after the tip of main (with the merging of #627)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--ocr-lang not working on Qubes

3 participants