Solve most congestions issues in /tmp (client & Server) #619

deeplow · 2023-11-02T15:18:49Z

In more resourced limited systems (such as Qubes OS) converting 1000+ page documents would not be possible due to several bottlenecks.

This PR addresses several of these bottlenecks (see commits for details).

Fixes #616 #625 #574

Regressions introduced

The containers version is slightly slower, stopping at the end of each batch for a brief moment. I don't knot yet what's causing the wait, but this will be solved with #443, once containers stream pages.

Remaining bottlenecks

Currently, converting a 1299 page document in Qubes still fails. This is due to the compression step (the reason why this still a draft). The compression creates additional files in /tmp which fill it up.

PDFtoPPM was producing RGB files faster than they were getting consumed. Since the RGB files were only getting removed after they were sent, this was leading to /tmp in the server getting clogged. This solution consists in processing and sending images in chunks of 50 pages. This solution is slightly inefficient since it can't process and send data simultaneously. That will be solved in a future commit. Fixes #574

Previously the PDFtoPPM conversion in batches would stop after conversion to send the data. But by sending the data in the following loop, we can perform the data sending at the "same time" as converting a batch.

Storing all RGB files in the host were leading to a fast-filling `/tmp`. This solution essentially converts all the RGB files to PNGs (which are compressed) saving valuable space in the process. This conversion is made with the Pillow (PIL) module, without the need for any external dependencies. Fixes #526

An overdue dead code removal was finally done, combined with the merging of part of the containers and Qubes isolation provider regarding size (and width/height) checks.

In large (1200+) PDFs the PDFunite command would fail on some systems (e.g. Qubes), because it would be called with 1024+ files, leading up to too many files open (`ulimit -n`). This solution splits the merging into batches, accumulating the results in a single PDF and then merging it with the next batch.

poetry.lock

Fixes #618

deeplow · 2023-11-06T14:15:46Z

dangerzone/isolation_provider/container.py

+        num_pages = len(glob.glob(f"{pixel_dir}/page-*.rgb"))
+        self.verify_received_pixel_files(pixel_dir, num_pages)
+
+        for page in range(1, num_pages + 1):
+            filename_base = f"{pixel_dir}/page-{page}"
+            rgb_filename = f"{filename_base}.rgb"
+            width_filename = f"{filename_base}.width"
+            height_filename = f"{filename_base}.height"
+            with open(width_filename) as f:
+                width = int(f.read().strip())
+            with open(height_filename) as f:
+                height = int(f.read().strip())
+            with open(rgb_filename, "rb") as rgb_f:
+                untrusted_pixels = rgb_f.read()
+                self.convert_pixels_to_png(
+                    str(pixel_dir), page, width, height, rgb_data=untrusted_pixels
+                )
+
+            os.remove(rgb_filename)
+            os.remove(width_filename)
+            os.remove(height_filename)


All of this should be move to after checking if there was a non-zero exit code.

Just to make sure why this code is necessary in the first place:

The current containers version, which does not stream the data to the host yet, has to write all the pixels in the temporary directory. This is the worst case scenario, and by converting them to PNG after the fact does not help much.

The streaming (post-Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443) containers version will piggyback on the Qubes code that converts the RGB images to PNG on the fly, which has already been added in this PR.

Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?

Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?

Yes, that's the logic here. And it's not worth leaving the pixels-to-pdf code more complex to make #412 easier.

apyrgio · 2023-11-06T16:06:59Z

I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with PyMuPDF.

Consider this snippet:

import sys, fitz  # import the bindings

fname = sys.argv[1]  # get filename from command line
doc = fitz.open(fname)  # open document

for page in doc:  # iterate through the pages
    pix = page.get_pixmap(dpi=150)  # render page to an RGB image
    rgb_buf = pix.samples_mv  # Get a memory view over the pixel data
    with open("mu-%i.rgb" % page.number, "wb+") as f:
        f.write(rgb_buf)
    print(f"Page {page.number}: W: {pix.width}, H: {pix.height}")

This snippet:

Opens a document and reads every page
Renders each page as an RGB image on the fly, and immediately has access to the width and height.
- Rendering the page to RGB on the fly means that we can ditch the unnecessary PPM -> RGB conversion step, as well as the convoluted callback logic that come with pdftoppm
Writes the pixel data to a file (but can also write them to a socket).

For the container case, we can simply iterate all the pages and write the data to the filesystem. For the Qubes case though, we can write the data immediately to the stdout, without buffering them locally. This means that we don't need any batch logic, and Unix pipes become the means of congestion control.

Also, another improvement that this snippet brings is that it can RGB files 2x faster than pdftoppm creates PPM file. And this does not even account the time it takes to convert PPM -> RGB.

Alas, there's a caveat: PyMuPDF is an AGPL project, and Artifex takes its GPL violations seriously. So, what do we have to concede here to make our code faster, simpler, and more secure? My understanding is that we need to license the conversion module under AGPL. This does not affect the application as a whole, because the code runs either in a separate Docker container, or disposable Qube.

Thoughts on this?

EDIT: Alternatively, we can use the liberally licensed pypdfium2, but I haven't experimented yet with it. Also, with a cursory look I don't see any Fedora packages for it, which will prove important for Qubes.

apyrgio · 2023-11-06T16:42:24Z

dangerzone/isolation_provider/base.py

-
-#     return True, True
+        try:
+            Image.open(ppm_data).save(png_path, "PNG")


I had tackled something similar in my 2023-10-test-enospc branch. You don't need to go through the PPM -> PNG conversion, nor create a bytesIO object. You can convert it from the existing buffer with:

image = Image.frombuffer("RGB", (width, height), untrusted_pixels, "raw", "RGB", 0, 1)

Ah, nice! I though there would be some golden nugget like this, but I had difficulties finding that branch of yours. Thanks! I'll try and add it.

deeplow · 2023-11-06T17:26:11Z

I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with PyMuPDF.

Thanks for the feedback. That's certainly interesting proposal. I did consider some of those PDF parsing libraries, but ultimately though it wouldn't add much benefit due to their disadvantages:

we're adding another library
The PDF rendering engine change - poppler isn't perfect, but changing it may change some things (I think).
The potential licensing issues - I'm not sure the disposable qube counts the same as a docker image. It would probably be so if we shipped a template image, which we don't.

But the speed increase and python-native bindings are certainly. Regarding speed, the current performance seems fast enough to be acceptable and there are slower steps in the pipeline (e.g. OCR) so I don't think it's too critical at the moment.

I am unsure if the benefits outweigh the risks. We could have a sync discussion if you have the availability.

apyrgio

As you see, I have proposed some radical detours from the batching implementation, which use native Python libraries instead. I think at the end of the day, this will lead to a simpler, faster, and more secure implementation, which is worth doing now that we are at the start of the release cycle. I'd like to hear your opinion as well though.

apyrgio · 2023-11-06T17:04:53Z

dangerzone/isolation_provider/container.py

+        num_pages = len(glob.glob(f"{pixel_dir}/page-*.rgb"))
+        self.verify_received_pixel_files(pixel_dir, num_pages)
+
+        for page in range(1, num_pages + 1):
+            filename_base = f"{pixel_dir}/page-{page}"
+            rgb_filename = f"{filename_base}.rgb"
+            width_filename = f"{filename_base}.width"
+            height_filename = f"{filename_base}.height"
+            with open(width_filename) as f:
+                width = int(f.read().strip())
+            with open(height_filename) as f:
+                height = int(f.read().strip())
+            with open(rgb_filename, "rb") as rgb_f:
+                untrusted_pixels = rgb_f.read()
+                self.convert_pixels_to_png(
+                    str(pixel_dir), page, width, height, rgb_data=untrusted_pixels
+                )
+
+            os.remove(rgb_filename)
+            os.remove(width_filename)
+            os.remove(height_filename)


Just to make sure why this code is necessary in the first place:

The current containers version, which does not stream the data to the host yet, has to write all the pixels in the temporary directory. This is the worst case scenario, and by converting them to PNG after the fact does not help much.

The streaming (post-Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443) containers version will piggyback on the Qubes code that converts the RGB images to PNG on the fly, which has already been added in this PR.

Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?

apyrgio · 2023-11-06T17:57:01Z

dangerzone/conversion/pixels_to_pdf.py

-            ),
-            timeout=timeout,
-        )
+        for first_page, last_page in batch_iterator(num_pages):


Question: what do we gain by batching the pdfunite operations?

At this point, we already have all the pages as PDF files on disk, and that is the main space hog, right? I get that with a single pdfunite call we need double the space, but that's a requirement for the conversion step as well, which takes the original PDF and creates a compressed one, and they both co-exist in the disk.

I have another suggestion here: using PyPDF2 instead, and perform compression online:

from pypdf import PdfReader, PdfWriter reader = PdfReader("example.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page) for page in writer.pages: # ⚠️ This has to be done on the writer, not the reader! page.compress_content_streams() # This is CPU intensive! with open("out.pdf", "wb") as f: writer.write(f)

(snippet taken from the docs)

With PyPDF2, we can open each PDF, compress it, and immediately create the output PDF, without pdfunite and ps2pdf. We can even delete the underlying PDFs as we process them. The good thing about PyPDF2 is that it's available in Fedora repos as well, so it will work on Qubes.

Question: what do we gain by batching the pdfunite operations?

It's in the commit message. So it's not size-related but rather ulimit related. We could change the ulimit, but since I had already done the batch logic, I figured it could be reused here.

Oh snap. I didn't expect that Qubes had such a low ulimit (nor that pdfunite opens the files at the same time, jeez...).

deeplow · 2023-11-10T08:56:34Z

I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with PyMuPDF.

I have now stalled progress on this PR, since I'll be evaluating how well PyMuPDF works for us, especially regarding the need to change the code license (or at least part of it).

deeplow · 2023-12-12T08:23:44Z

Closing potentially in favor of #622

deeplow added 5 commits November 2, 2023 15:11

Optimize PDFtoPPM batch conversion to run continuously

45a7122

Previously the PDFtoPPM conversion in batches would stop after conversion to send the data. But by sending the data in the following loop, we can perform the data sending at the "same time" as converting a batch.

Merge size checks onto common isolation provider

ebfed4e

An overdue dead code removal was finally done, combined with the merging of part of the containers and Qubes isolation provider regarding size (and width/height) checks.

github-advanced-security bot found potential problems Nov 2, 2023

View reviewed changes

poetry.lock Fixed Show fixed Hide fixed

Bump min py version to 3.8 and add Pillow module

f99ad74

Fixes #618

deeplow force-pushed the 616-tmp-size branch from faf5d5a to f99ad74 Compare November 2, 2023 15:24

deeplow added 2 commits November 6, 2023 11:12

FIXUP: num_batches miscalculation

c0a656c

FIXUP: add python3-pillow as deb/rpm dependency

eebd556

deeplow commented Nov 6, 2023

View reviewed changes

apyrgio reviewed Nov 6, 2023

View reviewed changes

harrislapiroff mentioned this pull request Nov 16, 2023

Fix /tmp Filling Up Bottlenecks (server + client) #616

Open

2 tasks

deeplow mentioned this pull request Nov 16, 2023

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

Merged

2 tasks

deeplow closed this Dec 12, 2023

apyrgio deleted the 616-tmp-size branch May 22, 2024 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve most congestions issues in /tmp (client & Server) #619

Solve most congestions issues in /tmp (client & Server) #619

deeplow commented Nov 2, 2023

deeplow Nov 6, 2023

apyrgio Nov 6, 2023

deeplow Nov 7, 2023

apyrgio commented Nov 6, 2023 •

edited

apyrgio Nov 6, 2023

deeplow Nov 6, 2023

deeplow commented Nov 6, 2023

apyrgio left a comment

apyrgio Nov 6, 2023

apyrgio Nov 6, 2023

deeplow Nov 7, 2023

apyrgio Nov 7, 2023

deeplow commented Nov 10, 2023

deeplow commented Dec 12, 2023 •

edited

Solve most congestions issues in /tmp (client & Server) #619

Solve most congestions issues in /tmp (client & Server) #619

Conversation

deeplow commented Nov 2, 2023

Regressions introduced

Remaining bottlenecks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apyrgio commented Nov 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deeplow commented Nov 6, 2023

apyrgio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deeplow commented Nov 10, 2023

deeplow commented Dec 12, 2023 • edited

apyrgio commented Nov 6, 2023 •

edited

deeplow commented Dec 12, 2023 •

edited