Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve most congestions issues in /tmp (client & Server) #619

Closed
wants to merge 8 commits into from

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Nov 2, 2023

In more resourced limited systems (such as Qubes OS) converting 1000+ page documents would not be possible due to several bottlenecks.

This PR addresses several of these bottlenecks (see commits for details).

Fixes #616 #625 #574

Regressions introduced

The containers version is slightly slower, stopping at the end of each batch for a brief moment. I don't knot yet what's causing the wait, but this will be solved with #443, once containers stream pages.

Remaining bottlenecks

Currently, converting a 1299 page document in Qubes still fails. This is due to the compression step (the reason why this still a draft). The compression creates additional files in /tmp which fill it up.

PDFtoPPM was producing RGB files faster than they were getting consumed.
Since the RGB files were only getting removed after they were sent, this
was leading to /tmp in the server getting clogged.

This solution consists in processing and sending images in chunks of 50
pages. This solution is slightly inefficient since it can't process and
send data simultaneously. That will be solved in a future commit.

Fixes #574
Previously the PDFtoPPM conversion in batches would stop after
conversion to send the data. But by sending the data in the following
loop, we can perform the data sending at the "same time" as converting
a batch.
Storing all RGB files in the host were leading to a fast-filling `/tmp`.
This solution essentially converts all the RGB files to PNGs (which are
compressed) saving valuable space in the process. This conversion is
made with the Pillow (PIL) module, without the need for any external
dependencies.

Fixes #526
An overdue dead code removal was finally done, combined with the merging
of part of the containers and Qubes isolation provider regarding size
(and width/height) checks.
In large (1200+) PDFs the PDFunite command would fail on some systems
(e.g. Qubes), because it would be called with 1024+ files, leading up
to too many files open (`ulimit -n`).

This solution splits the merging into batches, accumulating the results
in a single PDF and then merging it with the next batch.
poetry.lock Fixed Show fixed Hide fixed
Comment on lines +308 to +328
num_pages = len(glob.glob(f"{pixel_dir}/page-*.rgb"))
self.verify_received_pixel_files(pixel_dir, num_pages)

for page in range(1, num_pages + 1):
filename_base = f"{pixel_dir}/page-{page}"
rgb_filename = f"{filename_base}.rgb"
width_filename = f"{filename_base}.width"
height_filename = f"{filename_base}.height"
with open(width_filename) as f:
width = int(f.read().strip())
with open(height_filename) as f:
height = int(f.read().strip())
with open(rgb_filename, "rb") as rgb_f:
untrusted_pixels = rgb_f.read()
self.convert_pixels_to_png(
str(pixel_dir), page, width, height, rgb_data=untrusted_pixels
)

os.remove(rgb_filename)
os.remove(width_filename)
os.remove(height_filename)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this should be move to after checking if there was a non-zero exit code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure why this code is necessary in the first place:

  1. The current containers version, which does not stream the data to the host yet, has to write all the pixels in the temporary directory. This is the worst case scenario, and by converting them to PNG after the fact does not help much.
  2. The streaming (post-Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443) containers version will piggyback on the Qubes code that converts the RGB images to PNG on the fly, which has already been added in this PR.

Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?

Yes, that's the logic here. And it's not worth leaving the pixels-to-pdf code more complex to make #412 easier.

@apyrgio
Copy link
Contributor

apyrgio commented Nov 6, 2023

I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with PyMuPDF.

Consider this snippet:

import sys, fitz  # import the bindings

fname = sys.argv[1]  # get filename from command line
doc = fitz.open(fname)  # open document

for page in doc:  # iterate through the pages
    pix = page.get_pixmap(dpi=150)  # render page to an RGB image
    rgb_buf = pix.samples_mv  # Get a memory view over the pixel data
    with open("mu-%i.rgb" % page.number, "wb+") as f:
        f.write(rgb_buf)
    print(f"Page {page.number}: W: {pix.width}, H: {pix.height}")

This snippet:

  1. Opens a document and reads every page
  2. Renders each page as an RGB image on the fly, and immediately has access to the width and height.
    • Rendering the page to RGB on the fly means that we can ditch the unnecessary PPM -> RGB conversion step, as well as the convoluted callback logic that come with pdftoppm
  3. Writes the pixel data to a file (but can also write them to a socket).

For the container case, we can simply iterate all the pages and write the data to the filesystem. For the Qubes case though, we can write the data immediately to the stdout, without buffering them locally. This means that we don't need any batch logic, and Unix pipes become the means of congestion control.

Also, another improvement that this snippet brings is that it can RGB files 2x faster than pdftoppm creates PPM file. And this does not even account the time it takes to convert PPM -> RGB.

Alas, there's a caveat: PyMuPDF is an AGPL project, and Artifex takes its GPL violations seriously. So, what do we have to concede here to make our code faster, simpler, and more secure? My understanding is that we need to license the conversion module under AGPL. This does not affect the application as a whole, because the code runs either in a separate Docker container, or disposable Qube.

Thoughts on this?


EDIT: Alternatively, we can use the liberally licensed pypdfium2, but I haven't experimented yet with it. Also, with a cursory look I don't see any Fedora packages for it, which will prove important for Qubes.


# return True, True
try:
Image.open(ppm_data).save(png_path, "PNG")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tackled something similar in my 2023-10-test-enospc branch. You don't need to go through the PPM -> PNG conversion, nor create a bytesIO object. You can convert it from the existing buffer with:

image = Image.frombuffer("RGB", (width, height), untrusted_pixels, "raw", "RGB", 0, 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice! I though there would be some golden nugget like this, but I had difficulties finding that branch of yours. Thanks! I'll try and add it.

@deeplow
Copy link
Contributor Author

deeplow commented Nov 6, 2023

I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with PyMuPDF.

Thanks for the feedback. That's certainly interesting proposal. I did consider some of those PDF parsing libraries, but ultimately though it wouldn't add much benefit due to their disadvantages:

  • we're adding another library
  • The PDF rendering engine change - poppler isn't perfect, but changing it may change some things (I think).
  • The potential licensing issues - I'm not sure the disposable qube counts the same as a docker image. It would probably be so if we shipped a template image, which we don't.

But the speed increase and python-native bindings are certainly. Regarding speed, the current performance seems fast enough to be acceptable and there are slower steps in the pipeline (e.g. OCR) so I don't think it's too critical at the moment.

I am unsure if the benefits outweigh the risks. We could have a sync discussion if you have the availability.

Copy link
Contributor

@apyrgio apyrgio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you see, I have proposed some radical detours from the batching implementation, which use native Python libraries instead. I think at the end of the day, this will lead to a simpler, faster, and more secure implementation, which is worth doing now that we are at the start of the release cycle. I'd like to hear your opinion as well though.

Comment on lines +308 to +328
num_pages = len(glob.glob(f"{pixel_dir}/page-*.rgb"))
self.verify_received_pixel_files(pixel_dir, num_pages)

for page in range(1, num_pages + 1):
filename_base = f"{pixel_dir}/page-{page}"
rgb_filename = f"{filename_base}.rgb"
width_filename = f"{filename_base}.width"
height_filename = f"{filename_base}.height"
with open(width_filename) as f:
width = int(f.read().strip())
with open(height_filename) as f:
height = int(f.read().strip())
with open(rgb_filename, "rb") as rgb_f:
untrusted_pixels = rgb_f.read()
self.convert_pixels_to_png(
str(pixel_dir), page, width, height, rgb_data=untrusted_pixels
)

os.remove(rgb_filename)
os.remove(width_filename)
os.remove(height_filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure why this code is necessary in the first place:

  1. The current containers version, which does not stream the data to the host yet, has to write all the pixels in the temporary directory. This is the worst case scenario, and by converting them to PNG after the fact does not help much.
  2. The streaming (post-Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443) containers version will piggyback on the Qubes code that converts the RGB images to PNG on the fly, which has already been added in this PR.

Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?

),
timeout=timeout,
)
for first_page, last_page in batch_iterator(num_pages):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what do we gain by batching the pdfunite operations?

At this point, we already have all the pages as PDF files on disk, and that is the main space hog, right? I get that with a single pdfunite call we need double the space, but that's a requirement for the conversion step as well, which takes the original PDF and creates a compressed one, and they both co-exist in the disk.

I have another suggestion here: using PyPDF2 instead, and perform compression online:

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

for page in writer.pages:
    # ⚠️ This has to be done on the writer, not the reader!
    page.compress_content_streams()  # This is CPU intensive!

with open("out.pdf", "wb") as f:
    writer.write(f)

(snippet taken from the docs)

With PyPDF2, we can open each PDF, compress it, and immediately create the output PDF, without pdfunite and ps2pdf. We can even delete the underlying PDFs as we process them. The good thing about PyPDF2 is that it's available in Fedora repos as well, so it will work on Qubes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what do we gain by batching the pdfunite operations?

It's in the commit message. So it's not size-related but rather ulimit related. We could change the ulimit, but since I had already done the batch logic, I figured it could be reused here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh snap. I didn't expect that Qubes had such a low ulimit (nor that pdfunite opens the files at the same time, jeez...).

@deeplow
Copy link
Contributor Author

deeplow commented Nov 10, 2023

I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with PyMuPDF.

I have now stalled progress on this PR, since I'll be evaluating how well PyMuPDF works for us, especially regarding the need to change the code license (or at least part of it).

@deeplow
Copy link
Contributor Author

deeplow commented Dec 12, 2023

Closing potentially in favor of #622

@deeplow deeplow closed this Dec 12, 2023
@apyrgio apyrgio deleted the 616-tmp-size branch May 22, 2024 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix /tmp Filling Up Bottlenecks (server + client)
2 participants