Add merge feature #194

xaviRodri · 2022-12-14T10:22:18Z

In order to be able to merge multiple PDF files, we have added this Ghostscript command that can be called using ChromicPDF.merge/2.

This new function accepts a list of paths and an output option that can be added or not depending on the desired output (same as print_to_pdf/2 output option).

xaviRodri · 2022-12-14T10:27:22Z

Hey!
What do you think about this?
I love and use your library, but I am generating multiple PDF files (as I need different headers) and I thought about pushing this PR so it was easy to merge them afterwards.
I already started with the testing, but wanted to ask for some advice on it:

would be better do run the test against some already generated PDF files in "priv/" for example?
or maybe just generating them using print_to_pdf/2?

I also thought on the possibility to add something else so we could print and merge multiple PDFs from different HTML templates. Any idea on this?

Thanks a lot!

maltoe · 2022-12-14T11:10:05Z

Hey @xaviRodri

Thanks for your contribution! I think a merge function is indeed a nice little addition - even though people keep telling me that the Ghostscript part of ChromicPDF should actually be its own library 🙃

I'll take some time looking into your changes later. Already have 2 questions for you though:

What happens to the metadata set on the individual PDFs? (= if I set a title on one or both of them, what does pdfinfo show for the merged file?)
I'm assuming the PDF/A features are also "lost" when merging two PDF/A files. Did you try that?

I already started with the testing, but wanted to ask for some advice on it:

would be better do run the test against some already generated PDF files in "priv/" for example?

or maybe just generating them using print_to_pdf/2?

No strong opinion, but probably would go with the pre-generated files in priv as it's completely independent of the PDF printing

I also thought on the possibility to add something else so we could print and merge multiple PDFs from different HTML templates. Any idea on this?

We have the source_and_options map already, we could add an API that consumes multiple of these and runs the merge step at the end.

[
  Template.source_and_options(...),
  Template.source_and_options(...)
]
|> ChromicPDF.print_to_pdf_and_merge(output: "merged.pdf")

Naming tbd :)

xaviRodri · 2022-12-14T12:27:25Z

Thanks for the fast answer @maltoe !! 💪

First, to answer to your questions:

What happens to the metadata set on the individual PDFs?

It depends on 2 factors:

if only one of them has metadata, the merged one will take it from that one (order doesn't affect here)
if more than one has metadata, the last one being merged will give the meta to the merged file (order is important here)

I'm assuming the PDF/A features are also "lost" when merging two PDF/A files. Did you try that?

I'm not super familiar with all the PDF/A features.. but I tried to see an example file with xpdf and looks like the merged file still warns me about syntax and so.

Also, I like a lot the idea to get a list of source and options and create a print_to_pdf_and_merge/2 function!
Let me try it too because that would fit my use case 100%

Thanks again! 💯

maltoe · 2022-12-15T09:34:49Z

@xaviRodri Thought about it some more and figured we could as well just extend the existing print_to_pdf/2 function with a new clause. I pushed a small commit changing around the parameter types slightly, hopefully you can make sense of it.

New signature would be:

@spec print_to_pdf([source() | source_and_options()], [export_option()]) :: export_return()

Behaviour:

create a tmpdir
print a pdf for each source in the list
merge them all and return the file as indicated by the output option (so follow the existing API design)

Things to look out for:

telemetry should be called only once, so you would need to rearrange the chrome_export utility a bit (sorry for the mess, btw, I know the API implementation is a bit confusing)
documentation needs to be clear on the fact that this requires ghostscript (as print_to_pdf otherwise doesn't)

-> With this new function I think we should not add a top-level ChromicPDF.merge/2 API anymore.

Next, since we're using ghostscript for the merge anyway, we would be able to use the existing ghostscript functionality, i.e. adding metadata (see t:info_option) as well as PDF/A conversion. But if you think this is too much for a single PR, feel free to use the Ghostscript.merge/2 function you've already implemented for now, and we'll see further afterwards.

xaviRodri · 2022-12-15T11:02:38Z

Thanks for taking a look!!

I'm right now trying to get it work 👍

what do you think about this spec?

@spec print_to_pdf(source() | source_and_options() | [source() | source_and_options()], [export_option()]) :: export_return()

So the current people using it don't need to change it into a list?
I can try to patternmatch or add a guard so we can distinguish the different options 👌

maltoe · 2022-12-15T12:40:10Z

what do you think about this spec?

Yes, sorry, this is what I had in mind, too. @spec above was meant to illustrate the new clause of the print_to_pdf/2 function. The new clause should match on the first param being a list.

And yes, let's absolutely make sure to not break existing API ;)

xaviRodri · 2022-12-16T13:46:38Z

Some updates @maltoe !

I got it working! 🎊
Currently I have it as a separate function but takes the same arguments as print_to_pdf/2.

I'm running into a "problem" that I've solved but maybe some help could improve the solution..

So, right now, if we use source_and_options/1 to create a source and then we call print_to_pdf/2, what we are doing is overriding the source's options (let's call them individual opts) with the function options (general opts maybe?).

But I was thinking on the possibility of merging both instead of overriding them, so we would be able to have some general attributes (e.g. margins for the whole merged pdf) but at the same time specify some for a particular "section" of the final file (e.g. I want the top margin bigger in this section, I want a different header for that section, etc.).
This won't work if we just override the print_to_pdf option... So I created a merge_print_options/2 function that, for each source, merges some of the options we have there.
I had conflicts with some default values that doesn't appear in the result of source_and_options/1, so I had to do some changes to the Template module (and that's making the tests fail).

Let me push these draft changes so you can check the idea better (don't expect a super clean code, specs and so as it needs still some work!)
If you think that these may be too much for what we need let me know! I'm 100% open to change it as many times as needed.

lib/chromic_pdf/api.ex

maltoe · 2022-12-17T09:58:19Z

lib/chromic_pdf/api.ex

+
+        tmp_path
+    end)
+  end


You can DRY up this function a bit by extracting the identical "branches" of the anonymous function into another function

lib/chromic_pdf/api/pdf_options.ex

lib/chromic_pdf/template.ex

maltoe · 2022-12-17T10:09:45Z

@xaviRodri Nice work already!

xaviRodri · 2022-12-19T07:15:19Z

I will play with it today too so I can address all the above 🔝

But I think we are now on the good direction 💯

Thanks a lot for the feedback till now 🙏

xaviRodri · 2022-12-19T11:17:32Z

More updates! 👀

I have a couple of questions before writing documentation & tests:

What to do with the with_telemetry clauses - I remembered you said we should only call it once.. but once for all the process or once for all the chrome_exports we do? (e.g. in print_to_pdfa/3 it's call it twice) I could move it one level up so with_telemetry calls the chrome_export (that's maybe what you wanted to say with "rearrange the chrome_export utility a bit").
Should I maintain the API's merge feature? Or should we only have it as a tool for the "print & merge"?

For the rest, if you have a sec to check it, I'm happy to change anything you find weird 👍

In order to be able to merge multiple PDF files, we have added this Ghostscript command that can be called using `ChromicPDF.merge/2`. This new function accepts a list of paths and an output option that can be added or not depending on the desired output (same as `print_to_pdf/2` output option).

In order to be able to print and merge multiple sources, we have done the following: * Added a temporary `print_to_pdf_and_merge/2` function that will, in a tmp dir, create all the individual PDF files and then merge them * Updated the way the `print_to_pdf` options are merged (so we don't just override them) * Update the `source_and_options/1` function so it accepts more style options (margins) * Update the `source_and_options/1` so you can specify that your styles will be written in your templates (`style: :custom` option) * Added a `PDFOptions.feed_merge_data_into_output/2` similar to `PDFAOptions.feed_ghostscript_file_into_output/2`

In order to have a single module where we manage how to output PDF files, we have created `ChromicPDF.OutputOptions` (using plural as we also do it for `PDFOptions`). In here we can find the previous `feed_chrome_data_into_output/2` and a new `feed_file_into_output/2` that replaces `PDFAOptions.feed_ghostscript_file_into_output/2` . That's because we will also use the same function to output merged PDF files.

As the feature is the same, just that getting an input with multiple sources, we have joined both functions into one. Also updated the specs to match this change.

lib/chromic_pdf/api.ex

lib/chromic_pdf/api/output_option.ex

lib/chromic_pdf/pdfa/ghostscript_worker.ex

maltoe · 2022-12-20T09:03:24Z

@xaviRodri

e.g. in print_to_pdfa/3 it's call it twice

Hmm, wasn't aware of this anymore. OK then, maybe we align with the existing behaviour and continue to send one print_to_pdf event for each print job and one new merge event for the merger? Not sure 🤷 IMO having one combined events would make it more useful for instrumentation, but we shouldn't change existing behaviour.

Should I maintain the API's merge feature?

Let's get rid of it for now. If someone wants to only merge PDFs, they can use pdfjoin or the like. Keep the main API small.

We will use the merge functionality internally only.

As we already get the full path, we don't need to expand it in this case.

Added an event around the merge operation in `print_to_pdf/3` to align with the current behaviour in other processes.

maltoe · 2022-12-20T15:25:54Z

@xaviRodri Looking good, thanks!

When you document this, please make sure to mention that Ghostscript is required for the merge step and that printing of individual files before merge is done sequentially. For testing, I'd suggest to add a test case to pdf_generation_test.exs and perhaps one to telemetry_test.exs

andreasknoepfle · 2022-12-21T09:59:39Z

printing of individual files before merge is done sequentially

I wonder if it would be worth adding a ticket after this to allowing parallel printing here (at least up to the pool size?). Wdyt @maltoe

maltoe · 2022-12-21T10:33:47Z

allowing parallel printing

I think that is more complex than we foresee now, but yeah, we could explore it for sure.

Minimal tests so we only cover the specific feature. The rest is already covered by other tests.

xaviRodri · 2022-12-21T15:46:19Z

Hey again!

Added some tests and documentation as we agreed!
I'm not 100% sure if the tests cover all the expected for this merge part, but I didn't want to add tests for already tested features (dynamic templates, etc.).

Also about the documentation: I added it inside "Input options" as this is related to the inputs. But if you think it should belong to another part let me know! 👍

Marking it as "Ready for review"! 🚀

maltoe

Thanks, this is great!

Since we already had a few iterations - and since I realized that I want to refactor a few things (types, docs, Ghostscript module) -, I'd merge this as-is and "own" your code in a follow-up commit. Hope you don't mind.

Thanks a lot for the contribution! 💜 💙 💛 Super happy about how this turned out. Very nice addition, especially given we've recently been asked about this very feature in #177

maltoe · 2022-12-21T19:24:30Z

lib/chromic_pdf/supervisor.ex

+      **This feature requires Ghostscript to merge different files.**
+
+      If you have multiple inputs, you can pass them as a list of sources. In this case, the different results will be merged into one single output.
+      Remember that the merge will be done sequentially, so the order of the given list is important.


🤔 The order of the list is significant either way, right? As in, the order determines the order of pages in the merged document. What I meant with "sequentially" in my original comment is that the processing is not concurrent, i.e. when you print n pages and printing a single page takes t time, you would expect a total runtime of >= nt (regardless of the number of sessions in Chrome)

Suggested change

Remember that the merge will be done sequentially, so the order of the given list is important.

Please note that processing of individual inputs is not concurrent, so total runtime will scale linearly with the number of inputs.

Oh I missunderstood it then!
Time of printing is also something to be aware of, true!
Let’s see if what @andreasknoepfle said about concurrent printing can become real 💪

maltoe · 2022-12-21T21:37:37Z

test/integration/pdf_generation_test.exs

+    setup do
+      start_supervised!(ChromicPDF)
+      :ok
+    end


You probably wanted this describe section to contain the tests above

First I started in here but moved them into the other describe!
Sorry I forgot to delete this 😅

maltoe mentioned this pull request Dec 15, 2022

Rendering a combined PDF of multiple "files" #177

Closed

xaviRodri marked this pull request as draft December 16, 2022 13:55

maltoe reviewed Dec 17, 2022

View reviewed changes

xaviRodri added 5 commits December 19, 2022 12:31

Revert Template changes

2b65edc

Move print_to_pdf_and_merge into print_to_pdf

c0325e7

As the feature is the same, just that getting an input with multiple sources, we have joined both functions into one. Also updated the specs to match this change.

xaviRodri force-pushed the feature/merge-pdf-files branch from b9c6ec3 to c0325e7 Compare December 19, 2022 11:35

Update specs

1039fc2

maltoe reviewed Dec 20, 2022

View reviewed changes

lib/chromic_pdf/api.ex Outdated Show resolved Hide resolved

lib/chromic_pdf/api.ex Outdated Show resolved Hide resolved

lib/chromic_pdf/api/output_option.ex Show resolved Hide resolved

lib/chromic_pdf/pdfa/ghostscript_worker.ex Outdated Show resolved Hide resolved

xaviRodri added 4 commits December 20, 2022 13:18

Consolidate source for print & merge feature

13531bd

Delete merge API feature

40385e0

We will use the merge functionality internally only.

Delete path expanding

5341a1e

As we already get the full path, we don't need to expand it in this case.

Add merge event

c253c71

Added an event around the merge operation in `print_to_pdf/3` to align with the current behaviour in other processes.

Add tests for merging feature

75d0e36

Minimal tests so we only cover the specific feature. The rest is already covered by other tests.

xaviRodri added 2 commits December 21, 2022 16:14

Delete unused credo comment

5f3f959

Add documentation for the new merge feature

d8aaac4

xaviRodri marked this pull request as ready for review December 21, 2022 15:46

xaviRodri requested a review from maltoe December 21, 2022 15:46

maltoe approved these changes Dec 21, 2022

View reviewed changes

maltoe merged commit bf896ab into bitcrowd:main Dec 21, 2022

andreasknoepfle mentioned this pull request Dec 22, 2022

Parallel printing for multiple sources #197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add merge feature #194

Add merge feature #194

xaviRodri commented Dec 14, 2022

xaviRodri commented Dec 14, 2022

maltoe commented Dec 14, 2022 •

edited

Loading

xaviRodri commented Dec 14, 2022

maltoe commented Dec 15, 2022

xaviRodri commented Dec 15, 2022

maltoe commented Dec 15, 2022 •

edited

Loading

xaviRodri commented Dec 16, 2022 •

edited

Loading

maltoe Dec 17, 2022

maltoe commented Dec 17, 2022

xaviRodri commented Dec 19, 2022 •

edited

Loading

xaviRodri commented Dec 19, 2022

maltoe commented Dec 20, 2022

maltoe commented Dec 20, 2022

andreasknoepfle commented Dec 21, 2022 •

edited

Loading

maltoe commented Dec 21, 2022 •

edited

Loading

xaviRodri commented Dec 21, 2022

maltoe left a comment

maltoe Dec 21, 2022

xaviRodri Dec 22, 2022

maltoe Dec 21, 2022

xaviRodri Dec 22, 2022 •

edited

Loading

	Remember that the merge will be done sequentially, so the order of the given list is important.
	Please note that processing of individual inputs is not concurrent, so total runtime will scale linearly with the number of inputs.

Add merge feature #194

Add merge feature #194

Conversation

xaviRodri commented Dec 14, 2022

xaviRodri commented Dec 14, 2022

maltoe commented Dec 14, 2022 • edited Loading

xaviRodri commented Dec 14, 2022

maltoe commented Dec 15, 2022

xaviRodri commented Dec 15, 2022

maltoe commented Dec 15, 2022 • edited Loading

xaviRodri commented Dec 16, 2022 • edited Loading

maltoe Dec 17, 2022

Choose a reason for hiding this comment

maltoe commented Dec 17, 2022

xaviRodri commented Dec 19, 2022 • edited Loading

xaviRodri commented Dec 19, 2022

maltoe commented Dec 20, 2022

maltoe commented Dec 20, 2022

andreasknoepfle commented Dec 21, 2022 • edited Loading

maltoe commented Dec 21, 2022 • edited Loading

xaviRodri commented Dec 21, 2022

maltoe left a comment

Choose a reason for hiding this comment

maltoe Dec 21, 2022

Choose a reason for hiding this comment

xaviRodri Dec 22, 2022

Choose a reason for hiding this comment

maltoe Dec 21, 2022

Choose a reason for hiding this comment

xaviRodri Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

maltoe commented Dec 14, 2022 •

edited

Loading

maltoe commented Dec 15, 2022 •

edited

Loading

xaviRodri commented Dec 16, 2022 •

edited

Loading

xaviRodri commented Dec 19, 2022 •

edited

Loading

andreasknoepfle commented Dec 21, 2022 •

edited

Loading

maltoe commented Dec 21, 2022 •

edited

Loading

xaviRodri Dec 22, 2022 •

edited

Loading