Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add merge feature #194

Merged
merged 13 commits into from
Dec 21, 2022
Merged

Add merge feature #194

merged 13 commits into from
Dec 21, 2022

Conversation

xaviRodri
Copy link
Contributor

In order to be able to merge multiple PDF files, we have added this Ghostscript command that can be called using ChromicPDF.merge/2.

This new function accepts a list of paths and an output option that can be added or not depending on the desired output (same as print_to_pdf/2 output option).

@xaviRodri
Copy link
Contributor Author

Hey!
What do you think about this?
I love and use your library, but I am generating multiple PDF files (as I need different headers) and I thought about pushing this PR so it was easy to merge them afterwards.
I already started with the testing, but wanted to ask for some advice on it:

  • would be better do run the test against some already generated PDF files in "priv/" for example?
  • or maybe just generating them using print_to_pdf/2?

I also thought on the possibility to add something else so we could print and merge multiple PDFs from different HTML templates. Any idea on this?

Thanks a lot!

@maltoe
Copy link
Collaborator

maltoe commented Dec 14, 2022

Hey @xaviRodri

Thanks for your contribution! I think a merge function is indeed a nice little addition - even though people keep telling me that the Ghostscript part of ChromicPDF should actually be its own library 🙃

I'll take some time looking into your changes later. Already have 2 questions for you though:

  • What happens to the metadata set on the individual PDFs? (= if I set a title on one or both of them, what does pdfinfo show for the merged file?)
  • I'm assuming the PDF/A features are also "lost" when merging two PDF/A files. Did you try that?

I already started with the testing, but wanted to ask for some advice on it:

  • would be better do run the test against some already generated PDF files in "priv/" for example?
  • or maybe just generating them using print_to_pdf/2?

No strong opinion, but probably would go with the pre-generated files in priv as it's completely independent of the PDF printing

I also thought on the possibility to add something else so we could print and merge multiple PDFs from different HTML templates. Any idea on this?

We have the source_and_options map already, we could add an API that consumes multiple of these and runs the merge step at the end.

[
  Template.source_and_options(...),
  Template.source_and_options(...)
]
|> ChromicPDF.print_to_pdf_and_merge(output: "merged.pdf")

Naming tbd :)

@xaviRodri
Copy link
Contributor Author

Thanks for the fast answer @maltoe !! 💪

First, to answer to your questions:

What happens to the metadata set on the individual PDFs?

It depends on 2 factors:

  • if only one of them has metadata, the merged one will take it from that one (order doesn't affect here)
  • if more than one has metadata, the last one being merged will give the meta to the merged file (order is important here)

I'm assuming the PDF/A features are also "lost" when merging two PDF/A files. Did you try that?

I'm not super familiar with all the PDF/A features.. but I tried to see an example file with xpdf and looks like the merged file still warns me about syntax and so.

Also, I like a lot the idea to get a list of source and options and create a print_to_pdf_and_merge/2 function!
Let me try it too because that would fit my use case 100%

Thanks again! 💯

@maltoe
Copy link
Collaborator

maltoe commented Dec 15, 2022

@xaviRodri Thought about it some more and figured we could as well just extend the existing print_to_pdf/2 function with a new clause. I pushed a small commit changing around the parameter types slightly, hopefully you can make sense of it.

New signature would be:

@spec print_to_pdf([source() | source_and_options()], [export_option()]) :: export_return()

Behaviour:

  • create a tmpdir
  • print a pdf for each source in the list
  • merge them all and return the file as indicated by the output option (so follow the existing API design)

Things to look out for:

  • telemetry should be called only once, so you would need to rearrange the chrome_export utility a bit (sorry for the mess, btw, I know the API implementation is a bit confusing)
  • documentation needs to be clear on the fact that this requires ghostscript (as print_to_pdf otherwise doesn't)

-> With this new function I think we should not add a top-level ChromicPDF.merge/2 API anymore.


Next, since we're using ghostscript for the merge anyway, we would be able to use the existing ghostscript functionality, i.e. adding metadata (see t:info_option) as well as PDF/A conversion. But if you think this is too much for a single PR, feel free to use the Ghostscript.merge/2 function you've already implemented for now, and we'll see further afterwards.

@xaviRodri
Copy link
Contributor Author

Thanks for taking a look!!

I'm right now trying to get it work 👍

what do you think about this spec?

@spec print_to_pdf(source() | source_and_options() | [source() | source_and_options()], [export_option()]) :: export_return()

So the current people using it don't need to change it into a list?
I can try to patternmatch or add a guard so we can distinguish the different options 👌

@maltoe
Copy link
Collaborator

maltoe commented Dec 15, 2022

what do you think about this spec?

Yes, sorry, this is what I had in mind, too. @spec above was meant to illustrate the new clause of the print_to_pdf/2 function. The new clause should match on the first param being a list.

And yes, let's absolutely make sure to not break existing API ;)

@xaviRodri
Copy link
Contributor Author

xaviRodri commented Dec 16, 2022

Some updates @maltoe !

I got it working! 🎊
Currently I have it as a separate function but takes the same arguments as print_to_pdf/2.

I'm running into a "problem" that I've solved but maybe some help could improve the solution..

So, right now, if we use source_and_options/1 to create a source and then we call print_to_pdf/2, what we are doing is overriding the source's options (let's call them individual opts) with the function options (general opts maybe?).

But I was thinking on the possibility of merging both instead of overriding them, so we would be able to have some general attributes (e.g. margins for the whole merged pdf) but at the same time specify some for a particular "section" of the final file (e.g. I want the top margin bigger in this section, I want a different header for that section, etc.).
This won't work if we just override the print_to_pdf option... So I created a merge_print_options/2 function that, for each source, merges some of the options we have there.
I had conflicts with some default values that doesn't appear in the result of source_and_options/1, so I had to do some changes to the Template module (and that's making the tests fail).

Let me push these draft changes so you can check the idea better (don't expect a super clean code, specs and so as it needs still some work!)
If you think that these may be too much for what we need let me know! I'm 100% open to change it as many times as needed.

@xaviRodri xaviRodri marked this pull request as draft December 16, 2022 13:55
lib/chromic_pdf/api.ex Outdated Show resolved Hide resolved

tmp_path
end)
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can DRY up this function a bit by extracting the identical "branches" of the anonymous function into another function

lib/chromic_pdf/api/pdf_options.ex Outdated Show resolved Hide resolved
lib/chromic_pdf/template.ex Outdated Show resolved Hide resolved
@maltoe
Copy link
Collaborator

maltoe commented Dec 17, 2022

@xaviRodri Nice work already!

@xaviRodri
Copy link
Contributor Author

xaviRodri commented Dec 19, 2022

I will play with it today too so I can address all the above 🔝

But I think we are now on the good direction 💯

Thanks a lot for the feedback till now 🙏

@xaviRodri
Copy link
Contributor Author

More updates! 👀

I have a couple of questions before writing documentation & tests:

  1. What to do with the with_telemetry clauses - I remembered you said we should only call it once.. but once for all the process or once for all the chrome_exports we do? (e.g. in print_to_pdfa/3 it's call it twice) I could move it one level up so with_telemetry calls the chrome_export (that's maybe what you wanted to say with "rearrange the chrome_export utility a bit").
  2. Should I maintain the API's merge feature? Or should we only have it as a tool for the "print & merge"?

For the rest, if you have a sec to check it, I'm happy to change anything you find weird 👍

In order to be able to merge multiple PDF files, we have added this Ghostscript command that can be called using `ChromicPDF.merge/2`.

This new function accepts a list of paths and an output option that can be added or not depending on the desired output (same as `print_to_pdf/2` output option).
In order to be able to print and merge multiple sources, we have done the following:

* Added a temporary `print_to_pdf_and_merge/2` function that will, in a tmp dir, create all the individual PDF files and then merge them
* Updated the way the `print_to_pdf` options are merged (so we don't just override them)
* Update the `source_and_options/1` function so it accepts more style options (margins)
 * Update the `source_and_options/1` so you can specify that your styles will be written in your templates (`style: :custom` option)
* Added a `PDFOptions.feed_merge_data_into_output/2` similar to `PDFAOptions.feed_ghostscript_file_into_output/2`
In order to have a single module where we manage how to output PDF files, we have created `ChromicPDF.OutputOptions` (using plural as we also do it for `PDFOptions`).

In here we can find the previous `feed_chrome_data_into_output/2` and a new `feed_file_into_output/2` that replaces `PDFAOptions.feed_ghostscript_file_into_output/2` . That's because we will also use the same function to output merged PDF files.
As the feature is the same, just that getting an input with multiple sources, we have joined both functions into one.

Also updated the specs to match this change.
lib/chromic_pdf/api.ex Outdated Show resolved Hide resolved
lib/chromic_pdf/api.ex Outdated Show resolved Hide resolved
lib/chromic_pdf/api/output_option.ex Show resolved Hide resolved
lib/chromic_pdf/pdfa/ghostscript_worker.ex Outdated Show resolved Hide resolved
@maltoe
Copy link
Collaborator

maltoe commented Dec 20, 2022

@xaviRodri

e.g. in print_to_pdfa/3 it's call it twice

Hmm, wasn't aware of this anymore. OK then, maybe we align with the existing behaviour and continue to send one print_to_pdf event for each print job and one new merge event for the merger? Not sure 🤷 IMO having one combined events would make it more useful for instrumentation, but we shouldn't change existing behaviour.

Should I maintain the API's merge feature?

Let's get rid of it for now. If someone wants to only merge PDFs, they can use pdfjoin or the like. Keep the main API small.

We will use the merge functionality internally only.
As we already get the full path, we don't need to expand it in this case.
Added an event around the merge operation in `print_to_pdf/3` to align with the current behaviour in other processes.
@maltoe
Copy link
Collaborator

maltoe commented Dec 20, 2022

@xaviRodri Looking good, thanks!

When you document this, please make sure to mention that Ghostscript is required for the merge step and that printing of individual files before merge is done sequentially. For testing, I'd suggest to add a test case to pdf_generation_test.exs and perhaps one to telemetry_test.exs

@andreasknoepfle
Copy link
Member

andreasknoepfle commented Dec 21, 2022

printing of individual files before merge is done sequentially

I wonder if it would be worth adding a ticket after this to allowing parallel printing here (at least up to the pool size?). Wdyt @maltoe

@maltoe
Copy link
Collaborator

maltoe commented Dec 21, 2022

allowing parallel printing

I think that is more complex than we foresee now, but yeah, we could explore it for sure.

Minimal tests so we only cover the specific feature.

The rest is already covered by other tests.
@xaviRodri
Copy link
Contributor Author

Hey again!

Added some tests and documentation as we agreed!
I'm not 100% sure if the tests cover all the expected for this merge part, but I didn't want to add tests for already tested features (dynamic templates, etc.).

Also about the documentation: I added it inside "Input options" as this is related to the inputs. But if you think it should belong to another part let me know! 👍

Marking it as "Ready for review"! 🚀

@xaviRodri xaviRodri marked this pull request as ready for review December 21, 2022 15:46
Copy link
Collaborator

@maltoe maltoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great!

Since we already had a few iterations - and since I realized that I want to refactor a few things (types, docs, Ghostscript module) -, I'd merge this as-is and "own" your code in a follow-up commit. Hope you don't mind.

Thanks a lot for the contribution! 💜 💙 💛 Super happy about how this turned out. Very nice addition, especially given we've recently been asked about this very feature in #177

**This feature requires Ghostscript to merge different files.**

If you have multiple inputs, you can pass them as a list of sources. In this case, the different results will be merged into one single output.
Remember that the merge will be done sequentially, so the order of the given list is important.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The order of the list is significant either way, right? As in, the order determines the order of pages in the merged document. What I meant with "sequentially" in my original comment is that the processing is not concurrent, i.e. when you print n pages and printing a single page takes t time, you would expect a total runtime of >= nt (regardless of the number of sessions in Chrome)

Suggested change
Remember that the merge will be done sequentially, so the order of the given list is important.
Please note that processing of individual inputs is not concurrent, so total runtime will scale linearly with the number of inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missunderstood it then!
Time of printing is also something to be aware of, true!
Let’s see if what @andreasknoepfle said about concurrent printing can become real 💪

setup do
start_supervised!(ChromicPDF)
:ok
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably wanted this describe section to contain the tests above

Copy link
Contributor Author

@xaviRodri xaviRodri Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I started in here but moved them into the other describe!
Sorry I forgot to delete this 😅

@maltoe maltoe merged commit bf896ab into bitcrowd:main Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants