Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfSplitter, PdfRearranger #251

Closed
wants to merge 12 commits into from
Closed

PdfSplitter, PdfRearranger #251

wants to merge 12 commits into from

Conversation

Poltuu
Copy link
Contributor

@Poltuu Poltuu commented Jan 4, 2021

This is my take on the PdfSplitter/PdfMerger/ general pdf pages recombination problem. Feel free to put me back in my place if anything is wrong.

It mostly builds up api based on the #248 logic, to allow treatment of several new documents at once, which is necessary when splitting a pdf. I didn't know how to cherry pick commit across forks, so this will conflict with #248, and the page selection logic comes from this branch.

The suggested new public apis do conflicts with the suggested new ones in #248, but since it's not merged yet, I thought I would give it a try.

I introduce the IPdfArrangement public api for ultimate control over the pages order to the user. I believe proposing a method instead of a collection of collection opens up more dynamic possibilities.

I tried to keep the commit chain readable, so that's the easiest way to read this PR IMO.

Suggested new apis:

  • class PdfMerger

    • void Merge(string file1, string file2, Stream output)
    • void Merge(Stream output, params string[] filePaths)
    • void Merge(IReadOnlyList<Stream> streams, Stream output)
  • class PdfSplitter

    • void SplitTwoParts(Stream file, int secondDocumentFirstPageIndex, Stream output1, Stream output2)
    • void RemovePages(Stream file, IReadOnlyCollection<int> removedPages, Stream output, Stream removedPagesOutput = null)
    • void SplitEveryPage(Stream file, IEnumerable<Stream> outputs, int pageCountPerFile = 1)
    • void Split(Stream file, IReadOnlyCollection<(IReadOnlyCollection<int> Pages, Stream output)> pageBundles)
  • class PdfRearranger

    • void Rearrange(IReadOnlyList<IInputBytes> files, IPdfArrangement arrangement, Stream output)
    • void RearrangeMany(IReadOnlyList<IInputBytes> files, IEnumerable<(IPdfArrangement Arrangement, Stream Output)> rearrangements)
  • interface IPdfArrangement

    • IEnumerable<(int FileIndex, IReadOnlyCollection<int> PageIndices)> GetArrangements(Dictionary<int, int> pagesCountPerFileIndex)

Other changes:

  • I spotted that PdfDocumentFactory.Open(string filename) uses File.ReadAllBytes, which allocates the whole document, when I believe opening a stream should be beneficial for large documents. This can be put in another PR.
  • I tried to favor streams over byte[], in general, which helps with big documents
  • I changed some of the PdfStreamWriter logic to allow flushing after every page tree. This releases references and should again help with large documents.

@InusualZ
Copy link
Collaborator

InusualZ commented Jan 5, 2021

First of all thanks for taking your time in putting this code together. This comment is my opinion and shouldn't be considered final at all.

But, I don't like the three new classes. Why? Because they are little bit verbose, they duplicate code and they do what PdfMerger already can do (I merged #248). The change related to Stream is welcome I was going to that myself at some point, for the same reason that you explain. But, this change introduce public API incompatibility so I don't know If @EliotJones would want to merge these changes as they are (Maybe instead add an overload). The change to the PdfStreamWriter I like but I think we should create a interface from PdfStreamWriter so we can have two classes that implement. The current PdfStreamWriter implementation and another that flushes the tokens immediately. That way we can offer the user the option to save time (while constructing document) or save memory (by flushing immediately the tokens to the stream).

Lastly, I think you should break this changes down into their own PR. Maybe one for the Stream and PdfStreamWriter and another for the Merger API changes

@Poltuu
Copy link
Contributor Author

Poltuu commented Jan 6, 2021

Thanks for your feedback.

The reason that I went for PdfSplitter, PdfMerger, PdfRearranger is because it seems to me that there is actually 4 different use cases that would rather have their own API.

One Output Many Output
One Input Edition Split
Many Input Merge Rearrangment

I guess they could all fall under the PdfRearranger class, but we already had a PdfMerger class, so it seemed natural to create new classes.
I don't really see how they duplicate code thus, everything ends up using the PdfRearranger code.

I would also like to underline that, thanks to the IPdfArrangement argument, PdfRearranger can now merge documents in all sorts of way (ZipMerge, pages from file1 then pages from file2 then pages from file1 again...) that PdfMerger could not.

Similarly, this allows to split documents in complex ways with a fairly simple api.

Finally, the PdfRearranger can split/merge several documents into several new documents, which could be handled by PdfMerger of course, but the api didn't not seem natural to me PdfMerger.Recombine(...), hence the new class.


Concerning the stream aspect of this PR, I'll split in in differents PR, as you suggested. I made sure not to break any public API thus.


Concerning PdfStreamWriter modifications, I'll suggest the modifications in another PR, as this is not core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants