PdfSplitter, PdfRearranger #251

Poltuu · 2021-01-04T09:11:58Z

This is my take on the PdfSplitter/PdfMerger/ general pdf pages recombination problem. Feel free to put me back in my place if anything is wrong.

It mostly builds up api based on the #248 logic, to allow treatment of several new documents at once, which is necessary when splitting a pdf. I didn't know how to cherry pick commit across forks, so this will conflict with #248, and the page selection logic comes from this branch.

The suggested new public apis do conflicts with the suggested new ones in #248, but since it's not merged yet, I thought I would give it a try.

I introduce the IPdfArrangement public api for ultimate control over the pages order to the user. I believe proposing a method instead of a collection of collection opens up more dynamic possibilities.

I tried to keep the commit chain readable, so that's the easiest way to read this PR IMO.

Suggested new apis:

class PdfMerger
- void Merge(string file1, string file2, Stream output)
- void Merge(Stream output, params string[] filePaths)
- void Merge(IReadOnlyList<Stream> streams, Stream output)
class PdfSplitter
- void SplitTwoParts(Stream file, int secondDocumentFirstPageIndex, Stream output1, Stream output2)
- void RemovePages(Stream file, IReadOnlyCollection<int> removedPages, Stream output, Stream removedPagesOutput = null)
- void SplitEveryPage(Stream file, IEnumerable<Stream> outputs, int pageCountPerFile = 1)
- void Split(Stream file, IReadOnlyCollection<(IReadOnlyCollection<int> Pages, Stream output)> pageBundles)
class PdfRearranger
- void Rearrange(IReadOnlyList<IInputBytes> files, IPdfArrangement arrangement, Stream output)
- void RearrangeMany(IReadOnlyList<IInputBytes> files, IEnumerable<(IPdfArrangement Arrangement, Stream Output)> rearrangements)
interface IPdfArrangement
- IEnumerable<(int FileIndex, IReadOnlyCollection<int> PageIndices)> GetArrangements(Dictionary<int, int> pagesCountPerFileIndex)

Other changes:

I spotted that PdfDocumentFactory.Open(string filename) uses File.ReadAllBytes, which allocates the whole document, when I believe opening a stream should be beneficial for large documents. This can be put in another PR.
I tried to favor streams over byte[], in general, which helps with big documents
I changed some of the PdfStreamWriter logic to allow flushing after every page tree. This releases references and should again help with large documents.

…ting the document in byte[]

Also, add api similar to PdfDocument regarding streams

in pdfmerger methods

to clear tokenReferences as we are merging the documents

(not on public api)

spotted on 68-1990-01_A.pdf

from commit 1c9b86b

InusualZ · 2021-01-05T00:52:46Z

First of all thanks for taking your time in putting this code together. This comment is my opinion and shouldn't be considered final at all.

But, I don't like the three new classes. Why? Because they are little bit verbose, they duplicate code and they do what PdfMerger already can do (I merged #248). The change related to Stream is welcome I was going to that myself at some point, for the same reason that you explain. But, this change introduce public API incompatibility so I don't know If @EliotJones would want to merge these changes as they are (Maybe instead add an overload). The change to the PdfStreamWriter I like but I think we should create a interface from PdfStreamWriter so we can have two classes that implement. The current PdfStreamWriter implementation and another that flushes the tokens immediately. That way we can offer the user the option to save time (while constructing document) or save memory (by flushing immediately the tokens to the stream).

Lastly, I think you should break this changes down into their own PR. Maybe one for the Stream and PdfStreamWriter and another for the Merger API changes

Poltuu · 2021-01-06T11:36:57Z

Thanks for your feedback.

The reason that I went for PdfSplitter, PdfMerger, PdfRearranger is because it seems to me that there is actually 4 different use cases that would rather have their own API.

	One Output	Many Output
One Input	Edition	Split
Many Input	Merge	Rearrangment

I guess they could all fall under the PdfRearranger class, but we already had a PdfMerger class, so it seemed natural to create new classes.
I don't really see how they duplicate code thus, everything ends up using the PdfRearranger code.

I would also like to underline that, thanks to the IPdfArrangement argument, PdfRearranger can now merge documents in all sorts of way (ZipMerge, pages from file1 then pages from file2 then pages from file1 again...) that PdfMerger could not.

Similarly, this allows to split documents in complex ways with a fairly simple api.

Finally, the PdfRearranger can split/merge several documents into several new documents, which could be handled by PdfMerger of course, but the api didn't not seem natural to me PdfMerger.Recombine(...), hence the new class.

Concerning the stream aspect of this PR, I'll split in in differents PR, as you suggested. I made sure not to break any public API thus.

Concerning PdfStreamWriter modifications, I'll suggest the modifications in another PR, as this is not core.

Poltuu added 12 commits January 2, 2021 15:10

Opening a PdfDocument via a file path open a stream instead of alloca…

d81c9d9

…ting the document in byte[]

PdfMerger favors streams instead of byte[]

486eb14

Also, add api similar to PdfDocument regarding streams

Add output stream as an argument

d8d879a

in pdfmerger methods

flush after every document

77f51af

to clear tokenReferences as we are merging the documents

add page cherry picking on merge

1b9d045

(not on public api)

split pdf merger with pdfrearranger

94365d0

add IPdfArrangement to parametrized the merge

2b252e6

fix potential stackoverflow

cb8243a

spotted on 68-1990-01_A.pdf

add pdfsplitter

13bcd89

add pdfRearranger

c3569d1

Impose artificial limit in the amount of page, that a node can contain

645fbe5

from commit 1c9b86b

change syntax for CI

914c3ff

Poltuu closed this Jan 15, 2021

Poltuu mentioned this pull request Jan 15, 2021

Pdfmerger complete rearrangment api #262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfSplitter, PdfRearranger #251

PdfSplitter, PdfRearranger #251

Poltuu commented Jan 4, 2021

InusualZ commented Jan 5, 2021

Poltuu commented Jan 6, 2021 •

edited

PdfSplitter, PdfRearranger #251

PdfSplitter, PdfRearranger #251

Conversation

Poltuu commented Jan 4, 2021

InusualZ commented Jan 5, 2021

Poltuu commented Jan 6, 2021 • edited

Poltuu commented Jan 6, 2021 •

edited