Skip to content

PDF Embedding

gal kahana edited this page Mar 29, 2024 · 16 revisions

There are various applications that require using another PDF in the process of creating a new one. For instance, applications that merge multiple PDF files into one, would like to recreate pages of those PDFs in a new one. Impositioning applications might want to use pages as placed objects in a newly created page.

For this purpose, you can use the PDFWriter methods for embedding PDF. Two methods are supported:

  • Use PDF Pages as pages, simply appending them to the pages of the generated PDF.
  • Use PDF Pages as components in the creation of one or more pages in the generated PDF. This method is, in turn, divided to two sub-methods:
    • Embed pages as Form xobjects. The library creates a list of form XObjects based on the source PDF pages. you can then use them as regular form xobjects, placing them on pages in the generated PDF, in one or more locations. Using form xobjects the content becomes potentially reusable.
    • Embed pages as content of existing pages directly. The library merges content of source PDF pages into the content of a target page, this way allowing a one time including of the graphics. This fits scenarios of placing content that will not be reused on other pages.

In addition to just embedding Pages, the library provides also the ability to copy other objects of interest, based on the users choice. This is important for extensibility. Note that the library recreates pages according to what features it is familiar with. For example, it can recreates pages context. However, annotations that are not supported for the time being will not be copied. you will have to use extensibility options. These options are explained here as well.

All types of PDFs are supported. Among the supported types you can find Regular non-updated PDFs, PDFs with incremental changes, Linearized PDFs, PDFs from 1.5+ that have these object streams.

Note that PDF input can come from any stream, and not just files. For a discussion of custom stream input see Custom input and output

Appending pages

appending pages from another PDF is rather simple. Here's an example:

PDFWriter pdfWriter;
pdfWriter.StartPDF("C:\\MyPDF.PDF",ePDFVersion13);
pdfWriter.AppendPDFPagesFromPDF("C:\\OtherPDF.pdf",PDFPageRange());
pdfWriter.EndPDF();

In this example, all pages of OtherPDF.pdf are appended to the result PDF (which is MyPDF.PDF). Notice line 3. It contains a call to AppendPDFPagesFromPDF. First parameter is the name of the PDF to take the pages from. Second parameter is the choice of pages.

To select which pages to append use the PDFPageRange structure. The Structure has two members - mType and mSpecificRanges. mType can be either eRangeTypeAll, for all pages, or eRangeTypeSpecific, which denotes that only selected pages should be used. To select the pages to append, use the 2nd member - mSpecificRanges. This member is a list of pairs of unsigned longs (essentially list< pair<unsigned long,unsigned long> >), where each member is an inclusive range. For Example, providing (1,3) and (5,9) in the list will append pages 1,2,3,5,6,7,8,9 (0 based!).

The default constructor of this structure, as used in this example, simply means that all pages should be embedded.

The complete signature of AppendPDFPagesFromPDF is as follows:

EStatusCodeAndObjectIDTypeList AppendPDFPagesFromPDF(
            const string& inPDFFilePath,
            const PDFPageRange& inPageRange,
	    const ObjectIDTypeList& inCopyAdditionalObjects = ObjectIDTypeList())

We discussed the first and second parameters. Note in addition, that as is the cases with all text input, the path should be encoded using UTF8. The 3rd parameter is intended for extensibility, for copying non-page related objects from the source PDF. We'll discuss it later. It has a default value, so you can ignore it for now.
The return value is a pair of a status code and a list of Object IDs (in essence pair<EStatusCode, List<ObjectIDType> >). Status code is whether appending succeeded or not. The list is of the created pages object IDs. This is useful when you wish to reference the pages from other objects. Yeah...Extensibility.

for a complete code example (more complete than this one, that is) you can check Append Pages Test

Using pages as form XObjects

Sometimes you'll want to use an original PDF pages as graphic components of a new page. A good example is an imposition application that implements step and repeat - you can use the library to create an "imposed" PDF by creating Form XObjects from the original PDF, and then placing them as content in the new PDF page (or pages...cause they are reusable).

The following example shows how to do this:

PDFWriter pdfWriter;
pdfWriter.StartPDF("C:\\MyPDF.PDF",ePDFVersion13);
EStatusCodeAndObjectIDTypeList result = pdfWriter.CreateFormXObjectsFromPDF(
                                              "C:\\Other2PagePDF.PDF",
                                              PDFPageRange(),
                                              ePDFPageBoxMediaBox);
PDFPage* page = new PDFPage();
page->SetMediaBox(PDFRectangle(0,0,595,842));
PageContentContext* contentContext = pdfWriter.StartPageContentContext(page);

// place the first page in the top left corner of the document
contentContext->q();
contentContext->cm(0.5,0,0,0.5,0,421);
contentContext->Do(page->GetResourcesDictionary().AddFormXObjectMapping(result.second.front()));
contentContext->Q();

// place the second page in the bottom right corner of the document
contentContext->q();
contentContext->cm(0.5,0,0,0.5,297.5,0);
contentContext->Do(page->GetResourcesDictionary().AddFormXObjectMapping(result.second.back()));
contentContext->Q();

pdfWriter.EndPageContentContext(contentContext);
pdfWriter.WritePageAndRelease(page);
pdfWriter.EndPDF();

The important line is the 3rd one:

EStatusCodeAndObjectIDTypeList result = pdfWriter.CreateFormXObjectsFromPDF(
                                              "C:\\Other2PagePDF.PDF",
                                              PDFPageRange(),
                                              ePDFPageBoxMediaBox);

The call to CreateFormXObjectFromPDF returns a pair of status code and object IDs list, similar to the pages append function. This time, the IDs are of forms. You can use these IDs later when you wish to place the form xobject, such as in this line:

contentContext->Do(page->GetResourcesDictionary().AddFormXObjectMapping(result.second.front()));

which places the first "page".

The function receives 3 parameters here: file name and page range as well as enumerator of type EPDFPageBox. This parameter determines which of the pages boxes to use as the form bounding box. In this example, the Media box is to be used (ePDFPageBoxMediaBox).

The complete signature of CreateFormXObjectsFromPDF is as follows:

EStatusCodeAndObjectIDTypeList CreateFormXObjectsFromPDF(
                const string& inPDFFilePath,
		const PDFPageRange& inPageRange,
		EPDFPageBox inPageBoxToUseAsFormBox,
		const double* inTransformationMatrix = NULL,
		const ObjectIDTypeList& inCopyAdditionalObjects = ObjectIDTypeList());

The 4th parameter here is an optional transformation matrix to apply on the form. It's functionality is similar to the transformation matrix provided when creating Form XObjects with the library. The last parameter is, again, a list of object IDs to copy from the source PDF in addition to the pages themselves, meant for extensibility.

For those of you who wish not to rely on one of the bounding boxes of the page, but rather to provide your own crop box, there is another overload for the CreateFormXObjectsFromPDF method:

EStatusCodeAndObjectIDTypeList CreateFormXObjectsFromPDF(
                const string& inPDFFilePath,
		const PDFPageRange& inPageRange,
		const PDFRectangle& inCropBox,
		const double* inTransformationMatrix = NULL,
		const ObjectIDTypeList& inCopyAdditionalObjects = ObjectIDTypeList());

This overload is very similar, but in one parameter - inCropBox - which is the rectangle describing the crop box for this page. It will be used as the form xobject box. Using this overload is fitting when the page is known to describe graphic in only a particular area, but the PDF does not contain this information as any of the bounding boxes common to a PDF page.

For a complete code example check PDF Embedding Test

Merging pages content

Using a Form XObject as a container for a source PDF page, in order to place it later in one or more pages, is good especially when the content is to be reused. This is true due to the natural ability of forms to encapsulate code and be identified by their object code. Sometimes, however, you don't need to reuse the content, and then the creation of a form might be an unnecessary overhead. In Addition some information, such as reusability information, in the source page may be lost unless the source page content is not being placed directly in a target page - if such a mediator as a form is used.

For scenarios when it is more fitting to use the graphics just once, then you should use the MergePDFPagesToPage function. This method accepts a page as input, and injects the code of a page into this target page. It does so at the point of calling the method, so that any graphic already placed in the target page will be maintained. This is as if the graphics was placed there by the user, in other methods.

This method fits unique placement of pages, as it does not allow reuse of the content. The following is a simple usage example:

PDFPage* page = new PDFPage();
page->SetMediaBox(PDFRectangle(0,0,595,842));

PDFPageRange singlePageRange;
singlePageRange.mType = PDFPageRange::eRangeTypeSpecific;
singlePageRange.mSpecificRanges.push_back(ULongAndULong(0,0));

pdfWriter.MergePDFPagesToPage(page,"C:\\Other2PagePDF.PDF",singlePageRange);

pdfWriter.WritePageAndRelease(page);

Two interesting parts here. Note the usage of PDFPageRange between the 3rd and 5th row. It is being set to point to the first (0 indexed) page of a page, Then later it is being used in the MergePDFPagesToPage. The PDFPageRange structure is used here to point to the first page, and so the MergePDFPagesToPage will merge just the first page of the target document.

The 2nd thing of note is the call MergePDFPagesToPage itself. The first parameter is the target page. The method will use its content stream and add the source page content to it. The 2nd parameter is the source PDF file, and the last parameter is the @PDFPageRange@ object defined earlier to instruct the method to import just the first page.

The complete signature of MergePDFPagesToPage is as follows:

EStatusCode MergePDFPagesToPage(
                PDFPage* inPage
                const string& inPDFFilePath,
		const PDFPageRange& inPageRange,
		const ObjectIDTypeList& inCopyAdditionalObjects = ObjectIDTypeList());

Something to notice about using this function, is that it is very good for embedding a single page. For more than one page, unless something is done, they will be posited one on top of the other. You should either use it for a single page import, or try one of multiple possible strategies to import pages directly:

  • Use DocumentContext events, through IDocumentContextExtender, to introduce positioning code between the page using the OnBeforeMergePageFromPage and OnAfterMergePageFromPage. this will solve most cases, but is a bit cumbersome.
  • Call to MergePDFPagesToPage multiple times, one for each page. This is the easiest method, though requires multiple calls...however it is very inefficient, as multiple calls will allow less sharing of elements of importing PDFs. You see, each separate call for embedding PDF content (unless the copying context is used) requires parsing of the PDF header and directory content. Also - multiple calls for embedding don't share objects, while multiple additions of content in the same call do.
  • The best method is to create a copying context (as is explained in the next section), and use its merging functionality. Using this method will allow you multiple calls, with elements sharing. In addition it will allow you to merge some pages as immediate, unique content, some as reusable content through form XObjects, and some as complete pages - with a single parsing move. amazing

For a complete code example go to - PDF Merging Test

Merging helper class

A helper class exists for merging pages into target pages. Sometimes it may be useful to preferring its usage over regular PDFWriter methods. It is PDFPageMergingHelper

It has 3 method overloads for a function called MergePageContent as follows:

EStatusCode MergePageContent(PDFWriter* inWriter,const string& inPDFFilePath,unsigned long inPageIndex);
EStatusCode MergePageContent(PDFWriter* inWriter,IByteReaderWithPosition* inPDFStream,unsigned long inPageIndex);
EStatusCode MergePageContent(PDFDocumentCopyingContext* inCopyingContext,unsigned long inPageIndex); 

Each of these methods can merge page content in to the target page, with 3 different options for sources. To use it just wrap the target page and call the methods, like this:

PDFPage* myPage;

PDFPageMergingHelper(myPage).MergePageContent(pdfWriter,"c:\\mySourceFile.pdf",0);

The code example will merge the first page of mySourceFile.pdf into myPage.

Using copying context

In addition to using either of the three methods you can copy pages and objects from a PDF in an alternative method. You can create a "Copying Context" and then use it for copying one page as a time, as form xobject or a page. It also allows you to inject page content directly and copy miscellaneous PDF objects. A more complex method, the copying context path allows you more sophisticated copying, and to actually copy from multiple PDFs, in an interleaved fashion - by creating multiple contexts, and using them together.

To create a copying context, call the CreatePDFCopyingContext method of PDFWriter:

copyingContext = pdfWriter.CreatePDFCopyingContext("C:\\PDFLibTests\\TestMaterials\\BasicTIFFImagesTest.PDF");

This will create a context for copying content from the PDF to the result PDF. you can now use the returned PDFDocumentCopyingContext functions:

EStatusCodeAndObjectIDType CreateFormXObjectFromPDFPage(
                         unsigned long inPageIndex,
			 EPDFPageBox inPageBoxToUseAsFormBox,        
                         const double* inTransformationMatrix = NULL);

The CreateFormXObjectFromPDFPage creates a Form XObject from a page in the PDF. the page is indicated by inPageIndex. Note that using multiple calls here is similar to using the matching command from PDFWriter - however here you get to make different decisions on the other parameters for each page.

There is another overload, to let you determine a custom box for the form xobject. If this is desirable used this method instead:

EStatusCodeAndObjectIDType CreateFormXObjectFromPDFPage(
                         unsigned long inPageIndex,
			 const PDFRectangle& inCropBox,        
                         const double* inTransformationMatrix = NULL);

This overload allows you to provide a custom crop box, instead of using one of the page boxes.

EStatusCodeAndObjectIDType AppendPDFPageFromPDF(unsigned long inPageIndex);

The 'AppendPDFPageFromPDF' appends a page (designated by the input index).

EStatusCode MergePDFPageToPage(PDFPage* inTargetPage,unsigned long inSourcePageIndex);

The 'MergePDFPageToPage' merges a source page content to a target page in the written PDF.

EStatusCode MergePDFPageToFormXObject(PDFFormXObject* inTargetFormXObject,unsigned long inSourcePageIndex);

The 'MergePDFPageToFormXObject' merges a source page content to a target form xobject in the written PDF. This is very useful for merging page content with other drawing commands in order to place it in multiple positions in the end result PDF.

EStatusCodeAndObjectIDType CopyObject(ObjectIDType inSourceObjectID);

This method allows you to copy any indirect object from the source PDF, by providing its object ID. This is good for extensibility options, for implementing currently unsupported features such as annotations.

Note that you can use these methods in any fashion you want - you can embed some pages of a PDF as pages, and some as XObjects, and even merge some. If that is what you are looking for then this method is preferred over multiple calls to the PDFWriter functions, because the copied objects will be shared...and so you'll get a more efficient PDF. Note that you can create multiple contexts for different PDFs at the same time, and embed pages from them in an interleaved manner.

EStatusCodeAndObjectIDTypeList CopyDirectObject(PDFObject* inObject);

EStatusCode CopyNewObjectsForDirectObject(const ObjectIDTypeList& inReferencedObjects);

You can also copy direct objects. Use the CopyDirectObject method with a PDFObject pointer. This is good when you are trying to create a new object and want to copy to it some direct objects from the original PDF. Mostly, when called, this is all that is required. However, sometimes, the direct object will contain references to InDirect objects. Specifically it will contain references to yet-uncopied. Obviously - when copying a direct object you don't want the library to start writing anything but this object - something that will break the PDF structure. For this, the CopyDirectObject method returns a pair of items - an EStatusCode and an ObjectIDType list. The second returned value is a list of indirect objects referenced by the direct object, that should be copied, once you finish with writing your current object. Copy them using the 2nd method here - CopyNewObjectsForDirectObject. Please use this method ONLY for this scenario.

Now. If you need to copy multiple direct objects, and can't call CopyNewObjectsForDirectObject after each one, just merge the lists (make sure to truly merge, and not just append) and call to CopyNewObjectsForDirectObject with the merged list, when you can.


You may sometime want to embed a PDF page, but in the process have some objects of the source PDF, replaced by newly created objects. An example for one such application is for creating a PDF which has reduced images resolution, from a source PDF. To do that you need to embed the PDF document in a new PDF, but you would like a method that will allow you to avoid copying some of the content - namely the original hi-res images - an instead wherever the pages use these images, use other images instead.

To do this, use the following method of the Copying Context:

void ReplaceSourceObjects(const ObjectIDTypeToObjectIDTypeMap& inSourceObjectsToNewTargetObjects);

This method accepts an input map of source object ID to a target object ID. Each entry in the map signals an object that should not be copied, and the object in the target PDF to which references to the non-copied object should go. It is OK for the following to exist:

  1. The target object does not have to exist while the copying happens. It may be a forward declaration.
  2. A single, shared, target object may serve multiple source objects.

You can use this method during any time in the lifetime of the copying context, however note that any object that was already copied, will not be replaced by ReplaceSourceObjects if it is being called.


In addition to just the embedding options you also get some nice "getters" for extensibility activities:

PDFParser* GetSourceDocumentParser();

This method returns the parser object for the input PDF. The parser object contains the interpreted xref and a list of page IDs (only!). You can use it to retrieve the PDF file objects. The parser is discussed in detail in PDF Parsing.

IByteReaderWithPosition* GetSourceDocumentStream();

You may want a handle to the source document stream. you can retrieve it using GetSourceDocumentStream. This is good for cases where you would like to parse the document while copying it. Specifically it is important for PDF stream reading. for more details see PDF Parsing.

EStatusCodeAndObjectIDType GetCopiedObjectID(ObjectIDType inSourceObjectID);

If you want to know which object in the result PDF is the matching object of an original PDF object, use the GetCopiedObjectID method. Provide the source PDF object ID, and it will return a pair of Status Code and an Object ID. the Status code is eSuccess if the object was copied, and then the 2nd parameter becomes relevant, which will have the object ID.

MapIterator<ObjectIDTypeToObjectIDTypeMap> GetCopiedObjectsMappingIterator();

For iterating all objects that were copied, you can use GetCopiedObjectsMappingIterator. This method returns a MapIterator iterator object that loops through the copied object IDs. The following example shows how to use it:

MapIterator<ObjectIDTypeToObjectIDTypeMap> it = context->GetCopiedObjectsMappingIterator();

while(it.MoveNext())
{
   ObjectIDType sourceObjectID = it.GetKey();
   ObjectIDType targetObjectID = it.GetValue();
}

The sourceObjectID in this example will have the original ID from the source PDF, and the targetObjectID will have the resulted copied object ID.

When done with the context, just delete it. (you can also call its End method before that...but then the destructor does that as well...so no need).

something of note - know that you can create a PDF copying context (one or more) before starting to actually write the PDF with the StartPDF method call. This should allow you to consider the input PDF details already when starting the file (for even the PDF level passed to StartPDF).

For a code example see here - PDF Copying Context Test

Events and Copying individual objects

The copying context gives quite a lot of control of the copying process to satisfy most extensibility requirements, when used together with the existing extensibility options of DocumentContext extenders (to say, add content to pages).

Still, there are some added events that you can use, added to IDocumentContextExtender). To read more about them check out The DocumentContext Object.

Also, note that each of the PDFWriter methods for embedding pages (either as pages or xobjects) let's you copy individual objects. It's a bit difficult to know which objects you need to copy in advance in most applications, which is why if you think you need such a capability - better use the copying context, which provides you the parser, and individual object copying.

Clone this wiki locally