Skip to content
changetocoding edited this page Oct 4, 2023 · 21 revisions

This wiki contains more detail on various aspects of the public API and the PDF document format.

Features

  • Extracts the position and size of letters from any PDF document. This enables access to the text and words in a PDF document.
  • Allows the user to retrieve images from the PDF document.
  • Allows the user to read PDF annotations, PDF forms, embedded documents and hyperlinks from a PDF.
  • Provides access to metadata in the document.
  • Exposes the internal structure of the PDF document.
  • Creates PDF documents containing text and path operations.
  • Read content from encrypted files by providing the password.
  • Document Layout Analysis - PdfPig also comes with some tools for document layout analysis such as the Recursive XY Cut, Document Spectrum and Nearest Neighbour algorithms, along with others. It also provides support for exporting page contents to Alto, PageXML and hOcr format. See Document Layout Analysis
  • Tables are not directly supported but you can use Tabula Sharp or Camelot Sharp. As of 2023 Tabula-sharp is the most complete port source

This provides an alternative to the commercial libraries such as SpirePDF or copyleft alternatives such as iText 7 (AGPL) for some use-cases.

It should be noted the library does not support use-cases such as converting HTML to PDF or from other document formats to PDF. For HTML to PDF a good quality solution is wkhtmltopdf. It also does not currently support generating images from PDF pages. If you need this functionality see if docnet meets your requirements.

Getting Started

PdfPig aims to provide 2 main areas of functionality:

  • Extracting PDF content.
  • Creating PDFs.

The simplest usage of the library for extracting content involves opening a document and extracting the position and text of all words across all pages:

using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
	foreach (Page page in document.GetPages())
	{
		IEnumerable<Word> words = page.GetWords();
	}
}

Pages can also be accessed individually with an index starting at 1. You can also access the positions and sizes of the individual letters on a page:

using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
	Page page = document.GetPage(1);
	IReadOnlyList<Letter> letters = page.Letters;
}

For document creation a new document can be created using the Standard14 fonts which are included in the PDF specification:

PdfDocumentBuilder builder = new PdfDocumentBuilder();
PdfPageBuilder page = builder.AddPage(PageSize.A4);
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
page.AddText("Hello World!", 12, new PdfPoint(25, 520), font);
byte[] b = builder.Build();

The resulting bytes are a valid PDF document and can be saved to the file system, served from a web server, etc.

Contents

More details on the API can be found here.

Additional automated documentation from doc-comments can be found on DotNetApis.

Release Notes

Release notes as well as downloadable packages can be found on the releases page https://github.com/UglyToad/PdfPig/releases.