State of the library

The library works with few pdfs for two main reasons:

The transformation matrix and the graphic state is not handled
The fonts/encodings are not correctly handled

ExtractTablesFromPdf

Extract tables (and paragraphs outside tables) from pdf

License limitations

(please read before use)

This software is released under MIT license but uses iTextSharp v.4.1.6 that is released under MPL LGPL license. Before using this software you should also agree with the iTextSharp v.4.1.6 license. Also, take care if you upgrade iTextSharp because newer versions are released under AGPL.

What's this

PDF is a file format used to define device independent page output. This project intend to retrieve text and tables from a pdf.

The main part is the Engine.

The Renderer is a debug window to understand what's happening.

Usage

Call

var pages = ExtractText.Read(fileName);

to read all the pages.

Then, for every page, call

Page.DetermineTableStructures();
Page.DetermineParagraphs();
Page.FillContent();

To check if you already called the method above, use

Page.IsRefreshed

After that you'll be able to access to

Page.Contents

Contents is a collection of IPageContent ordered from top of page to bottom.
A IPageContent can be a

Paragraph that contains text (Content)
Table that contains a matrix of text (Content[,])

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
BuildTablesFromPdf.Console		BuildTablesFromPdf.Console
BuildTablesFromPdf.Engine.Test		BuildTablesFromPdf.Engine.Test
BuildTablesFromPdf.Engine		BuildTablesFromPdf.Engine
BuildTablesFromPdf.Renderer		BuildTablesFromPdf.Renderer
packages		packages
.gitattributes		.gitattributes
.gitignore		.gitignore
BuildTablesFromPdf.sln		BuildTablesFromPdf.sln
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

State of the library

ExtractTablesFromPdf

License limitations

What's this

Usage

About

Releases

Packages

Languages

License

bubibubi/ExtractTablesFromPdf

Folders and files

Latest commit

History

Repository files navigation

State of the library

ExtractTablesFromPdf

License limitations

What's this

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages