This repository has been archived by the owner on Apr 15, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
/
programming.html
223 lines (191 loc) · 8.29 KB
/
programming.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Programming with PDFMiner</title>
</head>
<body>
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Mon Mar 24 11:49:28 UTC 2014
<!-- hhmts end -->
</div>
<p>
<a href="index.html">[Back to PDFMiner homepage]</a>
<h1>Programming with PDFMiner</h1>
<p>
This page explains how to use PDFMiner as a library
from other applications.
<ul>
<li> <a href="#overview">Overview</a>
<li> <a href="#basic">Basic Usage</a>
<li> <a href="#layout">Performing Layout Analysis</a>
<li> <a href="#tocextract">Obtaining Table of Contents</a>
<li> <a href="#extend">Extending Functionality</a>
</ul>
<h2><a name="overview">Overview</a></h2>
<p>
<strong>PDF is evil.</strong> Although it is called a PDF
"document", it's nothing like Word or HTML document. PDF is more
like a graphic representation. PDF contents are just a bunch of
instructions that tell how to place the stuff at each exact
position on a display or paper. In most cases, it has no logical
structure such as sentences or paragraphs and it cannot adapt
itself when the paper size changes. PDFMiner attempts to
reconstruct some of those structures by guessing from its
positioning, but there's nothing guaranteed to work. Ugly, I
know. Again, PDF is evil.
<p>
[More technical details about the internal structure of PDF:
"How to Extract Text Contents from PDF Manually"
<a href="http://www.youtube.com/watch?v=k34wRxaxA_c">(part 1)</a>
<a href="http://www.youtube.com/watch?v=_A1M4OdNsiQ">(part 2)</a>
<a href="http://www.youtube.com/watch?v=sfV_7cWPgZE">(part 3)</a>]
<p>
Because a PDF file has such a big and complex structure,
parsing a PDF file as a whole is time and memory consuming. However,
not every part is needed for most PDF processing tasks. Therefore
PDFMiner takes a strategy of lazy parsing, which is to parse the
stuff only when it's necessary. To parse PDF files, you need to use at
least two classes: <code>PDFParser</code> and <code>PDFDocument</code>.
These two objects are associated with each other.
<code>PDFParser</code> fetches data from a file,
and <code>PDFDocument</code> stores it. You'll also need
<code>PDFPageInterpreter</code> to process the page contents
and <code>PDFDevice</code> to translate it to whatever you need.
<code>PDFResourceManager</code> is used to store
shared resources such as fonts or images.
<p>
Figure 1 shows the relationship between the classes in PDFMiner.
<div align=center>
<img src="objrel.png"><br>
<small>Figure 1. Relationships between PDFMiner classes</small>
</div>
<h2><a name="basic">Basic Usage</a></h2>
<p>
A typical way to parse a PDF file is the following:
<blockquote><pre>
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
<span class="comment"># Open a PDF file.</span>
fp = open('mypdf.pdf', 'rb')
<span class="comment"># Create a PDF parser object associated with the file object.</span>
parser = PDFParser(fp)
<span class="comment"># Create a PDF document object that stores the document structure.</span>
<span class="comment"># Supply the password for initialization.</span>
document = PDFDocument(parser, password)
<span class="comment"># Check if the document allows text extraction. If not, abort.</span>
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
<span class="comment"># Create a PDF resource manager object that stores shared resources.</span>
rsrcmgr = PDFResourceManager()
<span class="comment"># Create a PDF device object.</span>
device = PDFDevice(rsrcmgr)
<span class="comment"># Create a PDF interpreter object.</span>
interpreter = PDFPageInterpreter(rsrcmgr, device)
<span class="comment"># Process each page contained in the document.</span>
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
</pre></blockquote>
<h2><a name="layout">Performing Layout Analysis</a></h2>
<p>
Here is a typical way to use the layout analysis function:
<blockquote><pre>
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
<span class="comment"># Set parameters for analysis.</span>
laparams = LAParams()
<span class="comment"># Create a PDF page aggregator object.</span>
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
<span class="comment"># receive the LTPage object for the page.</span>
layout = device.get_result()
</pre></blockquote>
A layout analyzer returns a <code>LTPage</code> object for each page
in the PDF document. This object contains child objects within the page,
forming a tree structure. Figure 2 shows the relationship between
these objects.
<div align=center>
<img src="layout.png"><br>
<small>Figure 2. Layout objects and its tree structure</small>
</div>
<dl>
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
<code>LTCurve</code> and <code>LTLine</code>.
<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text.
It contains a list of <code>LTTextLine</code> objects.
<code>get_text()</code> method returns the text content.
<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.
<code>get_text()</code> method returns the text content.
<dt> <code>LTChar</code>
<dt> <code>LTAnno</code>
<dd> Represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTAnno</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).
<dt> <code>LTFigure</code>
<dd> Represents an area used by PDF Form objects. PDF Forms can be used to
present figures or pictures by embedding yet another PDF document within a page.
Note that <code>LTFigure</code> objects can appear recursively.
<dt> <code>LTImage</code>
<dd> Represents an image object. Embedded images can be
in JPEG or other formats, but currently PDFMiner does not
pay much attention to graphical objects.
<dt> <code>LTLine</code>
<dd> Represents a single straight line.
Could be used for separating text or figures.
<dt> <code>LTRect</code>
<dd> Represents a rectangle.
Could be used for framing another pictures or figures.
<dt> <code>LTCurve</code>
<dd> Represents a generic Bezier curve.
</dl>
<p>
Also, check out <a href="http://denis.papathanasiou.org/?p=343">a more complete example by Denis Papathanasiou</a>.
<h2><a name="tocextract">Obtaining Table of Contents</a></h2>
<p>
PDFMiner provides functions to access the document's table of contents
("Outlines").
<blockquote><pre>
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
<span class="comment"># Open a PDF document.</span>
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
<span class="comment"># Get the outlines of the document.</span>
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
</pre></blockquote>
<p>
Some PDF documents use page numbers as destinations, while others
use page numbers and the physical location within the page. Since
PDF does not have a logical structure, and it does not provide a
way to refer to any in-page object from the outside, there's no
way to tell exactly which part of text these destinations are
referring to.
<h2><a name="extend">Extending Functionality</a></h2>
<p>
You can extend <code>PDFPageInterpreter</code> and <code>PDFDevice</code> class
in order to process them differently / obtain other information.
<hr noshade>
<address>Yusuke Shinyama</address>
</body>