# What is pdfminer.six ?
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
## why .six ? 
The original goal of pdfminer.six was to add support for Python 3. This was done with the six package. The six package helps to write code that is compatible with Python 3. Hence, pdfminer.six.
### Features
Parse all objects from a PDF document into Python objects.<br>
Analyze and group text in a human-readable way.<br>
Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged contents and more.

## Install

In [1]:
!pip install pdfminer.six



## Import

In [7]:
import pdfminer

## Extract text from a PDF using Python

In [13]:
from pdfminer.high_level import extract_text
text = extract_text('one_page.pdf')

print(repr(text))

print(text) 

' \n\n \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at:  http://www.education.gov.yk.ca/\n   \n\n  or  \n\n.   \n\n\x0c'
 

 
PDF Test File 
 
Congratulations, your computer is equipped with a PDF (Portable Document Format) 
reader!  You should be able to view any of the PDF documents and forms available on 
our site.  PDF forms are indicated by these icons: 
 
Yukon Department of Education 
Box 2703 
Whitehorse,Yukon 
Canada 
Y1A 2C6 
 
Please visit our website at:  http://www.education.gov.yk.ca/
   

  or  

.   




## Extract Element from a PDF using Python

The high level functions can be used to achieve common tasks. In this case, we can use api_extract_pages:

In [14]:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages("one_page.pdf"):
    for element in page_layout:
        print(element)

<LTTextBoxHorizontal(0) 197.400,660.468,200.736,672.468 ' \n'>
<LTTextBoxHorizontal(1) 72.000,455.448,532.853,661.368 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at:  http://www.education.gov.yk.ca/\n   \n'>
<LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>
<LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>
<LTFigure(Im0) 72.000,663.000,197.400,738.000 matrix=[125.40,0.00,0.00,75.00, (72.00,663.00)]>
<LTFigure(Im1) 342.420,588.960,343.980,591.120 matrix=[1.56,0.00,0.00,2.16, (342.42,588.96)]>
<LTFigure(Im2) 342.420,587.520,343.200,588.960 matrix=[0.78,0.00,0.00,1.44, (342.42,587.52)]>
<LTFigure(Im3) 342.420,586.800,343.980,587.520 matrix=[1.56,0.

Each element will be an LTTextBox, LTFigure, LTLine, LTRect or an LTImage. Some of these can be iterated further, for example iterating though an LTTextBox will give you an LTTextLine, and these in turn can be iterated through to get an LTChar.

### When we want to extract all of the text

In [15]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("one_page.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):        
            print(element.get_text())

 

 
PDF Test File 
 
Congratulations, your computer is equipped with a PDF (Portable Document Format) 
reader!  You should be able to view any of the PDF documents and forms available on 
our site.  PDF forms are indicated by these icons: 
 
Yukon Department of Education 
Box 2703 
Whitehorse,Yukon 
Canada 
Y1A 2C6 
 
Please visit our website at:  http://www.education.gov.yk.ca/
   

  or  

.   



isinstance() returns True if the specified object is of the specifie type, otherwise False.

### When we extract the fontname or size of each individual character:

In [16]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("one_page.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print(character.fontname)
                        print(character.size)

ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0
ArialMT
12.0

## How to extract AcroForm interactive form fields from a PDF using PDFMiner
AcroForms (as found in PDF files with fillable forms or multiple choices). 

In [15]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral, PSKeyword
from pdfminer.utils import decode_text


data = {}


def decode_value(value):           #the decode_value method takes care of decoding the fields value returning a string

    # decode PSLiteral, PSKeyword
    if isinstance(value, (PSLiteral, PSKeyword)):
        value = value.name

    # decode bytes
    if isinstance(value, bytes):
        value = decode_text(value)

    return value


with open('AcroForm_TEST.pdf', 'rb') as fp:
    parser = PDFParser(fp)                                              #Initialize the parser and the PDFDocument objects
    doc = PDFDocument(parser)
                                                                        #Get the catalog
    res = resolve1(doc.catalog)                                         # ?                              
    #print(res) 
    if 'AcroForm' not in res:
        raise ValueError("No AcroForm Found")

    fields = resolve1(doc.catalog['AcroForm'])['Fields']             # may need further resolving
                                                                     #the field list resolving the entry in the catalog
    #print(fields)
    
    for f in fields:
        field = resolve1(f)
        #print(field)
        
        name, values = field.get('T'), field.get('V')              #field name and field value(s)
        #print(name)
        #print(values)
        
        # decode name                                              #Decode field name.
        name = decode_text(name)
        #print(name)
        
        # resolve indirect obj
        values = resolve1(values)                                  #Resolve indirect field value objects
        #print(values)
        
        #decode value(s)
        if isinstance(values, list):
            values = [decode_value(v) for v in values]
        else:
            values = decode_value(values)

        data.update({name: values})

        print(name, values)

[<PDFObjRef:6>, <PDFObjRef:12>, <PDFObjRef:22>, <PDFObjRef:27>, <PDFObjRef:29>, <PDFObjRef:31>, <PDFObjRef:41>]
Push Button0 None
Check Box0 None
Radio Button0 None
Combo Box0 111
List Box0 fff
Text Field0 None
List Box1 ['mm11']
