# Testing Textract (Functions) on OAs

Here we look at extracting text information from PDF versions of office actions.

Textract is my preference but there appear to be issues when installing for Python 3. I can replicate the Textract PDF parsing functions here.

In [1]:
# imports
import textract

ImportError: No module named 'textract'

In [2]:
import os

In [3]:
import subprocess

def run(args):
    """ Run subprocess """
    pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
    stdout, stderr = pipe.communicate()
    return stdout, stderr

In [4]:
filename = 'OAs-OCR/06.06.2017 - Art 94(3).pdf'
args = ['pdftotext', filename, '-']

s, err = run(args)
print(s.decode("UTF-8"))

Europaisches
Patentamt

European Patent Office
80298 MUNICH
GERMANY
Tel: +49 89 2399 0
Fax: +49 89 2399 4465

European
Patent Office
Office europeen
des brevets

Formalities Officer
Name: Abadie, Nathalie
Tel: +49 89 2399 - 2746
or call
+31 (0)70 340 45 00

Illllll lllll lllll lllll 111111111111111111111111111111111
Zimmermann, Tankred Klaus
Schoppe, Zimmermann, St6ckeler
Zinkler, Schenk & Partner mbB
Patentanwalte
Radlkoferstrasse 2
81373 Munchen
ALLEMAGNE

Substantive Examiner
Name: Peller, Ingrid
Tel: +49 89 2399 - 7016

L

_J

Application No.

Ref.

Date

13 883 570.7 - 1958

HP130403PEP

06.06.2017

I

I

Applicant

Hewlett Packard Enterprise Development LP

Communication pursuant to Article 94(3) EPC
The examination of the above-identified application has revealed that it does not meet the requirements of the
European Patent Convention for the reasons enclosed herewith. If the deficiencies indicated are not rectified
the application may be refused pursuant to Article 97(2) EPC.
Y

In [5]:
text = s.decode("UTF-8")
print(text[0:100])

Europaisches
Patentamt

European Patent Office
80298 MUNICH
GERMANY
Tel: +49 89 2399 0
Fax: +49 89 2


In [14]:
import shutil
import six
from tempfile import mkdtemp

def extract_tesseract(filename):
    """Extract text from pdfs using tesseract (per-page OCR)."""
    temp_dir = mkdtemp()
    base = os.path.join(temp_dir, 'conv')
    contents = []
    try:
        stdout, _ = run(['pdftoppm', filename, base])

        for page in sorted(os.listdir(temp_dir)):
            page_path = os.path.join(temp_dir, page)
            page_content = run(['tesseract', page_path, 'stdout'])
            # Page content appears to be returning a tuple with a second blank b' '
            contents.append(page_content[0])
        return six.b('').join(contents)
    finally:
        shutil.rmtree(temp_dir)

In [15]:
filename_nonocr = 'OAs/Search Opinion Positive.pdf'
text = extract_tesseract(filename_nonocr)

In [16]:
text

b'Eulopllxzhu European Paienl Oliice\npmmm\n\n80298 MUNiCH\n5:13:31.\xe2\x80\x9c GERMANY\nUV?!" wwphn Tel. +49 (0)89 2395 - 0\n\n \n\ndo: mm:\n\nFax +49 (0)39 2399 , 4465\n\n|l|||||||||| l|||| llllllllll l||l| l|||| l||l|| lllll l||| 0,3..me\n\nName: Nesciobeili, Kainrzyne\nFlutter, Paula Tel 4790\n\n   \n\nEIP \' or call:\nFairfax House RECEiVED +31(0)70 340 45 00\n15 Fuiwood Place\n\nLondon\nGreater London WC1V6 U 1 1 MAY 2012\nROYAUME UNI\n\n  \n    \n\n\xe2\x80\x9d1.:\n\nEIF\xe2\x80\x99\n\n \n\n\xe2\x80\x942012\n\n \n\n   \n  \n\ne arenas - Applismiun NcJFmen! No.\n\nE1338.800(F),EP 11160714 9 - 2224 / 2372552\n\npplicAnVProprielor\nIron Mountain |ncorporaied\n\n \n\n \n\n \n\nBRIEF COMMUNICATION\n\nSubiect: [N Ycurlerterof 05.04.2012\nEl Our telephone conversation of\n\n[:1 Communication of\n\nE]\n\nEncloeure(s): D Letter from the proprietor of the patent of\nEl Copy (copies)\nm Communication:\n\nPlease be informed that the European Search Opinion accompanying the European Search 

In [17]:
text.decode("UTF-8")

'Eulopllxzhu European Paienl Oliice\npmmm\n\n80298 MUNiCH\n5:13:31.“ GERMANY\nUV?!" wwphn Tel. +49 (0)89 2395 - 0\n\n \n\ndo: mm:\n\nFax +49 (0)39 2399 , 4465\n\n|l|||||||||| l|||| llllllllll l||l| l|||| l||l|| lllll l||| 0,3..me\n\nName: Nesciobeili, Kainrzyne\nFlutter, Paula Tel 4790\n\n   \n\nEIP \' or call:\nFairfax House RECEiVED +31(0)70 340 45 00\n15 Fuiwood Place\n\nLondon\nGreater London WC1V6 U 1 1 MAY 2012\nROYAUME UNI\n\n  \n    \n\n”1.:\n\nEIF’\n\n \n\n—2012\n\n \n\n   \n  \n\ne arenas - Applismiun NcJFmen! No.\n\nE1338.800(F),EP 11160714 9 - 2224 / 2372552\n\npplicAnVProprielor\nIron Mountain |ncorporaied\n\n \n\n \n\n \n\nBRIEF COMMUNICATION\n\nSubiect: [N Ycurlerterof 05.04.2012\nEl Our telephone conversation of\n\n[:1 Communication of\n\nE]\n\nEncloeure(s): D Letter from the proprietor of the patent of\nEl Copy (copies)\nm Communication:\n\nPlease be informed that the European Search Opinion accompanying the European Search Report\ndated September 7, 2011 is indeed a p

In [18]:
filename2 = 'OAs/06.06.2017 - Art 94(3).pdf'
text2 = extract_tesseract(filename2)
print(text2.decode("UTF-8"))

Eulopllsches
menu-m European Patent Otllce

 

5...”... 50298 MUNlCH

new ofﬁce GERMANY

wkuumym Tel +49 as 2399 a

des brevets Fax +49 89 2399 4465

F 7 Formlities onicer

Name Abadie,NathalIe
Tel +49 89 2399 - 2746
or Call

 

 

 

 

 

 

 

. ‘31 070 340 45 00
Zimmermann, Tankred Klaus ( ’

Schoppe. Zimmermann, Stockeler . .
Zinkler, Schenk & Partner mbB i:%i":':'§i§.5ﬁ;ﬂ§“e'
Patentanwﬁne Tel +49 89 2399 , 7015
Radlkoferstrasse 2

81373 Miinchen

 

 

 

 

ALLEMAGNE
L 4
Application No Rel Date
13 883 570.7 - 1958 HP130403PEP 06.06.2017
Applicant
Hewlett Packard Enterprise Development LP

 

 

 

Communication pursuant to Article 94(3) EPC

The examination of the above-identified application has revealed that it does not meet the requirements of the
European Patent Convention for the reasons enclosed herewith. If the deficiencies indicated are not rectified
the application may be refused pursuant to Article 97(2) EPC.

You are invited to file your observations and insofar as th