# Testing Textract (Functions) on OAs

Here we look at extracting text information from PDF versions of office actions.

Textract is my preference but there appear to be issues when installing for Python 3. I can replicate the Textract PDF parsing functions here.

In [1]:
import os

In [2]:
import subprocess

def run(args):
    """ Run subprocess """
    pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
    stdout, stderr = pipe.communicate()
    return stdout, stderr

In [3]:
filename = 'OAs-OCR/06.06.2017 - Art 94(3).pdf'
args = ['pdftotext', filename, '-']

s, err = run(args)
print(s.decode("UTF-8"))

Europaisches
Patentamt

European Patent Office
80298 MUNICH
GERMANY
Tel: +49 89 2399 0
Fax: +49 89 2399 4465

European
Patent Office
Office europeen
des brevets

Formalities Officer
Name: Abadie, Nathalie
Tel: +49 89 2399 - 2746
or call
+31 (0)70 340 45 00

Illllll lllll lllll lllll 111111111111111111111111111111111
Zimmermann, Tankred Klaus
Schoppe, Zimmermann, St6ckeler
Zinkler, Schenk & Partner mbB
Patentanwalte
Radlkoferstrasse 2
81373 Munchen
ALLEMAGNE

Substantive Examiner
Name: Peller, Ingrid
Tel: +49 89 2399 - 7016

L

_J

Application No.

Ref.

Date

13 883 570.7 - 1958

HP130403PEP

06.06.2017

I

I

Applicant

Hewlett Packard Enterprise Development LP

Communication pursuant to Article 94(3) EPC
The examination of the above-identified application has revealed that it does not meet the requirements of the
European Patent Convention for the reasons enclosed herewith. If the deficiencies indicated are not rectified
the application may be refused pursuant to Article 97(2) EPC.
Y

In [4]:
text = s.decode("UTF-8")
print(text[0:100])

Europaisches
Patentamt

European Patent Office
80298 MUNICH
GERMANY
Tel: +49 89 2399 0
Fax: +49 89 2


In [22]:
import shutil
import six
from tempfile import mkdtemp

def extract_tesseract(filename):
    """Extract text from pdfs using tesseract (per-page OCR)."""
    temp_dir = mkdtemp()
    base = os.path.join(temp_dir, 'conv')
    contents = []
    try:
        stdout, _ = run(['pdftoppm', filename, base])

        for page in sorted(os.listdir(temp_dir)):
            page_path = os.path.join(temp_dir, page)
            page_content = run(['tesseract', page_path, 'stdout'])
            # Page content appears to be returning a tuple with a second blank b' '
            contents.append(page_content[0])
        return six.b('').join(contents).decode("UTF-8")
    finally:
        shutil.rmtree(temp_dir)

In [23]:
filename_nonocr = 'OAs/Search Opinion Positive.pdf'
text = extract_tesseract(filename_nonocr)

In [24]:
text

'Eulopllxzhu European Paienl Oliice\npmmm\n\n80298 MUNiCH\n5:13:31.“ GERMANY\nUV?!" wwphn Tel. +49 (0)89 2395 - 0\n\n \n\ndo: mm:\n\nFax +49 (0)39 2399 , 4465\n\n|l|||||||||| l|||| llllllllll l||l| l|||| l||l|| lllll l||| 0,3..me\n\nName: Nesciobeili, Kainrzyne\nFlutter, Paula Tel 4790\n\n   \n\nEIP \' or call:\nFairfax House RECEiVED +31(0)70 340 45 00\n15 Fuiwood Place\n\nLondon\nGreater London WC1V6 U 1 1 MAY 2012\nROYAUME UNI\n\n  \n    \n\n”1.:\n\nEIF’\n\n \n\n—2012\n\n \n\n   \n  \n\ne arenas - Applismiun NcJFmen! No.\n\nE1338.800(F),EP 11160714 9 - 2224 / 2372552\n\npplicAnVProprielor\nIron Mountain |ncorporaied\n\n \n\n \n\n \n\nBRIEF COMMUNICATION\n\nSubiect: [N Ycurlerterof 05.04.2012\nEl Our telephone conversation of\n\n[:1 Communication of\n\nE]\n\nEncloeure(s): D Letter from the proprietor of the patent of\nEl Copy (copies)\nm Communication:\n\nPlease be informed that the European Search Opinion accompanying the European Search Report\ndated September 7, 2011 is indeed a p

In [9]:
filename2 = 'OAs/06.06.2017 - Art 94(3).pdf'
text2 = extract_tesseract(filename2)
print(text2)

Eulopllsches
menu-m European Patent Otllce

 

5...”... 50298 MUNlCH

new ofﬁce GERMANY

wkuumym Tel +49 as 2399 a

des brevets Fax +49 89 2399 4465

F 7 Formlities onicer

Name Abadie,NathalIe
Tel +49 89 2399 - 2746
or Call

 

 

 

 

 

 

 

. ‘31 070 340 45 00
Zimmermann, Tankred Klaus ( ’

Schoppe. Zimmermann, Stockeler . .
Zinkler, Schenk & Partner mbB i:%i":':'§i§.5ﬁ;ﬂ§“e'
Patentanwﬁne Tel +49 89 2399 , 7015
Radlkoferstrasse 2

81373 Miinchen

 

 

 

 

ALLEMAGNE
L 4
Application No Rel Date
13 883 570.7 - 1958 HP130403PEP 06.06.2017
Applicant
Hewlett Packard Enterprise Development LP

 

 

 

Communication pursuant to Article 94(3) EPC

The examination of the above-identified application has revealed that it does not meet the requirements of the
European Patent Convention for the reasons enclosed herewith. If the deficiencies indicated are not rectified
the application may be refused pursuant to Article 97(2) EPC.

You are invited to file your observations and insofar as th

In [15]:
# Make a cleaner function
def get_text_from_pdf(ocred_pdf):
    """ Extract text from an OCRed PDF file called ocred_pdf"""
    args = ['pdftotext', ocred_pdf, '-']
    pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
    stdout, stderr = pipe.communicate()
    return stdout.decode("UTF-8")

In [29]:
ocred_pdf = os.listdir("OAs-OCR")[0]
print(ocred_pdf)

Article 94(3) EPC(2).pdf


In [28]:
get_text_from_pdf()

''

## Get Test Data

Let's process the folders of example OAs to get a set of OA text to use for testing. We can get text for each of the already OCRed data and the Tesseract OCR data. Any cleaning functions should work on both.  

For now we'll use a simple dictionary to hold the data (we can build an object class later).  

In [35]:
# Start with pre-ocred files - these will be faster
import pickle

filename = 'ocr_text.pkl'
folder = "OAs-OCR"

if os.path.isfile(filename):
    with open(filename, 'rb') as f:
        ocr_text = pickle.load(f)
else:
    ocrfiles = os.listdir(folder)
    
    # Generate a dict with the filename as key - we can iterate through later using items
    ocr_text = dict()
    for file in ocrfiles:
        ocr_text[file] = get_text_from_pdf(os.path.join(folder,file))
        
    with open(filename, 'wb') as f:
        pickle.dump(ocr_text, f)

OAs-OCR/Article 94(3) EPC(2).pdf
OAs-OCR/Search Report(12).pdf
OAs-OCR/2016-5-18 - Search report.pdf
OAs-OCR/Extended Search Report(1).pdf
OAs-OCR/Response to Art 94(3) EPC(4).pdf
OAs-OCR/Response to Supplementary Search Report(2).pdf
OAs-OCR/2016-08-26 - Search Report.pdf
OAs-OCR/Article 94(3) EPC(3).pdf
OAs-OCR/2016-11-23 - Article 94(3) EPC.pdf
OAs-OCR/2016-06-20 - Search Report.pdf
OAs-OCR/2016-05-04 - Search report(1).pdf
OAs-OCR/11-07-2016 - Search Report.pdf
OAs-OCR/Article 94(3) EPC(27).pdf
OAs-OCR/Search Report(13).pdf
OAs-OCR/Further Processing Notice (Exam Report).pdf
OAs-OCR/Response to Article 94(3)(5).pdf
OAs-OCR/Response to Article 94(3)(4).pdf
OAs-OCR/Article 94(3)EPC(7).pdf
OAs-OCR/Response to Article 94(3)(8).pdf
OAs-OCR/Response to Art 94(3) as filed(1).pdf
OAs-OCR/Article 94(3) EPC(5).pdf
OAs-OCR/Search Report(21).pdf
OAs-OCR/Response to Article 94(3) as filed(4).pdf
OAs-OCR/Response to Art. 94(3) EPC(1).pdf
OAs-OCR/2017-06-21 - Search Report.pdf
OAs-OCR/Search Repo

OAs-OCR/Response to Supplementary Search Report(1).pdf
OAs-OCR/2016-08-18 - Article 94(3) EPC.pdf
OAs-OCR/Response to Article 94(3)(3).pdf
OAs-OCR/2016-04-15 - Search report.pdf
OAs-OCR/Article 94(3)EPC(6).pdf
OAs-OCR/Response to A94(3) as filed.pdf
OAs-OCR/Response to Art 94(3) EPC(3).pdf
OAs-OCR/Article 94(3) EPC(17).pdf
OAs-OCR/Response to Art 94(3).pdf
OAs-OCR/Supplementary Search Report(2).pdf
OAs-OCR/Response to Art 94(3) EPC(1).pdf
OAs-OCR/Response to Search Opinion(3).pdf


In [37]:
ocr_text['2016-08-18 - Article 94(3) EPC.pdf']

'European Patent Office\nPos1bus 5818\n2280 HV Rijswijk\nNETHERLANDS\nTel: +31 70 340 2040\nFax: +31 70 340 3016\n\nEuropaisches\nPatentamt\nEuropean\nPatent Office\nOffice europeen\ndes brevets\n\nFormalities Officer\nName: Sogno-Pabis, E\nTel: +31 70 340 - 2414\nor call\n+31 (0)70 340 45 00\n\nI llllll lllll lllll lllll 111111111111111111111111111111111\nMccann, Heather Alison\nEIP\nFairfax House\n15 Fulwood Place\nLondon WC1 V 6HU\nROYAUME-UNI\n\nSubstantive Examiner\nName: Bassanini, Anna\nTel: +31 70 340 - 2036\n\nApplication No.\n\nRef.\n\nDale\n\n07 844 381.9 - 1955\n\nE1129.651 (T)EPW\n\n18.08.2016\n\nApplicant\n\nTrading Technologies International, Inc.\n\nCommunication pursuant to Article 94(3) EPC\nThe examination of the above-identified application has revealed that it does not meet the requirements of the\nEuropean Patent Convention for the reasons enclosed herewith. If the deficiencies indicated are not rectified\nthe application may be refused pursuant to Article 97(2) E

In [38]:
# Now the files we have ocred

filename = 'nonocr_text.pkl'
folder = "OAs"

if os.path.isfile(filename):
    with open(filename, 'rb') as f:
        nonocr_text = pickle.load(f)
else:
    nonocrfiles = os.listdir(folder)
    
    # Generate a dict with the filename as key - we can iterate through later using items
    nonocr_text = dict()
    for file in nonocrfiles:
        nonocr_text[file] = extract_tesseract(os.path.join(folder,file))
        
    with open(filename, 'wb') as f:
        pickle.dump(nonocr_text, f)

## Cleaning EPO Office Action Data

### Header Text

We need to get rid of the text from the header of the letter.  

Basic heuristic - look for "\nEPO Form ... Date Feullle Demande n“:"

```
\n\nEPO Form 2906 01 .91TRI\n\n\x0cD~e\nDate\n\nAnmelde-Nr:\n\nBlatt\n\nDatum\n\n18.08.2016\n\nSheet\nFeuille\n\n8\n\nApplication No:\nDemande n°:\n\n0 7 8 4 4 3 81 . 9\n\n
```
Can we look for "\n\n.?EPO Form*\[0-9].?\n\n"
```
\n\n {0,3}EPO.*\d {0,3}\n\n
``` (but date has a number ending in two newlines)

Need to remove headers before we remove newlines - as we may need the multiple newlines to spot the header.

### Extra Newlines

These can be stripped.

### Garbage lines

In [None]:
def clean_letter_text(text):
    """ Cleans text from an OCRed office action. """
    pass

def strip_extra_newline(text):
    """ Strip extra newlines from text."""
    pass

def remove_header(text):
    """ Remove letter header text."""
    pass
    

## Identifying Objections

### Naive Approach
Just look for strings present in text - e.g. Article 56 EPC and "lack" or "contrary"

### Machine Learning Approach
Label each office action. Use a bag of words classifier.

## Identifying Legal Basis

### Naive Approach
Just look for "Article 94(3)" in text. (May give false positives if reference is made to a previous communication.)

Or look for line with string distance past a threshold with "Communication pursuant to Article 94(3) EPC"

In [39]:
text = ocr_text['2016-08-18 - Article 94(3) EPC.pdf']
print("94(3)" in text)
print("70(2)" in text)

True
False


In [40]:
*Observations*

D1, D2 etc is sometimes extracted as 01, 02.

I am kind of doing sentiment analysis - I want to know if, for an objection, the office action is positive or negative. Do we extract text related to particular objections first then look at that subtext?



* Data to extract:
    * headings, paragraph numbers and paragraph text.
    * date of action (although could we get this from Register)
    * objections - which were outstanding and which were met
    * prior art (can we use regex at first?)
    * current state of application (e.g. pages as published or amended)

SyntaxError: invalid syntax (<ipython-input-40-a5d848d3aa4e>, line 1)