# Sample Notebook for Extracting Data from OCRed PDFs Using Regex and LLMs

One can use this notebook to build a pipeline to parse and extract data from OCRed PDF files. **Warining:** When using LLMs for entity extraction, be sure to perform extensive quality control. They are very susceptible to distracting language (latching on to text that sound "kind of like" what you're looking for) and missing language (making up content to fill any holes), and importantly, they do **NOT** provide any hints to when they may be erroring. 

First we load the libraries we need. Note, if you try to run the cell, and you get something like `ModuleNotFoundError: No module named 'mod_name'`, you'll need to install the module. You can do this uncommenting the line bellow that reads `#!pip install mod_name` if it's listed. If it isn't, you can probably install it with a similarly formatted command. 

In [21]:
#!pip install os
#!pip install PyPDF2
#!pip install re
#!pip install pandas
#!pip install numpy

import os
from os import walk, path
import PyPDF2
import re
import pandas as pd
import numpy as np

def read_pdf(file):
    try:
        pdfFile = PyPDF2.PdfFileReader(open(file, "rb"), strict=False)
        text = ""
        for page in pdfFile.pages:
            text += " " + page.extractText()
        return text
    except:
        return ""

In [2]:
# Test Audio call
# Only works on Mac. If you aren't using a Mac, you should disable such calls below.
tmp = os.system( "say Testing, testing, one, two, three.")
del(tmp)

In [14]:
#!pip install transformers
#!pip install openai
#!pip install tiktoken

import json

from nltk.tokenize import word_tokenize, sent_tokenize

import openai
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

import tiktoken
ENCODING = "gpt2"
encoding = tiktoken.get_encoding(ENCODING)

def complete_text(prompt,temp=0,trys=0,clean=True):
    
    global tokens_used
    
    model="text-davinci-003"
    model_token_limit = 4097
    
    token_count = len(encoding.encode(prompt))
    max_tokens= model_token_limit-round(token_count+5)
    
    try:
        response = openai.Completion.create(
          model=model,
          prompt=prompt,
          temperature=temp,
          max_tokens=max_tokens,
          top_p=1.0,
          frequency_penalty=0.0,
          presence_penalty=0.0
        )
        output = str(response["choices"][0]["text"].strip())
    except:
        print("Problem with API call!")
        output = """{"output":"error"}"""
        
    tokens_used += token_count+len(encoding.encode(output))
    
    if clean:
        return clean_pseudo_json(output,temp=0,trys=trys)
    else:
        return output
    
def clean_pseudo_json(string,temp=0,key="output",trys=0,ask_for_help=1):
    try:
        output = json.loads(string)[key]
    except:
        try:
            string_4_json = re.findall("\{.*\}",re.sub("\n","",string))[0]
            output = json.loads(string_4_json)[key]
        except:
            try:
                string = "{"+string+"}"
                string_4_json = re.findall("\{.*\}",re.sub("\n","",string))[0]
                output = json.loads(string_4_json)[key]
            except Exception as e:
                prompt = "I tried to parse some json and got this error, '{}'. This was the would-be json.\n\n{}\n\nReformat it to fix the error.".format(e,string)              
                if trys <= 3:
                    if trys == 0:
                        warm_up = 0
                    else:
                        warm_up = 0.25
                    output = complete_text(prompt,temp=0+warm_up,trys=trys+1)  
                    print("\n"+str(output)+"\n")            
                elif ask_for_help==1:
                    print(prompt+"\nReformaing FAILED!!!")
                    try:
                        os.system( "say hey! I need some help. A little help please?")
                    except:
                        print("'say' not supported.\n\n")
                    output = input("Let's see if we can avoid being derailed. Examine the above output and construct your own output text. Then enter it below. If the output needs to be something other than a string, e.g., a list or json, start it with `EVAL: `. If you're typing that, be very sure there's no malicious code in the output.\n")      
                    if output[:6]=="EVAL: ":
                        output = eval(output[6:])
                else:
                    output = "There was an error getting a reponse!"
            
    return output

This notebook will make use of the [OpenAI API](https://openai.com/blog/openai-api). For things to work, you'll need to make sure that the files referenced below exist and contain your relevant credintials. 

In [8]:
with open("../keys_work/openai_org.txt", "r") as file:
    openai.organization = file.read().rstrip()
with open("../keys_work/openai_key.txt", "r") as file:
    openai.api_key = file.read().rstrip()

llm_temperature = 0 # I strongly suggest keeping the LLM's temp at zero to avoid it making things up.

# Toggle LLM usage on or off
use_LLM = True

tokens_used = 0

In [36]:
human_hourly_wage = 15
human_reading_words_per_min = 280

Next, place a bunch of OCRed pdf files in the right folder (here, the `data/boston` folder). FWIW, you can use Adobe Pro to OCR in batch. 

In [42]:
df = pd.DataFrame() #this will create an empty dataframe

path = "data/boston/" # this is where we'll be looking for files
f = []
for (dirpath, dirnames, filenames) in walk(path): # create a list of file names
    f.extend(filenames)
    break

token_counts = []
for file in f: # for each file in the list of file names, do some stuff
    
    global tokens_used
    
    tokens_used = 0
    column_names = ["file"]
    column_values = [file]

    fileloc = path+file
    text = read_pdf(fileloc)
    words = len(text.split())
    
    print("Reading ~{} words from: \"{}\"\n".format(words,fileloc))
    
    #############################################################
    # Here's where we use regex to pull out specific content
    
    # ---------------------------------------------------------       
    # case Number
    # ---------------------------------------------------------        
    case_no = re.search("(?<=case no\.\s)(.*?)(?=permit#)",text, flags=re.IGNORECASE).groups(0)[0].strip()
    column_names.append("case_no")
    column_values.append(case_no)
    
    # ---------------------------------------------------------        
    # address
    # ---------------------------------------------------------        
    address = re.search("(?<=concerning premises\s)(.*?)(?=\s?,?\s?ward)",text, flags=re.IGNORECASE).groups(0)[0].strip()
    column_names.append("address")
    column_values.append(address)
    
    # ---------------------------------------------------------        
    # ward
    # ---------------------------------------------------------        
    ward = re.search("(?<=ward\s)(.*?)(?=to vary)",text, flags=re.IGNORECASE).groups(0)[0].strip()
    column_names.append("ward")
    column_values.append(ward)
    
    #############################################################
    # Here's where use GPT to pull out some specific content. 
    
    #
    # Note: You should consider combining multiple prompts into a single prompt 
    # to avoid making unnecessary api calls. 
    #
  
    
    if use_LLM:
    
        # ---------------------------------------------------------    
        # description of variance requested
        # ---------------------------------------------------------        
        prompt_text = """Below you will be provided with the text of an order from a local zoning board of appeals responding to a variance request. You're looking to find the _description of variance requested_. That is, what the petitioner was asking for.        

    Here's the text of the order. 

    {}

    ---

    Return a json object, including the outermost currly brakets, where the key is "output" and the value is a the _description of variance requested_. If you can't find a _description of variance requested_ in the text of the above, answer "none found". Be sure to use valid json, encasing keys and values in double quotes, and escaping internal quotes and special characters as needed.""". format(text)
        #print(prompt_text)    
        request = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("request")
        column_values.append(request)

        # ---------------------------------------------------------   
        # relevant facts
        # ---------------------------------------------------------        
        prompt_text = """Below you will be provided with the text of an order from a local zoning board of appeals responding to a variance request. You're looking to find the _relevant facts_. That is, what facts did the board needed to know to rule on the petitioner's request.

    Here's the text of the order. 

    {}

    ---

    Return a json object, including the outermost currly brakets, where the key is "output" and the value is a a short summary of the _relevant facts_. If you can't find _relevant facts_ in the text of the above, answer "none found". Be sure to use valid json, encasing keys and values in double quotes, and escaping internal quotes and special characters as needed.""". format(text)
        #print(prompt_text)    
        facts = complete_text(prompt_text,temp=llm_temperature)
        column_names.append("facts")
        column_values.append(facts)

        # ---------------------------------------------------------
        # decision/reasoning & short decision
        # ---------------------------------------------------------        
        prompt_text = """Below you will be provided with the text of an order from a local zoning board of appeals responding to a variance request. You're looking to find the board's _decision_ and _reasoning_. That is, how the board ruled on the petitioner's request and how it came to that decision.        

    Here's the text of the order. 

    {}

    ---

    Return a json object, including the outermost currly brakets, where the key is "output" and the value is a json object with two key-value pairs: (1) the first item has the key "reasoning" and the value is a summary of the board's _reasoning_ as stated above; and (2) the second item has a the key "decision" with a value that is a one or two word re-statment of the _decision_ found above (e.g., "granted," "not granted," or "granted in part"). If you can't find the _decision_ or _reasoning_ in the text of the above order, both values should read "none found". Be sure to use valid json, encasing keys and values in double quotes, and escaping internal quotes and special characters as needed.""". format(text)
        #print(prompt_text)    
        output = complete_text(prompt_text,temp=llm_temperature)
        
        reasoning = output["reasoning"]
        column_names.append("reasoning")
        column_values.append(reasoning)

        decision = output["decision"]
        column_names.append("decision")
        column_values.append(decision)

    #############################################################   

    i = 0
    for datum in column_values:
        print("{}: {}\n".format(column_names[i].upper(),datum))
        i+=1

    print("Tokens used (approx.): {} (API Cost ${})\n".format(tokens_used,tokens_used*(0.02/1000))) # See https://openai.com/pricing
    token_counts.append(tokens_used)
        
    print("================================================\n")
    
    df = df.append(pd.DataFrame([column_values],columns=column_names), ignore_index=True,sort=False)

print("Average approx. tokens used per item {} (API Cost ${})".format(np.array(token_counts).mean(),np.array(token_counts).mean()*(0.02/1000)))
display(df)

Reading ~1135 words from: "data/boston/109 to 117A Blue Hill Ave_BOA848024_Decision.pdf"

FILE: 109 to 117A Blue Hill Ave_BOA848024_Decision.pdf

CASE_NO: BOA848024

ADDRESS: 109 to 117 A Blue Hill Avenue

WARD: 12

REQUEST: Change Occupancy to include Coffee Shop.

FACTS: The petitioner, Domingo De La Paz, is requesting a variance to the Zoning Act, Ch. 665, Acts of 1956, as amended, to change the occupancy of premises 109 to 117 A Blue Hill Avenue, Ward 12 to include a coffee shop. The Board of Appeal held a public hearing on the appeal on April 9, 2019 and found that the proposed project is an appropriate use of the lot and will not adversely affect the community or create any detriment for abutting residents. The Board unanimously voted to grant the requested Conditional Use Permit with the proviso of BPDA design review with attention to grates.

REASONING: The proposal will allow the Appellant to have reasonable use of the premises by changing the occupancy to include a coffee sho

Unnamed: 0,file,case_no,address,ward,request,facts,reasoning,decision
0,109 to 117A Blue Hill Ave_BOA848024_Decision.pdf,BOA848024,109 to 117 A Blue Hill Avenue,12,Change Occupancy to include Coffee Shop.,"The petitioner, Domingo De La Paz, is requesti...",The proposal will allow the Appellant to have ...,granted
1,107 Buttonwood St_BOA784573_Decision.pdf,BOA 784573,107 Buttonwood Street,7,Variance Article(s): 65(65-9: Floor area ratio...,"The petitioner, Thanh Nguyen, is requesting a ...",The proposed project will allow the Appellant ...,granted


In [12]:
# If you're happy with the stuff you pulled out above, you can write the df to a csv file

df.to_csv("data/Coding of Boston Variance Decisions.csv", index=False, encoding="utf-8")  