## Setup

This notebook uses Google Document AI to parse data from invoices. Running this requires a service account. 

Instructions at: https://cloud.google.com/document-ai/docs/setup

In [37]:
from dotenv import load_dotenv
import os
import pandas as pd

load_dotenv()

assert "GOOGLE_APPLICATION_CREDENTIALS" in os.environ, "No gcloud service account file"

DEBUG = 0

## Google Document AI based Parser

For this quick implementation, I'm using Google's prebuilt invoice parser. Since the sample data involves purchase orders as well as invoices, a cleaner, use-case specific approach is to use a custom document parser. 

See demo at https://cloud.google.com/document-ai/docs/drag-and-drop

TODO: Add a video

The performance is very good even with 0 training examples. 

In [28]:
# NOTES: 
# Choosing quick hack using google's prebuilt invoice parser. 
# You can get much better results with a custom document parser even with no training data
# Try demo
from typing import Optional

from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore

project_id = "q-cloud-f0042"
location = "us"  
processor_id = "92be89420caf567f"
mime_type = "application/pdf"

opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=opts)
name = client.processor_path(project_id, location, processor_id)

KEYS = ("invoice_total", "receiver_name", "supplier_name", "total_amount", "invoice_date")

def parse_invoice_or_po(
    file_path: str,
) -> None:
    with open(file_path, "rb") as input_pdf:
        pdf_content = input_pdf.read()
    raw_document = documentai.RawDocument(content=pdf_content, mime_type=mime_type)
    request = documentai.ProcessRequest(
        name=name,
        raw_document=raw_document,
        field_mask="entities,pages.pageNumber" 
    )

    result = client.process_document(request=request)
    if DEBUG:
        print([(r.type_, r.mention_text) for r in result.document.entities])
    return {str(r.type_): str(r.mention_text) for r in result.document.entities if r.type_ in KEYS}

## Process All Samples

In [31]:
input_dir = "./inputs/Sample PDFs/"

data = []
for fname in os.listdir(input_dir): 
    if not fname.endswith(".pdf"): continue
    pdf_path = os.path.join(input_dir, fname)
    print("Processing ", fname)
    parsed_data = parse_invoice_or_po(pdf_path)
    data.append({'Filename': fname, **parsed_data})

Processing  PO-VIAEN20220830001（INVOICE）.pdf
Processing  PO-WDLH20220812001.pdf
Processing  PO-SMKD20220511001.pdf
Processing  PO-WDLH20220805001.pdf
Processing  invoiceBJ202208120006.pdf
Processing  PO-VIAEN20220830001.pdf
Processing  PO-WDLH20220722001.rev.1.pdf
Processing  PO-VIAEN20220914001.pdf
Processing  PO-SMKD20220830001（INVOICE）.pdf
Processing  PO-WDLH20220722002.pdf
Processing  PO-SMKD20220830001.pdf
Processing  invoiceBJ20220830008.pdf
Processing  PO-VIAEN20220824001.pdf


In [35]:
pd.DataFrame(data, columns=("Filename", "invoice_date", "receiver_name", "supplier_name", "total_amount"))

Unnamed: 0,Filename,invoice_date,receiver_name,supplier_name,total_amount
0,PO-VIAEN20220830001（INVOICE）.pdf,2022-08-30,"VIAEON, INC",CHINA NATIONAL PUBLICATIONS IMPORT & EXPORT GU...,16058.28
1,PO-WDLH20220812001.pdf,12-Aug-22,"VIAEON, INC","BJ Global Supply Chain Co., Ltd",17148.56
2,PO-SMKD20220511001.pdf,2022/5/11,"SMARKIDS, INC","SMARKIDS, INC",17158.61
3,PO-WDLH20220805001.pdf,05-Aug-22,"Guangzhou tuwai leather goods Co., Ltd","WDLHQC, INC",29940.74
4,invoiceBJ202208120006.pdf,2022/8/12,"WDLHQC, INC","BJ Global Supply Chain Co., Ltd",17148.56
5,PO-VIAEN20220830001.pdf,30-Aug-22,CHINA NATIONAL PUBLICATIONS IMPORT \nEXPORT GU...,"VIAEON, INC",16058.28
6,PO-WDLH20220722001.rev.1.pdf,22-Jul-22,"WDLHQC, INC","WDLHQC, INC",29228.39
7,PO-VIAEN20220914001.pdf,14-Sep-22,CHINA NATIONAL PUBLICATIONS IMPORT \nEXPORT GU...,"VIAEON, INC",7801.08
8,PO-SMKD20220830001（INVOICE）.pdf,2022-09-01,"SMARKIDS, INC",CHINA NATIONAL PUBLICATIONS IMPORT & EXPORT GU...,7972.94
9,PO-WDLH20220722002.pdf,22-Jul-22,"BJ Global Supply Chain Co., Ltd","WDLHQC, INC",13866.13


## Testing Cell

In [29]:
DEBUG = 1
test_path = "./inputs/Sample PDFs/PO-WDLH20220805001.pdf"
parse_invoice_or_po(test_path)

[('invoice_date', '05-Aug-22'), ('purchase_order', 'WDLH20220805001'), ('invoice_type', ''), ('currency', 'US DOLLARS'), ('receiver_name', 'Guangzhou tuwai leather goods Co., Ltd'), ('total_amount', '29,940.74'), ('receiver_address', 'No. 7 Yaoji Alley, Qianjin Village, Shiling Town\nHuadu District, Guangdong Province\nChina'), ('supplier_name', 'WDLHQC, INC'), ('supplier_address', '9\nSheridan\n, \nWY \n82801\nUnited \nStates'), ('line_item', 'TOILETRY-BAG-301-FBA TOILETRY BAG BLACK&WHITE 3500 36.00 126,000.00'), ('line_item', 'TB-301-BLACK-FBA TOILETRY BAG 1000 36.00 36,000.00'), ('line_item', 'TB-301-PINK-FBA TOILETRY BAG 500 36.00 18,000.00'), ('line_item', 'TB-301-BROWN-FBA TOILETRY BAG 600 36.00 21,600.00'), ('line_item', '5600 201,600.00')]


{'invoice_date': '05-Aug-22',
 'receiver_name': 'Guangzhou tuwai leather goods Co., Ltd',
 'total_amount': '29,940.74',
 'supplier_name': 'WDLHQC, INC'}