Document Parsing Pipeline

This repo provides code for building a document processing pipeline on GCP. It works like this:

Upload a document as text, as a pdf, or as an image to the cloud (in this case, a Google Cloud Storage bucket).
The document gets tagged by type (i.e. "invoice," "article," etc) and then put into a new folder/storage bucket based on that type.
Each document type is processed differently. A file that's moved to the "article" bucket gets tagged (i.e. Sports, Politics, Technology). A file that's moved the "invoices" folder gets analyze for prices, phone numbers, and other entities.
Extracted data then gets put into a bigquery table.

Step 0: Enable APIs

For this project, you'll need to enable these GCP services:

AutoML Natural Language
Natural Language API
Vision API
Places API

Step 1: Create Storage Buckets

First, you'll want to create a couple of Storage Buckets.

Create one folder, something like gs://input-documents, where you'll initially upload your unsorted documemts.

Next, create folders for each of the file types you want to move documents to. In this project, I have code to analyze invoices and articles, so I created three buckets:

gs://articles
gs://invoices
gs://unsorted-docs

Step 2: Create an AutoML Document Classification Model

Next, you'll want to build a model that sorts documents by type, labeling them as invoices, invoices, articles, emails, and so on. To do this, I used the RVL-CDIP dataset:

A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015

This huge dataset contains 400,000 grayscale document images in 16 classes.

I also added some scanned invoice data from the Scanned invoices OCR and Information Extraction Dataset.

Because this dataset combined was so huge and hard to work with, I just used a sample of documents with this breakdown of document type:

label	count
advertisement	902
budget	900
email	900
form	900
handwritten	900
invoice	900
letter	900
news	188
invoice	626

I've hosted all this data at https://console.cloud.google.com/storage/browser/doc-classification-training-data (or gs://doc-classification-training-data). The training files, as pdfs and txt data, are in the processed_files folder. You'll need to copy those files to your own GCP project in order to train your own model. You'll also see a file automl_input.csv in the bucket. This is the csv file you'll use to import data into AutoML Natural Language.

AutoML Natural Language is a tool that builds custom deep learning models from user-provided training data. It works with text and pdf files. This is what I used to build my document type classifier. The docs will show you how to train your own document classification model. When you're done training your model, Google Cloud will host the model for you.

To use your AutoML model in your app, you'll need to find your model name:

projects/YOUR LONG NUMBER/locations/us-central1/models/YOUR LONG MODEL ID NUMBER

We'll use this id in the next step.

Step 3: Set up Cloud Functions

This document pipeline is designed so that when you upload files to an input bucket (i.e. gs://input-documents), they're sorted by type by the AutoML model we built in Step 2.

After you've created this storage bucket and trained your AutoML model, you'll need to create a new cloud function that runs every time a new file is uploaded to gs://input-documents.

Create a new cloud function with the code in sort_documents.py. For this to work, you'll need to set several environmental variables in the Cloud Functions console, like INVOICES_BUCKET, UNSORTED_BUCKET, and ARTICLES_BUCKET. These should be the names of the corresponding buckets you created, without the preceding gs: (i.e. gs://reciepts -> invoices). You'll also need to set the environmental variable SORT_MODEL_NAME to the model name we found in the last step (that entire long path that ends in model id number).

Once you've set up this function, documents uploaded to gs://input_documents (or whatever you've called your input document folder) will be classified and moved into the invoices, articles, or unsorted buckets respectively.

There are two more cloud functions defined in this repository:

analyze_invoice.py
tag_article.py

You'll want to create two new cloud functions that connects these scripts to your invoices and article buckets.

But first, you'll need to create a bigquery table to store the data extracted from those functions

Step 4: Create BigQuery Tables

Create two BigQuery tables. One, which you can call something like, "article_tags", will store the tags extracted by your tag_article cloud functions. It's schema should be:

column name	type
filename	string
tag	string

Create a second table called something like invoice_data with the schema:

column name	type
filename	string
address	string
phone_nunmber	string
name	string
total	float
date_str	string

For each of these tables, note the Table ID, which should be of the form:

your-project-name.your-dataset-name.your-table-name

Step 5: Create Remaining Cloud Functions

Now that you've created those BigQuery tables, you can deploy the cloud functions:

tag_article.py
analyze_invoice.py

As you deploy, you'll need to set the environmental variables ARTICLE_TAGS_TABLE and invoiceS_TABLE respectively to their BigQuery table IDs. For the analyze_receiept cloud function, you'll also need to [create an API key](https://cloud.google.com/docs/authentication/api-keys and set the environmental varialbe GOOGLE_API_KEY to that key (this script uses the Google Places API to find information about businesses from their phone numbers).

This documentation shows you how to create a Python cloud function on Google Cloud.

Step 6: Analyze

Voila! You're done. Upload a file to your gs://input-documents and watch as its sorted and analyzed.

This is not an officially supported Google product

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
analyze_form.py		analyze_form.py
analyze_invoice.pdf		analyze_invoice.pdf
analyze_invoice.py		analyze_invoice.py
document_pipeline.png		document_pipeline.png
env_template.sh		env_template.sh
main.py		main.py
requirements.txt		requirements.txt
sort_documents.py		sort_documents.py
tag_article.py		tag_article.py
test.py		test.py
tif_to_pdf.py		tif_to_pdf.py
utils.py		utils.py

License

dalequark/document-pipeline

Folders and files

Latest commit

History

Repository files navigation

Document Parsing Pipeline

Step 0: Enable APIs

Step 1: Create Storage Buckets

Step 2: Create an AutoML Document Classification Model

Step 3: Set up Cloud Functions

Step 4: Create BigQuery Tables

Step 5: Create Remaining Cloud Functions

Step 6: Analyze

About

Resources

License

Stars

Watchers

Forks

Languages