# Document Extractor

## Setup and requirements
### Enabling Document AI

In [None]:
# set to your designated project
! gcloud config set project <PROJECT_ID>

Enable the Document AI API:

In [None]:
! gcloud services enable documentai.googleapis.com

### Project setup
Get project ID

In [None]:
! gcloud config get-value core/project

Create a json file 'key.json' to store the service account key.

Set environment variables:

In [2]:
import os
os.environ['PROJECT_ID'] = '<PROJECT_ID>'
os.environ['SERVICE_ACCOUNT_NAME'] = 'documentai-processors'
os.environ['SERVICE_ACCOUNT'] = f"documentai-processors@{os.getenv('PROJECT_ID', '')}.iam.gserviceaccount.com"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'key.json' # path to json key file

Create Service Account and grant it "Document AI Editor" role

In [13]:
! gcloud iam service-accounts create documentai-processors

Created service account [documentai-processors].


Add documentai.editor and bigquery.dataEditor role to the service account:

In [None]:
! gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" \
                                                    --role "roles/documentai.editor"

In [None]:
! gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" \
                                                    --role "roles/bigquery.dataEditor"

In [None]:
! gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" \
                                                    --role "roles/bigquery.user"

Create and dowload the service account key:

In [16]:
! gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS --iam-account $SERVICE_ACCOUNT

created key [2b0f6a5c376227feb74f1ba7da9f2356c604ab9d] of type [json] as [documentai-processors/key.json] for [documentai-processors@fraud-detection-python.iam.gserviceaccount.com]


In [None]:
! cat $GOOGLE_APPLICATION_CREDENTIALS

### Python Setup
Install required packages:

In [None]:
! pip install -r requirements.txt

## Create a custom processor

In [2]:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
from google.api_core.exceptions import FailedPrecondition

In [3]:
PROJECT_ID = os.getenv("PROJECT_ID", "")
API_LOCATION = "us" # change to your document ai api location

In [4]:
def create_processor(display_name: str, type: str) -> documentai.Processor:
    client_options = ClientOptions(
        api_endpoint=f"{API_LOCATION}-documentai.googleapis.com"
    )
    client = documentai.DocumentProcessorServiceClient(client_options=client_options)
    parent = client.common_location_path(PROJECT_ID, API_LOCATION)
    processor = documentai.Processor(display_name=display_name, type_=type)
    return client.create_processor(parent=parent, processor=processor)

In [16]:
create_processor('customProcessor', 'CUSTOM_EXTRACTION_PROCESSOR')

name: "projects/1014038123259/locations/us/processors/160d1359502b88e0"
type_: "CUSTOM_EXTRACTION_PROCESSOR"
display_name: "customProcessor"
state: ENABLED
create_time {
  seconds: 1665421935
  nanos: 288389000
}

## Train the Processor
### Upload Training and Testing Data
Navigate to the custom processor detail page on cloud console.
![Screenshot1](img/Screenshot1.png)

Set the dataset location to a gcs bucket and click on ```CREATE DATA SET```.
![Screenshot2.png](img/Screenshot2.png)

Go to the gcs bucket and upload all the training and testing data in ```driving-license-train-data``` and ```driving-license-test-data``` folders.

Go to the custom processor detail page and navigate to the 'Train' tab. Click on ```IMPORT DOCUMENTS``` and import all the data stored on the gcs bucket. In the ```Data split``` field, mark the data as 'Training' when upload the documents from the ```driving-license-train-data``` folder. Mark the data as 'Test' when uploading the documents from ```driving-license-test-data``` folder. In this demo, 10 documents have been used to train the processor, and 10 documents have been used to test it.
![Screenshot4.png](img/Screenshot4.png)


### Labeling data
Click on ```EDIT SCHEMA``` and add the followings labels by clicking ```CREATE LABEL```
![Screenshot5](img/Screenshot5.png)

Select one of the imported documents and label all the fields(Select the text on the image and choose the correct label). After identifying all the fields, click on ```MARK AS LABELLED```. All the imported documents need to be labelled before training the processor.
![Screenshot6](img/Screenshot6.png)

### Trainning processor
After labelling all the documents, click on "TRAIN NEW VERSION". Enter the version name, and click on "START TRAINING".
![Screenshot7](img/Screenshot7.png)

### Deploying processor
After training is complete, go to the 'MANAGE VERSIONS' tab and deploy the trained version. Once the version has been deployed, an endpoint will be available under the 'PROCESSOR DETAILS' tab.
![Screenshot8](img/Screenshot8.png)

## Process document using the trained processor
Once the version has been deployed, copy the processor id to the ```PROCESSOR_ID``` field below.

In [5]:
# constants
PROJECT_ID = os.getenv("PROJECT_ID", "") # change to your own project id
LOCATION = 'us' # change to your document ai location
MIME_TYPE = 'application/pdf' # document type
PROCESSOR_ID = '16c52250499a3864' # change to the id of your trained processor

In [6]:
## Send the document for processing
def send_processing_req(project_id, location, processor_id, file_path, mime_type, GCS_INPUT_URI = None):
    
    docai_client = documentai.DocumentProcessorServiceClient(
        client_options = ClientOptions(api_endpoint=f'{location}-documentai.googleapis.com')
    )

    RESOURCE_NAME = docai_client.processor_path(project_id, location, processor_id)

    # load file into memory
    with open(file_path, 'rb') as image:
        image_content = image.read()

    raw_doc = documentai.RawDocument(content=image_content, mime_type=MIME_TYPE)
    request = documentai.ProcessRequest(name=RESOURCE_NAME, raw_document=raw_doc)

    result = docai_client.process_document(request=request)

    document_object = result.document
    print('Document processing complete')
    # print(document_object.text)
    return document_object


Pass the 'license.pdf' to the processor and a document object will be returned.

In [9]:
file_path = 'license.pdf'
document_object = send_processing_req(PROJECT_ID, LOCATION, PROCESSOR_ID, file_path, MIME_TYPE)

Document processing complete


Text extracted from the document:

In [10]:
document_object.text

'UK\nDRIVING LICENCE\n1. MORGAN\n2. SARAH\nMEREDYTH\n3. 11.03.1976 UNITED KINGDOM\n4a. 19.01.2013 4c. DVLA\n4b. 18.01.2023\n5. MORGA753116SM9IJ 35\nMPLE\n*\n7.\nScamp\n18.01.2023\nISHI\n122 BURNS CRESCENT\nEDINBURGH\nEH1 9GP\n9. AM/A/B1/B/f/k/l/n/p/q\nNEUROPEAN UNION MOCKS\nDVLA INTERNAL USE\n'

Fields and values identified by the trained processor:

In [11]:
from results_to_bigquery import extract_document_entities
extract_document_entities(document_object)

{'address': '122 BURNS CRESCENT\nEDINBURGH\nEH1 9GP',
 'date_of_expiry': '18.01.2023',
 'date_of_issue': '19.01.2013',
 'dob_pob': '11.03.1976 UNITED KINGDOM',
 'first_name': 'SARAH\nMEREDYTH',
 'issued_by': 'DVLA',
 'last_name': 'MORGAN',
 'license_number': 'MORGA753116SM9IJ 35',
 'type': 'AM/A/B1/B/f/k/l/n/p/q'}

## Output to Bigquery

In [12]:
from results_to_bigquery import process_document

Output the document ai object to Big Query. If you wish to store the output to a new table, change the ```TNTITIES_TABLE_NAME``` to a new table name. If the new table does not exist in the dataset, ```process_document``` will create a new table

In [3]:
DATASET_NAME = 'document_extractor_results'
ENTITIES_TABLE_NAME = 'driving_license_extracted_entities'

In [19]:
process_document(document_object, file_path, DATASET_NAME, ENTITIES_TABLE_NAME)

Entities:  {'address': '122 BURNS CRESCENT\nEDINBURGH\nEH1 9GP', 'date_of_expiry': '18.01.2023', 'date_of_issue': '19.01.2013', 'dob_pob': '11.03.1976 UNITED KINGDOM', 'first_name': 'SARAH\nMEREDYTH', 'issued_by': 'DVLA', 'last_name': 'MORGAN', 'license_number': 'MORGA753116SM9IJ 35', 'type': 'AM/A/B1/B/f/k/l/n/p/q', 'input_file_name': 'license.pdf', 'text': 'UK\nDRIVING LICENCE\n1. MORGAN\n2. SARAH\nMEREDYTH\n3. 11.03.1976 UNITED KINGDOM\n4a. 19.01.2013 4c. DVLA\n4b. 18.01.2023\n5. MORGA753116SM9IJ 35\nMPLE\n*\n7.\nScamp\n18.01.2023\nISHI\n122 BURNS CRESCENT\nEDINBURGH\nEH1 9GP\n9. AM/A/B1/B/f/k/l/n/p/q\nNEUROPEAN UNION MOCKS\nDVLA INTERNAL USE\n'}
Writing DocAI Entities to bq
LoadJob<project=fraud-detection-python, location=US, id=4d010d2b-7093-40b3-ab11-16365cba9877>


View the output result in BigQuery:

In [8]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query = f"""
    SELECT *
    FROM `{PROJECT_ID}.{DATASET_NAME}.{ENTITIES_TABLE_NAME}`
    LIMIT 20
"""
query_job = client.query(query)  # Make an API request.

In [9]:
query_job.to_dataframe()

Unnamed: 0,input_file_name,type,issued_by,date_of_expiry,dob_pob,text,license_number,last_name,date_of_issue,first_name,address
0,license.pdf,AM/A/B1/B/f/k/l/n/p/q,DVLA,18.01.2023,11.03.1976 UNITED KINGDOM,UK\nDRIVING LICENCE\n1. MORGAN\n2. SARAH\nMERE...,MORGA753116SM9IJ 35,MORGAN,19.01.2013,SARAH\nMEREDYTH,122 BURNS CRESCENT\nEDINBURGH\nEH1 9GP
1,license.pdf,AM/A/B1/B/f/k/l/n/p/q,DVLA,18.01.2023,11.03.1976 UNITED KINGDOM,UK\nDRIVING LICENCE\n1. MORGAN\n2. SARAH\nMERE...,MORGA753116SM9IJ 35,MORGAN,19.01.2013,SARAH\nMEREDYTH,122 BURNS CRESCENT\nEDINBURGH\nEH1 9GP
