## DocAI Parser
This notebook demonstrates how to set up and use Google's Document AI (DocAI) service for document processing. We'll be going through the following steps:<br>
<br>
- Setup: Import necessary libraries and initialize logging.<br>
- Configuration: Define essential parameters and authenticate with the service.<br>
- Fetch Available Processor Types: Check which document processors are available for creation.<br>
- Create a DocAI Processor: Set up a new document processor for our use case.<br>

#### 1. Setup
1.1. Importing Necessary Libraries

In [1]:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai
import logging
import os

#### 1.2. Initialize Logging
To keep track of our operations and potential issues, we're setting up a basic logging mechanism.

In [2]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

#### 2. Configuration
2.1. Authentication and Environment Setup<br><br>
To interact with the Google Cloud services, we need to set up authentication using service account credentials.<br><br>

🔴 Note: Ensure that the service account key file (vai-key.json) is present in the specified path before executing the code.

In [3]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './../credentials/vai-key.json'

2.2. Define Essential Parameters<br><br>
These parameters will define which GCP project and region we're working with.

In [4]:
PROJECT_ID = 'arun-genai-bb'
LOCATION = 'us'  # Note: 'us' here doesn't refer to 'us-central1'
client_options = ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=client_options)
parent = client.common_location_path(PROJECT_ID, LOCATION)

#### 3. Fetch Available Processor Types
Before creating our processor, let's check which types of processors we can create.

In [5]:
response = client.fetch_processor_types(parent=parent)
for processor_type in response.processor_types:
    if processor_type.allow_creation:
        logger.info(processor_type.type_)


INVOICE_PROCESSOR
CUSTOM_EXTRACTION_PROCESSOR
FORM_PARSER_PROCESSOR
OCR_PROCESSOR
FORM_W9_PROCESSOR
CUSTOM_CLASSIFICATION_PROCESSOR
UTILITY_PROCESSOR
EXPENSE_PROCESSOR
CUSTOM_SPLITTING_PROCESSOR
US_DRIVER_LICENSE_PROCESSOR
US_PASSPORT_PROCESSOR
PURCHASE_ORDER_PROCESSOR
ID_PROOFING_PROCESSOR
SUMMARY_PROCESSOR


#### 4. Create a DocAI Processor
Now, let's create a new DocAI processor specifically for our needs. In this case, we're setting up an OCR processor.<br><br>

💡 Tip: If you frequently work with specific types of documents, consider creating specialized processors for them.

In [6]:
docai_processor = client.create_processor(parent=parent, 
                                          processor=documentai.Processor(display_name='doc-processor01', 
                                                                         type_='OCR_PROCESSOR'))
docai_processor

name: "projects/390991481152/locations/us/processors/e02745c109a7cbc4"
type_: "OCR_PROCESSOR"
display_name: "doc-processor01"
state: ENABLED
default_processor_version: "projects/390991481152/locations/us/processors/e02745c109a7cbc4/processorVersions/pretrained-ocr-v1.0-2020-09-23"
process_endpoint: "https://us-documentai.googleapis.com/v1/projects/390991481152/locations/us/processors/e02745c109a7cbc4:process"
create_time {
  seconds: 1698334962
  nanos: 570907000
}