Skip to content

cyber-nic/go-gcp-doc-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-gcp-doc-ai

This project is a quick and effective way to OCR files in a GCP bucket. It attempts to handle failures by tracking already processed files enabling itself to pickup where it leftoff. The project makes reasonable performance for cost tradeoffs such as tracking processed files using a bucket rather than a database.

  • The application is tailored to operated in GCP.
  • The code is deployed as a Cloud Function.
  • The application:
    • reads from an src bucket
    • reads and writes image CRC32 hashes to a refs bucket
    • utilizes the Document AI OCR batch capabilities, which writes to a dst bucket
    • writes errors to an err bucket
  • The application can be triggered/invoked with via http. It runs until completion

https://cloud.google.com/functions/docs/running/functions-emulator#cloudevent-function

Back-of-the-envelope calculations

Quotas, Limitations and Configurations

Assumptions and Calculations

  • Concurrent worker functions/document ai batch processes (1:1 mapping): 5

batch 100

  • Files per batch processing request: 100
  • OCR per 30s: 5*100=500
  • Expected completed OCR per 30s, 1min, 1hour, 1day: 5*100=500, 1000, 60k, 1.44M
  • Expected duration to process 2.5M pages: 42h

Natural Language

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages