-
Notifications
You must be signed in to change notification settings - Fork 3
Ocr adapters #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Ocr adapters #4
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
9720a2e
OCR adapter changes
gaya3-zipstack a09f4b8
Fix function signature
gaya3-zipstack 2d1079e
Add .idea to be ignored by Git commits
gaya3-zipstack 569cad3
Merge branch 'main' into ocr-adapters
gaya3-zipstack 09f60c5
Roll up version for adapter changes for OCR
gaya3-zipstack c70cd1e
Remove unwanted space
gaya3-zipstack 39187f0
Private methid func name refactoring
gaya3-zipstack acb9ef7
Changes to support byte and string content types for x2text adapters
gaya3-zipstack File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| from unstract.adapters import AdapterDict | ||
| from unstract.adapters.ocr.register import OCRRegistry | ||
|
|
||
| adapters: AdapterDict = {} | ||
| OCRRegistry.register_adapters(adapters) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| class FileType: | ||
| TEXT_PLAIN = "text/plain" | ||
| IMAGE_JPEG = "image/jpeg" | ||
| IMAGE_PNG = "image/png" | ||
| IMAGE_TIFF = "image/tiff" | ||
| IMAGE_BMP = "image/bmp" | ||
| IMAGE_GIF = "image/gif" | ||
| IMAGE_WEBP = "image/webp" | ||
| APPLICATION_PDF = "application/pdf" | ||
| ALLOWED_TYPES = [ | ||
| IMAGE_JPEG, | ||
| IMAGE_PNG, | ||
| IMAGE_TIFF, | ||
| IMAGE_BMP, | ||
| IMAGE_GIF, | ||
| IMAGE_WEBP, | ||
| APPLICATION_PDF, | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| # Unstract Google Document AI OCR Adapter |
26 changes: 26 additions & 0 deletions
26
src/unstract/adapters/ocr/google_document_ai/pyproject.toml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| [build-system] | ||
| requires = ["pdm-backend"] | ||
| build-backend = "pdm.backend" | ||
|
|
||
|
|
||
| [project] | ||
| name = "unstract-googledocumentai-ocr" | ||
| version = "0.0.1" | ||
| description = "Google Document AI OCR" | ||
| authors = [ | ||
| {name = "Zipstack Inc.", email = "devsupport@zipstack.com"}, | ||
| ] | ||
| dependencies = [ | ||
|
|
||
| ] | ||
| requires-python = ">=3.9" | ||
| readme = "README.md" | ||
| classifiers = [ | ||
| "Programming Language :: Python" | ||
| ] | ||
| license = {text = "MIT"} | ||
|
|
||
| [tool.pdm.build] | ||
| includes = ["src"] | ||
| package-dir = "src" | ||
| # source-includes = ["tests"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| # Unstract Google Document AI OCR Adapter |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| from .google_document_ai import GoogleDocumentAI | ||
|
|
||
| metadata = { | ||
| "name": GoogleDocumentAI.__name__, | ||
| "version": "1.0.0", | ||
| "adapter": GoogleDocumentAI, | ||
| "description": "Google Document AI OCR adapter", | ||
| "is_active": True, | ||
| } |
174 changes: 174 additions & 0 deletions
174
src/unstract/adapters/ocr/google_document_ai/src/google_document_ai.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,174 @@ | ||
| import base64 | ||
| import json | ||
| import logging | ||
| import os | ||
| from typing import Any, Optional | ||
|
|
||
| import requests | ||
| from filetype import filetype | ||
| from google.auth.transport import requests as google_requests | ||
| from google.oauth2.service_account import Credentials | ||
|
|
||
| from unstract.adapters.exceptions import AdapterError | ||
| from unstract.adapters.ocr.constants import FileType | ||
| from unstract.adapters.ocr.ocr_adapter import OCRAdapter | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class GoogleDocumentAIKey: | ||
| RAW_DOCUMENT = "rawDocument" | ||
| MIME_TYPE = "mimeType" | ||
| CONTENT = "content" | ||
| SKIP_HUMAN_REVIEW = "skipHumanReview" | ||
| FIELD_MASK = "fieldMask" | ||
|
|
||
|
|
||
| class Constants: | ||
| URL = "url" | ||
| CREDENTIALS = "credentials" | ||
| CREDENTIAL_SCOPES = ["https://www.googleapis.com/auth/cloud-platform"] | ||
|
|
||
|
|
||
| class GoogleDocumentAI(OCRAdapter): | ||
| def __init__(self, settings: dict[str, Any]): | ||
| super().__init__("GoogleDocumentAI") | ||
| self.config = settings | ||
| google_service_account = self.config.get(Constants.CREDENTIALS) | ||
| if not google_service_account: | ||
| logger.error("Google service account not found") | ||
| else: | ||
| self.google_service_account = json.loads(google_service_account) | ||
|
|
||
| @staticmethod | ||
| def get_id() -> str: | ||
| return "googledocumentai|1013f64b-ecc9-4e35-b986-aebd60fb55d7" | ||
|
|
||
| @staticmethod | ||
| def get_name() -> str: | ||
| return "GoogleDocumentAI" | ||
|
|
||
| @staticmethod | ||
| def get_description() -> str: | ||
| return "Google Document AI OCR" | ||
|
|
||
| @staticmethod | ||
| def get_icon() -> str: | ||
| return ( | ||
| "https://storage.googleapis.com/pandora-static/" | ||
| "adapter-icons/GoogleDocumentAI.png" | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def get_json_schema() -> str: | ||
| f = open(f"{os.path.dirname(__file__)}/static/json_schema.json") | ||
| schema = f.read() | ||
| f.close() | ||
| return schema | ||
|
|
||
| """ Construct the request body to be sent to Google AI Document server """ | ||
|
|
||
| def _get_request_body( | ||
| self, file_type_mime: str, file_content_in_bytes: bytes | ||
| ) -> dict[str, Any]: | ||
| return { | ||
| GoogleDocumentAIKey.RAW_DOCUMENT: { | ||
| GoogleDocumentAIKey.MIME_TYPE: file_type_mime, | ||
| GoogleDocumentAIKey.CONTENT: base64.b64encode( | ||
| file_content_in_bytes | ||
| ).decode("utf-8"), | ||
| }, | ||
| GoogleDocumentAIKey.SKIP_HUMAN_REVIEW: True, | ||
| GoogleDocumentAIKey.FIELD_MASK: "text", | ||
| } | ||
|
|
||
| """ Construct the request headers to be sent | ||
| to Google AI Document server """ | ||
|
|
||
| def _get_request_headers(self) -> dict[str, Any]: | ||
| credentials = Credentials.from_service_account_info( | ||
| self.google_service_account, scopes=Constants.CREDENTIAL_SCOPES | ||
| ) | ||
| credentials.refresh(google_requests.Request()) | ||
|
|
||
| return { | ||
| "Content-Type": "application/json; charset=utf-8", | ||
| "Authorization": f"Bearer {credentials.token}", | ||
| } | ||
|
|
||
| """ Detect the mime type from the file content """ | ||
|
|
||
| def _get_input_file_type_mime(self, input_file_path: str) -> str: | ||
| with open(input_file_path, mode="rb") as file_obj: | ||
| sample_contents = file_obj.read(100) | ||
| file_type = filetype.guess(sample_contents) | ||
|
|
||
| file_type_mime: str = ( | ||
| file_type.MIME if file_type else FileType.TEXT_PLAIN | ||
| ) | ||
|
|
||
| if file_type_mime not in FileType.ALLOWED_TYPES: | ||
| logger.error("Input file type not supported: " f"{file_type_mime}") | ||
|
|
||
| logger.info(f"file: `{input_file_path} [{file_type_mime}]`\n\n") | ||
|
|
||
| return file_type_mime | ||
|
|
||
| def process( | ||
| self, input_file_path: str, output_file_path: Optional[str] = None | ||
| ) -> str: | ||
jaseemjaskp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| try: | ||
| file_type_mime = self._get_input_file_type_mime(input_file_path) | ||
| if os.path.isfile(input_file_path): | ||
| with open(input_file_path, "rb") as fop: | ||
| file_content_in_bytes: bytes = fop.read() | ||
| else: | ||
| raise AdapterError(f"File not found {input_file_path}") | ||
| processor_url = self.config.get(Constants.URL, "") + ":process" | ||
| headers = self._get_request_headers() | ||
| data = self._get_request_body( | ||
| file_type_mime=file_type_mime, | ||
| file_content_in_bytes=file_content_in_bytes, | ||
| ) | ||
| response = requests.post(processor_url, headers=headers, json=data) | ||
| if response.status_code != 200: | ||
| logger.error( | ||
| f"Error while calling Google Document AI: {response.text}" | ||
| ) | ||
| response_json: dict[str, Any] = response.json() | ||
| result_text: str = response_json["document"]["text"] | ||
| if output_file_path is not None: | ||
| with open(output_file_path, "w", encoding="utf-8") as f: | ||
| f.write(result_text) | ||
| f.close() | ||
| return result_text | ||
| except Exception as e: | ||
| logger.error(f"Error while processing document {e}") | ||
| if not isinstance(e, AdapterError): | ||
| raise AdapterError(str(e)) | ||
| else: | ||
| raise e | ||
| finally: | ||
| if fop is not None: | ||
| fop.close() | ||
jaseemjaskp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| def test_connection(self) -> bool: | ||
| try: | ||
| url = self.config.get(Constants.URL, "") | ||
jaseemjaskp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| headers = self._get_request_headers() | ||
| response = requests.get(url, headers=headers) | ||
| if response.status_code != 200: | ||
| logger.error( | ||
| f"Error while testing Google Document AI: {response.text}" | ||
| ) | ||
| raise AdapterError( | ||
| f"{response.status_code} - {response.reason}" | ||
| ) | ||
| else: | ||
| return True | ||
| except Exception as e: | ||
| logger.error(f"Error occured while testing adapter {e}") | ||
| if not isinstance(e, AdapterError): | ||
| raise AdapterError(str(e)) | ||
| else: | ||
| raise e | ||
30 changes: 30 additions & 0 deletions
30
src/unstract/adapters/ocr/google_document_ai/src/static/json_schema.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| { | ||
| "title": "Google Document AI OCR", | ||
| "type": "object", | ||
| "required": [ | ||
| "adapter_name", | ||
| "url", | ||
| "credentials" | ||
| ], | ||
| "properties": { | ||
| "adapter_name": { | ||
| "type": "string", | ||
| "title": "OCR Adapter ID", | ||
| "default": "", | ||
| "description": "Provide a unique name for this adapter instance. Example: google-document-ai-1" | ||
| }, | ||
| "url": { | ||
| "type": "string", | ||
| "title": "URL", | ||
| "default": "", | ||
| "format": "uri", | ||
| "description": "The URL of the Google Document AI endpoint for the processor Example: https://{endpoint}/v1/projects/{project}/locations/{location}/processors/{processor}" | ||
| }, | ||
| "credentials": { | ||
| "type": "string", | ||
| "title": "Google Service Account", | ||
| "deafult": "", | ||
| "description": "Service Account in JSON format" | ||
| } | ||
jaseemjaskp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.