# Text Extraction using OCR

## Introduction

In this notebook, we will demonstrate the results of using the `ocr-extraction` workflow.

For more information, see [the documentation for this workflow](https://docs.aperturedata.io/workflows/ocr-extraction).

## Setup

This notebook assumes that you have already created the `ocr-extraction` workflow.
You can do this conveniently in the Cloud UI.

You will also need to have added some images to your workflow, For example, you could run the `dataset-ingestion` workflow with the Coco dataset.

In [None]:
from aperturedb.CommonLibrary import create_connector, execute_query
from aperturedb.NotebookHelpers import display as display_images
from aperturedb.Images import Images

## Set up client connection

In [None]:
# This will only work if you have aperturedb installed and configured.
# The configuration is either created by setting an APERTUREDB_KEY environment variable,
# or by creating a configuration using adb config.
# See https://docs.aperturedata.io/Setup/client/adb for more information.
client = create_connector()

# If you wish to explicitly use the Connector class, you can do so like this:
# from aperturedb import Connector as Connector
# client = Connector.Connector(host="<YOUR_HOST_NAME_HERE>", user="<YOUR_USERNAME_HERE>", password="<YOUR_PASSWORD_HERE>")

response, _ = client.query([{"GetStatus": {}}])
print(response)

## Find some extracted text

Here we are going to look for extracted text that is associated with an image.
First we find some extracted text.
Them we find connected images.
We fetch the blob for the image, scaled down.

In [None]:
query = [
    {
        "FindEntity": {
            "with_class": "ExtractedText",
            "results": {"all_properties": True},
            "constraints": {"source_type": ["==", "image"]},
            "limit": 10,
            "_ref": 1,
        }
    },
    {
        "FindImage": {
            "is_connected_to": {"ref": 1},
            "blobs": True,
            "results": {"list": ["_uniqueid"]},
            "operations": [
                {
                    "type": "resize",
                    "width": 200,
                }
            ],
            "group_by_source": True,
        }
    },
]

status, response, blobs = execute_query(client, query)
assert status == 0, response

In [None]:
text_blocks = response[0]["FindEntity"]["entities"]
images_map = response[1]["FindImage"]["entities"]

for text_block in text_blocks:
    text = text_block["text"]
    text_id = text_block["_uniqueid"]
    image_index = images_map.get(text_id, {})[0].get("_blob_index")
    if image_index:
        image_blob = blobs[image_index]
        display_images([image_blob])
        print(text)

## Extract from image PDF

Now we're going to do the same thing but for image PDFs.
An image PDF is one that has only images of text with no text layer.

In [None]:
query = [
    {
        "FindEntity": {
            "with_class": "ExtractedText",
            "results": {"all_properties": True},
            "constraints": { "source_type": ["==", "pdf"] },
            "limit": 10,
            "_ref": 1,
        }
    },
    {
        "FindBlob": {
            "is_connected_to": {"ref": 1},
            "blobs": True,
            "results": {"list": ["_uniqueid"]},
            "group_by_source": True,
        }
    },
]

status, response, blobs = execute_query(client, query)
assert status == 0, response

In [None]:
import base64
from IPython.display import IFrame, display, HTML

text_blocks = response[0]["FindEntity"]["entities"]
images_map = response[1]["FindBlob"]["entities"]

for i, text_block in enumerate(text_blocks, start=1):
    text = text_block["text"]
    text = text.replace("\n", " ")
    text_id = text_block["_uniqueid"]
    image_index = images_map.get(text_id, {})[0].get("_blob_index")
    if image_index:
        image_blob = blobs[image_index]
        base64_pdf = base64.b64encode(image_blob).decode('utf-8')
        pdf_display_html = f'<iframe src="data:application/pdf;base64,{base64_pdf}" width="800" height="600"></iframe>'
        display(HTML(pdf_display_html))

        print(f"{i}. {text}")