In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Text Extraction with Generative Models on Vertex AI 

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/examples/prompt-design/text_extraction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/prompt-design/text_extraction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/examples/prompt-design/text_extraction.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

Text extraction is a process of extracting text from a document. This can be done manually or automatically. Manual text extraction is the process of reading the document and copying the text into a new document. Automatic text extraction is the process of using software to extract the text from the document.

Text extraction can be used for a variety of purposes. One common purpose is to convert documents into a machine-readable format. This can be useful for storing documents in a database or for processing documents with software. Another common purpose is to extract information from documents. This can be useful for finding specific information in a document or for summarizing the content of a document.

Large language models (LLMs) are good for text extraction because they are trained on massive datasets of text and code, which allows them to learn the relationships between words and phrases. They can also understand the context of text and generate text, which allows them to extract information that is not explicitly stated or fill in the gaps in text that is missing information. The answers from LLMs can also be further improved through methods like few-shot prompting.

Learn more about extraction prompts in the [official documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/text/extraction-prompts).

### Objective

In this tutorial, you will learn how to use generative models to extract the information from text by working through the following examples:
- Google Pixel technical specifications extraction
- WiFi troubleshooting with constraints
- Respond to inquiries in character
- Converting an ingredients list to JSON format
- Organizing the results of a text extraction


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Getting Started

### Install Vertex AI SDK

In [1]:
!pip install google-cloud-aiplatform google-cloud-core google-cloud-documentai google-cloud-storage simplejson --upgrade --user

Collecting google-cloud-documentai
  Downloading google_cloud_documentai-2.16.0-py2.py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.8/275.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting google-cloud-storage
  Downloading google_cloud_storage-2.10.0-py2.py3-none-any.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.6/114.6 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting simplejson
  Downloading simplejson-3.19.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.9/137.9 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: simplejson, google-cloud-storage, google-cloud-documentai
Successfully installed google-cloud-documentai-2.16.0 google-cloud-storage-2.10.0 simplejson-3.19.1


**Colab only:** Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top. 

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries


**Colab only:** Uncomment the following cell to initialize the Vertex AI SDK. For Vertex AI Workbench, you don't need to run this.  

In [None]:
# import vertexai

# PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
# vertexai.init(project=PROJECT_ID, location="us-central1")

In [14]:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore
from typing import Optional

from vertexai.preview.language_models import TextGenerationModel

### Import models

In [24]:
generation_model = TextGenerationModel.from_pretrained("text-bison@001")

## Text Extraction

### Google Pixel technical specifications extraction

In this example, you try to extract the technical specifications of a Pixel phone from text in JSON format using the PaLM API.

In [41]:
prompt = """
Extract the technical specifications from the text below in JSON format.

Text: Google Nest WiFi, network speed up to 1200Mpbs, 2.4GHz and 5GHz frequencies, WP3 protocol
JSON: {
  "product":"Google Nest WiFi",
  "speed":"1200Mpbs",
  "frequencies": ["2.4GHz", "5GHz"],
  "protocol":"WP3"
}

Text: Google Pixel 7, 5G network, 8GB RAM, Tensor G2 processor, 128GB of storage, Lemongrass
JSON:
"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
    ).text
)

{
  "product":"Google Pixel 7",
  "network":"5G",
  "RAM":"8GB",
  "processor":"Tensor G2",
  "storage":"128GB",
  "color":"Lemongrass"
}


### WiFi troubleshooting with constraints

In this example, you ask the generative model to answer a question about troubleshooting a Google WiFi router based on the description of the different status lights on the router. The model will only be able to respond with the text that was provided, which helps to prevent it from generating potentially harmful or incorrect answers. Here is how you can do this using the PaLM API.

In [42]:
prompt = """
Answer the question using the text below. Respond with only the text provided.
Question: What should I do to fix my disconnected WiFi? The light on my Google WiFi router is yellow and blinking slowly.

Text:
Color: No light
What it means: Router has no power or the light was dimmed in the app.
What to do:
Check that the power cable is properly connected to your router and to a working wall outlet.
If your device is already set up and the light appears off, check your light brightness settings in the app.
If there's still no light, contact WiFi customer support.

Color: Solid white, no light, solid white
What it means: Device is booting up.
What to do:
Wait for the device to boot up. This takes about a minute. When it's done, it will slowly pulse white, letting you know it's ready for setup.

Color: Slow-pulsing white
What it means: Device is ready for set up.
What to do:
Use the Google Home app to set up your router.

Color: Solid white
What it means: Router is online and all is well.
What to do:
You're online. Enjoy!

Color: Slowly pulsing yellow
What it means: There is a network error.
What to do:
Check that the Ethernet cable is connected to both your router and your modem and both devices are turned on. You might need to unplug and plug in each device again.

Color: Fast blinking yellow
What it means: You are holding down the reset button and are factory resetting this device.
What to do:
If you keep holding down the reset button, after about 12 seconds, the light will turn solid yellow. Once it is solid yellow, let go of the factory reset button.

Color: Solid yellow
What it means: Router is factory resetting.
What to do:
This can take up to 10 minutes. When it's done, the device will reset itself and start pulsing white, letting you know it's ready for setup.
Image Solid red light Solid red Something is wrong. Critical failure. Factory reset the router. If the light stays red, contact WiFi customer support.
"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=256, top_k=1, top_p=0.8
    ).text
)

There is a network error.
Check that the Ethernet cable is connected to both your router and your modem and both devices are turned on. You might need to unplug and plug in each device again.


### Extract from PDF

In the following example we present integration of PaLM API with DocumentAI API. In the following scenario, we ask DocumentAI to provide us the OCR of a PDF file (an Non Disclosure Aggrement) and then the text of the aggrement is prompted to text-bison in order to create a structured JSON document with the most important parts of the NDA.

In [36]:
DOCAI_PROJECT_ID = "dventerpriseaisearch"
DOCAI_LOCATION = "eu"  # Format is 'us' or 'eu'
DOCAI_PROCESSOR_ID = "b116c372f85dbbf7"  # Create processor in Cloud Console
DOCUMENT_PATH = "./nda.pdf" # Path of target document

The following method, simply requests DocumentAI to OCR nda.pdf file.

In [37]:
def get_ocr_text(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
    field_mask: Optional[str] = None,
    processor_version_id: Optional[str] = None,
) -> None:
    # You must set the `api_endpoint` if you use a location other than "us".
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    if processor_version_id:
        # The full resource name of the processor version, e.g.:
        # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
        name = client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
    else:
        # The full resource name of the processor, e.g.:
        # `projects/{project_id}/locations/{location}/processors/{processor_id}`
        name = client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Load binary data
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name, raw_document=raw_document, field_mask=field_mask
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    document = result.document

    # Read the text recognition output from the processor
    print("The document contains the following text:")
    print(document.text)
    
    return document.text

In [38]:
ndaText = get_ocr_text(DOCAI_PROJECT_ID, DOCAI_LOCATION, DOCAI_PROCESSOR_ID, DOCUMENT_PATH, "application/pdf")

The document contains the following text:
Non-Disclosure Agreement
(hereinafter: Agreement)
between
Digital Salt Technologies Pvt
(Ramnord Lab, Jogeshwari West
Mumbai, 400102, India)
effective date
14th of March 2022
and
Dataverse Ltd
(131 Ethnikis Antistaseos st,
Thessaloniki, 55134 Greece)
- hereinafter jointly referred to as “PARTIES” –
NOW IT IS AGREED as follows:
Digital Salt Technologies Pvt contacted Dataverse Ltd (VAT no EL998682224) in order to
purchase a virtual space on www.artsteps.com SaaS digital web platform. For that purpose
(hereinafter referred as the "Purpose of this Agreement"), the PARTIES intend to share certain
information of a confidential nature ("Confidential Information"). The PARTIES therefore wish
to enter into this Agreement to govern the confidentiality obligations between them as either
being the Receiving - or Disclosing PARTY. Both parties agree that this agreement, besides the
parties themselves, binds each party's shareholder, legal representatives, 

Now we construct a suitable prompt in order to and ask PaLM

In [40]:
prompt = """
You are lawyer and you want to extract:

1. Aggrement date (key: date)
2. First aggrement party official name (key: partyOne)
3. Second aggrement party official name (key: partyTwo)
4. Purpose of NDA (key: purpose)
5. Summary of non disclosure and confidentiality (key: summary)
6. Applicable law authority (key: court)

from the following NDA text in JSON format.
Text:
"""

prompt += ndaText
prompt += """

JSON:
"""

print(prompt)
print("-------------------------------")
print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
    ).text
)


You are lawyer and you want to extract:

1. Aggrement date (key: date)
2. First aggrement party official name (key: partyOne)
3. Second aggrement party official name (key: partyTwo)
4. Purpose of NDA (key: purpose)
5. Summary of non disclosure and confidentiality (key: summary)
6. Applicable law authority (key: court)

from the following NDA text in JSON format.
Text:
Non-Disclosure Agreement
(hereinafter: Agreement)
between
Digital Salt Technologies Pvt
(Ramnord Lab, Jogeshwari West
Mumbai, 400102, India)
effective date
14th of March 2022
and
Dataverse Ltd
(131 Ethnikis Antistaseos st,
Thessaloniki, 55134 Greece)
- hereinafter jointly referred to as “PARTIES” –
NOW IT IS AGREED as follows:
Digital Salt Technologies Pvt contacted Dataverse Ltd (VAT no EL998682224) in order to
purchase a virtual space on www.artsteps.com SaaS digital web platform. For that purpose
(hereinafter referred as the "Purpose of this Agreement"), the PARTIES intend to share certain
information of a confidentia