# 🚀 TABULATE DEMO

The **purpose** of this notebook:
- demonstrate how to call Tabulate API from Python
- run custom feature extraction for text documents

============================================================

# 1. PREPARATIONS

First, make sure to deploy the Tablulate stack by following the [README file](README.md).

This section imports required dependencies and connects to the AWS account.

In [32]:
### REQUIREMENTS

import os

In [33]:
### AWS CREDENTIALS

"""
Run this cell when running the notebook locally, when running on SageMaker.
"""

# Option 1
# os.environ["AWS_ACCESS_KEY_ID"] = "XXX"
# os.environ["AWS_SECRET_ACCESS_KEY"] = "XXX"
# os.environ["AWS_SESSION_TOKEN"] = "XXX"

# Option 2
os.environ["AWS_PROFILE"] = "XXX"

In [34]:
### ARN & BUCKET

# add the ARN of the Tabulate StepFunctions and S3 bucket name
# both are displayed in the Terminal output after running `cdk deploy`

STATE_MACHINE_ARN = "arn:aws:states:us-east-1:081277383238:stateMachine:tabulate-StepFunctions"
BUCKET_NAME = "tabulate-data-081277383238"

============================================================

# 2. DEMO

In [35]:
### DOCUMENT TEXTS

# upload documents to the Tabulate bucket
local_path = "originals"
s3_path = f"s3://{BUCKET_NAME}/originals"
!aws s3 cp $local_path $s3_path --recursive

documents = [
    "originals/email_1.txt",
    "originals/email_2.txt",
    "originals/email_3.txt",
]

upload: originals/email_1.txt to s3://tabulate-data-081277383238/originals/email_1.txt
upload: originals/email_3.txt to s3://tabulate-data-081277383238/originals/email_3.txt
upload: originals/email_2.txt to s3://tabulate-data-081277383238/originals/email_2.txt
upload: originals/code-sample-catalog.pdf to s3://tabulate-data-081277383238/originals/code-sample-catalog.pdf
upload: originals/cloud-adoption-framework.pdf to s3://tabulate-data-081277383238/originals/cloud-adoption-framework.pdf


In [None]:
### HELPER FUNCTION

from utils import run_tabulate_api

### Level 1: Extract Well-Defined Entities

In this example, we extract well-defined entities, including `customer_name`, `shipment_id`, and email `language`. For each entity, we provide the entity name, and a brief description of the entity.

In [None]:
### DEFINE ATTRIBUTES

attributes = [
    {"name": "customer_name", "description": "name of the customer who wrote the email"},
    {"name": "shipment_id", "description": "unique shipment identifier"},
    {"name": "language", "description": "two-letter language code of the email"},
]

In [None]:
### RUN ATTRIBUTE EXTRACTION

run_tabulate_api(documents=documents, attributes=attributes, state_machine_arn=STATE_MACHINE_ARN)

### Level 2: Assign Custom Numeric Scores

In this example, we extract custom numeric scores that describe the texts, including `sentiment` and shipment `delay`. Since these are not well-defined entities, we provide a more verbose description to calibrate the LLM. 

Additionally, we will also provide optional document-level `instructions`.

In [None]:
### DEFINE ATTRIBUTES

DELAY_DESCRIPTION = """Delay of the shipment in days.

Example email: I have been waiting for my shipment for a week now! Where is it?
Delay: 7

Example email: The shipment is supposed to arrive today, can you send me the tracking number?
Delay: 0
"""

attributes = [
    {
        "name": "sentiment",
        "description": "Sentiment score between 0 and 1. 0 refers to a very negative text, 1 is a very positive text, and 0.5 is neutral text",
    },
    {
        "name": "delay",
        "description": DELAY_DESCRIPTION,
    },
]

instructions = "All numbers must have at most 1 digit after the coma"

In [None]:
### RUN ATTRIBUTE EXTRACTION

run_tabulate_api(
    documents=documents, attributes=attributes, instructions=instructions, state_machine_arn=STATE_MACHINE_ARN
)

### Level 3: Generate Text-Based Features

In this example, we extract custom text snippets generated by the LLM, such as `summary` and `response`. 

We also:
- specify the LLM inference parameters as part of the API call to select a suitable model for our task.
- provide additional high-level instructions. These are optional and can contain formatting instructions, input-output examples, and more.

In [None]:
### DEFINE LLM INFERENCE PARAMS

model_params = {
    "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
    "output_length": 256,
    "temperature": 0.0,
}

In [None]:
### DEFINE ATTRIBUTES

attributes = [
    {
        "name": "summary",
        "description": "One-sentence summary of the issue mentioned in the email",
        "type": "character",
    },
    {
        "name": "response",
        "description": "suggested response to the customer email that aims to resolve the issue",
        "type": "character",
    },
]

In [None]:
### CUSTOM INSTRUCTIONS

instructions = "Provide all attribute values in Spanish"

In [None]:
### RUN ATTRIBUTE EXTRACTION

run_tabulate_api(
    documents=documents,
    attributes=attributes,
    instructions=instructions,
    model_params=model_params,
    state_machine_arn=STATE_MACHINE_ARN,
)

### Level 4: Use LLM As Parser

When processing PDF, JPG and PNG documents, you can use either `Amazon Textract` and `Amazon Bedrock` for parsing.

Let's upload a sample PDF document and extract information with Anthropic Claude model:

In [36]:
### PDF DOCUMENTS

documents = [
    "originals/code-sample-catalog.pdf",
    "originals/cloud-adoption-framework.pdf",
]

In [37]:
### DEFINE ATTRIBUTES

attributes = [
    {
        "name": "summary",
        "description": "Summary of the document",
    },
]

In [38]:
### DEFINE PARAMS

parsing_mode = "Amazon Bedrock"

model_params = {
    "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
    "output_length": 256,
    "temperature": 0.0,
}

In [39]:
### RUN ATTRIBUTE EXTRACTION

run_tabulate_api(
    documents=documents,
    attributes=attributes,
    parsing_mode=parsing_mode,
    model_params=model_params,
    state_machine_arn=STATE_MACHINE_ARN,
)

[{'file_key': 'originals/code-sample-catalog.pdf',
  'attributes': {'summary': 'The AWS Code Sample Catalog has moved to the AWS Code Examples Library.'}},
 {'file_key': 'originals/cloud-adoption-framework.pdf',
  'attributes': {'summary': 'The document provides an overview of the AWS Cloud Adoption Framework (AWS CAF), which leverages AWS experience and best practices to help organizations digitally transform and accelerate their business outcomes through innovative use of AWS. The framework identifies capabilities across six perspectives - Business, People, Governance, Platform, Security, and Operations - that organizations can leverage to improve their cloud readiness and transformation journey. The document covers the cloud transformation value chain, foundational capabilities, and the four iterative and incremental cloud transformation phases of Envision, Align, Launch, and Scale.'}}]