## What is Rhubarb?
---

Rhubarb is a light-weight Python framework that makes it easy to build document understanding applications using Multi-modal Large Language Models (LLMs). Rhubarb is created from the ground up to work with Amazon Bedrock and supports multiple foundation models including Anthropic Claude Sonnet, Haiku, and Opus models, as well as Amazon Nova Pro and Nova Lite models for document understanding and analysis.


## What can I do with Rhubarb?
---

Rhubarb can do multiple document processing tasks such as

- ✅ Document Q&A
- ✅ Streaming chat with documents (Q&A)
- ✅ Document Summarization
  - 🚀 Page level summaries
  - 🚀 Full summaries
  - 🚀 Summaries of specific pages
  - 🚀 Streaming Summaries
- ✅ Extraction based on a JSON schema
  - 🚀 Key-value extractions
  - 🚀 Table extractions
- ✅ Named entity recognition (NER) 
  - 🚀 With 50 built-in common entities
- ✅ PII recognition with built-in entities
- ✅ Figure and image understanding from documents
- ✅ Document classification with Multi-modal Language models
- ✅ Document classification with vector sampling using Multi-modal embedding models

Rhubarb comes with built-in system prompts that makes it easy to use it for a number of different document understanding use-cases. You can customize Rhubarb by passing in your own system prompts. It supports exact JSON schema based output generation which makes it easy to integrate into downstream applications.

- Supports PDF, TIFF, DOCX, PNG, JPG files
- Performs document to image conversion internally to work with the multi-modal models
- Works on local files or files stored in S3
- Supports specifying page numbers for multi-page documents
- Supports chat-history based chat for documents
- Supports streaming and non-streaming mode
- Supports Converse API 
- Supports Cross-Region Inference

## How do I use Rhubarb?
---

Start by installing Rhubarb using `pip`.

In [None]:
!python -m pip install pyrhubarb

Initialize Boto3 Session

In [1]:
import boto3
session = boto3.Session(profile_name="anjanavb+demo1-Admin")

### Basic Usage - Q&A with local file
---

Initiaize `DocAnalysis` with a local file and `bedrock` boto3 client and call the `run` method to get response back. In it's default form, it uses a default system prompt.

In [2]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,)
resp = da.run(message="What is the employee's name?")
resp

{'output': [{'page': 1,
   'detected_languages': ['English'],
   'content': 'Martha C Rivera'}],
 'token_usage': {'input_tokens': 5093,
  'output_tokens': 45,
  'total_tokens': 5138}}

### Basic Usage - Q&A with S3 file
---

Initiaize `DocAnalysis` with a file in S3, and boto3 session and call the `run` method to get response back. In it's default form, it uses a default system prompt.

In [3]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="s3://<your-bucket>/<prefix>/employee_enrollment.pdf", 
                 boto3_session=session)
resp = da.run(message="What is the employee's name?")
resp

{'output': [{'page': 1,
   'detected_languages': ['English'],
   'content': 'Martha C Rivera'},
  {'page': 2,
   'detected_languages': ['English'],
   'content': 'Answer not found'},
  {'page': 3,
   'detected_languages': ['English'],
   'content': 'Answer not found'}],
 'token_usage': {'input_tokens': 5093,
  'output_tokens': 113,
  'total_tokens': 5206}}

### Default Model
---
By default Rhubarb uses Claude Sonnet model, however you can also use Haiku, Sonnet 3.5 or Opus (when available).

In [4]:
from rhubarb import DocAnalysis, LanguageModels

da = DocAnalysis(file_path="s3://<your-bucket>/<prefix>/employee_enrollment.pdf", 
                 boto3_session=session,
                 modelId=LanguageModels.CLAUDE_HAIKU_V1)
resp = da.run(message="What is the employee's name?")
resp

{'output': "The employee's name is Martha.",
 'token_usage': {'input_tokens': 5093,
  'output_tokens': 10,
  'total_tokens': 5103}}

In [3]:
from rhubarb import DocAnalysis, LanguageModels

da = DocAnalysis(file_path="s3://<your-bucket>/<prefix>/employee_enrollment.pdf", 
                 boto3_session=session,
                 enable_cri=True,
                 modelId=LanguageModels.CLAUDE_OPUS_V1)
resp = da.run(message="What is the employee name?")
resp

{'output': [{'page': 1,
   'detected_languages': ['English'],
   'content': 'Martha C Rivera'}],
 'token_usage': {'input_tokens': 5092,
  'output_tokens': 51,
  'total_tokens': 5143}}

### Using Converse API
---

Rhubarb supports streaming responses using the Bedrock converse API. This feature enables real-time streaming of responses as they are generated. Here's how to use it:


In [None]:

from rhubarb import DocAnalysis, SystemPrompts
import boto3

# Initialize a boto3 session
session = boto3.Session()

# Create a DocAnalysis instance with converse API enabled
da = DocAnalysis(
    file_path="./test_docs/employee_enrollment.pdf",
    boto3_session=session,
    use_converse_api=True,  # Enable converse API
    system_prompt=SystemPrompts().SummarySysPrompt
)

# Stream the response
for resp in da.run_stream(message="Give me a brief summary of this document."):
    if isinstance(resp, str):
        print(resp, end='')
    else:
        print("\n")
        print(resp)

## Using Cross-Region Inference

Rhubarb supports cross-region inference capabilities, allowing you to process documents using models deployed in different AWS regions. This feature can help optimize latency and provide regional failover support. Here's how to enable it:


In [None]:
from rhubarb import DocAnalysis, SystemPrompts
import boto3

# Initialize a boto3 session
session = boto3.Session()

# Create a DocAnalysis instance with cross-region inference enabled
da = DocAnalysis(
    file_path="./test_docs/employee_enrollment.pdf",
    boto3_session=session,
    enable_cri=True,  # Enable cross-region inference
    system_prompt=SystemPrompts().SummarySysPrompt
)

# Run document analysis
resp = da.run(message="Give me a brief summary of this document.")
print(resp)

### Q&A with specific pages
---

Initiaize `DocAnalysis` with a file and page numbers, and boto3 session and call the `run` method to get response back. 

In [4]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 pages=[3])
resp = da.run(message="For beneficiary type 'Secondary', what is the full name?")
resp

{'output': [{'page': 3,
   'detected_languages': ['English'],
   'content': 'Pat Rivera'}],
 'token_usage': {'input_tokens': 2004,
  'output_tokens': 44,
  'total_tokens': 2048}}

Or specify multiple pages

In [7]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 pages=[1,3])
resp = da.run(message="What is the employee's name and what is the spouse's name?")
resp

{'output': [{'page': 1,
   'detected_languages': ['English'],
   'content': 'Martha C Rivera'},
  {'page': 3,
   'detected_languages': ['English'],
   'content': "Employee's name: Martha C Rivera, Spouse's name: Mateo Rivera"}],
 'token_usage': {'input_tokens': 3552,
  'output_tokens': 93,
  'total_tokens': 3645}}

### Document Classification
---
You can classify the pages of a document using Rhubarb by using either the `ClassificationSysPrompt` system prompt for single class classification or `MultiClassificationSysPrompt` system prompt for multi-class classification.

In [13]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/Sample1.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().ClassificationSysPrompt)
resp = da.run(message="""Given the document, classify the pages into the following classes
                        <classes>
                        DRIVERS_LICENSE  # a driver's license
                        INSURANCE_ID     # a medical insurance ID card
                        RECEIPT          # a store receipt
                        BANK_STATEMENT   # a bank statement
                        W2               # a W2 tax document
                        MOM              # a minutes of meeting or meeting notes
                        </classes>""")
resp

{'output': [{'page': 1, 'class': 'BANK_STATEMENT'},
  {'page': 2, 'class': 'RECEIPT'},
  {'page': 3, 'class': 'DRIVERS_LICENSE'},
  {'page': 4, 'class': 'INSURANCE_ID'},
  {'page': 5, 'class': 'W2'},
  {'page': 6, 'class': 'MOM'}],
 'token_usage': {'input_tokens': 8803, 'output_tokens': 156}}

Or multi-class classification. Note that in Multi-class classification it is helpful to clarify the hierarchy of classes to the model in two different list of classes. This should typically match with your document taxonomy such as

```
FINANCIAL           (Level-2)
├── BANK_STATEMENT  (Level-1 leaf)
└── W2              (Level-1 leaf)

IDENTIFICATION      (Level-2)
├── DRIVERS_LICENSE (Level-1 leaf)
└── INSURANCE_ID    (Level-1 leaf)
```

And so on

In [14]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/Sample1.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().MultiClassificationSysPrompt)
resp = da.run(message="""Given the document, classify the pages into the following classes
                        <classes_level1>
                        DRIVERS_LICENSE  # a driver's license
                        INSURANCE_ID     # a medical insurance ID card
                        RECEIPT          # a store receipt
                        BANK_STATEMENT   # a bank statement
                        W2               # a W2 tax document
                        MOM              # a minutes of meeting or meeting notes
                        <classes_level1>
                        <classes_level2>
                        FINANCIAL        # a document related to finances of a person
                        IDENTIFICATION   # a personal document such as ID, membership cards, etc.
                        GENERAL          # any other general document
                        </classes_level2>""")
resp

{'output': [{'page': 1, 'class': ['BANK_STATEMENT', 'FINANCIAL']},
  {'page': 2, 'class': ['RECEIPT', 'GENERAL']},
  {'page': 3, 'class': ['DRIVERS_LICENSE', 'IDENTIFICATION']},
  {'page': 4, 'class': ['INSURANCE_ID', 'IDENTIFICATION']},
  {'page': 5, 'class': ['W2', 'FINANCIAL']},
  {'page': 6, 'class': ['MOM', 'GENERAL']}],
 'token_usage': {'input_tokens': 8925, 'output_tokens': 228}}

### Perform Named Entity Recognition 
---
Rhubarb comes with 50 built-in entities which includes common entities such as LOCATION, EVENT etc. and PII entities such as NAME, SSN, ADDRESS and so on. Entities are available via the `Entities` class. You can pick and choose which entities to detect and then pass them onto the `run_entity` method.

In [9]:
from rhubarb import DocAnalysis, Entities

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 pages=[1,3])
resp = da.run_entity(message="Extract all the specified entities from this document.", 
                     entities=[Entities.PERSON, Entities.ADDRESS])
resp

{'output': [{'page': 1,
   'entities': [{'PERSON': 'Martha C Rivera'},
    {'ADDRESS': '5005 ANY AVENUE, NEW YORK, NY- 10021'},
    {'ADDRESS': '8 Any Plaza, 21 Street, Any City, CA 90210'}]},
  {'page': 3,
   'entities': [{'PERSON': 'Mateo Rivera'},
    {'PERSON': 'Pat Rivera'},
    {'ADDRESS': '8 Any Plaza, 21 Street, Any City, CA 90210'}]}],
 'token_usage': {'input_tokens': 3523,
  'output_tokens': 196,
  'total_tokens': 3719}}

### Perform PII Recognition 
---
You can use the same `run_entity` method with PII entities available via `Entities`.

In [16]:
from rhubarb import DocAnalysis, Entities

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 pages=[1,3])
resp = da.run_entity(message="Extract all the specified entities from this document.", 
                     entities=[Entities.SSN, Entities.ADDRESS])
resp

{'output': [{'page': 1,
   'entities': [{'SSN': '376 12 1987'},
    {'ADDRESS': '8 Any Plaza, 21 Street'}]},
  {'page': 3,
   'entities': [{'SSN': '791 36 9771'},
    {'ADDRESS': '8 Any Plaza, 21 Street'},
    {'SSN': '824 26 2211'},
    {'ADDRESS': '8 Any Plaza, 21 Street'}]}],
 'token_usage': {'input_tokens': 3534, 'output_tokens': 183}}

### Perform key-value extraction using custom JSON schema
---
Rhubarb supports extraction of key values using JSON Schema. You can pass in a valid JSON schema to extract specific data out of your document. Let's define a custom JSON schema appropriate for our document.

In [10]:
schema = {
    "type": "object",
    "properties": {
        "employee_name": {
            "description": "Employee's Name",
            "type": "string"
        },
        "employee_ssn": {
            "description": "Employee's social security number",
            "type": "string"
        },
        "employee_address": {
            "description": "Employee's mailing address",
            "type": "string"
        },
        "employee_dob": {
            "description": "Employee's date of birth",
            "type": "string"
        },
        "employee_gender": {
            "description": "Employee's gender",
            "type": "object",
            "properties": {
                "male":{
                    "description": "Whether the employee gender is Male",
                    "type": "boolean"
                },
                "female":{
                    "description": "Whether the employee gender is Female",
                    "type": "boolean"
                }
            },
            "required": ["male", "female"]
        },
        "employee_hire_date": {
            "description": "Employee's hire date",
            "type": "string"
        },
        "employer_no": {
            "description": "Employer number",
            "type": "string"
        },
        "employment_status": {
            "type": "object",
            "description": "Employment status",
            "properties": {
                "full_time":{
                    "description": "Whether employee is full-time",
                    "type": "boolean"
                },
                "part_time": {
                    "description": "Whether employee is part-time",
                    "type": "boolean"
                }
            },
            "required": ["full_time", "part_time"]
        },
        "employee_salary_rate":{
            "description": "The dollar value of employee's salary",
            "type": "integer"
        },
        "employee_salary_frequency":{
            "type": "object",
            "description": "Salary rate of the employee",
            "properties": {
                "annual":{
                    "description": "Whether salary rate is monthly",
                    "type": "boolean"
                },
                "monthly": {
                    "description": "Whether salary rate is monthly",
                    "type": "boolean"
                },
                "semi_monthly": {
                    "description": "Whether salary rate is semi_monthly",
                    "type": "boolean"
                },
                "bi_weekly": {
                    "description": "Whether salary rate is bi_weekly",
                    "type": "boolean"
                },
                "weekly": {
                    "description": "Whether salary rate is weekly",
                    "type": "boolean"
                }
            },
            "required": ["annual", "monthly", "semi_monthly","bi_weekly","weekly"]
        }
    },
    "required": ["employee_name","employee_hire_date", "employer_no", "employment_status"]
}

In [11]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session)
resp = da.run(message="Give me the output based on the provided schema.", output_schema=schema)
resp

{'output': {'employee_name': 'Martha C Rivera',
  'employee_ssn': '376 12 1987',
  'employee_address': '8 Any Plaza, 21 Street, Any City, CA 90210',
  'employee_dob': '09/19/80',
  'employee_gender': {'male': False, 'female': True},
  'employee_hire_date': '07/19/2023',
  'employer_no': '784371',
  'employment_status': {'full_time': True, 'part_time': False},
  'employee_salary_rate': 79930,
  'employee_salary_frequency': {'annual': True,
   'monthly': False,
   'semi_monthly': False,
   'bi_weekly': False,
   'weekly': False}},
 'token_usage': {'input_tokens': 5375,
  'output_tokens': 224,
  'total_tokens': 5599}}

### Perform table extraction using custom JSON schema
---
You can also perform table extraction using custom JSON schema. In this case we will use a rather complex table from an AMZN 10-k filing document and attempt to extract the data from it. Here's what a JSON schema might look like.

In [12]:
table_schema = {
  "additionalProperties": {
    "type": "object",
    "patternProperties": {
      "^(2022|2023)$": {
        "type": "object",
        "properties": {
          "Net Sales": {
            "type": "object",
            "properties": {
              "North America": {
                "type": "number"
              },
              "International": {
                "type": "number"
              },
              "AWS": {
                "type": "number"
              },
              "Consolidated": {
                "type": "number"
              }
            },
            "required": ["North America", "International", "AWS", "Consolidated"]
          },
          "Year-over-year Percentage Growth (Decline)": {
            "type": "object",
            "properties": {
              "North America": {
                "type": "number"
              },
              "International": {
                "type": "number"
              },
              "AWS": {
                "type": "number"
              },
              "Consolidated": {
                "type": "number"
              }
            },
            "required": ["North America", "International", "AWS", "Consolidated"]
          },
          "Year-over-year Percentage Growth, excluding the effect of foreign exchange rates": {
            "type": "object",
            "properties": {
              "North America": {
                "type": "number"
              },
              "International": {
                "type": "number"
              },
              "AWS": {
                "type": "number"
              },
              "Consolidated": {
                "type": "number"
              }
            },
            "required": ["North America", "International", "AWS", "Consolidated"]
          },
          "Net Sales Mix": {
            "type": "object",
            "properties": {
              "North America": {
                "type": "number"
              },
              "International": {
                "type": "number"
              },
              "AWS": {
                "type": "number"
              },
              "Consolidated": {
                "type": "number"
              }
            },
            "required": ["North America", "International", "AWS", "Consolidated"]
          }
        },
        "required": ["Net Sales", "Year-over-year Percentage Growth (Decline)", "Year-over-year Percentage Growth, excluding the effect of foreign exchange rates", "Net Sales Mix"]
      }
    }
  }
}

We are only interested in the results of operation which we know is in the first page, we will call Rhubarb with just the first page in this case to save costs. However, in situations where the table's exact location isn't known, the full document can be passed.

In [14]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/amzn-10k.pdf", 
                 boto3_session=session,
                 pages=[1])
resp = da.run(message="Give me data in the results of operation table from this 10-K SEC filing document. Use the schema provided.", 
              output_schema=table_schema)
resp

{'output': {'2022': {'Net Sales': {'North America': 315880,
    'International': 118007,
    'AWS': 80096,
    'Consolidated': 513983},
   'Year-over-year Percentage Growth (Decline)': {'North America': 13,
    'International': -8,
    'AWS': 29,
    'Consolidated': 9},
   'Year-over-year Percentage Growth, excluding the effect of foreign exchange rates': {'North America': 13,
    'International': 4,
    'AWS': 29,
    'Consolidated': 13},
   'Net Sales Mix': {'North America': 61,
    'International': 23,
    'AWS': 16,
    'Consolidated': 100}},
  '2023': {'Net Sales': {'North America': 352828,
    'International': 131200,
    'AWS': 90757,
    'Consolidated': 574785},
   'Year-over-year Percentage Growth (Decline)': {'North America': 12,
    'International': 11,
    'AWS': 13,
    'Consolidated': 12},
   'Year-over-year Percentage Growth, excluding the effect of foreign exchange rates': {'North America': 12,
    'International': 11,
    'AWS': 13,
    'Consolidated': 12},
   'Net Sal

### Schema creation assistant
---
Rhubarb can also help create accurate JSON schemas from plain text prompts. You can provide a document and ask it to extract certain values from the document and it will respond back with a JSON schema. You can then use the JSON schema with the `output_schema` as shown above, or you can tweak and modify it to fit your need further. You do this using the `generate_schema` function.

In [15]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 pages=[1])
resp = da.generate_schema(message="I want to extract the employee name, employee SSN, employee address, date of birth and phone number from this document.")
resp['output']

{'type': 'object',
 'description': 'Employee enrollment form information',
 'properties': {'employee_name': {'type': 'object',
   'description': "The employee's full name",
   'properties': {'first': {'type': 'string',
     'description': "Employee's first name"},
    'initial': {'type': 'string', 'description': "Employee's middle initial"},
    'last': {'type': 'string', 'description': "Employee's last name"}},
   'required': ['first', 'last']},
  'ssn': {'type': 'string',
   'description': "Employee's Social Security Number"},
  'address': {'type': 'object',
   'description': "Employee's mailing address",
   'properties': {'street': {'type': 'string',
     'description': 'Street address'},
    'city': {'type': 'string', 'description': 'City'},
    'state': {'type': 'string', 'description': 'State'},
    'zip_code': {'type': 'string', 'description': 'ZIP code'}},
   'required': ['street', 'city', 'state', 'zip_code']},
  'date_of_birth': {'type': 'string',
   'description': "Employee'

We can then use this schema to perform extraction on the same document.

In [16]:
output_schema = resp['output']
resp = da.run(message="I want to extract the employee name, employee SSN, employee address, date of birth and phone number from this document. Use the schema provided.", 
              output_schema=output_schema)
resp

{'output': {'employee_name': {'first': 'Martha',
   'initial': 'C',
   'last': 'Rivera'},
  'ssn': '376 12 1987',
  'address': {'street': '8 Any Plaza, 21 Street',
   'city': 'Any City',
   'state': 'CA',
   'zip_code': '90210'},
  'date_of_birth': '09/19/80',
  'phone_number': '(888) 555-0100'},
 'token_usage': {'input_tokens': 2113,
  'output_tokens': 145,
  'total_tokens': 2258}}

### Schema creation assistance with question rephrase
---
In many cases you may want to quickly get started with creating a JSON Schema for your document wihtout spending too much time crafting a proper prompt for the document. For example, in a birth certificate you could be vague in asking a question such as "_I want to get the child's, the mother's and father's details from the given document_". In such cases Rhubarb can help rephrasing the question and create an appropriate rephrased question based on the document and generate a subsequent schema for it which can directly be used to extract the data. For this, you use the `assistive_rephrase` parameter in your call to `generate_schema` function.

In [17]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/birth_cert.jpeg",
                 boto3_session=session)
resp = da.generate_schema(message="I want to get the child's, the mother's and father's details from the given document",
                          assistive_rephrase=True)
resp['output']

{'rephrased_question': "Extract the child's, mother's, and father's details from the given birth certificate document.",
 'output_schema': {'type': 'object',
  'properties': {'child': {'type': 'object',
    'properties': {'name': {'type': 'object',
      'properties': {'first': {'type': 'string'},
       'middle': {'type': 'string'},
       'last': {'type': 'string'}},
      'required': ['first', 'middle', 'last']},
     'sex': {'type': 'string'},
     'date_of_birth': {'type': 'string'},
     'time_of_birth': {'type': 'string'},
     'place_of_birth': {'type': 'object',
      'properties': {'hospital': {'type': 'string'},
       'city': {'type': 'string'},
       'county': {'type': 'string'}},
      'required': ['hospital', 'city', 'county']}},
    'required': ['name',
     'sex',
     'date_of_birth',
     'time_of_birth',
     'place_of_birth']},
   'mother': {'type': 'object',
    'properties': {'name': {'type': 'object',
      'properties': {'first': {'type': 'string'},
       'mi

In [18]:
output_schema = resp['output']['output_schema']
question = resp['output']['rephrased_question']
resp = da.run(message = question,
              output_schema = output_schema)
resp

{'output': {'child': {'name': {'first': 'PAULO',
    'middle': 'SOUZA',
    'last': 'SANTOS'},
   'sex': 'MALE',
   'date_of_birth': 'MARCH 23,1981',
   'time_of_birth': '7:52A',
   'place_of_birth': {'hospital': 'ANYGOVERNMENT MEMORIAL HOSPITAL',
    'city': 'ANY TOWN',
    'county': 'ANY COUNTY'}},
  'mother': {'name': {'first': 'MARIA',
    'middle': 'OLIVERIA',
    'last': 'GARCIA'},
   'age': 29,
   'birthplace': 'SWITZERLAND',
   'residence': {'state': 'FLORIDA',
    'county': 'ANY COUNTY',
    'city': 'ANY TOWN',
    'address': '543 ANYSTREET DR'}},
  'father': {'name': {'first': 'DIEGO', 'middle': '', 'last': 'RAMIREZ'},
   'age': 31,
   'birthplace': 'ILLINOIS'}},
 'token_usage': {'input_tokens': 2301,
  'output_tokens': 325,
  'total_tokens': 2626}}

### Perform page level summarization
---
Rhubarb can generate sumarries of every page in the document.

In [19]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session)
resp = da.run(message="Give me a brief summary for each page.")
resp

{'output': [{'page': 1,
   'detected_languages': ['English'],
   'content': "This page contains an employee enrollment form for a 401(k) plan with designated Roth contributions from Anycompany of America Life Insurance Co. It includes employer information, employee details such as name (Martha C Rivera), address, social security number, salary, and employment status. The employee's hire date is 07/19/2023, with a full-time status and an annual salary of $79,930. The form also includes sections for prior tax-exempt service and contribution details."},
  {'page': 2,
   'detected_languages': ['English'],
   'content': 'This page shows Section 2 - Allocation of Contributions for the 401(k) plan. It provides options for allocating contributions between the Interest Accumulation Account and various Separate Account Investment Funds. The employee has allocated 50% to the AnyCompany of America Interest Accumulation Account. The page also includes Section 3 - Beneficiary Designations, which exp

### Perform full summarization
---
Or you can generate an overall summary of the entire document. In this case, we will override the default System Prompt which breaks down the response per page. Rhubarb comes with a Summary specific System Prompt for the model, available via `SystemPrompts`.

In [20]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().SummarySysPrompt)
resp = da.run(message="Give me a brief summary of this document.")
resp

{'output': "This document is an Employee Enrollment Form for a 401(k) Plan with Designated Roth Contributions from Anycompany of America Life Insurance Co. The key details are:\n\n1. The employee, Martha C. Rivera, is enrolling in the 401(k) plan.\n2. She was hired on 07/19/2023 as a full-time employee with an annual salary of $79,930.\n3. The form includes personal information such as address, Social Security number, and date of birth.\n4. There's a section for allocation of contributions to various investment funds.\n5. The beneficiary designation section shows:\n   - Primary beneficiary: Mateo Rivera (spouse)\n   - Secondary beneficiary: Pat Rivera (child)\n6. Both beneficiaries are allocated 50% of the benefits.\n7. The employee is married, and there's a spouse's waiver section signed by Mateo Rivera.\n8. The form is signed and dated by Martha on 8/25/2022.\n\nThis document serves as an official record of the employee's enrollment in the company's 401(k) plan and their choices rega

### Perform summarization of specific pages
---
You can also perform summarization of specific pages using the `pages` parameter.

In [27]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().SummarySysPrompt,
                 pages=[1,3])
resp = da.run(message="Give me a brief summary of this document.")
resp

{'output': "## Summary\n\nThis document is an employee enrollment form for a 401(k) plan with designated Roth contributions from AnyCompany of America Life Insurance Co. The key details are:\n\n- The employee is Martha C. Rivera, hired on 07/19/2023 as a full-time employee with an annual salary of $79,930. \n- Her employer is AnyCompany Constructions Inc. with the employer number 784371.\n- Traditional pre-tax contributions are set at 10% of salary effective 08/19/23 and employer matching contributions effective 08/19/23.\n- Martha has designated her spouse Mateo Rivera as the primary beneficiary at 50% and her child Pat Rivera as the secondary beneficiary at 50%.\n- Mateo has signed a spouse's waiver allowing Martha to receive the death benefit after his death.\n- Martha has signed the form on 3/25/2022, indicating she has read the plan materials and wishes to participate in the Thrift Plan.",
 'token_usage': {'input_tokens': 3207, 'output_tokens': 228}}

### Streaming summaries
---
In some cases, you may want to stream the summaries for example let's say a real time chat application. You can easily do that using the `run_stream` method. Let's generate the full summary and stream it.

In [28]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().SummarySysPrompt)
for resp in da.run_stream(message="Give me a brief summary of this document."):
    if isinstance(resp, str):
        print(resp,end='')
    else:
        print("\n")
        print(resp)

|-START-|
This document is an Employee Enrollment Form for 401(k) Plans with Designated Roth Contributions from AnyCompany of America Life Insurance Co. It contains the following key information:

- Employee details: A female employee named Martha C. Rivera, hired on 07/19/2023 as a full-time employee with an annual salary of $79,930. Her personal information like address, date of birth, and contact details are provided.

- Contribution details: The employee is enrolling in the company's 401(k) plan with 10% of salary allocated to Traditional Pre-Tax Contributions effective 08/19/23 and 08/19/23 for employer matching and non-matching contributions respectively. No Designated Roth Contributions are specified.

- Investment allocation: 50% of contributions will be placed in the AnyCompany of America Interest Accumulation Account, while the remaining options for separate investment funds like equity, real estate, retirement, balanced, asset allocation, and fixed income funds are listed.



#### With `converse` API

In [21]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf", 
                 boto3_session=session,
                 use_converse_api=True,
                 system_prompt=SystemPrompts().SummarySysPrompt)
for resp in da.run_stream(message="Give me a brief summary of this document."):
    if isinstance(resp, str):
        print(resp,end='')
    else:
        print("\n")
        print(resp)

|-START-|
This document is an Employee Enrollment Form for a 401(k) Plan with Designated Roth Contributions from Anycompany of America Life Insurance Co. The key details are:

1. The employee, Martha C. Rivera, is enrolling in the 401(k) plan.
2. She was hired on 07/19/2023 as a full-time employee with an annual salary of $79,930.
3. The form includes personal information such as address, Social Security number, and date of birth.
4. There's a section for allocation of contributions, with 50% allocated to the Interest Accumulation Account.
5. The beneficiary designation section shows:
   - Primary beneficiary: Mateo Rivera (spouse)
   - Secondary beneficiary: Pat Rivera (child)
6. Both beneficiaries are set to receive 50% of the benefits.
7. The employee is married, and there's a spouse's waiver section signed by Mateo Rivera.
8. The form is signed and dated by Martha on 8/25/2022.

This document represents a standard 401(k) enrollment process, including personal information, contribut

Text streaming starts with a `|-START-|` marker and the end of streaming is marked with an `|-END-|` marker. This is to make sure that the application client recieving the stream has a clear demarcation of when the streaming starts and ends. Currently Rhubarb doesn't support `stop_words` kwargs during model invocation but that is coming soon.

One thing to keep in mind is that streaming only makes sense for a couple of use cases
- One where you have a real time chat interface where a lot of text (like summary) is expected from the model
- A real time conversational chat interface a.k.a. ChatBot

As such, Rhubarb supports streaming for only Summary System Prompt and Chat System Prompt as we will see in the next section. You can also view the chat history by accessing the `history` property.

In [None]:
da.history

### Conversational Chat with Documents (no streaming)
---

You can chat with your documents with Rhubarb using the Chat System Prompt `ChatSysPrompt` available via `SystemPrompts` class. Here's an example of a (non-streaming) chat. Note, that internally Rhubarb does support chat history, however, much of the history implementation is left to the developer. This coupled with the fact that Claude models only accept 20 images per invocation, makes it a little complicated to properly implement chat history based conversational system. We are working on simplifying this more, and will add support in upcoming releases.

Also note that there isn't a lot of fundamental difference between the default invocation using Rhubarb vs. invocation using the `ChatSysPrompt`. The difference is in the response structure. While default is capable of giving you responses to your question on a per page basis, chat gives you the output (i.e. response to your chat) and optionally cites the source (i.e. the page number). So in conclusion, the only difference is the response payload structure.

For non streaming chat responses, below is the output structure-

```json
{
    'output': 
    {
        'text': "Response", 
        'sources': [1, 2, 3],  // these are the page numbers
        'quotes': ['quote 1', 'quote 2', 'quote 3'] // these are the verbatim quotes from the document
    }, 
    'token_usage': {'input_tokens': 5029, 'output_tokens': 215, 'total_tokens': 5244}
}
```

In [30]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/amzn-10k.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().ChatSysPrompt)
resp = da.run(message="What is this document about?")
resp

{'output': {'text': "This document appears to be a financial report or annual report for a company, likely Amazon, based on the references to business segments like North America, International, and AWS (Amazon Web Services). It provides details on the company's results of operations, net sales, operating income/loss, operating expenses, cost of sales, and fulfillment costs across these segments for the years 2022 and 2023.",
  'sources': [1, 2, 3],
  'quotes': ['We have organized our operations into three segments: North America, International, and AWS. These segments reflect the way the Company evaluates its business performance and manages its operations.',
   'Operating income (loss) by segment is as follows (in millions):',
   'Cost of sales primarily consists of the purchase price of consumer products, inbound and outbound shipping costs, including costs related to sortation and delivery centers and Where we are the transportation service provider, and digital media content costs

### Conversational Chat with Documents (streaming)
---
Let's perform a streaming chat. Note `SystemPrompts(streaming=True)`, this tells Rhubarb not to respond back with JSON output since JSON output in streaming text is hard to parse and re-construct.

In [31]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/amzn-10k.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts(streaming=True).ChatSysPrompt)
for resp in da.run_stream(message="What is this document about?"):
    if isinstance(resp, str):
        print(resp,end='')
    else:
        print("\n")
        print(resp)

|-START-|
Based on the content in these pages, this document appears to be Amazon's annual report or financial statements, specifically discussing the company's results of operations for the fiscal year 2023 compared to 2022.

The key points covered include:

- Net sales breakdown by segment - North America, International, and AWS (Amazon Web Services) (Page 1)
- Operating income/(loss) by segment (Page 2)
- Detailed operating expenses like cost of sales, fulfillment, technology/infrastructure, marketing etc. (Page 3)
- Explanations for changes in net sales, operating income, and operating expenses across the different segments (Pages 1-3)

This report provides an overview of Amazon's financial performance, growth rates, and underlying factors driving the results in its core business segments for the fiscal year.
|-END-|

{'input_tokens': 4933, 'output_tokens': 178}


### Reasoning and explaining figures, charts etc.
---
Rhubarb can also help perform explanation and reasoning on images, charts, graphs within documents.

In [24]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/scientific_paper.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().FigureSysPrompt)
resp = da.run(message="Explain the bar chart in this document.")
resp

{'output': [{'page': 1,
   'figure_analysis': "The bar chart in this document is labeled 'Figure 2: Normalized purchases of the same type, as function of the days from the request, in late shopping stages.' It shows three different categories: 'Same Product', 'Same Type', and 'Unrelated Purchases' over a period of 0 to 15 days after the request. The y-axis represents 'Normalized number of purchases' with values ranging from 0 to 30. The 'Same Product' category has the highest bars, particularly in the first few days, followed by 'Same Type'. 'Unrelated Purchases' shows the lowest values consistently across all days.",
   'figure_description': 'Figure 2: Normalized purchases of the same type, as function of the days from the request, in late shopping stages.',
   'reasoning': "1. Identified the chart as a bar chart from visual inspection.\n2. Located the title of the chart in the caption below it.\n3. Observed the x-axis showing days from 0 to 15.\n4. Noted the y-axis label 'Normalized 

### Reasoning with tables (experimental)
---
We can also perform reasoning with tables using the Figure System Prompt.

In [33]:
from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/amzn-10k.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().FigureSysPrompt)
resp = da.run(message="What is the dollar value difference of Net Sales between 2022 and 2023 for North America in the given table. Explain your answer.")
resp

{'output': [{'page': 1,
   'figure_analysis': 'The Net Sales for North America increased from $315,880 million in 2022 to $352,828 million in 2023, a difference of $36,948 million.',
   'figure_description': 'A table showing Net Sales broken down by segment (North America, International, AWS) for the years ended December 31, 2022 and 2023.',
   'reasoning': "In the 'Net Sales' table on page 1, the value for 'North America' in the 2022 column is $315,880 million. The value for 'North America' in the 2023 column is $352,828 million. The difference between these two values is $352,828 million - $315,880 million = $36,948 million."}],
 'token_usage': {'input_tokens': 4960, 'output_tokens': 209}}

### Multi-lingual documents
---
Analyze multi-lingual documents.

In [6]:
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/public-notice-spanish.pdf", 
                 boto3_session=session)
resp = da.run(message="Which entity or organization issued this notice?")
resp

{'output': [{'page': 1,
   'detected_languages': ['English', 'Spanish'],
   'content': 'El Departamento de Vivienda de la Ciudad de Phoenix (City of Phoenix Housing Department, COPHD) aceptará solicitudes previas para su lista de espera del Programa de vales de elección de vivienda (Housing Choice Voucher, HCV) de la Sección 8 a partir del martes 12 de septiembre a las 8 a. m. hasta el martes 26 de septiembre de 2023 a las 7 p. m. hora de Arizona (AZ).'},
  {'page': 2,
   'detected_languages': ['English', 'Spanish'],
   'content': 'No hay costo para solicitar o recibir un vale de elección de vivienda de la Sección 8.'},
  {'page': 3,
   'detected_languages': ['English', 'Spanish'],
   'content': 'Compromiso con la equidad en la vivienda y la no discriminación La ciudad de Phoenix no discrimina por motivos de raza, etnia, sexo, identidad de género, color, religión, estado civil, estado familiar, país de origen, edad, discapacidad, ascendencia, fuente de ingresos u orientación sexual en 

## Generate ALT Text from images in a document
---
We will generate alt texts for images in a document in order to digitize the books and make them available in web format. Alt text helps with accessibility and screenreaders for people who rely on them.

In [35]:
#./test_docs/Anatomy_and_Physiology_2e_small.pdf

from rhubarb import DocAnalysis, SystemPrompts

da = DocAnalysis(file_path="./test_docs/aquifers.pdf", 
                 boto3_session=session,
                 system_prompt=SystemPrompts().FigureSysPrompt,
                 max_tokens=4096)
resp = da.run(message="You will generate alt-texts for images found in the pages of this document. Make sure to be short but descriptive and explain the colors since alt text is used by screen-readers thus enabling accessibility.")
resp

{'output': [{'page': 1,
   'figure_analysis': "The image depicts the initial phase of the hydrological cycle, where rainwater infiltrates the topsoil layer ('T') and percolates through subsurface layers, recharging aquifers ('R'). The colors used are blue for water, brown for soil/rock layers, and green for vegetation.",
   'figure_description': 'Figure 1: The Surface Connection',
   'reasoning': "The image illustrates the process of rainwater infiltration into the topsoil ('T') and subsequent percolation through subsurface layers to recharge aquifers ('R'). The colors blue, brown, and green represent water, soil/rock layers, and vegetation, respectively."},
  {'page': 2,
   'figure_analysis': "The image shows a cross-sectional view of an aquifer system, with 'Y' representing wells and springs that tap into the aquifer, 'R' depicting the aquifer itself, and 'K' representing the bedrock layer beneath. The colors used are green for the surface, blue for water, and brown/gray for rock lay