#  Enhancing accuracy in LLM multi-modal vision use cases on Amazon Bedrock

As Large Language Models (LLMs) evolve to process multiple modalities, we face challenges reminiscent of early text-based LLMs: basic inconsistency, inaccuracy, and hallucinations. 
This notebook focus on techniques to increase the accuracy of vision use cases using various techniques. Inlucded is a quick introduction to AWS Gen AI services landscape and how to get started with using Anthropic Claude and other LLMs on Amazon Bedrock.

## Intro: Structured OCR

The goal is to A. idetify and read the (Hebrew) text on the page. B. Figure out the structure of the page. C. populate the structure with the text and output it (as Json).
Note that this example includes Hebrew text, some block lettes, some handwritten.

<img title="Sturctured OCR Task" src="./images/structured ocr 2.png"/>

# How can we accomplish this on AWS?
<img title="aws stack.png" src="./images/aws stack 3 layers.png" />

## Let's use a multi-modal LLM for this task
Let's proceed with using Anthropic Claude Sonnet 3.5 multi modal on Amazon Bedrock. We get a lot of abilities out of the box and we get started in minutes.

#### What vision use cases can it do?
- Classification (+ few shots)
- Object Detection
- Scene insights extraction (emotions)
- (Structured) OCR
- Content moderation
- Custome tasks - you name it

### How do we call A bedrock model with an image?

```python 
    # 1) Read the image bytes
    with open(input_image, "rb") as f:
        image_bytes = f.read()

    # 2) Create a message(s) with text or image
    message = {
        "role": "user",
        "content": [
            {
                "image": {
                    "format": "jpeg",
                    "source": {
                        "bytes": image_bytes
                    }
                }
            }
        ]
    }

    # 3) Invoke Converse API (same exact API for different model vendors + native tools support). Supported by LangChain.
    response = bedrock_client.converse(
        modelId = model_id,
        messages = [messages,],
        system = system_prompts,
        inferenceConfig = inference_config,
        #additionalModelRequestFields = additional_model_fields
    )
```

## Let's call the Sonnet 3.5/Bedrock

In [33]:
%load_ext autoreload
%autoreload 2
%pip install --upgrade pip
%pip install boto3 --quiet
%pip install botocore --quiet
%pip install json --quiet
%pip install pandas --quiet
import base64
import boto3
import json
from IPython.display import Image
from json_eval import compare_json

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
from llm_util import generate_conversation, get_ocr
input_image_fullpath = "./images/pizza-2e.png"
base_test = {
                'model_id' : "us.anthropic.claude-3-5-sonnet-20240620-v1:0",
                'input_image' : input_image_fullpath,
                'instructions_filename' : 'pizza_system_prompt1.txt',
                'temperature' : 0
                }

output = get_ocr(base_test)

Generating message with model us.anthropic.claude-3-5-sonnet-20240620-v1:0
Role: assistant
Text: I understand your request. I'll analyze the menu image and provide the information in the requested JSON structure, being careful to accurately transcribe the text as it appears. Here's the structured menu information:

{
  "categories": [
    {
      "name": "פיצות",
      "dishes": [
        {
          "name": "פיצה אישית",
          "price": 23
        },
        {
          "name": "פיצה משפחתית",
          "price": 42
        },
        {
          "name": "פיצה ענקית",
          "price": 65
        },
        {
          "name": "משפחתית ללא גלוטן",
          "price": 50
        },
        {
          "name": "משפחתית טבעונית",
          "price": 48
        },
        {
          "name": "פיצה בדרוס",
          "price": 70
        },
        {
          "name": "פיצה מיוחדת",
          "price": 70,
          "description": "מיקס גבינות, שום קונפי, קשקבל יווני, אספרגוס צלוי, בלסמי"
  

In [5]:
# write response to disk
actual_filepath = './outputs/pizza-2e_actual.json'
with open(actual_filepath, 'w', encoding='utf-8') as f:
    f.write(output)

# Lets evaluate (manually)! 
It looks promising, however a manual diff uncovers some mistakes compared to the ground truth Json. For example instead of רטבים (sauces) it hallucinated לחמניות (buns).
<img title="Diff view" src="./images/pizza-2e-diff.png" />

## Hurray! While far from being perfect, it delivers immediate business value (💸, ⏱️)!
While it's not perfect, we can see immediate business value here, as instead of typing everything from scratch a data entry worker could just review the model's output and fix mistakes. Saving time and $.
## Let's proceed with formal evaluation
We'll calculate a **similarity score**. A custome function that compares both the json structure, text and numbers values, against a ground truth.

In [6]:
import json_eval
expected_json_filepath = './outputs/pizza-2e_expected.json'
expected = json.load(open(expected_json_filepath))
#actual_output = open(actual_filepath, "r").read()
actual_output = open('./outputs/pizza-2e_chatgpt_actual.json', "r").read()
actual = json_eval.extract_json_from_string(actual_output)

similarity_score, difference_report = compare_json(expected, actual)
print(f"Similarity score: {similarity_score:.2f}%")
print("\nDetailed Difference Report:")
difference_report.print_report()

Similarity score: 78.80%

Detailed Difference Report:
Differences found:
  Path: extraInfo[0]
  Type: String Difference
  Expected Value 1: האפייה בתנור אבן
  Actual Value 2: כל הפיצות שלנו עם 100% מוצרלה

  Path: categories[0].dishes[6].description
  Type: String Difference
  Expected Value 1: מיקס גבינות, שום בהגזמה, קוקטייל זיתים, אקסטרה מוצרלה
  Actual Value 2: מיקס גבינות, שום גבינה, קממבר, ריקוטה, פרמז'ן, אקסטרה מוצרלה

  Path: categories[0].dishes[7].description
  Type: String Difference
  Expected Value 1: טבסקו, שיפקה, בוליביה, בקיאנו, פרסנו
  Actual Value 2: טבסקו, שקנפף, חלפיניו, ברבקיו, רק אולין פסטו

  Path: categories[0].dishes[7].name
  Type: String Difference
  Expected Value 1: לאמיצים חריפה אש
  Actual Value 2: לאצ'יצים חיפה אש

  Path: categories[2].name
  Type: String Difference
  Expected Value 1: שתיה
  Actual Value 2: שתייה

  Path: categories[2].dishes[1].price
  Type: Value Difference
  Expected Value 1: 14
  Actual Value 2: 10

  Path: categories[2].dishes[2].

# Dealing with hallucinations 😱
Let's discuss detection and resolution of hallucinations:
#### 🕵🏼‍♀️ Detection
If we could accurately detect hallucinations, it could improve the human data entry worker to focus on what to fix. 🔦 Plus, also assist with automatic resolution.
#### 🛠️ Resolution
resolve hallucination to improve accuracy. 📈

## Detection strategies
When in the wild, i.e., w/o ground truth data:
1. Compare outputs over multiple different invocations 🔢 to uncover variations in values at specific json paths (use different 🌡️ temperature,  📝 prompts, models, few shots examples, or anything else).
2. Ask the LLM for a confidence score  alongside each value (wishful thinking?) ✰ ✰ ✰

<img title="detection through multi invocation" src="./images/detection multi invoke.png" />

## Resolution strategies
1. Majority vote - pick the common answer from multiple invocations
2. Confidence - pick highest probability answer
3. Character level inference - Once established detected a suspicious value, we can ask the LLM to spell it out character by character.
4. Attention - Cropping the document to smaller sections and invoke per prompt (classic CV for bounding boxes)
5. Ensemble of the techniques above
6. Fine tuning the LLM on this OCR usecase (fine tuning Claude Haiku currently possible in Amazon Bedrock)

### Let's attempt detection and resolution using multiple invocations and majority voting
Start with defning 9 different invocations designd to induce different results:

In [14]:
#model_id = "anthropic.claude-3-sonnet-20240229-v1:0".  # Sonnet 3.5 outperforms Sonnet 3.0 in this task
base_test = {
                'model_id' : "us.anthropic.claude-3-5-sonnet-20240620-v1:0",
                'input_image' : input_image_fullpath,
                'instructions_filename' : 'pizza_system_prompt1.txt',
                'temperature' : 0
        }

# variations
from copy import deepcopy
num_of_invocations = 9
tests = [deepcopy(base_test) for i in range(num_of_invocations)]
tests[0]['temperature'] = 0
tests[1]['temperature'] = 0.7
tests[2]['temperature'] = 1

tests[3]['temperature'] = 0
tests[3]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[4]['temperature'] = 0.5
tests[4]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[5]['temperature'] = 1
tests[5]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[6]['temperature'] = 0
tests[6]['instructions_filename'] = 'pizza_system_prompt3.txt'

tests[7]['temperature'] = 0.7
tests[7]['instructions_filename'] = 'pizza_system_prompt3.txt'

tests[8]['temperature'] = 1
tests[8]['instructions_filename'] = 'pizza_system_prompt3.txt'

def get_outfile_path(idx, t):
        return f'./outputs/pizza_actual_' + t['model_id'] + f"_tempr({t['temperature']})_" + t['instructions_filename'].replace('.txt', '') + f'_{idx}' + '.json'


# print invocations as table
import pandas as pd
df = pd.DataFrame(tests)
df[['instructions_filename', 'temperature']]

Unnamed: 0,instructions_filename,temperature
0,pizza_system_prompt1.txt,0.0
1,pizza_system_prompt1.txt,0.7
2,pizza_system_prompt1.txt,1.0
3,pizza_system_prompt2.txt,0.0
4,pizza_system_prompt2.txt,0.5
5,pizza_system_prompt2.txt,1.0
6,pizza_system_prompt3.txt,0.0
7,pizza_system_prompt3.txt,0.7
8,pizza_system_prompt3.txt,1.0


Execute multi-invocations:

In [15]:
for idx,test in enumerate(tests): # IMPROVE: can run in parallel
        outfile_path = get_outfile_path(idx, test)
        output = get_ocr(test)
        print(f'Writing output to {outfile_path}')
        open(outfile_path, 'w').write(output)

Generating message with model us.anthropic.claude-3-5-sonnet-20240620-v1:0
Role: assistant
Text: I understand your request. I'll analyze the menu image and provide the information in the requested JSON structure, being careful to accurately transcribe the text as it appears. Here's the extracted menu information:

{
  "categories": [
    {
      "name": "פיצות",
      "dishes": [
        {
          "name": "פיצה אישית",
          "price": 23
        },
        {
          "name": "פיצה משפחתית",
          "price": 42
        },
        {
          "name": "פיצה ענקית",
          "price": 65
        },
        {
          "name": "משפחתית ללא גלוטן",
          "price": 50
        },
        {
          "name": "משפחתית טבעונית",
          "price": 48
        },
        {
          "name": "פיצה בדרוק",
          "price": 70
        },
        {
          "name": "פיצה מיוחדת",
          "price": 70,
          "description": "מיקס גבינות, שום קונפי, קשקבל יווני, אספרגוס צלוי"
        },

## Comparing tests with ground truth data
We can see that we get different outputs with different levels of accuracy, hopefully these differences will largely intersect with the hallucinations we want to detect and resolve.
Comparing with similarity with ground truth:

In [16]:
# Data for the table
data = {
    'Test': [f"Test({i})" for i in range(9)],
    'Similarity': [94.38, 94.33, 94.90, 94.55, 94.55, 94.28, 82.51, 82.58, 94.38]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Adding additional metrics for average, worst, and best invocation similarity
summary_data = {
    'Test': ['Average invocation', 'Worst invocation', 'Best invocation'],
    'Similarity': [91.83, 82.51, 94.90]
}

# Creating a summary DataFrame and concatenating with the main data
summary_df = pd.DataFrame(summary_data)
df = pd.concat([df, summary_df], ignore_index=True)

df

Unnamed: 0,Test,Similarity
0,Test(0),94.38
1,Test(1),94.33
2,Test(2),94.9
3,Test(3),94.55
4,Test(4),94.55
5,Test(5),94.28
6,Test(6),82.51
7,Test(7),82.58
8,Test(8),94.38
9,Average invocation,91.83


In [22]:
import os
data = {
    'Invocation': [],
    'Scores': []
}
for idx,t in enumerate(tests):
    # load expected and actual output
    expected = json.load(open(expected_json_filepath))
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    
    similarity_score, difference_report = compare_json(expected, actual)
    data['Invocation'].append(f'{idx}')
    data['Scores'].append(similarity_score)

# Summary data
invocations_avg_similarity = sum(data['Scores'])/len(data['Scores'])
worst_score = min(data['Scores'])
best_score = max(data['Scores'])
summary_data = {
    'Invocation': ['Average invocation', 'Worst invocation', 'Best invocation'],
    'Scores': [invocations_avg_similarity, worst_score, best_score]
}

df = pd.concat([pd.DataFrame(data), pd.DataFrame(summary_data)], ignore_index=True)
df['Scores'] = df['Scores'].round(2)
df

Unnamed: 0,Invocation,Scores
0,0,94.44
1,1,82.19
2,2,94.4
3,3,94.39
4,4,94.73
5,5,94.26
6,6,82.4
7,7,47.71
8,8,94.65
9,Average invocation,86.58


## Compare all tests outputs to invocation[0] output
Let's see how similar are the different invocations to invocation(0) results. We can see they are 85%-90% similar, so might be able to use this to our advantage.

In [28]:
self_similarity = {
    'Invocation A': [],
    'Invocation B': [],
    'Similarity score': []
}

import json
expected_output = open(get_outfile_path(0, tests[0]), "r").read()
expected = json_eval.extract_json_from_string(expected_output)
for idx,t in enumerate(tests):
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    similarity_score, difference_report = compare_json(expected, actual)
    self_similarity['Invocation A'].append(0)
    self_similarity['Invocation B'].append(idx)
    self_similarity['Similarity score'].append(round(similarity_score, 2))

pd.DataFrame(self_similarity)

Unnamed: 0,Invocation A,Invocation B,Similarity score
0,0,0,100.0
1,0,1,78.41
2,0,2,99.61
3,0,3,99.04
4,0,4,99.22
5,0,5,99.24
6,0,6,78.97
7,0,7,51.77
8,0,8,98.41


## Calculate majority vote across all possible variations of all invocations

In [29]:
import pprint
pp = pprint.PrettyPrinter(depth=5)
print("let's review different values for each difference:")
all_differences = {}
num_tests = len(tests)
for idx,t in enumerate(tests):
    if idx == 0: continue
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    
    similarity_score, difference_report = compare_json(expected, actual)
    for i, diff in enumerate(difference_report.differences):
        path_diff = all_differences.get(diff['path'])
        if diff['type'] == 'Missing Key': continue # not dealing with these now, only with value differences
        if path_diff is None:
            values = {  diff['expected value'] : num_tests-1, # start assuming it's the only difference we'll encounter
                        diff['actual value'] : 1} 
            path_diff = {'path' : diff['path'],
                         'type' : diff['type'],
                         'values' : values}
            all_differences[diff['path']] = path_diff
        else:
            count = path_diff['values'].get(diff['actual value']) or 1
            path_diff['values'][diff['actual value']] = count + 1
            path_diff['values'][diff['expected value']] += -1

pp.pprint(all_differences)

let's review different values for each difference:
{'categories[0].dishes[5].name': {'path': 'categories[0].dishes[5].name',
                                  'type': 'String Difference',
                                  'values': {'פיצה בדרוזי': 2,
                                             'פיצה בדרום': 4,
                                             'פיצה בדרוס': 3,
                                             'פיצה בדרוק': 1,
                                             'פיצה ענקית': 2}},
 'categories[0].dishes[5].price': {'path': 'categories[0].dishes[5].price',
                                   'type': 'Value Difference',
                                   'values': {65: 1, 70: 8}},
 'categories[0].dishes[6].description': {'path': 'categories[0].dishes[6].description',
                                         'type': 'String Difference',
                                         'values': {'בצק בצבע, קונפיט שום, אנשובי, אספרגוס ועוד': 2,
                                       

## Measure accuracy of majority vote resolution against ground truth
We'll update invocation(0) output with the mojority voted values for all value differences.

In [32]:
test0_output = open(get_outfile_path(0, tests[0]), "r").read()
test0 = json_eval.extract_json_from_string(test0_output)

import copy
avg_actual_obj = json_eval.update_json_with_highest_values(copy.deepcopy(test0), all_differences)

expected = json.load(open(expected_json_filepath))
similarity_score, difference_report = compare_json(expected, avg_actual_obj)
print(f"majority_vote<>GT - Similarity: {similarity_score:.2f}%\n")

uplift_from_avg = similarity_score - invocations_avg_similarity
print(f"Similarity for the average invocation is: {invocations_avg_similarity:.2f}%. Majority vote uplifts similarity score by {uplift_from_avg:.2f}%.") 

uplift_from_worst = similarity_score - worst_score
print(f"Worst invocation is: {worst_score:.2f}%. We avoided it and uplifted by {uplift_from_worst:.2f}%.")

uplift_from_best = similarity_score - best_score
print(f"Best invocation is: {best_score:.2f}%. We uplifted by {uplift_from_best:.2f}%.")

majority_vote<>GT - Similarity: 94.39%

Similarity for the average invocation is: 86.58%. Majority vote uplifts similarity score by 7.81%.
Worst invocation is: 47.71%. We avoided it and uplifted by 46.68%.
Best invocation is: 94.73%. We uplifted by -0.34%.


# Conclusion
1. It's still early days for vision tasks, their robustness needs to improve.
2. Until this happens, we should include detection and resolution flows in order to stabilize and improve predictions.
3. The same techniques can be used with Agents where small inaccuracies causes drift in multi-steps based Re:Act computations and tools launch and evaluation.

# Next steps
1. 🗒 You can find this notebook in the Github repo here: [https://bit.ly/structured-ocr](https://bit.ly/structured-ocr).
2. 🎞️ See a recorded deep dive in the AWS Israel Hebrew YouTube: [https://bit.ly/aws-hebrew-youtube](https://bit.ly/aws-hebrew-youtube)
<img title="Subscribe to AWS Israel Hebrew YouTube Channel!" src="./images/subscribe youtube.png" />
3. 📢 WE'RE HIRING!
<img title="📢 WE'RE HIRING!" src="./images/we are hiring.png" />
