#  Enhancing accuracy in LLM multi-modal vision use cases on Amazon Bedrock

As Large Language Models (LLMs) evolve to process multiple modalities, we face challenges reminiscent of early text-based LLMs: basic inconsistency, inaccuracy, and hallucinations. 
This notebook focus on techniques to increase the accuracy of vision use cases using various techniques. Inlucded is a quick introduction to AWS Gen AI services landscape and how to get started with using Anthropic Claude and other LLMs on Amazon Bedrock.

## Let's deep dive into a specific vision task of Structured OCR

The goal is to A. idetify and read the (Hebrew) text on the page. B. Figure out the structure of the page. C. populate the structure with the text and output it (as Json).
Note that this example includes Hebrew text, some block lettes, some handwritten.

<img title="Sturctured OCR Task" src="./images/structured ocr.png" />

# How can we accomplish this on AWS?
<img title="aws stack.png" src="./images/aws stack.png" />

## Let's use a multi-modal LLM for this task
Let's proceed with using Anthropic Claude Sonnet 3.5 multi modal on Amazon Bedrock. We get a lot of abilities out of the box and we get started in minutes.

#### What vision use cases can it do?
- Classification (+ few shots)
- Object Detection
- Scene insights extraction (emotions)
- (Structured) OCR
- Content moderation
- Custome tasks - you name it

### How do we call A bedrock model with an image?

```python 
    # 1) Read the image bytes
    with open(input_image, "rb") as f:
        image_bytes = f.read()

    # 2) Create a message(s) with text or image
    message = {
        "role": "user",
        "content": [
            {
                "image": {
                    "format": "jpeg",
                    "source": {
                        "bytes": image_bytes
                    }
                }
            }
        ]
    }

    # 3) Invoke Converse API (same exact API for different model vendors + native tools support). Supported by LangChain.
    response = bedrock_client.converse(
        modelId = model_id,
        messages = [messages,],
        system = system_prompts,
        inferenceConfig = inference_config,
        #additionalModelRequestFields = additional_model_fields
    )
```

In [1]:
%load_ext autoreload
%autoreload 2
%pip install --upgrade pip
%pip install boto3 --quiet
%pip install botocore --quiet
%pip install json --quiet
import base64
import boto3
import json
from IPython.display import Image
from json_eval import compare_json

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
from llm_util import generate_conversation, get_ocr

base_test = {
                'model_id' : "us.anthropic.claude-3-5-sonnet-20240620-v1:0",
                'input_image' : "./images/pizza-1c.png",
                'instructions_filename' : 'pizza_system_prompt1.txt',
                'temperature' : 0
                }

output = get_ocr(base_test)


Generating message with model us.anthropic.claude-3-5-sonnet-20240620-v1:0


Role: assistant
Text: I'll extract the menu information from the image and structure it according to the provided JSON schema. I'll go through the menu from right to left and top to bottom, as it's written in Hebrew.

{
  "categories": [
    {
      "name": "תפריט",
      "dishes": [
        {
          "name": "פיצה אישית (S)",
          "price": 20
        },
        {
          "name": "פיצה משפחתית (L)",
          "price": 29
        },
        {
          "name": "פיצה ענקית (XL)",
          "price": 49
        },
        {
          "name": "משפחתית טבעונית",
          "price": 40
        },
        {
          "name": "משפחתית ללא גלוטן",
          "price": 45
        },
        {
          "name": "המלצת השף פיצה יוונית",
          "price": 49,
          "description": "עגבניה,בצל,זיתים שחורים,בולגרית"
        }
      ],
      "extraInfo": [
        "תוספת שומשום בקצוות הפיצה 2 ₪"
      ]
    },
    {
      "name": "לחם שום",
      "dishes": [
        {
          "name": "לחם ש

# Lets evaluate (manually)! 
It looks promising, however a manual diff uncovers some mistakes compared to the ground truth Json. For example instead of רטבים (sauces) it hallucinated לחמניות (buns).
<img title="Diff view" src="./images/pizza diff.png" />

## Hurray! While far from being perfect, it delivers immediate business value (💸, ⏱️)!
While it's not perfect, we can see immediate business value here, as instead of typing everything from scratch a data entry worker could just review the model's output and fix mistakes. Saving time and $.
## Let's proceed with formal evaluation
We'll calculate a **similarity score**. A custome function that compares both the json structure, text and numbers values, against a ground truth.

In [44]:
import json_eval
expected = json.load(open('./outputs/pizza_expected.json'))
actual = json_eval.extract_json_from_string(output)

similarity_score, difference_report = compare_json(expected, actual)
print(f"Similarity score: {similarity_score:.2f}%")
print("\nDetailed Difference Report:")
difference_report.print_report()

Similarity score: 78.61%

Detailed Difference Report:
Differences found:
  Path: categories[0].dishes[3].description
  Type: Missing Key
  Expected Value 1: V
  Actual Value 2: Not present

  Path: categories[0].extraInfo[0]
  Type: String Difference
  Expected Value 1: תוספת שומשום בקצוות הפיצה 2 ₪
  Actual Value 2: 100% צהובה

  Path: categories[1].dishes[4].description
  Type: String Difference
  Expected Value 1: בתוספת צהובה, מוצרלה ובולגרית
  Actual Value 2: (בתוספת גבינה, מוצרלה ובולגרית)

  Path: categories[2].dishes[1].description
  Type: String Difference
  Expected Value 1: חסה, מלפפון, עגבניות שרי, תירס, זיתים ירוקים וטונה
  Actual Value 2: חסה, מלפפון, עגבניות שרי, תירס, זיתים שחורים וטונה

  Path: categories[2].extraInfo[0]
  Type: String Difference
  Expected Value 1: כל הסלטים מוגשים עם שתי רטבים לבחירה
  Actual Value 2: כל הסלטים מוגשים עם שמן זית ולימון סחוט

  Path: categories[3].dishes[0].price
  Type: Value Difference
  Expected Value 1: 3
  Actual Value 2: 2

  Pa

# Dealing with hallucinations 😱
Let's discuss detection and resolution of hallucinations:
#### Detection
If we could accurately detect hallucinations, it could improve the human data entry worker to focus on what to fix. 🔦
#### Resolution
Reduce hallucination to improve accuracy. 📈

## Detection strategies
When in the wild, i.e., w/o ground truth data:
1. Compare outputs over multiple invocations 🔢 to uncover variations in values at specific json paths (use high 🌡️ temperature and different 📝 prompts).
2. Ask the LLM for a confidence score  alongside each value (wishful thinking?) ✰ ✰ ✰

<img title="detection through multi invocation" src="./images/detection multi invoke.png" />

## Resolution strategies
1. Majority vote - pick the common answer from multiple invocations
2. Confidence - pick highest probability answer
3. Character level inference - Once established detected a suspicious value, we can ask the LLM to spell it out character by character.
4. Attention - Cropping the document to smaller sections and invoke per prompt (classic CV for bounding boxes)
5. Ensemble of the techniques above
6. Fine tuning the LLM on this OCR usecase (fine tuning Claude Haiku currently possible in Amazon Bedrock)

### Let's attempt detection and resolution using multiple invocations and majority voting
Start with defning 5 different invocations designd to induce different results:

In [37]:
#model_id = "anthropic.claude-3-sonnet-20240229-v1:0".  # Sonnet 3.5 is far better than 3.0 in this task
base_test = {
                'model_id' : "us.anthropic.claude-3-5-sonnet-20240620-v1:0",
                'input_image' : "./images/pizza-1c.png",
                'instructions_filename' : 'pizza_system_prompt1.txt',
                'temperature' : 0
        }

# variations
from copy import deepcopy
tests = [deepcopy(base_test) for i in range(5)]
tests[0]['temperature'] = 0

tests[1]['temperature'] = 0.7

tests[2]['temperature'] = 0
tests[2]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[3]['temperature'] = 1
tests[3]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[4]['temperature'] = 1
tests[4]['instructions_filename'] = 'pizza_system_prompt3.txt'

def get_outfile_path(idx, t):
        return f'./outputs/pizza_actual_' + t['model_id'] + f"_tempr({t['temperature']})_" + t['instructions_filename'].replace('.txt', '') + f'_{idx}' + '.json'


Execute tests:

In [39]:
for idx,test in enumerate(tests): # IMPROVE: can run in parallel
        outfile_path = get_outfile_path(idx, test)
        output = get_ocr(test)
        print(f'Writing output to {outfile_path}')
        open(outfile_path, 'w').write(output)

Generating message with model us.anthropic.claude-3-5-sonnet-20240620-v1:0
Role: assistant
Text: Here's the structured menu information extracted from the provided Hebrew menu image:

{
  "categories": [
    {
      "name": "תפריט",
      "dishes": [
        {
          "name": "פיצה אישית (S)",
          "price": 20
        },
        {
          "name": "פיצה משפחתית (L)",
          "price": 29
        },
        {
          "name": "פיצה ענקית (XL)",
          "price": 49
        },
        {
          "name": "משפחתית טבעונית",
          "price": 40
        },
        {
          "name": "משפחתית ללא גלוטן",
          "price": 45
        },
        {
          "name": "המלצת השף פיצה יוונית",
          "price": 49,
          "description": "עגבניה,בצל,זיתים שחורים,בולגרית"
        }
      ]
    },
    {
      "name": "תוספות",
      "dishes": [
        {
          "name": "תוספת שימוש בקצות הפיצה",
          "price": 2
        }
      ]
    },
    {
      "name": "לחם שום",
      "

## Comparing tests with ground truth data
We can see that we get different outputs with different levels of accuracy, hopefully these differences will largely intersect with the hallucinations we want to detect and resolve.
Comparing with similarity with ground truth:

In [56]:
scores = []
for idx,t in enumerate(tests):
    # load expected and actual output
    expected = json.load(open('./outputs/pizza_expected.json'))
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    
    similarity_score, difference_report = compare_json(expected, actual)
    scores.append(similarity_score)
    print(f"Test({idx})<>GT - similarity: {similarity_score:.2f}%")

invocations_avg_similarity = sum(scores)/len(scores)
print(f"Average Similarity: {invocations_avg_similarity:.2f}%") # average similary

Test(0)<>GT - similarity: 86.98%
Test(1)<>GT - similarity: 81.33%
Test(2)<>GT - similarity: 78.96%
Test(3)<>GT - similarity: 88.56%
Test(4)<>GT - similarity: 78.61%
Average Similarity: 82.89%


## Compare all tests outputs to invocation[0] output
Let's see how similar are the different results to results of invocation(0). We can see they are 85%-90% similar, so might be able to use this to our advantage.

In [53]:
import json
expected_output = open(get_outfile_path(0, tests[0]), "r").read()
expected = json_eval.extract_json_from_string(expected_output)
for idx,t in enumerate(tests):
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    similarity_score, difference_report = compare_json(expected, actual)
    print(f"Test({idx})<>Test(0) - Similarity: {similarity_score:.2f}%")

Test(0)<>Test(0) - Similarity: 100.00%
Test(1)<>Test(0) - Similarity: 90.34%
Test(2)<>Test(0) - Similarity: 85.09%
Test(3)<>Test(0) - Similarity: 94.56%
Test(4)<>Test(0) - Similarity: 85.13%


## Calculate majority vote per all possible variations across all invocations

In [54]:
import pprint
pp = pprint.PrettyPrinter(depth=5)
print("let's review different values for each difference:")
all_differences = {}
num_tests = len(tests)
for idx,t in enumerate(tests):
    if idx == 0: continue
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    
    similarity_score, difference_report = compare_json(expected, actual)
    for i, diff in enumerate(difference_report.differences):
        path_diff = all_differences.get(diff['path'])
        if diff['type'] == 'Missing Key': continue # not dealing with these now, only with value differences
        if path_diff is None:
            values = {  diff['expected value'] : num_tests-1, # start assuming it's the only difference we'll encounter
                        diff['actual value'] : 1} 
            path_diff = {'path' : diff['path'],
                         'type' : diff['type'],
                         'values' : values}
            all_differences[diff['path']] = path_diff
        else:
            count = path_diff['values'].get(diff['actual value']) or 1
            path_diff['values'][diff['actual value']] = count + 1
            path_diff['values'][diff['expected value']] += -1

pp.pprint(all_differences)

let's review different values for each difference:
{'categories[0].extraInfo[0]': {'path': 'categories[0].extraInfo[0]',
                                'type': 'String Difference',
                                'values': {'100% צהובה': 2,
                                           '100% צהובה עמק': 1,
                                           'תוספת שומשום בקצוות הפיצה 2 ₪': 3}},
 'categories[1].dishes[4].description': {'path': 'categories[1].dishes[4].description',
                                         'type': 'String Difference',
                                         'values': {'(בתוספת גבינה, מוצרלה ובולגרית)': 2,
                                                    '(בתוספת עגבניה, מוצרלה ובולגרית)': 1,
                                                    'בתוספת בצלים, מוצרלה ובולגרית': 2,
                                                    'בתוספת גבינה, מוצרלה ובולגרית': 2,
                                                    'בתוספת זעתר, מוצרלה ובולגרית': 1}},
 'categor

## Measure accuracy of majority vote output against ground truth
We'll update invocation(0) output with the mojority voted values for all value differences.

In [63]:
test0_output = open(get_outfile_path(0, tests[0]), "r").read()
test0 = json_eval.extract_json_from_string(test0_output)

import copy
avg_actual_obj = json_eval.update_json_with_highest_values(copy.deepcopy(test0), all_differences)

expected = json.load(open('./outputs/pizza_expected.json'))
similarity_score, difference_report = compare_json(expected, avg_actual_obj)
print(f"avg_test<>GT - Similarity: {similarity_score:.2f}%")
print(f"Keep in mind that the Average Similarity is only: {invocations_avg_similarity:.2f}%")
uplift = similarity_score - invocations_avg_similarity
print(f"We achieved an uplift of {uplift:.2f}% using majority vote resolution")

avg_test<>GT - Similarity: 87.40%
Keep in mind that the Average Similarity is only: 82.89%
We achieved an uplift of 4.51% using majority vote resolution


# Conclusion
1. It's still early days for vision tasks, and it's robustness needs to improve
2. Until this happens being able to "Reflect" and fix halucinations can greatly help the end result
3. The same techniques can be used with Agents where small inaccuracies causes drift in multi-steps computations.

# Next steps
1. 🗒 You can find this notebook in the Github repo here: [https://github.com/gilinachum/LLM-OCR-Evaluation/blob/main/main.ipynb](https://github.com/gilinachum/LLM-OCR-Evaluation/blob/main/main.ipynb).
2. 🎞️ See a recorded deep dive in the AWS Israel Hebrew YouTube: [https://bit.ly/aws-hebrew-youtube](https://bit.ly/aws-hebrew-youtube)
<img href>
3. 📢 WE'RE HIRING!
<img title="📢 WE'RE HIRING!" src="./images/we are hiring.png" />


In [None]:
# load json from outputs/groundtruth1.json
expected = json.load(open('./outputs/pizza_expected.json'))
actual = json.load(open('./outputs/pizza_actual.json'))
actual = actual_json

similarity_score, difference_report = compare_json(expected, actual)
print(f"Similarity score: {similarity_score:.2f}%")
print("\nDetailed Difference Report:")
difference_report.print_report()

Similarity score: 81.66%

Detailed Difference Report:
Differences found:
  Path: categories[0].dishes[3].description
  Type: Missing Key
  Value 1: V
  Value 2: Not present

  Path: categories[0].dishes[5].name
  Type: String Difference
  Value 1: המלצת השף פיצה יוונית
  Value 2: המלצת השף פיצה יווניתת

  Path: categories[0].extraInfo[0]
  Type: String Difference
  Value 1: תוספת שימושים בקצוות הפיצה 2 ₪
  Value 2: 100% גבינה עמק

  Path: categories[1].extraInfo[0]
  Type: String Difference
  Value 1: כל הסלטים מגיעים עם שתי לחמניות לבחירה
  Value 2: כל הסלטים מוגשים עם פיתה לבנה

  Path: categories[3].name
  Type: String Difference
  Value 1: רטבים
  Value 2: תוספות

  Path: categories[3].dishes[1].name
  Type: String Difference
  Value 1: רוטב פיצה, מיני סטייק
  Value 2: רוטב פיצה, מיני טוסטר

  Path: categories[4].dishes[4].name
  Type: String Difference
  Value 1: לחם שום ענק (L)
  Value 2: לחם שום הבית (L)

  Path: categories[4].dishes[4].description
  Type: String Difference
  Va

In [None]:
# Example usage
import json
json1 = json.loads('{"name": "John", "age": 30, "city": "New York", "hobbies": ["reading", "swimming"]}')
json2 = json.loads('{"name": "Jon", "age": 31, "city": "New York", "hobbies": ["reading", "running"]}')

from json_eval import compare_json

similarity_score, difference_report = compare_json(json1, json2)
print(f"Similarity score: {similarity_score:.2f}%")
print("\nDetailed Difference Report:")
difference_report.print_report()

Similarity score: 77.76%

Detailed Difference Report:
Differences found:
  Path: age
  Type: Value Difference
  Value 1: 30
  Value 2: 31

  Path: name
  Type: String Difference
  Value 1: John
  Value 2: Jon



In [None]:
# load json from outputs/groundtruth1.json
expected = json.load(open('./outputs/pizza_expected.json'))
actual = json.load(open('./outputs/pizza_actual.json'))
actual = json.load(open(output_filename_fullpath))

similarity_score, difference_report = compare_json(expected, actual)
print(f"Similarity score: {similarity_score:.2f}%")
print("\nDetailed Difference Report:")
difference_report.print_report()

Similarity score: 35.73%

Detailed Difference Report:
Differences found:
  Path: categories[0].extraInfo[0]
  Type: String Difference
  Value 1: תוספת שימושים בקצוות הפיצה 2 ₪
  Value 2: תוספת ביצה קשה לסלט 3 ש"ח

  Path: categories[0].dishes[0].price
  Type: Value Difference
  Value 1: 20
  Value 2: 34

  Path: categories[0].dishes[0].name
  Type: String Difference
  Value 1: פיצה אישית (S)
  Value 2: סלט טורקי

  Path: categories[0].dishes[0].description
  Type: Missing Key
  Value 1: Not present
  Value 2: חסה, מלפפון, עגבניה שרי, חצילים וזיתים שרויים

  Path: categories[0].dishes[1].price
  Type: Value Difference
  Value 1: 29
  Value 2: 34

  Path: categories[0].dishes[1].name
  Type: String Difference
  Value 1: פיצה משפחתית (L)
  Value 2: סלט בגרוזית

  Path: categories[0].dishes[1].description
  Type: Missing Key
  Value 1: Not present
  Value 2: חסה, מלפפון, עגבניה שרי, חצילים וביצים שרויים

  Path: categories[0].dishes[2].price
  Type: Value Difference
  Value 1: 49
  Value 2