#  Enhancing accuracy in LLM multi-modal vision use cases on Amazon Bedrock

As Large Language Models (LLMs) evolve to process multiple modalities, we face challenges reminiscent of early text-based LLMs: basic inconsistency, inaccuracy, and hallucinations. 
This notebook focus on techniques to increase the accuracy of vision use cases using various techniques. Inlucded is a quick introduction to AWS Gen AI services landscape and how to get started with using Anthropic Claude and other LLMs on Amazon Bedrock.

## What is Structured OCR

The goal is to A. idetify and read the (Hebrew) text on the page. B. Figure out the structure of the page. C. populate the structure with the text and output it (as Json).
Note that this example includes Hebrew text, some block lettes, some handwritten.

<img title="Sturctured OCR Task" src="./images/structured ocr 2.png"/>

# How can we accomplish this on AWS?
<img title="aws stack.png" src="./images/aws stack 3 layers.png" />

## Let's use a multi-modal LLM for this task
Let's proceed with using Anthropic Claude Sonnet 3.5 multi modal on Amazon Bedrock. We get a lot of abilities out of the box and we get started in minutes.

#### What vision use cases can it do?
- Classification (+ few shots)
- Object Detection
- Scene insights extraction (emotions)
- (Structured) OCR
- Content moderation
- Custome tasks - you name it

### How do we call A bedrock model with an image?

```python 
    # 1) Read the image bytes
    with open(input_image, "rb") as f:
        image_bytes = f.read()

    # 2) Create a message(s) with text or image
    message = {
        "role": "user",
        "content": [
            {
                "image": {
                    "format": "jpeg",
                    "source": {
                        "bytes": image_bytes
                    }
                }
            }
        ]
    }

    # 3) Invoke Converse API (same exact API for different model vendors + native tools support). Supported by LangChain.
    response = bedrock_client.converse(
        modelId = model_id,
        messages = [messages,],
        system = system_prompts,
        inferenceConfig = inference_config,
        #additionalModelRequestFields = additional_model_fields
    )
```

In [2]:
%load_ext autoreload
%autoreload 2
%pip install --upgrade pip
%pip install boto3 --quiet
%pip install botocore --quiet
%pip install json --quiet
import base64
import boto3
import json
from IPython.display import Image
from json_eval import compare_json

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
from llm_util import generate_conversation, get_ocr
input_image_fullpath = "./images/pizza-2e.png"
base_test = {
                'model_id' : "us.anthropic.claude-3-5-sonnet-20240620-v1:0",
                'input_image' : input_image_fullpath,
                'instructions_filename' : 'pizza_system_prompt1.txt',
                'temperature' : 0
                }

output = get_ocr(base_test)

Generating message with model us.anthropic.claude-3-5-sonnet-20240620-v1:0
Role: assistant
Text: I understand your request. I'll analyze the menu image and provide the information in the requested JSON structure, being careful to accurately transcribe the text as it appears. Here's the structured menu information:

{
  "categories": [
    {
      "name": "פיצות",
      "dishes": [
        {
          "name": "פיצה אישית",
          "price": 23
        },
        {
          "name": "פיצה משפחתית",
          "price": 42
        },
        {
          "name": "פיצה ענקית",
          "price": 65
        },
        {
          "name": "משפחתית ללא גלוטן",
          "price": 50
        },
        {
          "name": "משפחתית טבעונית",
          "price": 48
        },
        {
          "name": "פיצה בדרום",
          "price": 70
        },
        {
          "name": "פיצה מיוחדת",
          "price": 70,
          "description": "מיקס גבינות, שום קונפי, קשקבל יווני, אספרגוס צלוי, בלסמי"
  

In [4]:
# write response to disk
actual_filepath = './outputs/pizza-2e_actual.json'
with open(actual_filepath, 'w', encoding='utf-8') as f:
    f.write(output)

# Lets evaluate (manually)! 
It looks promising, however a manual diff uncovers some mistakes compared to the ground truth Json. For example instead of רטבים (sauces) it hallucinated לחמניות (buns).
<img title="Diff view" src="./images/pizza diff.png" />

## Hurray! While far from being perfect, it delivers immediate business value (💸, ⏱️)!
While it's not perfect, we can see immediate business value here, as instead of typing everything from scratch a data entry worker could just review the model's output and fix mistakes. Saving time and $.
## Let's proceed with formal evaluation
We'll calculate a **similarity score**. A custome function that compares both the json structure, text and numbers values, against a ground truth.

In [5]:
import json_eval
expected_json_filepath = './outputs/pizza-2e_expected.json'
expected = json.load(open(expected_json_filepath))
#actual_output = open(actual_filepath, "r").read()
actual_output = open('./outputs/pizza-2e_chatgpt_actual.json', "r").read()
actual = json_eval.extract_json_from_string(actual_output)

similarity_score, difference_report = compare_json(expected, actual)
print(f"Similarity score: {similarity_score:.2f}%")
print("\nDetailed Difference Report:")
difference_report.print_report()

Similarity score: 78.80%

Detailed Difference Report:
Differences found:
  Path: categories[0].dishes[6].description
  Type: String Difference
  Expected Value 1: מיקס גבינות, שום בהגזמה, קוקטייל זיתים, אקסטרה מוצרלה
  Actual Value 2: מיקס גבינות, שום גבינה, קממבר, ריקוטה, פרמז'ן, אקסטרה מוצרלה

  Path: categories[0].dishes[7].name
  Type: String Difference
  Expected Value 1: לאמיצים חריפה אש
  Actual Value 2: לאצ'יצים חיפה אש

  Path: categories[0].dishes[7].description
  Type: String Difference
  Expected Value 1: טבסקו, שיפקה, בוליביה, בקיאנו, פרסנו
  Actual Value 2: טבסקו, שקנפף, חלפיניו, ברבקיו, רק אולין פסטו

  Path: categories[2].name
  Type: String Difference
  Expected Value 1: שתיה
  Actual Value 2: שתייה

  Path: categories[2].dishes[1].price
  Type: Value Difference
  Expected Value 1: 14
  Actual Value 2: 10

  Path: categories[2].dishes[2].name
  Type: String Difference
  Expected Value 1: ספרייט
  Actual Value 2: ספייט

  Path: categories[2].dishes[4].name
  Type: Strin

# Dealing with hallucinations 😱
Let's discuss detection and resolution of hallucinations:
#### Detection
If we could accurately detect hallucinations, it could improve the human data entry worker to focus on what to fix. 🔦 Plus, also assist with automatic resolution.
#### Resolution
Reduce hallucination to improve accuracy. 📈

## Detection strategies
When in the wild, i.e., w/o ground truth data:
1. Compare outputs over multiple invocations 🔢 to uncover variations in values at specific json paths (use high 🌡️ temperature and different 📝 prompts).
2. Ask the LLM for a confidence score  alongside each value (wishful thinking?) ✰ ✰ ✰

<img title="detection through multi invocation" src="./images/detection multi invoke.png" />

## Resolution strategies
1. Majority vote - pick the common answer from multiple invocations
2. Confidence - pick highest probability answer
3. Character level inference - Once established detected a suspicious value, we can ask the LLM to spell it out character by character.
4. Attention - Cropping the document to smaller sections and invoke per prompt (classic CV for bounding boxes)
5. Ensemble of the techniques above
6. Fine tuning the LLM on this OCR usecase (fine tuning Claude Haiku currently possible in Amazon Bedrock)

### Let's attempt detection and resolution using multiple invocations and majority voting
Start with defning 9 different invocations designd to induce different results:

In [6]:
#model_id = "anthropic.claude-3-sonnet-20240229-v1:0".  # Sonnet 3.5 is far better than 3.0 in this task
base_test = {
                'model_id' : "us.anthropic.claude-3-5-sonnet-20240620-v1:0",
                'input_image' : input_image_fullpath,
                'instructions_filename' : 'pizza_system_prompt1.txt',
                'temperature' : 0
        }

# variations
from copy import deepcopy
num_of_invocations = 9
tests = [deepcopy(base_test) for i in range(num_of_invocations)]
tests[0]['temperature'] = 0

tests[1]['temperature'] = 0.7

tests[2]['temperature'] = 1

tests[3]['temperature'] = 0
tests[3]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[4]['temperature'] = 0.5
tests[4]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[5]['temperature'] = 1
tests[5]['instructions_filename'] = 'pizza_system_prompt2.txt'

tests[6]['temperature'] = 0
tests[6]['instructions_filename'] = 'pizza_system_prompt3.txt'

tests[7]['temperature'] = 0.7
tests[7]['instructions_filename'] = 'pizza_system_prompt3.txt'

tests[8]['temperature'] = 1
tests[8]['instructions_filename'] = 'pizza_system_prompt3.txt'

def get_outfile_path(idx, t):
        return f'./outputs/pizza_actual_' + t['model_id'] + f"_tempr({t['temperature']})_" + t['instructions_filename'].replace('.txt', '') + f'_{idx}' + '.json'


Execute tests:

In [7]:
for idx,test in enumerate(tests): # IMPROVE: can run in parallel
        outfile_path = get_outfile_path(idx, test)
        output = get_ocr(test)
        print(f'Writing output to {outfile_path}')
        open(outfile_path, 'w').write(output)

Generating message with model us.anthropic.claude-3-5-sonnet-20240620-v1:0
Role: assistant
Text: I understand. I'll provide a structured summary of the menu information without reproducing copyrighted material, focusing on accurately representing the menu items and prices as shown in the image.

{
  "categories": [
    {
      "name": "פיצות",
      "dishes": [
        {
          "name": "פיצה אישית",
          "price": 23
        },
        {
          "name": "פיצה משפחתית",
          "price": 42
        },
        {
          "name": "פיצה ענקית",
          "price": 65
        },
        {
          "name": "משפחתית ללא גלוטן",
          "price": 50
        },
        {
          "name": "משפחתית טבעונית",
          "price": 48
        },
        {
          "name": "פיצה בדרום",
          "price": 70,
          "description": "מיקס גבינות, שום קונפי, קישואים, זיתים, אספרגוס, עגבניה"
        },
        {
          "name": "לאסאנה חריפה אש",
          "price": 78,
          "descrip

## Comparing tests with ground truth data
We can see that we get different outputs with different levels of accuracy, hopefully these differences will largely intersect with the hallucinations we want to detect and resolve.
Comparing with similarity with ground truth:

In [8]:
import os
scores = []
for idx,t in enumerate(tests):
    # load expected and actual output
    expected = json.load(open(expected_json_filepath))
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    
    similarity_score, difference_report = compare_json(expected, actual)
    scores.append(similarity_score)
    print(f"Test({idx})<>GT - similarity: {similarity_score:.2f}%")

invocations_avg_similarity = sum(scores)/len(scores)
print(f"Average invocation Similarity: {invocations_avg_similarity:.2f}%") # average similary
worst_score = min(scores)
print(f"Worst invocation Similarity: {worst_score:.2f}%") # worst similarity
best_score = max(scores)
print(f"Best invocation Similarity: {best_score:.2f}%") # best similarity

Test(0)<>GT - similarity: 94.38%
Test(1)<>GT - similarity: 94.33%
Test(2)<>GT - similarity: 94.90%
Test(3)<>GT - similarity: 94.55%
Test(4)<>GT - similarity: 94.55%
Test(5)<>GT - similarity: 94.28%
Test(6)<>GT - similarity: 82.51%
Test(7)<>GT - similarity: 82.58%
Test(8)<>GT - similarity: 94.38%
Average invocation Similarity: 91.83%
Worst invocation Similarity: 82.51%
Best invocation Similarity: 94.90%


## Compare all tests outputs to invocation[0] output
Let's see how similar are the different results to results of invocation(0). We can see they are 85%-90% similar, so might be able to use this to our advantage.

In [9]:
import json
expected_output = open(get_outfile_path(0, tests[0]), "r").read()
expected = json_eval.extract_json_from_string(expected_output)
for idx,t in enumerate(tests):
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    similarity_score, difference_report = compare_json(expected, actual)
    print(f"Test({idx})<>Test(0) - Similarity: {similarity_score:.2f}%")

Test(0)<>Test(0) - Similarity: 100.00%
Test(1)<>Test(0) - Similarity: 98.96%
Test(2)<>Test(0) - Similarity: 98.94%
Test(3)<>Test(0) - Similarity: 99.26%
Test(4)<>Test(0) - Similarity: 99.04%
Test(5)<>Test(0) - Similarity: 99.12%
Test(6)<>Test(0) - Similarity: 78.57%
Test(7)<>Test(0) - Similarity: 78.59%
Test(8)<>Test(0) - Similarity: 98.31%


## Calculate majority vote across all possible variations of all invocations

In [10]:
import pprint
pp = pprint.PrettyPrinter(depth=5)
print("let's review different values for each difference:")
all_differences = {}
num_tests = len(tests)
for idx,t in enumerate(tests):
    if idx == 0: continue
    actual_output = open(get_outfile_path(idx, t), "r").read()
    actual = json_eval.extract_json_from_string(actual_output)
    
    similarity_score, difference_report = compare_json(expected, actual)
    for i, diff in enumerate(difference_report.differences):
        path_diff = all_differences.get(diff['path'])
        if diff['type'] == 'Missing Key': continue # not dealing with these now, only with value differences
        if path_diff is None:
            values = {  diff['expected value'] : num_tests-1, # start assuming it's the only difference we'll encounter
                        diff['actual value'] : 1} 
            path_diff = {'path' : diff['path'],
                         'type' : diff['type'],
                         'values' : values}
            all_differences[diff['path']] = path_diff
        else:
            count = path_diff['values'].get(diff['actual value']) or 1
            path_diff['values'][diff['actual value']] = count + 1
            path_diff['values'][diff['expected value']] += -1

pp.pprint(all_differences)

let's review different values for each difference:
{'categories[0].dishes[5].description': {'path': 'categories[0].dishes[5].description',
                                         'type': 'String Difference',
                                         'values': {'טירלי נבונה, שום קונפי, קישואים, זיתים, אספרגוס, בזיליקום': 2,
                                                    'מיקס גבינות, שום קונפי, קישואים, זיתים, אספרגוס טרי': 1,
                                                    'מיקס גבינות, שום קונפי, קישואים, זיתים, אספרגוס, עגבניה': 6,
                                                    'מיקס גבינות, שום קונפי, קרעי חלה, אספרגוס צרוב': 2}},
 'categories[0].dishes[5].name': {'path': 'categories[0].dishes[5].name',
                                  'type': 'String Difference',
                                  'values': {'פיצה בדרום': 4,
                                             'פיצה בדרוס': 3,
                                             'פיצה מיוחדת': 3}},
 'categories[0].di

## Measure accuracy of majority vote resolution against ground truth
We'll update invocation(0) output with the mojority voted values for all value differences.

In [11]:
test0_output = open(get_outfile_path(0, tests[0]), "r").read()
test0 = json_eval.extract_json_from_string(test0_output)

import copy
avg_actual_obj = json_eval.update_json_with_highest_values(copy.deepcopy(test0), all_differences)

expected = json.load(open(expected_json_filepath))
similarity_score, difference_report = compare_json(expected, avg_actual_obj)
print(f"majority_vote<>GT - Similarity: {similarity_score:.2f}%\n")

uplift_from_avg = similarity_score - invocations_avg_similarity
print(f"Similarity for the average invocation is: {invocations_avg_similarity:.2f}%. Majority vote uplifts similarity score by {uplift_from_avg:.2f}%.") 

uplift_from_worst = similarity_score - worst_score
print(f"Worst invocation is: {worst_score:.2f}%. We avoided it and uplifted by {uplift_from_worst:.2f}%.")

uplift_from_best = similarity_score - best_score
print(f"Best invocation is: {best_score:.2f}%. We uplifted by {uplift_from_best:.2f}%.")
print(f"Best invocation Similarity: {best_score:.2f}%") # best similarity

majority_vote<>GT - Similarity: 94.55%

Similarity for the average invocation is: 91.83%. Majority vote uplifts similarity score by 2.73%.
Worst invocation is: 82.51%. We avoided it and uplifted by 12.04%.
Best invocation is: 94.90%. We uplifted by -0.34%.
Best invocation Similarity: 94.90%


# Conclusion
1. It's still early days for vision tasks, their robustness needs to improve.
2. Until this happens being able to "Reflect" and fix halucinations can greatly help the end result
3. The same techniques can be used with Agents where small inaccuracies causes drift in multi-steps computations.

# Next steps
1. 🗒 You can find this notebook in the Github repo here: [https://github.com/gilinachum/LLM-OCR-Evaluation/blob/main/main.ipynb](https://github.com/gilinachum/LLM-OCR-Evaluation/blob/main/main.ipynb).
2. 🎞️ See a recorded deep dive in the AWS Israel Hebrew YouTube: [https://bit.ly/aws-hebrew-youtube](https://bit.ly/aws-hebrew-youtube)
<img href>
3. 📢 WE'RE HIRING!
<img title="📢 WE'RE HIRING!" src="./images/we are hiring.png" />
