# Assess GPT-4o Accuracy
# Create Truth Dataset for Fine-Tuning
This is a companion notebook to a LinkedIn post <INSERT URL>
In this notebook we do the following 2 things:
* assess GPT-4o accuracy on a custom dataset of PDF invoices from Biden for President Campaign
* create truth dataset, which will be used for supervised fine-tuning.

The goal of fine-tuning will be to teach a model to extract data about Spots from line items of PDF invoices with a complex structure, which includes:
* multiple pages
* multiple line items with nested elements (Spots)

To create a truth dataset which can be used in fine-tuning, we will use GPT-4o to run over a sample of our data and then apply a Quality function to identify completions with accurate data.  

As a result, we will:
* estimate GPT-4o accuracy
* obtain a dataset with truth values


## Steps Overview
1. Run GPT-4o on a sample of PDF invoices
2. Use a quality function to filter accurate completions
3. Store the resulting truth dataset

# Initializer

In [None]:
!pip install boto3
import boto3
from IPython.display import clear_output
clear_output()

In [None]:
import os
if 's3-operator' in os.listdir():
  pass
else:
# this will install a library to help with reading/saving files from/to s3
  !git clone https://github.com/aguille-vert/s3-operator

import sys
sys.path.append('/content/s3-operator')

import s3_operator as oper

it is assumed that pdf files were downloaded from FCC web site, page images were extracted from each file, and everything was uploaded to AWS s3 bucket  
For details see this [Colab notebook](https://github.com/aguille-vert/trump-biden-ads/blob/main/notebooks/trump_biden_download_preprocess_store.ipynb)

In [None]:
BUCKET = '<INSERT YOUR BUCKE NAME HERE>'

In [None]:
import json
from time import time
import pandas as pd
import numpy as np
from io import BytesIO
from PIL import Image
import requests
from collections import defaultdict
from random import choice, choices
import re
from datetime import datetime, timedelta
from joblib import Parallel, delayed, parallel_backend, dump, load
from random import randint, choice
from pprint import pprint
from time import sleep
import base64

## clients and tokens

In [None]:
from google.colab import userdata

OPENAI_KEY = userdata.get('OPENAI_KEY')

AWS_BRG_ACCESS_KEY = userdata.get('AWS_BRG_ACCESS_KEY')
AWS_BRG_SECRET_ACCESS_KEY = userdata.get('AWS_BRG_SECRET_ACCESS_KEY')

s3_client = boto3.client('s3',
            aws_access_key_id = AWS_BRG_ACCESS_KEY,
            aws_secret_access_key = AWS_BRG_SECRET_ACCESS_KEY)


# Functions

In [None]:
# per million tokens

model_pricing = {

                 'gpt-4o': {'prompt_tokens':5,
                                          'completion_tokens':15},

                  'gpt-3.5-turbo-0125': {'prompt_tokens' : 0.5,
                                              'completion_tokens': 1.5},
                  'gpt-4-turbo': {'prompt_tokens':10,
                                          'completion_tokens':30},
                  # https://mistral.ai/technology/:
                'open-mistral-7b' : {
                    'prompt_tokens' : 0.25,
                    'completion_tokens' : 0.25
                    },
                  'open-mixtral-8x7b' : {
                                        'prompt_tokens' : 0.7,
                                        'completion_tokens' : 0.7
                                        },
                  'open-mixtral-8x22b' : {
                                        'prompt_tokens' : 2,
                                        'completion_tokens' : 6
                                        },
                  # groq.com 8,192 tokens:
                  'llama3-8b-8192' : {
                                        'prompt_tokens' : 0.005,
                                        'completion_tokens' : 0.010
                                        },
                  'mixtral-8x7b-32768' : {
                                        'prompt_tokens' : 0.27,
                                        'completion_tokens' : 0.27
                                        },
                  'llama3-70b-8192' : {
                                        'prompt_tokens' : 0.59,
                                        'completion_tokens' : 0.79
                                        },

                }

def get_image(s3_client, bucket, key):
    # Use the S3 client to download the file
    buffer = BytesIO()
    s3_client.download_fileobj(bucket, key, buffer)
    buffer.seek(0)
    pil_image = Image.open(buffer)

    # Reset buffer's pointer to the beginning
    buffer.seek(0)

    # Read the buffer content into bytes
    image_bytes = buffer.read()

    # Encode image bytes to base64
    base64_image = base64.b64encode(image_bytes).decode('utf-8')

    return pil_image, base64_image

def get_image_completion(model,
                         prompt,
                         api,
                         api_key,
                         base64_images,
                         temperature=0.1,
                         max_tokens=2048,
                         TIMEOUT=30):

    apis = {
        'groq': 'https://api.groq.com/openai/v1/chat/completions',
        'openai': 'https://api.openai.com/v1/chat/completions',
        'mistral': 'https://api.mistral.ai/v1/chat/completions',
        'nvidia': 'https://integrate.api.nvidia.com/v1/chat/completions'
    }
    url = apis.get(api)

    headers = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    # Create the messages payload
    content = [

            {"type": "text", "text": prompt}
          ]
    for i in base64_images:
      content.append({"type": "image_url", "image_url": {
                          "url": f"data:image/png;base64,{i}"}
                      })
    messages=[
            {
                "role": "user", "content": content
            }
            ]

    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
    }

    return requests.post(url, json=payload, headers=headers, timeout=TIMEOUT)

def get_text_completion(model, prompt, api, api_key,
                   temperature = 0.1,
                   max_tokens = 1024,
                        timeout=60):

  apis = {'groq': 'https://api.groq.com/openai/v1/chat/completions',
          'openai' : 'https://api.openai.com/v1/chat/completions',
          'mistral' : 'https://api.mistral.ai/v1/chat/completions',
          'nvidia' : 'https://integrate.api.nvidia.com/v1/chat/completions'}
  url = apis.get(api)

  headers = {"Content-Type": "application/json",
          "Accept": "application/json",
          "Authorization": f"Bearer {api_key}"}


  messages = [
                {
                    "role": "user",
                    "content": json.dumps(prompt),
                },
                # {
                #   "role": "system",
                #   "content": "you are a helpful accountant."
                # }
              ]

  payload = {
          "model": model,
          "messages": messages,
          "max_tokens" : max_tokens,
          "response_format" : {"type": "json_object"},
          "temperature":temperature,
          # "stop" : '<END>'
        }


  return requests.post(url, json=payload, headers=headers, timeout=timeout)

def get_prompt(questions, document):

    return f"""Analyze  invoice text below and answer the following questions:
    ## Questions: {questions}
    Answer in JSON
    If you can't retrieve a value, assign 'None' to it.

    ### Documents start here:
    f"{document}"
    ### Documents end here

    ### Answer:"""


def get_tokens_usage(completion):

  try:
    # completion = json.loads(completion)
    usage = completion['usage']
    prompt_tokens = usage['prompt_tokens']
    completion_tokens = usage['completion_tokens']
  except:
    prompt_tokens = None
    completion_tokens = None
  return prompt_tokens, completion_tokens




# Load Source dataset

In [None]:

key = 'datasets/FCC/biden_df.parquet'
obj=s3_client.get_object(Bucket=BUCKET,
                         Key=key)
biden_df = oper.pd_read_parquet(s3_client,
                                 BUCKET,
                                 key)
biden_df['last_update_ts'] = pd.to_datetime(biden_df['last_update_ts'])
biden_df['month'] = biden_df['create_ts'].dt.month
biden_df['year'] = biden_df['create_ts'].dt.year

biden_df.shape

(29613, 14)

# Prompt

In [None]:
questions = """
### Question1: What are the summary fields in the invoice ?

### Question2: How many line items are in the invoice ?

### Question3: How many spots are in the invoice ?

### Question4: What are the headers of the line items in the invoice ?

### Question5: Which headers correspond to:
    * TV program description
    * Amount paid for Spot
    * Spot Air Date


### Question6: Analyze each line item, for each header from the Answer to Question5 assign
corresponding spot value.

### Question7: Add all spot_amounts and output the total sum

### Answer_example =
{
  'Summary_fields' : {
                        'number': '3983920-1',
                        'date': '2023-09-29',
                        'gross_amount': 10000.00,
                        'net_amount': 9500.00,
                        'issuer': 'WTAE',
                                        }
                    },
  'Line_items_num' : 10,
  'Spots_num' : 12,
  'Headers_mapping':
                    {'description' : 'Description',
                    'spot_amount' : 'Amount',
                    'air_date' : 'Air Date'},
  'Line_items' : [{
                      'line_num': 1,
                      'spot_num' : 1,
                      'air_date' : '2024-03-13',
                      'description': '6-7am News',
                      'spot_amount': 750.00
                      },
                      {
                      'line_num': 5,
                      'spot_num' : 6,
                      'air_date' : '2024-03-15',
                      'description': '6-7am News',
                      'spot_amount': 1750.00
                      }],
  'spot_amounts_total' = 2500.00

}


Answer in JSON format
"""

# Step 1: Run GPT-4o on a Sample of PDF Invoices
First, we'll use OpenAI's GPT-4o model in its text version to process a sample of PDF invoices. The model will generate completions that attempt to extract the relevant information from each invoice.

## april_invoices_df: a sample of PDF invoices
let's use invoices received by Biden campaign in April 2024 as our sample of PDF invoices

In [None]:
april_invoices_df = biden_df.query("month ==4 & year == 2024 & \
                                  file_name.str.contains('inv',case=False)").reset_index(drop=True)
april_invoices_df.shape

(191, 14)

## Run GPT-4o

I tried batching documents and inferencing by batches, but that did not lead to a significant time savings; therefore I'm falling back on a loop with a single document (with one or more page images) per iteration

In [None]:
model = 'gpt-4o'
api = 'openai'
api_key = OPENAI_KEY



invoice_df = pd.DataFrame()
price_counter = 0
failed = []
for row in april_invoices_df.itertuples():
  if row[0]>0:
    file_name = row.file_name
    print(file_name)
    json_data = None
    try:

        prefix = f"FCC/images/{file_name}/page_"
        _, image_keys = oper.get_latest_keys_from_(s3_client,
                                                    BUCKET,
                                                    prefix,
                                                    time_unit='day',
                                                    time_interval=100,
                                                    zipped=False)

        base64_images = []
        for key in image_keys:
            pil_image, base64_image = get_image(s3_client,
                                                 BUCKET,
                                                 key)
            base64_images.append(base64_image)

        PROMPT = questions



        s = time()
        r = get_image_completion(model,
                                 PROMPT,
                                  api,
                                  api_key,
                                  base64_images,
                                  temperature=0.1,
                                  max_tokens=4096,
                                  TIMEOUT=90)
        time_to_completion = round(time() - s, 2)

        json_data = r.json()

        # make sure that the model did not exripe max_tokens
        assert json_data['choices'][0]['finish_reason'] == 'stop'

        prompt_tokens, completion_tokens = get_tokens_usage(json_data)
        prompt_tokens_price = model_pricing[model]['prompt_tokens']*prompt_tokens/1000000
        prompt_tokens_price = model_pricing[model]['prompt_tokens']*prompt_tokens/1000000
        completion_tokens_price = model_pricing[model]['completion_tokens']*completion_tokens/1000000
        model_price = prompt_tokens_price + completion_tokens_price
        price_counter += model_price

        print(f"prompt_tokens : {prompt_tokens}")
        print(f"completion_tokens : {completion_tokens}")
        print(f"prompt_tokens_price : {prompt_tokens_price}")
        print(f"completion_tokens_price : {completion_tokens_price}")
        print(f"model_price : {model_price}")

        completion = json_data['choices'][0]['message']['content']

        try:
            completion = json.loads(completion)
        except:
            # completion is not always straightforward JSON
            # this is a hack to extract JSON from completion
            completion = json.loads(''.join(i for i in completion.splitlines()[1:-1]))
        # completion = json.loads(completion)
        print(row[0], file_name, time_to_completion, price_counter, len(failed))


        # break
        key = f'datasets/FCC/completions/april_2024_biden_invoices/gpt-4o/{file_name}/completion.json'
        s3_client.put_object(Body=json.dumps(json_data),
                            Bucket=BUCKET,
                            Key=key)
    except:
        print(f"exception: {row[0]}")
        failed.append((file_name,json_data))
    # break

Biden for presiden Invoice 74406-1
prompt_tokens : 5916
completion_tokens : 1801
prompt_tokens_price : 0.02958
completion_tokens_price : 0.027015
model_price : 0.056595
1 Biden for presiden Invoice 74406-1 39.36 0.056595 0
Biden for President 1340531 March 24 invoice
prompt_tokens : 4216
completion_tokens : 1245
prompt_tokens_price : 0.02108
completion_tokens_price : 0.018675
model_price : 0.039755
2 Biden for President 1340531 March 24 invoice 34.99 0.09634999999999999 0
Biden for President - Invoice 3484398-1
prompt_tokens : 1921
completion_tokens : 790
prompt_tokens_price : 0.009605
completion_tokens_price : 0.01185
model_price : 0.021455000000000002
3 Biden for President - Invoice 3484398-1 18.55 0.11780499999999999 0
77585-5224040004 Biden for President est 11583 April invoice
exception: 4
Biden for President 386096 Invoice March
prompt_tokens : 2686
completion_tokens : 2101
prompt_tokens_price : 0.01343
completion_tokens_price : 0.031515
model_price : 0.044945
5 Biden for Preside

# Step 2. gpt_df_true: Use a quality function to filter accurate completions

let's load the dataset we created in the previous section

some of the 'Rate' values retrieved by GPT-4o are None values or strings. Let's break the original df into the following 2:
* df_floats in which 'Rate' values can be converted to dtype 'float'
* df_strings in which such values are not convertible into floats

In [None]:
def get_totals(completion):
  summary = completion['Summary_fields']
  gross_amount = float(summary['gross_amount'])
  line_items = completion['Line_items']
  line_items_sum = sum([float(i['spot_amount']) for i in line_items])
  return gross_amount, line_items_sum

## gpt-4o

In [None]:
# let's get file_keys of gpt-4o completions (see Step1 above)
model = 'gpt-4o'

prefix = f'datasets/FCC/completions/april_2024_biden_invoices/{model}/'
keys_df = oper.get_latest_keys_from_(s3_client,
                                    BUCKET,
                                    prefix,
                                    zipped=True,
                                     time_unit='day',
                                     time_interval=100)
keys_df = pd.DataFrame(keys_df,
                       columns = ['ts','key'])
keys_df['file_name'] = keys_df['key'].str.split('/').str[5]
print(keys_df.shape)


# compute total of each line item and pack it with invoice gross amount
collector = []
failed = []
for row in keys_df.itertuples():
  try:
    json_data = s3_client.get_object(
                                    Bucket = BUCKET,
                                    Key = row.key
                                    )
    json_data = json.loads(json_data['Body'].read())
    completion = json_data['choices'][0]['message']['content']
    try:
      completion = json.loads(completion)
    except:
        # completion is not always straightforward JSON
        # this is a hack to extract JSON from completion
        completion = json.loads(''.join(i for i in completion.splitlines()[1:-1]))
        # print("exception")
        # break

    gross_amount, line_items_sum = get_totals(completion)
    collector.append((row.file_name, gross_amount, line_items_sum))
  except:
    failed.append(row.file_name)
    # break
  if row[0]%10==0:
    print(row[0], len(collector), len(failed))

# gpt_df_true: invoice gross amount equals line items total
df = pd.DataFrame(collector,
             columns=[
                      'file_name',
                      'gross_amount',
                      'line_sum'])
gpt_df_true = df.query(f"gross_amount == line_sum").reset_index(drop=True).copy()

df.shape, gpt_df_true.shape, len(failed)

(175, 3)
0 1 0
10 11 0
20 21 0
30 31 0
40 41 0
50 51 0
60 61 0
70 71 0
80 81 0
90 91 0
100 101 0
110 111 0
120 121 0
130 131 0
140 141 0
150 151 0
160 161 0
170 171 0


((175, 3), (142, 3), 0)

In [None]:
gpt_df_wrong = df.query(f"gross_amount != line_sum")
gpt_df_wrong.shape

(33, 3)

In [None]:
accuracy = len(gpt_df_true)/len(df)
print(f"accuracy of GPT-4o is {round(accuracy*100, 2)} %")

accuracy of GPT-4o is 81.14 %


In [None]:
gpt_df_wrong.reset_index(drop=True, inplace=True)

### accuracy of GPT-4o is 81.14 %
this number is based on matching invoice gross total with the sum of spot amounts; if the two values are equal, the model identified all the spot data correctly and retrieved correct spot amounts

# Step 3. store gpt_df_true and gpt_df_wrong in s3 bucket

In [None]:

key = 'datasets/FCC/completions/april_2024_biden_invoices/true_df.parquet'
oper.pd_save_parquet(s3_client,
                     gpt_df_true,
                     BUCKET,
                     key,
                     )

In [None]:

key = 'datasets/FCC/completions/april_2024_biden_invoices/wrong_df.parquet'
oper.pd_save_parquet(s3_client,
                     gpt_df_wrong,
                     BUCKET,
                     key,
                     )