# Compliance Tasks Using OpenAI's `gpt-4o`

## Implementing an "autoscan" function via Prompt engineering

In [1]:
import base64
import json
import os

from openai import OpenAI

from pydantic import BaseModel, Field, ConfigDict

### Authentication

> The OpenAI API uses API keys for authentication....
>
> ... Once you've generated an API key, export it as an environment variable in your terminal.
>
> ... OpenAI SDKs are configured to automatically read your API key from the system environment.

In [2]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

API keys are provided automagically via [HTTP Bearer authentication](https://swagger.io/docs/specification/v3_0/authentication/bearer-authentication/).

The actual HTTP header looks like this:

    Authorization: Bearer OPENAI_API_KEY

### Instantiate a client to the OpenAI API

In [3]:
client = OpenAI()

In [4]:
SYSTEM_LEVEL_INSTRUCTION = (
    "You are an expert financial data analyst, specializing in Compliance. "
    "You are also perfectly bilingual in Japanese and English. "
    "As you work for the Compliance team, you need to follow the given instructions to the letter "
    "and make sure that the given file does not break any financial laws or regulations. "
    "Not only must you read the text in the file, but you must use OCR in order to also "
    "read any text that might appear in all images and objects embedded in the file. "
).strip()

### File inputs

* The OpenAI API expects PDF files as input.
* We can pass the PDF file with our request by transforming the raw bytes into [Base64](https://en.wikipedia.org/wiki/Base64).

In [5]:
with open("samples/sample_ppt.pdf", "rb") as f:
    raw_data = f.read()

base64_data = base64.b64encode(raw_data).decode("utf-8")

### Defining the "autoscan" function

#### Inputs

The general inputs to the "autoscan" function are:
* OpenAI API client
* GPT model name
* Input PDF file name
* Input PDF file bytes as Base64-string
* Prompt to model

#### Output

The general output is the answer from the specified model. Note that the implementation of the "answer" depends upon the OpenAI API used internally within the "autoscan" function definition.

There are two major APIs available from OpenAI:
1. the newer [Responses API](https://platform.openai.com/docs/api-reference/responses)
2. the legacy [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)

* Specify system-level instructions to guide the model's responses. Define general model behavior here.
* Per-request model instructions are provided via the prompt.

As we probably wish to leverage [zero data retention](https://platform.openai.com/docs/guides/your-data#default-usage-policies-by-endpoint), do NOT use the newer [Responses API](https://platform.openai.com/docs/api-reference/responses)! 

We instead use the older [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Although by default this API will save any input data for 30 days, _the option for zero data retention is available_.

In [6]:
model_name = "gpt-4o"

In [7]:
prompt = """
Scan the given file, and tell me if there is any appearance of following terms:
1. ティムコ
2. 7501
3. ダイワ
4. 6121

For each of the terms listed above, create a result object that includes both the term
and the location in the PDF of the term.

Good examples of location values are:
- page 1, title
- page 2, table
- page 3, text
- page 4, bullet
- page 5, footnotes
- page 6, graph
- page 7, image
- not found
""".strip()

### Structured output

Read more about using structured output in the OpenAI API [here](https://platform.openai.com/docs/guides/structured-outputs?api-mode=chat#introduction).
        
        "schema": {
            "type": "object",
            "properties": {
                "results": {
                    "type": "array", 
                    "items": {
                        "type": "object",
                        "properties": {
                            "term": {"type": "string"},
                            "location": {"type": "string"}
                         },
                        "required": ["term", "location"],
                        "additionalProperties": False
                    }
                },
            },
            "required": ["results"],
            "additionalProperties": False
        }

In [9]:
class Result(BaseModel):
    model_config = ConfigDict(extra="forbid")
    term: str = Field(description="the search term")
    location: str = Field(
        description=("single location in the document where the term was found.")
    )

class AutoscanResults(BaseModel):
    model_config = ConfigDict(extra="forbid")
    results: list[Result] = Field(description="list of search results")
    
results_schema = AutoscanResults.model_json_schema()

print(json.dumps(results_schema, indent=2))

{
  "$defs": {
    "Result": {
      "additionalProperties": false,
      "properties": {
        "term": {
          "description": "the search term",
          "title": "Term",
          "type": "string"
        },
        "location": {
          "description": "single location in the document where the term was found.",
          "title": "Location",
          "type": "string"
        }
      },
      "required": [
        "term",
        "location"
      ],
      "title": "Result",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "results": {
      "description": "list of search results",
      "items": {
        "$ref": "#/$defs/Result"
      },
      "title": "Results",
      "type": "array"
    }
  },
  "required": [
    "results"
  ],
  "title": "AutoscanResults",
  "type": "object"
}


#### `autoscan` implemented with the newer [Responses API](https://platform.openai.com/docs/api-reference/responses)

In [10]:
def autoscan_responses(client, model_name, filename, filedata, prompt):
    response = client.responses.create(
        model=model_name,
        
        input=[
            {
                "role": "system", 
                "content": SYSTEM_LEVEL_INSTRUCTION
            },
            
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_file",
                        "filename": filename,
                        "file_data": f"data:application/pdf;base64,{filedata}"
                    },
                    {
                        "type": "input_text",
                        "text": prompt,
                    },
                ]
            }
        ],

        text={
            "format": {
                "type": "json_schema",
                "name": "autoscan",
                "schema": results_schema,
                "strict": True
            }
        },

        store=False
    )

    return response

In [11]:
%%time

response = autoscan_responses(client, model_name, "sample_ppt.pdf", base64_data, prompt)

#print(response.output_text)
if response.error:
    print("ERROR: ...")
else:
    print("Autoscan results:")
    o = json.loads(response.output_text)
    for result in o["results"]:
        print(f"- {result}")    

Autoscan results:
- {'term': 'ティムコ', 'location': 'page 2, table'}
- {'term': '7501', 'location': 'page 2, table'}
- {'term': 'ダイワ', 'location': 'not found'}
- {'term': '6121', 'location': 'not found'}
CPU times: user 36.9 ms, sys: 8.15 ms, total: 45 ms
Wall time: 4.45 s


In [12]:
print(response)

Response(id='resp_67fb67892fc881929c0788d6da1505bf05ff722e7e450649', created_at=1744529290.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_67fb678c0d708192baf3361bf1db02a705ff722e7e450649', content=[ResponseOutputText(annotations=[], text='{"results":[{"term":"ティムコ","location":"page 2, table"},{"term":"7501","location":"page 2, table"},{"term":"ダイワ","location":"not found"},{"term":"6121","location":"not found"}]}', type='output_text')], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, max_output_tokens=None, previous_response_id=None, reasoning=Reasoning(effort=None, generate_summary=None), status='completed', text=ResponseTextConfig(format=ResponseFormatTextJSONSchemaConfig(name='autoscan', schema_={'$defs': {'Result': {'additionalProperties': False, 'properties': {'term': {'description': 

----

#### `autoscan` implemented with the legacy [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)

In [13]:
def autoscan_completions(client, model_name, filename, filedata, prompt):
    completion = client.chat.completions.create(
        model=model_name,

        messages=[
            {
                "role": "system", 
                "content": SYSTEM_LEVEL_INSTRUCTION
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "file",
                        "file": {
                            "filename": filename,
                            "file_data": f"data:application/pdf;base64,{filedata}"
                        }
                    },
                    {
                        "type": "text",
                        "text": prompt,
                    },
                ]
            }
        ],

        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "autoscan",
                "schema": results_schema,
                "strict": True
            }
        },

        store=False     
    )
    
    return completion

In [14]:
%%time

completion = autoscan_completions(client, model_name, "sample_ppt.pdf", base64_data, prompt)

if completion.choices[0].message.refusal:
    print(f"ERROR: {completion.choices[0].message.refusal}")
else:
    print("Autoscan results:")
    o = json.loads(completion.choices[0].message.content)
    for result in o["results"]:
        print(f"- {result}")

Autoscan results:
- {'term': 'ティムコ', 'location': 'not found'}
- {'term': '7501', 'location': 'not found'}
- {'term': 'ダイワ', 'location': 'not found'}
- {'term': '6121', 'location': 'not found'}
CPU times: user 33.6 ms, sys: 5.43 ms, total: 39.1 ms
Wall time: 1min 38s


In [15]:
print(completion.to_dict())

{'id': 'chatcmpl-BLm5wyXYiQTBNxxqoiQDIuzzwc6z5', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': '{"results":[{"term":"ティムコ","location":"not found"},{"term":"7501","location":"not found"},{"term":"ダイワ","location":"not found"},{"term":"6121","location":"not found"}]}', 'refusal': None, 'role': 'assistant', 'annotations': []}}], 'created': 1744529464, 'model': 'gpt-4o-2024-08-06', 'object': 'chat.completion', 'service_tier': 'default', 'system_fingerprint': 'fp_92f14e8683', 'usage': {'completion_tokens': 48, 'prompt_tokens': 345, 'total_tokens': 393, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}}


### File clean-up

In [16]:
print("Lastly, we clean up any files on OpenAI storage")

for f in client.files.list():
    client.files.delete(f.id)

Lastly, we clean up any files on OpenAI storage


In [17]:
print("Any files left on OpenAI storage?")

client.files.list()

Any files left on OpenAI storage?


SyncCursorPage[FileObject](data=[], has_more=False, object='list', first_id=None, last_id=None)