# Data Analysis with the OpenAI Assistants API

A tutorial on how to get GPT4 to analyse a CSV file with the new Assistants API.

OpenAI tells us "The Assistants API allows you to build AI assistants within your own applications. An Assistant has instructions and can leverage models, tools, and knowledge to respond to user queries. The Assistants API currently supports three types of tools: Code Interpreter, Retrieval, and Function calling."

In this tutorial we will use the Code Interpreter tool in order to read CSV data; we will then ask the AI questions about the data and display the messages that are returned.

Some of the code is adapted from the [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/Assistants_API_overview_python.ipynb) (MIT licence).

In [1]:
import json

# This utility function uses the IPython function 'display' to pretty-print the JSON of the
# various entities created in the Assistants API. This code is copied directly from the OpenAI Cookbook.

def show_json(obj):
    display(json.loads(obj.model_dump_json()))

### Assistants
You can create Assistants directly through the Assistants API. You must provide a name, a model and instruction as in the code below and you must, of course be signed in. This Notebook requires you to enter your OpenAI API key.

After creating the assistant we display the JSON version of it.

In [2]:
from openai import OpenAI

# If you have set an environment variable set with your API key then you do not need
# to pass it to OpenAI
# client = OpenAI()

# Otherwise you need to pass the key as a parameter, e.g.
# api_key = "your key here - don't put this in publicly available code!"
# or enter it interactively
api_key = input("API key")

client = OpenAI(api_key = api_key)

In [3]:
# Create an assistant and show what it looks like

assistant = client.beta.assistants.create(
    name="Data Assistant",
    instructions="You are a data analyst who can analyse and interpret data files such as CSV and can draw conclusions from the data",
    model="gpt-4-1106-preview",
)
show_json(assistant)

{'id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
 'created_at': 1710757235,
 'description': None,
 'file_ids': [],
 'instructions': 'You are a data analyst who can analyse and interpret data files such as CSV and can draw conclusions from the data',
 'metadata': {},
 'model': 'gpt-4-1106-preview',
 'name': 'Data Assistant',
 'object': 'assistant',
 'tools': []}

### Create a thread

Initially, the thread is empty.


In [4]:
thread = client.beta.threads.create()
show_json(thread)

{'id': 'thread_Wqyp05bvHMyp2jsiIi46yexd',
 'created_at': 1710757235,
 'metadata': {},
 'object': 'thread'}

A utility from the OpenAI Cookbook to poll the run for completion

In [5]:
import time

def wait_on_run(run, thread):
    while run.status == "queued" or run.status == "in_progress":
        run = client.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id,
        )
        time.sleep(0.5)
    return run

Wait for completion of the run

In [6]:
# Pretty printing helper from the OpenAI Cookbook

def pretty_print(messages):
    print("# Messages")
    for m in messages:
        print(f"{m.role}: {m.content[0].text.value}")
    print()

#### Code Interpreter

Add the code interpreter tool


In [7]:
assistant = client.beta.assistants.update(
    assistant.id,
    tools=[{"type": "code_interpreter"}],
)
show_json(assistant)


{'id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
 'created_at': 1710757235,
 'description': None,
 'file_ids': [],
 'instructions': 'You are a data analyst who can analyse and interpret data files such as CSV and can draw conclusions from the data',
 'metadata': {},
 'model': 'gpt-4-1106-preview',
 'name': 'Data Assistant',
 'object': 'assistant',
 'tools': [{'type': 'code_interpreter'}]}

In [8]:
from vega_datasets import data

df = data.gapminder()
df.to_csv("data.csv")

In [9]:
# Upload the file
file = client.files.create(
    file=open(
        "data.csv",
        "rb",
    ),
    purpose="assistants",
)
# Update Assistant
assistant = client.beta.assistants.update(
    assistant.id,
    tools=[{"type": "code_interpreter"}],
    file_ids=[file.id],
)
show_json(assistant)

{'id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
 'created_at': 1710757235,
 'description': None,
 'file_ids': ['file-KuB2jlqECC7T8YqWvRWIGxKr'],
 'instructions': 'You are a data analyst who can analyse and interpret data files such as CSV and can draw conclusions from the data',
 'metadata': {},
 'model': 'gpt-4-1106-preview',
 'name': 'Data Assistant',
 'object': 'assistant',
 'tools': [{'type': 'code_interpreter'}]}

In [10]:
# Create a message to append to our thread
message = client.beta.threads.messages.create(
    thread_id=thread.id, 
    role="user", 
    content="""Draw a line chart of the change in US population over time"""
)

# Execute our run
run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

# Wait for completion
wait_on_run(run, thread)

# Retrieve all the messages added after our last user message
messages = client.beta.threads.messages.list(
    thread_id=thread.id, 
    order="asc", 
    after=message.id
)
show_json(messages)

{'data': [{'id': 'msg_pNy3H7met2SwVK7MzGdVqt3F',
   'assistant_id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
   'content': [{'text': {'annotations': [],
      'value': "To create the line chart, I first need to examine the contents of the uploaded file to understand the structure of the data, including what time period it covers and how the population figures are recorded. Once I have this information, I can parse the data and create the appropriate line chart.\n\nI will begin by loading the data and previewing its structure to determine how we can extract the US population over time. Let's take a look at the file."},
     'type': 'text'}],
   'created_at': 1710757241,
   'file_ids': [],
   'metadata': {},
   'object': 'thread.message',
   'role': 'assistant',
   'run_id': 'run_84kPyLyMWCDZMcwgvLPtj0PU',
   'thread_id': 'thread_Wqyp05bvHMyp2jsiIi46yexd'},
  {'id': 'msg_3LiWou93oHp8voLgeWltavli',
   'assistant_id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
   'content': [{'text': {'annotations': [],
   

In [11]:
show_json(messages)

{'data': [{'id': 'msg_pNy3H7met2SwVK7MzGdVqt3F',
   'assistant_id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
   'content': [{'text': {'annotations': [],
      'value': "To create the line chart, I first need to examine the contents of the uploaded file to understand the structure of the data, including what time period it covers and how the population figures are recorded. Once I have this information, I can parse the data and create the appropriate line chart.\n\nI will begin by loading the data and previewing its structure to determine how we can extract the US population over time. Let's take a look at the file."},
     'type': 'text'}],
   'created_at': 1710757241,
   'file_ids': [],
   'metadata': {},
   'object': 'thread.message',
   'role': 'assistant',
   'run_id': 'run_84kPyLyMWCDZMcwgvLPtj0PU',
   'thread_id': 'thread_Wqyp05bvHMyp2jsiIi46yexd'},
  {'id': 'msg_3LiWou93oHp8voLgeWltavli',
   'assistant_id': 'asst_KPJMedvBcgVfZq1SKtnxbhVe',
   'content': [{'text': {'annotations': [],
   

In [13]:
# this not working as the messages are not only text
#pretty_print(messages)

In [18]:
# code here to extract the image file
output_files = client.files.list(purpose="assistants_output")

In [24]:
print(output_files.data)

[FileObject(id='file-xdZcG8zRWKInraZlUUJA8pws', bytes=103495, created_at=1710757260, filename='7920bf66-aa74-4c35-8ce3-a750d8c8ef79', object='file', purpose='assistants_output', status='processed', status_details=None)]


In [25]:
output_file_id = output_files.data[0].id
print(output_file_id)

file-xdZcG8zRWKInraZlUUJA8pws


In [35]:
import ipywidgets as widgets

s = widgets.Dropdown(
    options=output_files.data,
    description='Choose file',
    disabled = False
)

s

Dropdown(description='Choose file', options=(FileObject(id='file-xdZcG8zRWKInraZlUUJA8pws', bytes=103495, crea…

In [36]:
print(s.value)

FileObject(id='file-xdZcG8zRWKInraZlUUJA8pws', bytes=103495, created_at=1710757260, filename='7920bf66-aa74-4c35-8ce3-a750d8c8ef79', object='file', purpose='assistants_output', status='processed', status_details=None)


In [26]:
image_data = client.files.content(output_file_id)
image_data_bytes = image_data.read()

with open("./my-image.png", "wb") as file:
    file.write(image_data_bytes)

In [14]:
# Clear up after session

deleted_assistant_file = client.beta.assistants.files.delete(
    assistant_id=assistant.id,
    file_id=file.id
)
print(deleted_assistant_file)

response = client.beta.assistants.delete(assistant.id)
print(response)



FileDeleteResponse(id='file-KuB2jlqECC7T8YqWvRWIGxKr', deleted=True, object='assistant.file.deleted')
AssistantDeleted(id='asst_KPJMedvBcgVfZq1SKtnxbhVe', deleted=True, object='assistant.deleted')
