The below generates all the necessary imports and creates the client for using the openai or deepseek models. You can easily switch between models by changing the base_url in the client constructor.

In [None]:
from openai import OpenAI
import dotenv
import importlib
import messages


importlib.reload(messages)
from messages import messages_list
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor()


dotenv.load_dotenv()
openai_api_key = dotenv.get_key(".env", "OPENAI_API_KEY")
deepseek_api_key = dotenv.get_key(".env", "DEEPSEEK_API_KEY")

responses = []

deepseek_url = "https://api.deepseek.com"
openai_url = "https://api.openai.com/v1"

client = OpenAI(api_key=openai_api_key,  base_url=openai_url)

Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


ModuleNotFoundError: No module named 'dotenv'

This creates an array of messages containing general instructions for the LLM as well as the question that it should answer.

In [4]:
messages = [
    (
        [
            {
                "role": "user",
                "content": "You are trying to help people that are not very knowledgeable about finance answer questions about their mortgage.",
            },
            {"role": msg["role"], "content": msg["content"]},
            {
                "role": "user",
                "content": "Output only the answer you get.",
            },
        ],
        i,
    )
    for i, msg in enumerate(messages_list)
]

helper function that cleans the response to make it only numbers so we can easily compare with the answers

In [5]:
import re


def extract_number(s):
    # Remove backticks and extract the numeric part
    s = s.replace("`", "")
    match = re.search(r"(-?\$?[\d,]+(?:\.\d+)?)", s)
    if match:
        return float(match.group(1).replace(",", "").replace("$", ""))
    return None

Generates an array of ai_models to make it simpler to swap between them. In order to use deepseek-chat, make sure you change the base_url for the client up top.

In [None]:
ai_models = [
    "deepseek-chat", # 0
    "gpt-3.5-turbo", # 1
    "gpt-4",         # 2
    "o1-mini",       # 3
    "gpt-4o",        # 4
    "gpt-4o-mini",   # 5
    "o3-mini",       # 6
    "o1",            # 7
    "o1-preview",    # 8
]

This gets all of the responses using a particular AI model from above. It does this in parallel to improve speed using an executor. 

Temperature - a variable used to increase or decrease the variabilty of a response. It ranges from 0 - 2 with 2 being the most random, 0 being the least.

Max Completion tokens - sets an upper bound on tokens that can be used on a completion. In general, each token is around 3/4 of a word.

Reasoning Effort - can be either low, medium, or high and it specifies how hard a reasoning model will try. If set to low, responses will be faster and use less tokens, but they will also use less reasoning. Only on o1 and o3-mini models.

In [None]:
temperature = 1
max_completion_tokens = 2000
reasoning_effort = ["low", "medium", "high"]


def get_response(message):
    response = client.chat.completions.create(
        model=ai_models[6],
        temperature=temperature,
        max_completion_tokens=max_completion_tokens,
        messages=message[0],
        # reasoning_effort=reasoning_effort[1]
    )
    return (response.choices[0].message.content, message[1])


futures = [executor.submit(get_response, message) for message in messages]
responses_parallel = [future.result() for future in futures]
responses_parallel = sorted(responses_parallel, key=lambda x: x[1])

This cleans the responses from above and outputs them compared to the expected answer

In [17]:
cleaned_responses = [(extract_number(text)) for text,_ in responses_parallel]

for i, (response, message) in enumerate(zip(cleaned_responses, messages_list)):
    print(f"Question {i}:")
    print(f"Response: {response}")
    print(f"Expected: {message['answer']}")
    print()

Question 0:
Response: 336.38
Expected: 336.37

Question 1:
Response: 19.77
Expected: 19.77

Question 2:
Response: 88.81
Expected: 88.85

Question 3:
Response: 1638.62
Expected: 1637.97

Question 4:
Response: 1713.36
Expected: 1713.37

Question 5:
Response: 44289.0
Expected: 44289.03

Question 6:
Response: 171867.0
Expected: 171836

Question 7:
Response: -1957.06
Expected: -1954.91

Question 8:
Response: 15.0
Expected: 15

Question 9:
Response: 909.09
Expected: 909.09

Question 10:
Response: 64843.56
Expected: 64843.56

Question 11:
Response: 15.9374246
Expected: 15.94

Question 12:
Response: 1869.16
Expected: 1869.07

Question 13:
Response: 11960.0
Expected: 11969.6

Question 14:
Response: 388000.0
Expected: 388075.82

Question 15:
Response: 8.51
Expected: 8.51

Question 16:
Response: 570457.0
Expected: 570455.96

Question 17:
Response: 75.13
Expected: 74.7

Question 18:
Response: 123542.0
Expected: 123592.35

Question 19:
Response: 5148.19
Expected: 5122.28



This creates two assistants, one that has openai's code interpreter and another that does not. The instructions are similar for the two but the first one tells the assistant to use python scripts to answer the problem. Both models are told to only output the answer they get to make data collection easier. The assistants feature is only available on the gpt-4o and gpt-4o-mini models.

In [22]:
assistant_instructions = "You are trying to help people that are not very knowledgeable about finance answer questions about their mortgage. Use python scripts to solve for things like the time value of money. Output only the answer you get after running the python script. Please make sure the only output you have is a number"
assistant_name = "Finance Advisor With Python"

my_assistant = client.beta.assistants.create(
    instructions=assistant_instructions,
    name=assistant_name,
    tools=[{"type": "code_interpreter"}],
    model=ai_models[5],
)

assistant_instructions_without_python = "You are trying to help people that are not very knowledgeable about finance answer questions about their mortgage. Output only the answer you get. Please make sure the only output you have is a number"
assistant_name_without_python = "Finance Advisor No Python"

my_assistant_without_python = client.beta.assistants.create(
    instructions=assistant_instructions_without_python,
    name=assistant_name_without_python,
    model=ai_models[5],
)

The below is a function that handles a message to an assistant. It leverages threads from openai to do this and then returns the answer text and index when it is completed

In [23]:
def process_message(idx, msg, assistant):
    thread = client.beta.threads.create()

    client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=msg["content"],
    )

    run = client.beta.threads.runs.create_and_poll(
        thread_id=thread.id,
        assistant_id=assistant.id,
    )

    if run.status == "completed":
        messages_assistant = client.beta.threads.messages.list(thread_id=thread.id)
        answer_text = messages_assistant.data[0].content[0].text.value
        return (answer_text, idx)

This generates responses for the assistant that can use and run python and then cleans the responses so it is just the answer

In [24]:
futures = [
    executor.submit(process_message, idx, msg, my_assistant)
    for idx, msg in enumerate(messages_list)
]
responses_assistant = [future.result() for future in futures]

cleaned_responses_python = [
    round(extract_number(text), 2) for text, idx in responses_assistant
]

This generates the responses for the model without python using the same logic as above.

In [25]:
futures = [
    executor.submit(process_message, idx, msg, my_assistant_without_python)
    for idx, msg in enumerate(messages_list)
]
responses_assistant_without_python = [future.result() for future in futures]

cleaned_responses_without_python = [
    round(extract_number(text), 2) for text, idx in responses_assistant_without_python
]

In [27]:
print("Responses no python    ", [f"{x:12,.2f}" for x in cleaned_responses_without_python])
print("Responses using python ", [f"{x:12,.2f}" for x in cleaned_responses_python])
print("Answers                ", [f"{x['answer']:12,.2f}" for x in messages_list])

Responses no python     ['      335.48', '       19.16', '       76.66', '    1,500.94', '    1,391.23', '   44,676.34', '  155,654.00', '      701.28', '       12.00', '      909.09', '   64,150.14', '       15.93', '    1,360.49', '   20,117.21', '  386,195.34', '        9.64', '  749,688.93', '      102.77', '  196,242.21', '      727.29']
Responses using python  ['      336.37', '       19.77', '       88.85', '    1,637.97', '    1,713.37', '   44,289.03', '  171,835.94', '   -1,954.91', '       15.00', '      909.09', '   64,843.56', '       15.94', '    1,869.07', '   11,969.60', '  388,075.83', '        8.51', '  570,455.96', '       74.70', '  123,592.35', '    5,122.28']
Answers                 ['      336.37', '       19.77', '       88.85', '    1,637.97', '    1,713.37', '   44,289.03', '  171,836.00', '   -1,954.91', '       15.00', '      909.09', '   64,843.56', '       15.94', '    1,869.07', '   11,969.60', '  388,075.82', '        8.51', '  570,455.96', '       74.70

This calculates the percent error for with and without python by comparing it to the correct answer

In [31]:
def percent_error(pred, actual):
    if actual != 0:
        return abs(pred - actual) / abs(actual) * 100
    return float("nan")


percent_errors_python = [
    percent_error(pred, msg['answer'])
    for pred, msg in zip(cleaned_responses_python, messages_list)
]
percent_errors_without_python = [
    percent_error(pred, msg['answer'])
    for pred, msg in zip(cleaned_responses_without_python, messages_list)
]

print(
    f"mean error without python: {round(sum(percent_errors_without_python) / len(percent_errors_without_python), 2)}%"
)
print(
    f"mean error with python: {round(sum(percent_errors_python) / len(percent_errors_python), 2)}%"
)

mean error without python: 26.71%
mean error with python: 0.0%
