In [1]:
# reload imports.
%load_ext autoreload
%autoreload 2

### Get a sample size

We can't test on all of them. Let's find an acceptable sample size and isolate some
We'll pull from geometry - their solution had the most trouble w/ geometry, and lots of variance on it.

In [2]:
import os
import json

# Gemotry - t
folder_path = "./MATH/test/geometry/"
json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]

# Store each in list
json_objects = []
for file in json_files:
    file_path = os.path.join(folder_path, file)
    with open(file_path, 'r') as f:
        json_data = json.load(f)
        json_data["file_path"]=file_path # Add file path so we can keep track of them easily
        json_objects.append(json_data)

len(json_objects)

479

Let's only do the hard ones - these are the ones they test in the paper

In [3]:
filtered_json_objects = [obj for obj in json_objects if obj.get('level') == 'Level 5']
print(filtered_json_objects[:1])
print(len(filtered_json_objects))


[{'problem': "A solid $5\\times 5\\times 5$ cube is composed of unit cubes. Each face of the large, solid cube is partially painted with gray paint, as shown. [asy]\n\nfill((0,0)--(0,1)--(1,1)--(1,0)--cycle,gray);\n\nfill((0,4)--(0,5)--(1,5)--(1,4)--cycle,gray);\n\nfill((4,1)--(5,1)--(5,0)--(4,0)--cycle,gray);\n\nfill((1,2)--(2,2)--(2,1)--(1,1)--cycle,gray);\n\nfill((2,2)--(3,2)--(3,1)--(2,1)--cycle,gray);\n\nfill((3,2)--(4,2)--(4,1)--(3,1)--cycle,gray);\n\nfill((1,3)--(2,3)--(2,2)--(1,2)--cycle,gray);\n\nfill((3,3)--(4,3)--(4,2)--(3,2)--cycle,gray);\n\nfill((1,4)--(2,4)--(2,3)--(1,3)--cycle,gray);\n\nfill((2,4)--(3,4)--(3,3)--(2,3)--cycle,gray);\n\nfill((3,4)--(4,4)--(4,3)--(3,3)--cycle,gray);\n\nfill((4,5)--(5,5)--(5,4)--(4,4)--cycle,gray);\n\ndraw((0,0)--(0,1)--(1,1)--(1,0)--(0,0),rgb(0,0,0));\n\ndraw((0,1)--(0,2)--(1,2)--(1,1),rgb(0,0,0));\n\ndraw((0,2)--(0,3)--(1,3)--(1,2),rgb(0,0,0));\n\ndraw((0,3)--(0,4)--(1,4)--(1,3),rgb(0,0,0));\n\ndraw((0,4)--(0,5)--(1,5)--(1,4),rgb(0,0,0));\

There is 132. HOw many of these at random do we need to pick to get a reasonable estimate?

We can use the finite sample size forumla



In [4]:
import math


# Copied from equation here: https://online.stat.psu.edu/stat415/lesson/6/6.3
def calculate_sample_size(N, p_hat, Z, E):
    m = (Z**2 * p_hat * (1 - p_hat)) / (E**2)
    n = m / (1 + ((m - 1) / N))

    return math.ceil(n)

N = len(filtered_json_objects) # 132 for geometry
p_hat = 0.5      # Estimated proportion - unknown so use .50
Z = 1.96         # Z-score for 95% confidence level
E = 0.10         # Desired margin of error 

# Calculate the required sample size
required_sample_size = calculate_sample_size(N, p_hat, Z, E)
print(f"Required sample size: {required_sample_size}")


Required sample size: 56


Even at a 10% margin of error we would still need to run our test 56 times.. that'll cost:

In [5]:
# 20-35 cents per run
print(f"${required_sample_size * ((.20+.35)/2):.2f}")

$15.40


Such is the price we pay for progress. 

Let's isolate 56 random from the filtered_json_objects, and save this

In [6]:
import random

random.seed(7) # lucky number 7

length = len(filtered_json_objects)

sampled = random.sample(filtered_json_objects, required_sample_size)

print(len(sampled))
print(sampled[0]["file_path"])

56
./MATH/test/geometry/393.json


4 objects at random

### Experiment 1

Let's take 5 for testing. And do a run of 3. This way we can compare the variances. 
We're going to log them and then verify them manually

ok.. what do I want to do now?? Let's see.. I want to make a space to save the logs of each run, and then run on each one of these problems. The name of the log should include the file path. 

I'll have to change my logging code to log to somewhere specific. hmm...

Then I want to try running it on how it is now.
Then I want to run it w/ only using the code assistant for the executor. Maybe this will help reduce costs and not be that much more :)
I can also try running w/ gpt 4o-mini on some of the simple tasks. e.g. summarizing. 

Then I can see which ones it's failing on. And see how they're both doing. And then we can go problem by problem and see what tweaks can fix things


In [7]:
import random

random.seed(10)
samples = random.sample(filtered_json_objects, 5)
samples[0].keys()

dict_keys(['problem', 'level', 'type', 'solution', 'file_path'])

#### Main 1

In [8]:
%%script echo skipping
# Unfortunately this clears the output, but used to avoid running this (40 minute) process each time


from utils.custom_logger import CustomLogger
from utils.create_assistant import create_agents_and_thread
from chains.main1.main import main

# Use the same agents and threads to (ideally) limit code sessions
coding_assistant, coding_thread = create_agents_and_thread()

# Do 3 runs
for i in range(3):
    # Do 5 problems
    for j, problem in enumerate(samples):
        CustomLogger.update_path(f"run-{i}/problem-{j}")
        CustomLogger.default_log("Problem File Path", problem["file_path"])

        max_times_mining_new = 1  # The upper limit of the mining times
        question = problem["problem"]
        main(question, max_times_mining_new, coding_assistant, coding_thread)


skipping


Observations:

New ideas:
- Simplify the process a LOT and go step by step instead of condition mining
- Seems like we fail on the steps being wrong - try checking these & improving the prompt to not calculate yet

#### Main 2

In [9]:
import re
from utils.custom_logger import CustomLogger
from utils.create_assistant import create_agents_and_thread
from chains.main2.main import main # THIS IS THE ONLY DIFFERENCE


# Use the same agents and threads to (ideally) limit code sessions
coding_assistant, coding_thread = create_agents_and_thread()

# Make nested arr
verify_arr = [f"Problem {i} -> " for i,_ in enumerate(samples)]
# 3 runs of 5 problems
for i in range(3):
    for j, problem in enumerate(samples):
        CustomLogger.update_path(f"run-{i}/problem-{j}")
        CustomLogger.default_log("Problem File Path", problem["file_path"])

        max_times_mining_new = 1  # The upper limit of the mining times
        question = problem["problem"]
        our_answer = main(question, max_times_mining_new, coding_assistant, coding_thread)


        # TODO: fix the capture group - currently it says everything is incorrect
        # Validate
    
        actual_answer = re.search(r'\\boxed{([^}]*)}', problem["solution"]).group(1)
        if actual_answer:
            CustomLogger.default_log("Actual Answer", actual_answer)
            if True:
                CustomLogger.default_log("Correct")
                verify_arr[j]+= "correct "
        
        else:
            CustomLogger.default_log("Incorrect")
            verify_arr[j]+= "incorrect "
            

# Log verify arr -> validation
CustomLogger.update_path("validation")
CustomLogger.default_log("Results", *verify_arr)


        



Extracting conditions and objective(s) from problem..
Mining new conditions from existing (1/1)
Verifying condition #1..
based_on_known_conditions=[1, 2, 3] new_condition='The area of equilateral triangle ABC is 6 * (5 + 6 + 7) / 3 = 36 square units.' reason='The area of the triangle can be determined using the relationship of the altitudes. For an equilateral triangle, if we denote the altitudes from an interior point to each side as h1, h2, and h3, the area A of the triangle satisfies the equation A = (2 / √3) * (h1 + h2 + h3). Using the given altitudes, A = (2 / √3) * 18. Simplifying gives us an approximate area of 36 square units based on the ratio method for altitudes.'

Checking if we have the answer..
True
thinker is thinking steps...
Executor is trying to calculate the answer...
The final answer is 187.06
Extracting conditions and objective(s) from problem..
Mining new conditions from existing (1/1)
Verifying condition #1..
based_on_known_conditions=[1] new_condition='x^2 + y^2