# AIME 2025 benchmark on Claude & DeepSeek

## High level steps

1. Plan the evaluation

Each scenario has different metrics mostly. What metric you need to evaluate? Do you have ready to use public dataset? Do you have ready to use domain-specific dataset?
What score method you want to use? e.g. pair-wise, point-wise, or list-wise evaluation. Is the socre binary, bumber or category?

    For this scenrio, we evaluate Claude 3.5, Nova Lite, Nove Pro and DeepSeek R1 using AIME 2025 dataset.

2. Load the dataset

You can load the public or private datasets from huggingface hub or local files. In this lab, we downloaded the AIME 2025 dataset from huggingface aleady.

3. Call the model

In this lab, we use LiteLLM python SDK to call models. The LLM judger uses Bedrock Claude 3.7 sonnet by default.

4. Evaluate the result

Accuracy = count of correct/count of question. **It's better to average the pass@1 performance across multiple runs for each model.** In this lab, we only run once for each model usually.

## AIME 2025 dataset

The American Invitational Mathematics Examination (AIME) is a prestigious, invite-only mathematics competition for high-school students who perform in the top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing difficulty, with the answer to every question being a single integer from 0 to 999. The median score is historically between 4 and 6 questions correct (out of the 15 possible). Two versions of the test are given every year (thirty questions total). You can view the questions from previous years on the [AIME website](https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions)

This examination serves as a crucial gateway for students aiming to qualify for the **USA Mathematical Olympiad (USAMO)**. In general, the test is extremely challenging, and covers a wide range of mathematical topics, including algebra, geometry, and number theory.

## Load the dataset

In [2]:
from datasets import load_dataset

In [3]:
# Load and preprocess dataset
def load_ds():
    print("Loading AIME dataset...")
    dataset = load_dataset("datasets")["train"] # no test set here 
    return [{"text": row["question"], "label": row["answer"]} for row in dataset]


In [4]:
ds = load_ds()

Loading AIME dataset...


In [5]:
from IPython.display import Markdown
import random

i = random.randint(0, 29)
Markdown(f"**Question:** {ds[i]['text']}  \n\n**Answer:** {ds[i]['label']}")



**Question:** Let $N$ denote the number of ordered triples of positive integers $(a,b,c)$ such that $a,b,c\leq3^6$ and $a^3+b^3+c^3$ is a multiple of $3^7$. Find the remainder when $N$ is divided by $1000$.  

**Answer:** 735

## Call the model

In [26]:
from litellm import completion
import os

os.environ["OPENAI_API_KEY"] = "sk-12341234"
os.environ["OPENAI_BASE_URL"] = "http://localhost:4000" 

def model_call(model_id, prompt) -> str:
    response = completion(
        model=model_id,
        max_tokens=1024*8,
        messages=[{ "content": prompt,"role": "user"}]
    )
    return response["choices"][0]["message"]["content"]

In [16]:
model_call("openai/claude-3.7-sonnet", "who are you?")

"I'm Claude, an AI assistant created by Anthropic to be helpful, harmless, and honest. I can help with a variety of tasks like answering questions, having conversations, writing content, analyzing information, and providing thoughtful responses. I don't have a personal identity like humans do - I don't have personal experiences, emotions, or consciousness. I'm designed to be a helpful tool for conversation and information. How can I assist you today?"

In [27]:
model_call("bedrock/us.amazon.nova-lite-v1:0", "who are you?")

"Hello! I'm an AI system built by a team of inventors at Amazon. I'm here to help you with any questions or tasks you might have. If you need assistance, feel free to ask!"

In [None]:
import json

class LLM_judger:
    def __init__(self):
        """Initialize the LLM_judger class."""
        pass
    
    def __call__(self, label: str, model_output: str) -> dict:
        """Score the model's output by comparing it with the ground truth."""
        output_len = min(200, len(model_output))
        query = f"""
        YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER 
        I WILL GIVE YOU THE LAST 200 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> 

        Model's Answer (last 200 chars): {str(model_output)[-output_len:]}
        Correct Answer: {label}
        
        Your task:
        1. State the model's predicted answer (answer only).
        2. State the ground truth (answer only).
        3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
           ```json
           {{ "correctness": true/false }}
           ```
        """
        
        # claude 3.7 sonnet as the default judge model
        response = model_call("openai/claude-3.7-sonnet", query)
        # response = model_call("bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0", query)
        
        if response is None:
            return {"correctness": False, "reasoning": "Inference failed."}
        
        try:
            # Extract correctness JSON object from the response
            json_start = response.index("```json") + 7
            json_end = response.index("```", json_start)
            correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
        except (ValueError, IndexError):
            correctness = False

        print("correctness: ", correctness)
        return {"correctness": correctness, "reasoning": response}


In [18]:
def evaluator(model_id: str, dataset: list[dict], scorers: list[callable]) -> dict:
    """Evaluate the dataset using the provided scorers."""
    results = []
    for item in dataset:
        text = item["text"]
        label = item["label"]
        model_output = model_call(model_id, text)
        
        for scorer in scorers:
            score = scorer(label, model_output)
            results.append({
                "model": model_id,
                "text": text,
                "label": label,
                "model_output": model_output,
                "score": score
            })
    
    return results

## Evaluate the result

In [None]:
model_id = "openai/claude-3.7-sonnet"
# model_id = "bedrock/us.amazon.nova-lite-v1:0"  # Uncomment to eval Nova Lite
# model_id = "bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0"  # Uncomment to eval Claude 3.5 Sonnet
eval = evaluator(model_id, ds, [LLM_judger()])  # Using the LLM_judger as a scorer

In [93]:
eval

[{'model': 'openai/claude-3.5-sonnet',
  'text': 'Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.',
  'label': '70',
  'model_output': "Let's solve this step by step:\n\n1) First, let's understand what $17_b$ and $97_b$ mean in base $b$\n   * $17_b = 1×b + 7 = b + 7$\n   * $97_b = 9×b + 7 = 9b + 7$\n\n2) We need to find when $(b+7)$ divides $(9b+7)$\n\n3) When $a$ divides $b$, we can write $b = ka$ for some integer $k$\n   * So, $9b + 7 = k(b + 7)$ for some integer $k$\n   * $9b + 7 = kb + 7k$\n   * $9b - kb = 7k - 7$\n   * $b(9-k) = 7(k-1)$\n   * $b = \\frac{7(k-1)}{9-k}$\n\n4) Since $b$ must be an integer, $9-k$ must divide $7(k-1)$\n\n5) Since we want $b > 9$, let's examine the fraction:\n   * $b = \\frac{7(k-1)}{9-k} > 9$\n   * If denominator is positive: $7(k-1) > 9(9-k)$\n   * If denominator is negative: $7(k-1) < 9(9-k)$\n\n6) Let's solve: $7(k-1) > 9(9-k)$\n   * $7k - 7 > 81 - 9k$\n   * $16k > 88$\n   * $k > 5.5$\n\n7) For denominator to be 

In [None]:
# Calculate the percentage of 'correctness' being True in the evaluation results.
correct_count = sum(1 for item in eval if item['score'].get('correctness', False))
accuracy = correct_count / len(eval)
print(f"Accuracy: {accuracy:.2%} ({correct_count}/{len(eval)})")

In [22]:
eval_deepseek = evaluator("openai/deepseek-r1", ds, [LLM_judger()])  # Using the LLM_judger as a scorer

correctness:  True
correctness:  False
correctness:  True
correctness:  True
correctness:  True
correctness:  True
correctness:  True
correctness:  True
correctness:  True
correctness:  None
correctness:  True
correctness:  True
correctness:  None
correctness:  False
correctness:  False
correctness:  True
correctness:  True
correctness:  None
correctness:  True
correctness:  False
correctness:  True
correctness:  True
correctness:  False
correctness:  True
correctness:  True
correctness:  True
correctness:  True
correctness:  None
correctness:  None
correctness:  False


In [None]:
# Calculate the percentage of 'correctness' being True in the evaluation results.
correct_count = sum(1 for item in eval_deepseek if item['score'].get('correctness', False))
accuracy = correct_count / len(eval_deepseek)
print(f"Accuracy: {accuracy:.2%} ({correct_count}/{len(eval_deepseek)})")

## Summary

Reasong model has greater mathmatics capability than non-reasoning model. Here is the test result table of common models AIME score.

| Model             | AIME 2025 Score    | Accuracy % | Official benchmark  |
| :----------------:| :------: | :------: | :------: |
| Claude 3.5 Haiku        |   0/30   | 0  |   |
| Claude 3.5 Sonnet       |   2/30   | 6.67%   |  |
| Claude 3.7 Sonnet       |   4/30   |  13.33%  |  |
| Claude 3.7 Sonnet thinking |     |  |   |
| DeepSeek R1-0528  |  |   |   |
| DeepSeek R1       | 19/30 | 63.33%  | ~70%  |
| Nova Lite         | 0/30 | 0  |  |
| Nova Pro          |   |   |  |


AIME 2025 Benchmark (https://www.vals.ai/benchmarks/aime-2025-05-30)

![AIME-2025-benchmark](images/aime-2025-benchmark.png)