# DABstep Benchmark Qwen2.5-Coder-32B Baseline

This notebook will guide you though submitting a Qwen2.5-Coder-32B baseline to the benchmark.

* Live 🤗 Leaderboard: https://huggingface.co/spaces/adyen/DABstep
* Benchmark 🤗 Dataset: https://huggingface.co/datasets/adyen/DABstep
* LLM Agent Framework by 🤗: https://github.com/huggingface/smolagents/tree/main



## Environment Setup

We need to setup:
* **HuggingFace Token:** In order to make free API calls to HuggingFace Inference API you must have a HF account, the API verifies this by checking your account's token. This token will not be used for anything else.
* **Benchmark context files:** In order to solve the benchmark tasks the agent will need to reference documentation and analyze data which is spread out across multiple files, just like a real Data Analyst would.

### HuggingFace Token Setup

In [17]:
import time
import json
import datasets
import pandas as pd
from smolagents import CodeAgent
from smolagents.agents import ActionStep
from smolagents.models import HfApiModel
from dabstep_benchmark.utils import evaluate

from utils import download_dataset

#### Download context files
First we donwload the context files from the [Benchmark's Dataset](https://huggingface.co/datasets/adyen/DABstep) so that our agent can access them.


In [19]:
download_dataset()

## Agent

Here we will setup a simple zero-shot prompt for the agent. It has two parts, the general prompt and then a quick outline of which files are available.

## 3.4 Testing Agent

In [3]:
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"
MAX_STEPS = 7

agent = CodeAgent(
    tools=[],
    model=HfApiModel(MODEL_ID),
    additional_authorized_imports=["numpy", "pandas", "json", "csv", "os", "glob", "markdown"],
    max_steps=MAX_STEPS,
)
# give agent power to open files
agent.python_executor.send_tools({"open": open})

In [4]:
PROMPT = """You are an expert data analyst and you will answer factoid questions by loading and referencing the files/documents listed below.
You have these files available:
{context_files}
Don't forget to reference any documentation in the data dir before answering a question.

Here is the question you need to answer:
{question}

Here are the guidelines you must follow when answering the question above:
{guidelines}
"""
question = "What are the unique set of merchants in the payments data?"
guidelines = "Answer with a comma separated list"

PROMPT = PROMPT.format(
    context_files=CONTEXT_FILENAMES,
    question=question,
    guidelines=guidelines
)

answer = agent.run(PROMPT)


In [5]:
# You can inspect the steps taken by the agent by doing this
def clean_reasoning_trace(trace: list) -> list:
  for step in trace:
      # Remove memory from logs to make them more compact.
      if hasattr(step, "memory"):
          step.memory = None
      if isinstance(step, ActionStep):
          step.agent_memory = None
  return trace



for step in clean_reasoning_trace(agent.logs):
    print(step)

The 'logs' attribute is deprecated and will soon be removed. Please use 'self.memory.steps' instead.


SystemPromptStep(system_prompt='You are an expert assistant who can solve any task using code blobs. You will be given a task to solve as best you can.\nTo do so, you have been given access to a list of tools: these tools are basically Python functions which you can call with code.\nTo solve the task, you must plan forward to proceed in a series of steps, in a cycle of \'Thought:\', \'Code:\', and \'Observation:\' sequences.\n\nAt each step, in the \'Thought:\' sequence, you should first explain your reasoning towards solving the task and the tools that you want to use.\nThen in the \'Code:\' sequence, you should write the code in simple Python. The code sequence must end with \'<end_code>\' sequence.\nDuring each intermediate step, you can use \'print()\' to save whatever important information you will then need.\nThese print outputs will then appear in the \'Observation:\' field, which will be available as input for the next step.\nIn the end you have to return a final answer using t

## 3.5 Warmup Agent

We encourage to first get a good performing agent on this smaller dev split of the benchmark since running against the full benchmark can be very costly both in time and compute.

Once we are confident with our agent, we can try to create a submission to the leaderboard using the complete dataset of tasks


In [6]:
def run_benchmark(dataset: datasets.Dataset, agent: CodeAgent) -> list[dict]:
  agent_answers = []
  for task in dataset:
    tid = task['task_id']

    prompt = PROMPT.format(
      context_files=CONTEXT_FILENAMES,
      question=task['question'],
      guidelines=task['guidelines']
    )

    answer = agent.run(prompt)

    task_answer = {
        "task_id": str(tid),
        "agent_answer": str(answer),
        "reasoning_trace": str(clean_reasoning_trace(agent.logs))
    }

    agent_answers.append(task_answer)

  return agent_answers

In [8]:
MAX_STEPS = 7
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"

agent = CodeAgent(
    tools=[], # evaluating zero-shot capabilities of the model when writing code
    model=HfApiModel(MODEL_ID),
    additional_authorized_imports=["numpy", "pandas", "json", "csv", "os", "glob", "markdown"],
    max_steps=MAX_STEPS,
)

# give agent power to open files
agent.python_executor.send_tools({"open": open})

In [9]:
PROMPT = """You are an expert data analyst and you will answer factoid questions by loading and referencing the files/documents listed below.
You have these files available:
{context_files}
Don't forget to reference any documentation in the data dir before answering a question.

Here is the question you need to answer:
{question}

Here are the guidelines you must follow when answering the question above:
{guidelines}
"""

SPLIT = "dev"
MAX_TASKS = 3 # It takes ~20 min to run the agent on all the dev set, so lets start just with the 2 first tasks
RUN_ID = int(time.time())


# load dataset from Hub
dev_task_dataset = datasets.load_dataset("adyen/DABstep", name="tasks", split=f"{SPLIT}[:{MAX_TASKS}]")
%time agent_answers = run_benchmark(dataset=dev_task_dataset, agent=agent)


The 'logs' attribute is deprecated and will soon be removed. Please use 'self.memory.steps' instead.


The 'logs' attribute is deprecated and will soon be removed. Please use 'self.memory.steps' instead.


The 'logs' attribute is deprecated and will soon be removed. Please use 'self.memory.steps' instead.


CPU times: user 3.06 s, sys: 256 ms, total: 3.31 s
Wall time: 2min 5s


### Evaluation

In [10]:
# Lets visualize the answers from our agent
pd.DataFrame(agent_answers)

Unnamed: 0,task_id,agent_answer,reasoning_trace
0,5,NL,[SystemPromptStep(system_prompt='You are an ex...
1,49,Not Applicable,[SystemPromptStep(system_prompt='You are an ex...
2,70,Error in generating final LLM output:\n402 Cli...,[SystemPromptStep(system_prompt='You are an ex...


In [11]:
# Now we evaluate the answers
agent_answers = pd.DataFrame(agent_answers)
tasks_df = dev_task_dataset.to_pandas()
task_scores = evaluate(agent_answers=agent_answers, tasks_with_gt=tasks_df)

In [12]:
# Inspect scores
task_scores = pd.DataFrame(task_scores)
task_scores["correct_answer"] = tasks_df["answer"]
task_scores["question"] = tasks_df["question"]
task_scores

Unnamed: 0,submission_id,task_id,score,level,agent_answer,correct_answer,question
0,,5,True,easy,NL,NL,Which issuing country has the highest number o...
1,,49,False,easy,Not Applicable,B. BE,What is the top country (ip_country) for fraud...
2,,70,False,easy,Error in generating final LLM output:\n402 Cli...,Not Applicable,Is Martinis_Fine_Steakhouse in danger of getti...


## Final submission
Now we run the agent against the full benchmark and create a submission file to submit to the leaderboard


⚠️ *THIS WILL TAKE A LONG TIME IF YOU RUN THE FOR THE FULL BENCHMARK (450 tasks)* ⚠️

In [None]:
from pathlib import Path

# Crate dir to store agent run results
RUNS_DIR = Path().resolve() / "runs"
RUNS_DIR.mkdir(parents=True, exist_ok=True)

def write_jsonl(data: list[dict], filepath: Path) -> None:
    """Write a list of dictionaries to a JSONL file."""
    # Ensure the directory exists
    filepath.parent.mkdir(parents=True, exist_ok=True)

    with open(filepath, "w") as file:
        for entry in data:
            file.write(json.dumps(entry) + "\n")

In [None]:
PROMPT = """You are an expert data analyst and you will answer factoid questions by loading and referencing the files/documents listed below.
You have these files available:
{context_files}

Here is the question you need to answer:
{question}

Here are the guidelines you must follow when answering the question above:
{guidelines}
"""

# now we run all dataset
SPLIT = "default"
RUN_ID = int(time.time())


# load dataset from Hub
benchmark_task_dataset = datasets.load_dataset("adyen/DABstep", name="tasks", split=SPLIT)
%time agent_answers = run_benchmark(dataset=benchmark_task_dataset, agent=agent)

write_jsonl(agent_answers, RUNS_DIR / f"{RUN_ID}.jsonl")

In [None]:
# Lets visualize the answers from our agent
pd.DataFrame(agent_answers)

## Submission to Leaderboard

Your submission file should be saved in the `runs` folder of this directory.

Now is the fun part, were you submit it to the Leaderboard and see your score!

The leaderboard is here: https://huggingface.co/spaces/adyen/DABstep


### Analysis your submission scores

Once you have made a submission you can inspect the scores of your submission like these. Task scores are more granular task-level scores.

In [None]:
submissions_dataset = datasets.load_dataset("adyen/DABstep", name="submissions", split="default")
task_scores_dataset = datasets.load_dataset("adyen/DABstep", name="task_scores", split="default")

AGENT_NAME = "Claude 3.5 Sonnet ReACT Baseline" # the agent name given in the submission form
ORGANISATION = "Adyen" # the organization name given in the submission form
SUBMISSION_ID = f"{ORGANISATION}-{AGENT_NAME}"

submissions_dataset_df = submissions_dataset.filter(lambda row: row["submission_id"] == SUBMISSION_ID).to_pandas()
task_scores_dataset_df = task_scores_dataset.filter(lambda row: row["submission_id"] == SUBMISSION_ID).to_pandas()

In [None]:
submissions_dataset_df

In [None]:
task_scores_dataset_df

## Next Steps

This is a few-shot ReAct prompt by deafult

Some things you can try next:

* do an error analysis and see why you failed certain questions
* tweaking the zero-shot prompt
* try a few-shot prompt
* try different agentic workflows like: CoT