# Optimize a tool using Arxiv agent with DSPy and Argilla



In [None]:
%env OPENAI_API_KEY=<OPEN_AI_API_KEY>

In [None]:
!pip install -qqq dspy arxiv argilla

## Set up DSPy

This notebook combines DSPy, Langchain tools, and Argilla to create and optimize an agent for using the Arxiv API. The agent will be able to search for papers, download them, and extract the text. We will start by defining the agent's llm as an openai model.

In [17]:
import os
import dspy

dspy.settings.configure(
    lm=dspy.OpenAI(
        model="gpt-4o-mini",
        api_key=os.getenv("OPENAI_API_KEY"),
        max_tokens=4000,
        temperature=0,
    )
)

## Defining DSPy Signature

DSPy relies on signatures to represent data samples. We will define the signature for the Arxiv agent which consists of the following fields:

In [5]:
class ArxivQASignature(dspy.Signature):
    """You will be given a question and an Arxiv Paper ID. Your task is to answer the question."""

    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
        format=lambda x: x.strip(),
    )
    paper_id: str = dspy.InputField(
        prefix="Paper ID:",
        desc="Arxiv Paper ID",
    )
    answer: str = dspy.OutputField(
        prefix="Answer:",
        desc="answer to the question",
    )

## Loading the Arxiv Dataset

ArXiv QA is a dataset of automated question answering (QA) pairs generated from ArXiv papers using large language models. The dataset includes over 900 papers with corresponding QA pairs, covering a wide range of topics in computer science and related fields. The dataset is organized by year, with papers dating back to 2009.

In [7]:
from random import sample
from dspy.datasets import DataLoader

dl = DataLoader()

arxiv_qa = dl.from_huggingface(
    "taesiri/arxiv_qa",
    split="train",
    input_keys=("question", "paper_id"),
)

Generating train split: 100%|██████████| 210580/210580 [00:00<00:00, 993446.51 examples/s] 


To keep requests down we'll use a subset of training and testing dataset. We'll be using 100 examples for training set and 20 examples for testing set.

In [8]:
import random

# Set a random seed for reproducibility
random.seed(42)

aqa_train = [
    dspy.Example(
        question=example.question, paper_id=example.paper_id, answer=example.answer
    ).with_inputs("question", "paper_id")
    for example in sample(arxiv_qa, 100)
]
aqa_test = [
    dspy.Example(
        question=example.question, paper_id=example.paper_id, answer=example.answer
    ).with_inputs("question", "paper_id")
    for example in sample(arxiv_qa, 20)
]

## DSPy Avatar and Tool

We'll setup `Avatar` module with a signature and the tool. Once we have defined our `tools`, we can now create an `Avatar` object by passing the `tools` and `signature`. It takes 2 more optional parameters `verbose` and `max_iters`. `verbose` is used to display the logs and `max_iters` is used to control the number of iterations in multi step execution. 


In [11]:
from dspy.predict.avatar import Tool, Avatar
from langchain_community.utilities import ArxivAPIWrapper

tools = [
    Tool(
        tool=ArxivAPIWrapper(),
        name="ARXIV_SEARCH",
        desc="Pass the arxiv paper id to get the paper information.",
        input_type="Arxiv Paper ID",
    ),
]

arxiv_agent = Avatar(
    tools=tools,
    signature=ArxivQASignature,
    verbose=True,
)

## Evaluate performance

Open enden QA tasks are hard to evaluate on rigid metrics like exact match. So, we'll be using an improvised LLM as Judge for the evaluation of our model on test set.

In [13]:
class Evaluator(dspy.Signature):
    """Please act as an impartial judge and evaluate the quality of the responses provided by multiple AI assistants to the user question displayed below. You should choose the assistant that offers a better user experience by interacting with the user more effectively and efficiently, and providing a correct final response to the user's question.

    Rules:
    1. Avoid Position Biases: Ensure that the order in which the responses were presented does not influence your decision. Evaluate each response on its own merits.
    2. Length of Responses: Do not let the length of the responses affect your evaluation. Focus on the quality and relevance of the response. A good response is targeted and addresses the user's needs effectively, rather than simply being detailed.
    3. Objectivity: Be as objective as possible. Consider the user's perspective and overall experience with each assistant."""

    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
    )
    reference_answer: str = dspy.InputField(
        prefix="Reference Answer:",
        desc="Answer to the question given by the model.",
    )
    answer: str = dspy.InputField(
        prefix="Answer:",
        desc="Answer to the question given by the model.",
    )
    rationale: str = dspy.OutputField(
        prefix="Rationale:",
        desc="Explanation of why the answer is correct or incorrect.",
    )
    is_correct: bool = dspy.OutputField(
        prefix="Correct:",
        desc="Whether the answer is correct.",
    )


evaluator = dspy.TypedPredictor(Evaluator)


def metric(example, prediction, trace=None):
    return int(
        evaluator(
            question=example.question,
            answer=prediction.answer,
            reference_answer=example.answer,
        ).is_correct
    )

For evaluation we can't use `dspy.Evaluate`, reason being that `Avatar` changes it's signature per iteration by adding the actions and it's results to it as fields. So we can create our own hacky thread safe evaluator for it.

In [14]:
import tqdm

from concurrent.futures import ThreadPoolExecutor


def process_example(example, signature):
    try:
        avatar = Avatar(
            signature,
            tools=tools,
            verbose=False,
        )
        prediction = avatar(**example.inputs().toDict())

        return metric(example, prediction)
    except Exception as e:
        print(e)
        return 0


def multi_thread_executor(test_set, signature, num_threads=60):
    total_score = 0
    total_examples = len(test_set)

    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [
            executor.submit(process_example, example, signature) for example in test_set
        ]

        for future in tqdm.tqdm(
            futures, total=total_examples, desc="Processing examples"
        ):
            total_score += future.result()

    avg_metric = total_score / total_examples
    return avg_metric

In [18]:
aqa_score = multi_thread_executor(aqa_test, ArxivQASignature)
print(f"Average Score on ArxivQA: {aqa_score:.2f}")

Processing examples: 100%|██████████| 100/100 [01:53<00:00,  1.14s/it]

Average Score on ArxivQA: 0.38





## DSPy Optimization with `AvatarOptimizer`

To optimize the Actor for optimal tool usage, we use the AvatarOptimizer class. This class is a DSPy implementation of the Avatar method, which uses a comparator module to optimize the Actor for the given tools.

Parameters

The AvatarOptimizer class takes the following parameters:

metric: The metric to optimize for
max_iters: The maximum number of iterations for the optimizer
lower_bound: The lower bound for the metric to classify an example as negative
upper_bound: The upper bound for the metric to classify an example as positive
max_positive_inputs: The maximum number of positive inputs to sample for the comparator
max_negative_inputs: The maximum number of negative inputs to sample for the comparator
optimize_for: Whether to maximize or minimize the metric during optimization
Usage

To use the AvatarOptimizer, create an instance of the class and pass in the required parameters. Then, call the compile method to optimize the Actor.

The AvatarOptimizer optimizes the Actor for optimal tool usage, but it does not optimize the instruction of the signature passed to the Agent. The Actor is the module that directs tool execution and flow, and it is not the same as the signature passed to the Agent.

In [None]:
from dspy.teleprompt import AvatarOptimizer

teleprompter = AvatarOptimizer(
    metric=metric,
    max_iters=1,
    max_negative_inputs=10,
    max_positive_inputs=10,
)

optimized_arxiv_agent = teleprompter.compile(student=arxiv_agent, trainset=aqa_train)

Now we can evaluate our actor module, for this we've provided an implementation of thread safe evaluator that we above as part of class method of `AvatarOptimizer`.

In [None]:
teleprompter.thread_safe_evaluator(aqa_test, optimized_arxiv_agent)

# Review optimizations in Argilla

Now let's take a look at the opimized tool usage in Argilla. We will pass both agents' responses to the UI and rank them blindly, to see how the optimized agent performs.

In [31]:
from uuid import uuid4
import argilla as rg

client = rg.Argilla()

dataset = rg.Dataset(
    name=f"arxiv-tools-{uuid4()}",
    settings=rg.Settings(
        fields=[
            rg.TextField(name="response1"),
            rg.TextField(name="response2"),
        ],
        questions=[
            rg.RankingQuestion(name="ranking", values=["response1", "response2"])
        ],
    ),
)

dataset.create()

Dataset(id=UUID('dd04ab88-168d-4bbb-adb9-7f698c58ace2') inserted_at=datetime.datetime(2024, 10, 3, 9, 44, 24, 471390) updated_at=datetime.datetime(2024, 10, 3, 9, 44, 25, 564826) name='arxiv-tools-64c1f828-e922-43d6-91f5-a15902c66ed6' status='ready' guidelines=None allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('31fb2d4c-46f8-4698-8dcb-27555e78c092') last_activity_at=datetime.datetime(2024, 10, 3, 9, 44, 25, 564826))

Now let's use the agents and push the responses to the UI for ranking.

In [33]:
max_samples = 2

for example in aqa_test[:max_samples]:
    question = aqa_test[0].question
    base_answer = arxiv_agent(**aqa_test[0]).answer
    optimized_answer = optimized_arxiv_agent(**aqa_test[0]).answer
    dataset.records.log(
        [
            {
                "question": question,
                "response1": base_answer,
                "response2": optimized_answer,
                "ranking": ["response1", "response2"],
            }
        ]
    )

Starting the task...
Action 1: ARXIV_SEARCH (2403.06404 keywords)
Action 2: Finish (The keywords or key terms associated with the paper 2403.06404 are: uncertainty modeling, speaker representation, cosine scoring, neural speaker embedding, embedding estimation, speaker recognition.)
Starting the task...
Action 1: ARXIV_SEARCH (What are the keywords or key terms associated with the paper 2403.06404?)
Action 2: Finish ()




Sending records...: 100%|██████████| 1/1 [00:00<00:00,  1.77batch/s]


Starting the task...
Action 1: ARXIV_SEARCH (2403.06404 keywords)
Action 2: Finish (The keywords or key terms associated with the paper 2403.06404 are: uncertainty modeling, speaker representation, cosine scoring, neural speaker embedding, embedding estimation, speaker recognition.)
Starting the task...
Action 1: ARXIV_SEARCH (What are the keywords or key terms associated with the paper 2403.06404?)
Action 2: Finish ()


Sending records...: 100%|██████████| 1/1 [00:00<00:00,  1.79batch/s]
