# Choose a base model for your AI Agent using Atla x Langfuse

This notebook demonstrates how to evaluate function calling capabilities across different models using **Atla Selene for evaluation** and **Langfuse for experiment observability**. If you'd like a visual walkthrough, check out our [demo video](https://youtu.be/0lenKLgn1p8).

We compare the performance of various models (o1-mini, o3-mini, and gpt-4o) on function calling tasks using the [Salesforce ShareGPT dataset](https://huggingface.co/datasets/arcee-ai/agent-data/viewer/default/train?f%5Bdataset%5D%5Bvalue%5D=%27glaive-function-calling-v2-extended%27&sql=SELECT+*%0AFROM+train%0AWHERE+dataset+%3D+%27salesforce_sharegpt%27%0ALIMIT+10%3B&views%5B%5D=train).

The notebook uploads the dataset to Langfuse and sets up experiment runs on different models, where these are automatically evaluated by Selene.

<br>

**Requirements:**

- An Atla account - you can sign up for free [here](https://www.atla-ai.com/sign-up)
- A Langfuse account - you can sign up for free [here](https://cloud.langfuse.com/auth/sign-up)
- An OpenAI API key - you can sign up for free [here](https://platform.openai.com/signup)

**Get started**

1. Follow the steps below in **Setup Atla on Langfuse**.
2. Set your Langfuse + OpenAI API keys.
3. Run the rest of the functions to run experiments on performance across the three OpenAI models. Selene will score the outputs against the ground truths from the dataset.
4. Easily compare model outputs and assess average scores in your Langfuse Cloud.

## Setup Atla on Langfuse

Navigate to your project on [cloud.langfuse.com](cloud.langfuse.com):

<br>

**Add your Atla API key to your Langfuse project:**

1. Head to **Settings** → **LLM Connections** and select **+** **Add new LLM API key.**
2. Set `atla` as the **Provider name** and select `atla` from the **LLM adapter** dropdown .
3. The API Base URL will automatically be filled in. Paste your Atla API key beginning with “pk-…” into the **API Key** field.
4. Leave **Enable default models** on.
5. Click **Save new LLM API key.**

![alt text](https://atla-ai.notion.site/image/attachment%3Aed7c7d92-b0bf-464f-9f43-83df6de65a5e%3Aimage.png?table=block&id=1bc309d1-7745-80fd-9f2d-cdd40f3005cf&spaceId=f08e6e70-73af-4363-9621-90e906b92ebc&width=2000&userId=&cache=v2)

**Add an LLM-as-a-Judge template**

1. Head to **Evaluation → LLM-as-a-Judge** in your sidebar and select **Templates**
2. Click **+ New Template**
3. Select atla as the Model Provider and choose atla-selene as the Model name
4. Let’s evaluate the function calling ability of a model - name your template `function_calling` and paste in the following prompt:

```
Evaluate whether the model accurately selects the appropriate function(s) from the available options with a binary score of 0 or 1, comparing against the provided ground truth.

Scoring Criteria:
1: The model selection matches the ground truth—selected the correct function(s) using the proper tool call syntax.
0: The model selection does not match the ground truth—selected incorrect functions, missed required functions, or used significantly different parameters.

Instruction: {{input}}
Ground Truth: {{expected_output}}
Response: {{output}}
```

5. Adjust the prompt under **Reasoning** to tune Selene’s evaluation critique to your liking - we use "One sentence reasoning for the score"

6. Adjust the prompt under **Score** to "Score 0 or 1" in alignment with our eval prompt and hit **Save**

![alt text](https://atla-ai.notion.site/image/attachment%3A0078240d-4138-4b4c-a00e-075304065051%3Aimage.png?table=block&id=1ba309d1-7745-80fa-bc58-d43390728b0c&spaceId=f08e6e70-73af-4363-9621-90e906b92ebc&width=2000&userId=&cache=v2)

**Add a new Evaluator configuration:**

1. Head to **Evaluation → LLM-as-a-Judge** in your sidebar and select **Evaluators**
2. Click **+ New evaluator**
3. Select the template `function_calling`  you just created
4. Select Dataset as the **Target object** and make sure the Evaluator runs on **New dataset items**
5. Configure the **Variable mapping** for your evaluator - this ensures that the correct parts of your traces get evaluated.
    - `{{input}}` can be mapped to **Dataset item → Input**
    - `{{expected_output}}` can be mapped to **Dataset item → Expected output**
    - `{{output}}` can be mapped to **Trace → Output**
6. Click **Save**

![alt text](https://atla-ai.notion.site/image/attachment%3Af49c428d-d8ef-44ed-9a87-3d6936fe1252%3Aimage.png?table=block&id=1ba309d1-7745-80af-b04c-c558c7f8225f&spaceId=f08e6e70-73af-4363-9621-90e906b92ebc&width=2000&userId=&cache=v2)

> You can set a sampling rate such that a % of the dataset is evaluated. For the purposes of this demonstration, we want to evaluate every trace so keep the sampling rate at 100%!



## Set Langfuse + OpenAI API keys

In [1]:
import os

os.environ["LANGFUSE_PUBLIC_KEY"] = "" # Your public key
os.environ["LANGFUSE_SECRET_KEY"] = "" # Your secret key

os.environ["OPENAI_API_KEY"] = "" # Your OpenAI API key

## Setup experiment runs

### Install packages

In [2]:
pip install -qU datasets langfuse

### Upload dataset to Langfuse

We processes data from the [Salesforce ShareGPT dataset](https://huggingface.co/datasets/arcee-ai/agent-data/viewer/default/train?f%5Bdataset%5D%5Bvalue%5D=%27glaive-function-calling-v2-extended%27&sql=SELECT+*%0AFROM+train%0AWHERE+dataset+%3D+%27salesforce_sharegpt%27%0ALIMIT+10%3B&views%5B%5D=train), extracting system prompts (that define the list of functions available), user queries, and the expected response.

This data includes:

1. `system_prompt` - Defines the AI's role as a function calling model, provides available functions in a <tools> XML format, and specifies the expected response format using <tool_call> tags.
<br>

2. `input` - A human query that asks the model to perform some task using the available functions.
<br>

3. `expected_output` - The expected model's response that includes one or more function calls in the specified XML format.

In [3]:
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # EU region; LANGFUSE_HOST="https://us.cloud.langfuse.com"

from datasets import load_dataset
from langfuse import Langfuse

langfuse = Langfuse()

# Load and process data
dataset = load_dataset("arcee-ai/agent-data")
salesforce_data = list(dataset['train'].filter(lambda x: x['dataset'] == "salesforce_sharegpt"))[:10]  # convert to list first

print("Number of items after filtering:", len(salesforce_data))
print("\nFirst item structure:")
print(salesforce_data[0])

# Process conversations into the format we want
processed_items = []

for item in salesforce_data:
    try:
        conversations = item['conversations']

        # Debug print
        print("\nProcessing item with conversations:", conversations)

        system_prompt = next((conv['value'] for conv in conversations if conv['from'] == 'system'), '')
        human_input = next((conv['value'] for conv in conversations if conv['from'] == 'human'), '')
        model_output = next((conv['value'] for conv in conversations if conv['from'] == 'gpt'), '')

        if system_prompt and human_input and model_output:
            processed_items.append({
                "input": {
                    "system_prompt": system_prompt,
                    "query": human_input
                },
                "expected_output": model_output
            })
    except Exception as e:
        print(f"Skipping malformed entry: {e}")
        print(f"Item that caused error: {item}")

print(f"\nProcessed {len(processed_items)} items")

# Create the dataset
langfuse.create_dataset(
    name="salesforce_sharegpt",
    description="Function calling dataset from ArceeAI"
)

# Upload items
for item in processed_items:
    langfuse.create_dataset_item(
        dataset_name="salesforce_sharegpt",
        input=item["input"],
        expected_output=item["expected_output"]
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Number of items after filtering: 10

First item structure:
{'conversations': [{'from': 'system', 'value': 'You are a function calling AI model. You may call one or more functions to assist with the user query. Don\'t make assumptions about what values to plug into function. The user may use the terms function calling or tool use interchangeably.\n\nHere are the available functions:\n<tools>[{"name": "live_giveaways_by_type", "description": "Retrieve live giveaways from the GamerPower API based on the specified type.", "parameters": {"type": {"description": "The type of giveaways to retrieve (e.g., game, loot, beta).", "type": "str", "default": "game"}}}]</tools>\n\nFor each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags in the format:\n<tool_call>{"tool_name": <function-name>, "tool_arguments": <args-dict>}</tool_call>'}, {'from': 'human', 'value': 'Where can I find live giveaways for beta access and games?'}, {'from': 'gpt',

### Define experiment runs

We run the system prompt and user query through each model, and evaluate if the responses follow the proper tool call syntax (enclosed in <tool_call> tags) as defined in the dataset.

In [6]:
from langfuse.openai import openai
from langfuse.decorators import observe, langfuse_context
import time

@observe
def run_function_calling(input, model_name):
    # Models that support system messages
    system_supported_models = ["gpt-4", "gpt-4o", "gpt-3.5-turbo"]

    if model_name in system_supported_models:
        messages = [
            {"role": "system", "content": input["system_prompt"]},
            {"role": "user", "content": input["query"]}
        ]
    else:
        # For models that don't support system messages (the o-series), combine them in the user prompt
        combined_prompt = f"{input['system_prompt']}\n\nUser query: {input['query']}"
        messages = [
            {"role": "user", "content": combined_prompt}
        ]

    completion = openai.chat.completions.create(
        model=model_name,
        messages=messages
    ).choices[0].message.content

    return completion

def run_experiment(experiment_name, model_name):
    dataset = langfuse.get_dataset("salesforce_sharegpt")

    for idx, item in enumerate(dataset.items):
        with item.observe(run_name=experiment_name) as trace_id:
            # Run application with the specific model
            output = run_function_calling(item.input, model_name)

        # Add a delay between requests to avoid rate limits
        if idx < len(dataset.items) - 1:
            time.sleep(2)  # 2 second delay

## Run experiments over different models

In [7]:
# Run experiments with different models
models = ["o1-mini", "o3-mini", "gpt-4o"]
for model in models:
    run_experiment(f"function_calling_{model}", model)

## Analyze results

<br>

After the experiments with the different models finish running, head to **Datasets** in your sidebar and click on `salesforce_sharegpt` .

<br>

####**High-level model comparison**

- You will see **Latency (s)** and **Average Total Cost ($)** graphs by default.
- Add **Charts → Scores → Function_calling (Eval)** to see the average Selene scores for `function_calling`.

  ![alt text](https://atla-ai.notion.site/image/attachment%3Aecd25ea4-86ed-4534-a647-5b39391e0b35%3Aimage.png?table=block&id=1bd309d1-7745-8012-b2eb-f4b07f9d2a5b&spaceId=f08e6e70-73af-4363-9621-90e906b92ebc&width=2000&userId=&cache=v2)

Based on our limited test set of 10 samples, it seems the reasoning models are more suited to our agentic task!

####**Fine-grained model comparison**

- Select the three experiment runs and click **Actions (3 selected) → Compare**

  ![alt text](https://atla-ai.notion.site/image/attachment%3Aecd25ea4-86ed-4534-a647-5b39391e0b35%3Aimage.png?table=block&id=1bd309d1-7745-8012-b2eb-f4b07f9d2a5b&spaceId=f08e6e70-73af-4363-9621-90e906b92ebc&width=2000&userId=&cache=v2)


- In the comparison screen, we can inspect cases where certain models fail vs. others—highlighted by a `# Function_calling (Eval)` score of 0 in the cell.

  ![alt text](https://atla-ai.notion.site/image/attachment%3Aeb428c05-cac0-46de-ba69-bcfc389ae1a6%3Aimage.png?table=block&id=1c0309d1-7745-8058-b75b-fa6f963f2979&spaceId=f08e6e70-73af-4363-9621-90e906b92ebc&width=2000&userId=&cache=v2)

####**Next steps**

Now that you've analyzed the models' performance, consider setting up additional Evaluators to measure other metrics that matter to your specific use case. If you have a collection of test user queries, upload this dataset and run comparative tests across different models.