# Measure accuracy of agentic app with Arize and Langflow

In this notebook we'll step through how to use open source tools [Arize](https://arize.com/) and [Langflow](https://www.langflow.org/) and to measure the accuracy of an agentic application.

## Prepping a Dataset

Let's start by gathering some example data from a standard question and answer benchmark. We'll use the Standford Question Answering Dataset (SQuAD).

Available at [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/).


In [1]:
import requests
import json

# Download the file
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'
response = requests.get(url)

# Check if the download was successful
if response.status_code == 200:
    # Save the file locally
    with open('dev-v2.0.json', 'w') as file:
        json.dump(response.json(), file)
    print("File downloaded successfully")
    
    # Load the data
    data = response.json()
    print(f"Dataset loaded with {len(data['data'])} articles")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully
Dataset loaded with 35 articles


In [2]:
# get all articles from the dataset
docs = []
for record in data["data"]:
    text = ""
    for paragraph in record["paragraphs"]:
        text += paragraph["context"] + "\n"
    docs.append(text)

In [3]:
print(f"Number of wikipedia articles in dataset: {len(docs)}.")

Number of wikipedia articles in dataset: 35.


In [None]:
# create markdown files for each article for use in Langflow UI
for i, doc in enumerate(docs):
    filename = f"data/markdown_files/article_{i}.md"
    with open(filename, "w", encoding="utf-8") as f:
        f.write(doc)

In [6]:
# get all questions and answers for each article
qandas = []
for record in data["data"]:
    for paragraph in record["paragraphs"]:
        for qa in paragraph["qas"]:
            if len(qa["answers"]) > 0:
                qandas.append((qa["question"], qa["answers"][0]["text"]))

# there are about 6000 Q&A pairs in this data set.  
# Let's shorten to 100 to make it easier to run our tests
import random
samples = random.sample(qandas, 100)

In [None]:
# Let's look at a couple examples of questions and answers
print(samples[0])
print(samples[1])
print(samples[2])

('What type of fossils were found in China?', 'Cambrian sessile frond-like fossil Stromatoveris')
('Where do pharmacists acquire more preparation following pharmacy school?', 'a pharmacy practice residency')
('What contributed to the decreased inequality between trained and untrained workers?', 'period of compression')


## Install modules needed for the rest of this example

We'll need the following modules to be available

- `langflow` for rapidly designing AI workflows
- `arize` to run evals
- `pandas` to format the data
- `openai` for access to LLMs and embedding models

If you don't already have these installed, go ahead and run the install commands below.

In [None]:
!pip install uv

In [None]:
# uv requires a virtual environment to run
!uv venv
!source .venv/bin/activate

In [None]:
# install all the modules for the rest of the notebook
!uv pip install langflow pandas openai arize -U

## Loading our ground truth data to Arize

Now that we have a set of questions and answers, we can use this to evaluate the quality of our agentic application.

Let's upload this dataset to Phoenix.

In [8]:
# let's start by creating a dataframe with our Q&A pairs and save a copy
import pandas as pd
df = pd.DataFrame(samples, columns=["question", "answer"])

# only run this line if you want to resave the dataframe
df.to_csv("data/qa_pairs.csv", index=False)

In [1]:
# or just load from the csv provided
import pandas as pd
df = pd.read_csv("data/qa_pairs.csv")
with pd.option_context('display.max_colwidth', None):
    display(df.head(3))

Unnamed: 0,question,answer
0,What type of fossils were found in China?,Cambrian sessile frond-like fossil Stromatoveris
1,Where do pharmacists acquire more preparation following pharmacy school?,a pharmacy practice residency
2,What contributed to the decreased inequality between trained and untrained workers?,period of compression


In [6]:
# get the Arize client
# now let's create a dataset in Arize
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

import os
from dotenv import load_dotenv
load_dotenv()

ARIZE_API_KEY = os.getenv("ARIZE_API_KEY")
ARIZE_SPACE_ID = os.getenv("ARIZE_SPACE_ID")
ARIZE_DEVELOPER_KEY = os.getenv("ARIZE_DEVELOPER_KEY")

arize_client = ArizeDatasetsClient(api_key=ARIZE_API_KEY, developer_key=ARIZE_DEVELOPER_KEY)



In [None]:
# create the dataset in Arize
dataset_id = arize_client.create_dataset(
    space_id=ARIZE_SPACE_ID, 
    dataset_name="squad_dataset",
    dataset_type=GENERATIVE,
    data=df
)



  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# for speed, let's also make a smaller dataset
df_small = df.head(20)
dataset = arize_client.create_dataset(
    data=df_small,
    dataset_name="squad-dev-v2.0-x-small",
    space_id=ARIZE_SPACE_ID,
    dataset_type=GENERATIVE
)

## Run Langflow Flows from code

In this section, we define a method to use the Langflow runtime to execute flows and return the results.

You can find details on how to call these methods by clicking the **Publish** button in the upper right corner of the Langflow UI when a flow is open and you are editing it.  Then select **API Access**.

The code below assumes you are running Langflow locally on your machine.


In [None]:
# code to run a flow in Langflow
# trigger flow via API
import requests
import uuid

BASE_API_URL = "http://127.0.0.1:7860"

def run_flow(input: str, flow_id: str, input_type: str = "chat", tweaks: dict = None):
    """ Raises exceptions for non-200 status code from langflow response
    """
    api_url = f"{BASE_API_URL}/api/v1/run/{flow_id}"

    payload = {
        "input_value": input,
        "output_type": "chat",
        "input_type": input_type,
        "session_id": str(uuid.uuid4()), # create a new session to avoid chat history
    }

    if tweaks:
        payload["tweaks"] = tweaks

    timeout = 25
    attempts = 6
    for i in range(attempts):
        try:
            response = requests.post(api_url, json=payload, timeout=timeout)
            if response.status_code != 200:
                raise requests.exceptions.RequestException(f"Status code: {response.status_code}")
            return response
        
        except requests.exceptions.Timeout:
            print(f"The flow request timed out. Attempt {i}, current timeout {timeout}.")
            timeout *= 2
        except requests.exceptions.RequestException as e:
            print(e)
            continue

## Run an experiment with Langflow and Arize

1. Define a task to run the the a flow on the dataset
2. Create an LLM to act as a judge
3. Define an evaluator to measure the accuracy of the results
4. Run the experiment and log the results to Arize Phoenix

In [56]:
CHAT_FLOW_ID = "796625a6-2526-4e12-b2f8-873d66940016"
TWEAKS = {}

def task(dataset_row) -> str:
    question = dataset_row["question"]
    response = run_flow(question, CHAT_FLOW_ID, tweaks=TWEAKS)
    text = response.json()['outputs'][0]['outputs'][0]['results']['message']["text"]
    return text   

In [47]:
from phoenix.evals.models import OpenAIModel

from dotenv import load_dotenv
load_dotenv()

judge_model = OpenAIModel(
    model="gpt-4o-mini",
    temperature=0.0,
)

from phoenix.evals import (
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

In [5]:
# let's look at the prompt template
print(QA_PROMPT_TEMPLATE)


You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.



In [36]:
#from phoenix.experiments.evaluators import create_evaluator
from arize.experimental.datasets.experiments.evaluators.base import EvaluationResult
import pandas as pd

#@create_evaluator(name="Answer Correctness", kind="LLM")
def answer_correctness(input, output, dataset_row) -> EvaluationResult:

    question = dataset_row["question"]
    answer = dataset_row["answer"]
    df_in = pd.DataFrame({
        "input": question,
        "output": output,
        "reference": answer,
    }, index=[0])
              
    rails = list(QA_PROMPT_RAILS_MAP.values())
    
    eval_df = llm_classify(
        data=df_in,
        template=QA_PROMPT_TEMPLATE,
        model=judge_model,
        rails=rails,
        provide_explanation=True,
        run_sync=True,
    )
    
    label = eval_df["label"][0]
    explanation = eval_df["explanation"][0]
    score = 1 if label == "correct" else 0

    return EvaluationResult(label=label, score=score, explanation=explanation)


In [8]:
# dataset IDs - get these from the Arize UI
squad_qa_20 = "RGF0YXNldDozNDMxNjowWFIx"  
squad_qa_100 = "RGF0YXNldDozNDMxNTpSZ21u"

In [49]:
arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID,
    dataset_id=squad_qa_20,
    task=task,
    evaluators=[answer_correctness],
    experiment_name="astra_v3_small_n1_gpt4o_concise",
    exit_on_error=True
)

[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
running tasks |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

The flow request timed out. Attempt 0, current timeout 10.


running tasks |████      | 8/20 (40.0%) | ⏳ 00:49<00:53 |  4.49s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |██████████| 20/20 (100.0%) | ⏳ 02:01<00:00 |  6.07s/it

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (04/29/25 04:01 PM -0700)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|           20 |       20 |          0 |[0m



🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.57s/it? | ?it/s
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.56s/it00:31 |  1.66s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.22s/it<00:29 |  1.65s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.39s/it<00:25 |  1.50s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.25s/it<00:24 |  1.50s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.33s/it<00:21 |  1.44s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.14s/it<00:20 |  1.43s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.36s/it<00:17 |  1.37s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.38s/it<00:16 |  1.39s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00

[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m





('RXhwZXJpbWVudDo5OTg0OmNhWVE=',
                id                            example_id  \
 0   EXP_ID_e5570c  fd618731-c84c-4b19-a3cc-1926cfaf46e9   
 1   EXP_ID_ddc0f7  cb30a0b3-25fc-4e02-83ef-d24e0854bfe0   
 2   EXP_ID_1262a8  652d1890-e423-48d4-82fb-f6b81aec92d2   
 3   EXP_ID_0f323c  ca6ebe6d-14b8-4636-9248-73bc5e600a2a   
 4   EXP_ID_1727ab  e210d864-6a21-4988-800e-6bd2b584dcc8   
 5   EXP_ID_39234c  5b7a3b5d-8c68-4796-a2cf-6f14c3cc4f4a   
 6   EXP_ID_acc5f4  48a4e6eb-205e-4574-808c-e92d59640f84   
 7   EXP_ID_3a5b8a  b025e287-4ad4-43bd-8255-56f185832e92   
 8   EXP_ID_4caef2  768a00d8-56a5-496a-b42b-d131ed10d3e8   
 9   EXP_ID_8a95fb  551de8b8-5b29-47dd-a76e-49f95ad528ed   
 10  EXP_ID_adc3e7  4b1762ff-b7af-4e46-b6b8-1bc4f4284d39   
 11  EXP_ID_d18163  b1f7b3ff-dee6-4040-8f8a-75719fef3cff   
 12  EXP_ID_43e51a  9e0975bf-5989-4da1-8088-86dc6a88a69b   
 13  EXP_ID_9c7699  165c43f8-1ea6-4f2d-99b4-b81b32640da5   
 14  EXP_ID_8c87d5  e0089d72-aabb-4f96-bf65-2e6d9c4d75cc   
 15  EX

## Modify the flow from code instead of in the Langflow frontend

We can also script changes to flows so we can write a batch of tests to try out many different configurations.

In this case, we'll change the LLM used by the Orchestrating Agent to `gpt-4.1-mini`.

In Langflow, you can find this by clicking the **Publish** button in the upper right, then selecting **API access** and then clicking the **Tweaks** button.

Select the components you want to modify, and then close the Tweaks section to see updated code to call the API with these changes.


![How to set tweaks](img/tweaks_ide.png)

In [57]:
tweaks = { 
    "Agent-C1GU0": {
            "model_name": "gpt-4.1-mini"
    }
}

TWEAKS = tweaks

arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID,
    dataset_id=squad_qa_20,
    task=task,
    evaluators=[answer_correctness],
    experiment_name="astra_v3_small_n3_gpt41mini_concise",
    exit_on_error=True
)

[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
running tasks |██████████| 20/20 (100.0%) | ⏳ 02:36<00:00 |  7.81s/it

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (04/29/25 04:43 PM -0700)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|           20 |       20 |          0 |[0m



🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.27s/it? | ?it/s
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.14s/it00:26 |  1.39s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.19s/it<00:23 |  1.31s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.07s/it<00:21 |  1.29s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.42s/it<00:19 |  1.24s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.30s/it<00:20 |  1.34s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.49s/it<00:19 |  1.36s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.47s/it<00:18 |  1.43s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.41s/it<00:17 |  1.47s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00

[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m





('RXhwZXJpbWVudDo5OTk1OnBQY1Q=',
                id                            example_id  \
 0   EXP_ID_bceea2  fd618731-c84c-4b19-a3cc-1926cfaf46e9   
 1   EXP_ID_2af667  cb30a0b3-25fc-4e02-83ef-d24e0854bfe0   
 2   EXP_ID_6f452c  652d1890-e423-48d4-82fb-f6b81aec92d2   
 3   EXP_ID_8ffdd6  ca6ebe6d-14b8-4636-9248-73bc5e600a2a   
 4   EXP_ID_abd09a  e210d864-6a21-4988-800e-6bd2b584dcc8   
 5   EXP_ID_3bc1fa  5b7a3b5d-8c68-4796-a2cf-6f14c3cc4f4a   
 6   EXP_ID_daa7c7  48a4e6eb-205e-4574-808c-e92d59640f84   
 7   EXP_ID_34b6e5  b025e287-4ad4-43bd-8255-56f185832e92   
 8   EXP_ID_acaf12  768a00d8-56a5-496a-b42b-d131ed10d3e8   
 9   EXP_ID_73fecf  551de8b8-5b29-47dd-a76e-49f95ad528ed   
 10  EXP_ID_292c23  4b1762ff-b7af-4e46-b6b8-1bc4f4284d39   
 11  EXP_ID_5d8461  b1f7b3ff-dee6-4040-8f8a-75719fef3cff   
 12  EXP_ID_b030c0  9e0975bf-5989-4da1-8088-86dc6a88a69b   
 13  EXP_ID_b75547  165c43f8-1ea6-4f2d-99b4-b81b32640da5   
 14  EXP_ID_185367  e0089d72-aabb-4f96-bf65-2e6d9c4d75cc   
 15  EX