# Measure accuracy of agentic app with Arize and Langflow

In this notebook we'll step through how to use open source tools [Arize](https://arize.com/) and [Langflow](https://www.langflow.org/) and to measure the accuracy of an agentic application.

## Prepping a Dataset

Let's start by gathering some example data from a standard question and answer benchmark. We'll use the Standford Question Answering Dataset (SQuAD).

Available at [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/).


In [1]:
import requests
import json

# Download the file
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'
response = requests.get(url)

# Check if the download was successful
if response.status_code == 200:
    # Save the file locally
    with open('dev-v2.0.json', 'w') as file:
        json.dump(response.json(), file)
    print("File downloaded successfully")
    
    # Load the data
    data = response.json()
    print(f"Dataset loaded with {len(data['data'])} articles")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully
Dataset loaded with 35 articles


In [2]:
# get all articles from the dataset
docs = []
for record in data["data"]:
    text = ""
    for paragraph in record["paragraphs"]:
        text += paragraph["context"] + "\n"
    docs.append(text)

In [3]:
print(f"Number of wikipedia articles in dataset: {len(docs)}.")

Number of wikipedia articles in dataset: 35.


In [None]:
# create markdown files for each article for use in Langflow UI
for i, doc in enumerate(docs):
    filename = f"data/markdown_files/article_{i}.md"
    with open(filename, "w", encoding="utf-8") as f:
        f.write(doc)

In [6]:
# get all questions and answers for each article
qandas = []
for record in data["data"]:
    for paragraph in record["paragraphs"]:
        for qa in paragraph["qas"]:
            if len(qa["answers"]) > 0:
                qandas.append((qa["question"], qa["answers"][0]["text"]))

# there are about 6000 Q&A pairs in this data set.  
# Let's shorten to 100 to make it easier to run our tests
import random
samples = random.sample(qandas, 100)

In [None]:
# Let's look at a couple examples of questions and answers
print(samples[0])
print(samples[1])
print(samples[2])

('What type of fossils were found in China?', 'Cambrian sessile frond-like fossil Stromatoveris')
('Where do pharmacists acquire more preparation following pharmacy school?', 'a pharmacy practice residency')
('What contributed to the decreased inequality between trained and untrained workers?', 'period of compression')


## Install modules needed for the rest of this example

We'll need the following modules to be available

- `langflow` for rapidly designing AI workflows
- `phoenix-arize` to run evals
- `pandas` to format the data
- `openai` for access to LLMs and embedding models

If you don't already have these installed, go ahead and run the install commands below.

In [12]:
!pip install uv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [13]:
# uv requires a virtual environment to run
!uv venv
!source .venv/bin/activate

Using CPython 3.11.7 interpreter at: [36m/opt/homebrew/opt/python@3.11/bin/python3.11[39m
Creating virtual environment at: [36m.venv[39m
Activate with: [32msource .venv/bin/activate[39m


In [None]:
# install all the modules for the rest of the notebook
!uv pip install langflow pandas openai arize -U

## Loading our ground truth data to Arize

Now that we have a set of questions and answers, we can use this to evaluate the quality of our agentic application.

Let's upload this dataset to Phoenix.

In [8]:
# let's start by creating a dataframe with our Q&A pairs and save a copy
import pandas as pd
df = pd.DataFrame(samples, columns=["question", "answer"])

# only run this line if you want to resave the dataframe
df.to_csv("data/qa_pairs.csv", index=False)

In [1]:
# or just load from the csv provided
import pandas as pd
df = pd.read_csv("data/qa_pairs.csv")
with pd.option_context('display.max_colwidth', None):
    display(df.head(3))

Unnamed: 0,question,answer
0,What type of fossils were found in China?,Cambrian sessile frond-like fossil Stromatoveris
1,Where do pharmacists acquire more preparation following pharmacy school?,a pharmacy practice residency
2,What contributed to the decreased inequality between trained and untrained workers?,period of compression


In [6]:
# get the Arize client
# now let's create a dataset in Arize
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

import os
from dotenv import load_dotenv
load_dotenv()

ARIZE_API_KEY = os.getenv("ARIZE_API_KEY")
ARIZE_SPACE_ID = os.getenv("ARIZE_SPACE_ID")
ARIZE_DEVELOPER_KEY = os.getenv("ARIZE_DEVELOPER_KEY")

arize_client = ArizeDatasetsClient(api_key=ARIZE_API_KEY, developer_key=ARIZE_DEVELOPER_KEY)



In [None]:
# create the dataset in Arize
dataset_id = arize_client.create_dataset(
    space_id=ARIZE_SPACE_ID, 
    dataset_name="squad_dataset",
    dataset_type=GENERATIVE,
    data=df
)




  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# for speed, let's also make a smaller dataset
df_small = df.head(20)
dataset = arize_client.create_dataset(
    data=df_small,
    dataset_name="squad-dev-v2.0-x-small",
    space_id=ARIZE_SPACE_ID,
    dataset_type=GENERATIVE
)

## Run Langflow Flows from code

In this section, we define a method to use the Langflow runtime to execute flows and return the results.

You can find details on how to call these methods by clicking the **Publish** button in the upper right corner of the Langflow UI when a flow is open and you are editing it.  Then select **API Access**.

The code below assumes you are running Langflow locally on your machine.


In [None]:
# code to run a flow in Langflow
# trigger flow via API
import requests

BASE_API_URL = "http://127.0.0.1:7860"

def run_flow(input: str, flow_id: str, input_type: str = "chat", tweaks: dict = None):
    """ Raises exceptions for non-200 status code from langflow response
    """
    api_url = f"{BASE_API_URL}/api/v1/run/{flow_id}"

    payload = {
        "input_value": input,
        "output_type": "chat",
        "input_type": input_type,
    }

    if tweaks:
        payload["tweaks"] = tweaks

    timeout = 25
    attempts = 6
    for i in range(attempts):
        try:
            response = requests.post(api_url, json=payload, timeout=timeout)
            if response.status_code != 200:
                raise requests.exceptions.RequestException(f"Status code: {response.status_code}")
            return response
        
        except requests.exceptions.Timeout:
            print(f"The flow request timed out. Attempt {i}, current timeout {timeout}.")
            timeout *= 2
        except requests.exceptions.RequestException as e:
            print(e)
            continue

## Run an experiment with Langflow and Arize

1. Define a task to run the the a flow on the dataset
2. Create an LLM to act as a judge
3. Define an evaluator to measure the accuracy of the results
4. Run the experiment and log the results to Arize Phoenix

In [2]:
CHAT_FLOW_ID = "796625a6-2526-4e12-b2f8-873d66940016"

def task(dataset_row) -> str:
    question = dataset_row["question"]
    response = run_flow(question, CHAT_FLOW_ID)
    text = response.json()['outputs'][0]['outputs'][0]['results']['message']["text"]
    return text   

In [4]:
import openai
import os
from phoenix.evals.models import OpenAIModel

from dotenv import load_dotenv
load_dotenv()

judge_model = OpenAIModel(
    model="gpt-4o-mini",
    temperature=0.0,
)

from phoenix.evals import (
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

In [5]:
# let's look at the prompt template
print(QA_PROMPT_TEMPLATE)


You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.



In [36]:
#from phoenix.experiments.evaluators import create_evaluator
from arize.experimental.datasets.experiments.evaluators.base import EvaluationResult
import pandas as pd

#@create_evaluator(name="Answer Correctness", kind="LLM")
def answer_correctness(input, output, dataset_row) -> EvaluationResult:

    question = dataset_row["question"]
    answer = dataset_row["answer"]
    df_in = pd.DataFrame({
        "input": question,
        "output": output,
        "reference": answer,
    }, index=[0])
              
    rails = list(QA_PROMPT_RAILS_MAP.values())
    
    eval_df = llm_classify(
        data=df_in,
        template=QA_PROMPT_TEMPLATE,
        model=judge_model,
        rails=rails,
        provide_explanation=True,
        run_sync=True,
    )
    
    label = eval_df["label"][0]
    explanation = eval_df["explanation"][0]
    score = 1 if label == "correct" else 0

    return EvaluationResult(label=label, score=score, explanation=explanation)


In [8]:
# dataset IDs - get these from the Arize UI
squad_qa_20 = "RGF0YXNldDozNDMxNjowWFIx"  
squad_qa_100 = "RGF0YXNldDozNDMxNTpSZ21u"

In [38]:
arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID,
    dataset_id=squad_qa_100,
    task=task,
    evaluators=[answer_correctness],
    experiment_name="astra_v3_small_n4_gpt4o_try3",
    exit_on_error=True
)

[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
running tasks |          | 0/100 (0.0%) | ⏳ 00:00<? | ?it/s

The flow request timed out. Attempt 0, current timeout 10.


running tasks |██▍       | 24/100 (24.0%) | ⏳ 02:33<07:13 |  5.70s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |███       | 31/100 (31.0%) | ⏳ 03:24<06:16 |  5.45s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |███▋      | 37/100 (37.0%) | ⏳ 04:11<07:18 |  6.96s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |████▌     | 46/100 (46.0%) | ⏳ 05:22<05:35 |  6.22s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |████▊     | 48/100 (48.0%) | ⏳ 05:53<08:58 | 10.35s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |█████▌    | 56/100 (56.0%) | ⏳ 07:04<06:06 |  8.32s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |█████▋    | 57/100 (57.0%) | ⏳ 07:25<08:37 | 12.03s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |██████▍   | 64/100 (64.0%) | ⏳ 08:24<05:02 |  8.39s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |████████  | 81/100 (81.0%) | ⏳ 10:18<01:51 |  5.86s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |████████▏ | 82/100 (82.0%) | ⏳ 10:39<03:06 | 10.34s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |█████████ | 90/100 (90.0%) | ⏳ 11:41<01:06 |  6.69s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |█████████▎| 93/100 (93.0%) | ⏳ 12:11<00:54 |  7.81s/it

The flow request timed out. Attempt 0, current timeout 10.


running tasks |██████████| 100/100 (100.0%) | ⏳ 13:08<00:00 |  6.67s/it

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (04/28/25 08:23 PM -0700)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|          100 |      100 |          0 |[0m


🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
running tasks |██████████| 100/100 (100.0%) | ⏳ 13:08<00:00 |  7.88s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.04s/it<? | ?it/s
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.40s/it<03:30 |  2.12s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.90s/it<03:49 |  2.34s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:04<00:00 |  4.01s/it<04:15 |  2.64s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:07<00:00 |  7.80s/it<05:08 |  3.21s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.95s/it<07:45 |  4.90s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:02<00:00 |  2.40s/it<06:40 |  4.27s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:03<00:00 |  3.48s/it<05:42 |  3.69s/it
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:03<00:00 | 

[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m





('RXhwZXJpbWVudDo5OTA1OlB3MlA=',
                id                            example_id  \
 0   EXP_ID_05d1f6  fd618731-c84c-4b19-a3cc-1926cfaf46e9   
 1   EXP_ID_e17228  cb30a0b3-25fc-4e02-83ef-d24e0854bfe0   
 2   EXP_ID_48e019  652d1890-e423-48d4-82fb-f6b81aec92d2   
 3   EXP_ID_51c6fd  ca6ebe6d-14b8-4636-9248-73bc5e600a2a   
 4   EXP_ID_919634  e210d864-6a21-4988-800e-6bd2b584dcc8   
 ..            ...                                   ...   
 95  EXP_ID_c8f322  5d3c5fac-d7fd-4b4d-a955-1fc48636906e   
 96  EXP_ID_46f4a8  4f309848-a47c-410b-9ba1-eaa387d442d2   
 97  EXP_ID_4f2119  b6ba039c-28df-400a-9884-93b20236a00f   
 98  EXP_ID_cf8489  3ae1c206-29d2-4886-acc1-e0d5dd4be250   
 99  EXP_ID_873646  92038ae5-482d-4d90-9906-76d0a0da015c   
 
                                                result  \
 0   In China, significant fossils have been discov...   
 1   Pharmacists can acquire additional preparation...   
 2   The decreased inequality between trained and u...   
 3   In South