# Measure accuracy of agentic app with Arize and Langflow

In this notebook we'll step through how to use open source tools [Arize](https://arize.com/) and [Langflow](https://www.langflow.org/) and to measure the accuracy of an agentic application.

## Prepping a Dataset

Let's start by gathering some example data from a standard question and answer benchmark. We'll use the Standford Question Answering Dataset (SQuAD).

Available at [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/).


In [1]:
import requests
import json

# Download the file
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'
response = requests.get(url)

# Check if the download was successful
if response.status_code == 200:
    # Save the file locally
    with open('dev-v2.0.json', 'w') as file:
        json.dump(response.json(), file)
    print("File downloaded successfully")
    
    # Load the data
    data = response.json()
    print(f"Dataset loaded with {len(data['data'])} articles")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully
Dataset loaded with 35 articles


In [3]:
# get all articles from the dataset
docs = []
for record in data["data"]:
    text = ""
    for paragraph in record["paragraphs"]:
        text += paragraph["context"] + "\n"
    docs.append(text)

In [4]:
print(f"Number of wikipedia articles in dataset: {len(docs)}.")

Number of wikipedia articles in dataset: 35.


In [6]:
# get all questions and answers for each article
qandas = []
for record in data["data"]:
    for paragraph in record["paragraphs"]:
        for qa in paragraph["qas"]:
            if len(qa["answers"]) > 0:
                qandas.append((qa["question"], qa["answers"][0]["text"]))

# there are about 6000 Q&A pairs in this data set.  
# Let's shorten to 100 to make it easier to run our tests
import random
samples = random.sample(qandas, 100)

In [None]:
# Let's look at a couple examples of questions and answers
print(samples[0])
print(samples[1])
print(samples[2])

('What type of fossils were found in China?', 'Cambrian sessile frond-like fossil Stromatoveris')
('Where do pharmacists acquire more preparation following pharmacy school?', 'a pharmacy practice residency')
('What contributed to the decreased inequality between trained and untrained workers?', 'period of compression')


## Install modules needed for the rest of this example

We'll need the following modules to be available

- `langflow` for rapidly designing AI workflows
- `phoenix-arize` to run evals
- `pandas` to format the data
- `openai` for access to LLMs and embedding models

If you don't already have these installed, go ahead and run the install commands below.

In [12]:
!pip install uv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [13]:
# uv requires a virtual environment to run
!uv venv
!source .venv/bin/activate

Using CPython 3.11.7 interpreter at: [36m/opt/homebrew/opt/python@3.11/bin/python3.11[39m
Creating virtual environment at: [36m.venv[39m
Activate with: [32msource .venv/bin/activate[39m


In [None]:
# install all the modules for the rest of the notebook
!uv pip install langflow pandas openai arize -U

## Loading our ground truth data to Arize

Now that we have a set of questions and answers, we can use this to evaluate the quality of our agentic application.

Let's upload this dataset to Phoenix.

In [8]:
# let's start by creating a dataframe with our Q&A pairs and save a copy
import pandas as pd
df = pd.DataFrame(samples, columns=["question", "answer"])

# only run this line if you want to resave the dataframe
df.to_csv("data/qa_pairs.csv", index=False)

In [2]:
# or just load from the csv provided
import pandas as pd
df = pd.read_csv("data/qa_pairs.csv")
with pd.option_context('display.max_colwidth', None):
    display(df.head(3))

Unnamed: 0,question,answer
0,What type of fossils were found in China?,Cambrian sessile frond-like fossil Stromatoveris
1,Where do pharmacists acquire more preparation following pharmacy school?,a pharmacy practice residency
2,What contributed to the decreased inequality between trained and untrained workers?,period of compression


In [12]:
# now let's create a dataset in Arize
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

import os
from dotenv import load_dotenv
load_dotenv()

ARIZE_API_KEY = os.getenv("ARIZE_API_KEY")
ARIZE_SPACE_ID = os.getenv("ARIZE_SPACE_ID")

client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)

dataset_id = client.create_dataset(
    space_id=ARIZE_SPACE_ID, 
    dataset_name="squad_dataset",
    dataset_type=GENERATIVE,
    data=df
)


RuntimeError: Failed to create dataset: name=squad_dataset, type=1 for space_id=U3BhY2U6MTg3OTg6RDY3Vg==

In [23]:
# for speed, let's also make a smaller dataset
df_small = df.head(20)
dataset = client.upload_dataset(
    dataframe=df_small,
    dataset_name="squad-dev-v2.0-x-small",
    input_keys=["question"],
    output_keys=["answer"],
)

📤 Uploading dataset...
💾 Examples uploaded: http://localhost:6006/datasets/RGF0YXNldDoy/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246Mg==


## Run Langflow Flows from code

In this section, we define a method to use the Langflow runtime to execute flows and return the results.

You can find details on how to call these methods by clicking the API button in the upper right corner of the Langflow UI when a flow is open and you are editing it.


In [24]:
# code to run a flow in Langflow
# trigger flow via API
import requests

BASE_API_URL = "http://127.0.0.1:7860"

def run_flow(input: str, flow_id: str, input_type: str = "chat", tweaks: dict = None):
    """ Raises exceptions for non-200 status code from langflow response
    """
    api_url = f"{BASE_API_URL}/api/v1/run/{flow_id}"

    payload = {
        "input_value": input,
        "output_type": "chat",
        "input_type": input_type,
    }

    if tweaks:
        payload["tweaks"] = tweaks

    timeout = 10
    attempts = 6
    for i in range(attempts):
        try:
            response = requests.post(api_url, json=payload, timeout=timeout)
            if response.status_code != 200:
                raise requests.exceptions.RequestException(f"Status code: {response.status_code}")
            return response
        
        except requests.exceptions.Timeout:
            print(f"The flow request timed out. Attempt {i}, current timeout {timeout}.")
            timeout *= 2
        except requests.exceptions.RequestException as e:
            print(e)
            continue

## Let's load our Wikipedia data into the vector database with a flow in Langflow

In [None]:
data_load_flow_id = "d469424c-96da-44ae-b018-9b044b805144"
for i, doc in enumerate(docs):
    run_flow(doc, data_load_flow_id, input_type="text")
    print(f"Completed processing doc {i}.")

## Run an experiment with Langflow and Arize Phoenix

1. Define a task to run the the a flow on the dataset
2. Create an LLM to act as a judge
3. Define an evaluator to measure the accuracy of the results
4. Run the experiment and log the results to Arize Phoenix

In [29]:
CHAT_FLOW_ID = "feab8feb-9f27-4132-8aa6-265962e19144"


def task(dataset_row) -> str:
    question = dataset_row["question"]
    response = run_flow(question, CHAT_FLOW_ID)
    text = response.json()['outputs'][0]['outputs'][0]['results']['message']["text"]
    return text   

In [26]:
import openai
import os
from phoenix.evals.models import OpenAIModel

with open("openai_key.json", "r") as file:
    keys = json.load(file)

openai_api_key = keys.get("openai_api_key")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

judge_model = OpenAIModel(
    model="gpt-4o-mini",
    temperature=0.0,
)

from phoenix.evals import (
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)




In [13]:
print(QA_PROMPT_TEMPLATE)


You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.



In [27]:
from phoenix.experiments.evaluators import create_evaluator

@create_evaluator(name="Answer Correctness", kind="LLM")
def answer_correctness(input, output, expected) -> int:

    df_in = pd.DataFrame({
        "input": input["question"],
        "output": output,
        "reference": expected["answer"]
    }, index=[0])
              
    rails = list(QA_PROMPT_RAILS_MAP.values())
    
    eval_df = llm_classify(
        data=df_in,
        template=QA_PROMPT_TEMPLATE,
        model=judge_model,
        rails=rails,
        provide_explanation=True,
        run_sync=True,
    )

    label = eval_df["label"][0]
    explanation = eval_df["explanation"][0]
    score = 1 if label == "correct" else 0

    return score


In [None]:
from phoenix.experiments import run_experiment

run_experiment(
    dataset,
    task=task,
    evaluators=[answer_correctness],
    experiment_name="Experiment 6 - long",
)