## LlamaIndex Agents + Ground Truth & Custom Evaluations

In this example, we build an agent-based app with Llama Index to answer questions with the help of Yelp. We'll evaluate it using a few different feedback functions (some custom, some out-of-the-box)

The first set of feedback functions complete what the non-hallucination triad. However because we're dealing with agents here,  we've added a fourth leg (query translation) to cover the additional interaction between the query planner and the agent. This combination provides a foundation for eliminating hallucination in LLM applications.

1. Query Translation - The first step. Here we compare the similarity of the original user query to the query sent to the agent. This ensures that we're providing the agent with the correct question.
2. Context or QS Relevance - Next, we compare the relevance of the context provided by the agent back to the original query. This ensures that we're providing context for the right question.
3. Groundedness - Third, we ensure that the final answer is supported by the context. This ensures that the LLM is not extending beyond the information provided by the agent.
4. Question Answer Relevance - Last, we want to make sure that the final answer provided is relevant to the user query. This last step confirms that the answer is not only supported but also useful to the end user.

In this example, we'll add two additional feedback functions.

5. Ratings usage - evaluate if the summarized context uses ratings as justification. Note: this may not be relevant for all queries.
6. Ground truth eval - we want to make sure our app responds correctly. We will create a ground truth set for this evaluation.

Last, we'll compare the evaluation of this app against a standalone LLM. May the best bot win?

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/expositional/frameworks/llama_index/llama_index_agents.ipynb)

### Install TruLens and Llama-Index

In [5]:
!pip uninstall trulens_eval -y
!pip install git+https://github.com/truera/trulens

Found existing installation: trulens-eval 0.24.0
Uninstalling trulens-eval-0.24.0:
  Successfully uninstalled trulens-eval-0.24.0
Collecting git+https://github.com/truera/trulens
  Cloning https://github.com/truera/trulens to /tmp/pip-req-build-oeh6t164
  Running command git clone --filter=blob:none --quiet https://github.com/truera/trulens /tmp/pip-req-build-oeh6t164
  Resolved https://github.com/truera/trulens to commit eaa03c7a7a85351fa028b5df907a19b0e0f504ee
[31mERROR: git+https://github.com/truera/trulens does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m

In [9]:
! pip install -U trulens_eval

Collecting trulens_eval
  Downloading trulens_eval-0.33.0-py3-none-any.whl (765 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m765.5/765.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting psutil>=5.9.8 (from trulens_eval)
  Downloading psutil-6.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.5/290.5 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pip>=24.0 (from trulens_eval)
  Downloading pip-24.1.2-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
Collecting streamlit-aggrid==0.3.4 (from trulens_eval)
  Downloading streamlit_aggrid-0.3.4-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
Collecting streamlit-pills>=0.3.0 (f

In [8]:
! pip install trulens_eval==0.24.0 llama_index==0.10.33 llama-index-tools-yelp==0.1.2 openai

Collecting trulens_eval==0.24.0
  Using cached trulens_eval-0.24.0-py3-none-any.whl (662 kB)
Installing collected packages: trulens_eval
Successfully installed trulens_eval-0.24.0


In [None]:
# If running from github repo, uncomment the below to setup paths.
#from pathlib import Path
#import sys
#trulens_path = Path().cwd().parent.parent.parent.parent.resolve()
#sys.path.append(str(trulens_path))

In [1]:
# Setup OpenAI Agent
import llama_index
from llama_index.agent.openai import OpenAIAgent
import openai

import os

In [2]:
# Set your API keys. If you already have them in your var env., you can skip these steps.

os.environ["OPENAI_API_KEY"] = ""
openai.api_key = os.environ["OPENAI_API_KEY"]

os.environ["YELP_API_KEY"] = "..."
os.environ["YELP_CLIENT_ID"] = "..."

# If you already have keys in var env., use these to check instead:
# from trulens_eval.keys import check_keys
# check_keys("OPENAI_API_KEY", "YELP_API_KEY", "YELP_CLIENT_ID")

### Set up our Llama-Index App

For this app, we will use a tool from Llama-Index to connect to Yelp and allow the Agent to search for business and fetch reviews.

In [3]:
# Import and initialize our tool spec
from llama_index.tools.yelp.base import YelpToolSpec
from llama_index.core.tools.tool_spec.load_and_search.base import LoadAndSearchToolSpec

# Add Yelp API key and client ID
tool_spec = YelpToolSpec(
    api_key=os.environ.get("YELP_API_KEY"),
    client_id=os.environ.get("YELP_CLIENT_ID")
)

In [4]:
gordon_ramsay_prompt = "You answer questions about restaurants in the style of Gordon Ramsay, often insulting the asker."

In [5]:
# Create the Agent with our tools
tools = tool_spec.to_tool_list()
agent = OpenAIAgent.from_tools([
        *LoadAndSearchToolSpec.from_defaults(tools[0]).to_tool_list(),
        *LoadAndSearchToolSpec.from_defaults(tools[1]).to_tool_list()
    ],
    verbose=True,
    system_prompt=gordon_ramsay_prompt
)

### Create a standalone GPT3.5 for comparison

In [6]:
client = openai.OpenAI()

chat_completion = client.chat.completions.create

In [14]:
!pip install git+https://github.com/truera/trulens

Collecting git+https://github.com/truera/trulens
  Cloning https://github.com/truera/trulens to /tmp/pip-req-build-532k388t
  Running command git clone --filter=blob:none --quiet https://github.com/truera/trulens /tmp/pip-req-build-532k388t
  Resolved https://github.com/truera/trulens to commit eaa03c7a7a85351fa028b5df907a19b0e0f504ee
[31mERROR: git+https://github.com/truera/trulens does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m

In [7]:
from trulens_eval.tru_custom_app import TruCustomApp, instrument

class LLMStandaloneApp():
    @instrument
    def __call__(self, prompt):
        return chat_completion(
            model="gpt-3.5-turbo",
            messages=[
                    {"role": "system", "content": gordon_ramsay_prompt},
                    {"role": "user", "content": prompt}
                ]
        ).choices[0].message.content

llm_standalone = LLMStandaloneApp()



## Evaluation and Tracking with TruLens

In [8]:
# imports required for tracking and evaluation
from trulens_eval import Feedback, OpenAI, Tru, TruLlama, Select, OpenAI as fOpenAI
from trulens_eval.feedback import GroundTruthAgreement

tru = Tru()
# tru.reset_database() # if needed

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [8]:
!pip list | grep trulens_eval  # For Unix-like systems in a notebook cell
!pip list | findstr trulens_eval  # For Windows in a notebook cell

/bin/bash: line 1: findstr: command not found
ERROR: Pipe to stdout was broken
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe


In [9]:
!pip list

Package                                 Version
--------------------------------------- ---------------------
absl-py                                 1.4.0
aiohttp                                 3.9.5
aiosignal                               1.3.1
alabaster                               0.7.16
albumentations                          1.3.1
alembic                                 1.13.2
altair                                  4.2.2
annotated-types                         0.7.0
anyio                                   4.4.0
argon2-cffi                             23.1.0
argon2-cffi-bindings                    21.2.0
array_record                            0.5.1
arviz                                   0.15.1
astropy                                 5.3.4
astunparse                              1.6.3
async-timeout                           4.0.3
atpublic                                4.1.0
attrs                                   23.2.0
audioread                               3.0.1
autograd  

## Evaluation setup

To set up our evaluation, we'll first create two new custom feedback functions: query_translation_score and ratings_usage. These are straight-forward prompts of the OpenAI API.

In [10]:
class Custom_OpenAI(OpenAI):
    def query_translation_score(self, question1: str, question2: str) -> float:
        prompt = f"Your job is to rate how similar two quesitons are on a scale of 1 to 10. Respond with the number only. QUESTION 1: {question1}; QUESTION 2: {question2}"
        return self.generate_score_and_reason(system_prompt = prompt)

    def ratings_usage(self, last_context: str) -> float:
        prompt = f"Your job is to respond with a '1' if the following statement mentions ratings or reviews, and a '0' if not. STATEMENT: {last_context}"
        return self.generate_score_and_reason(system_prompt = prompt)

In [10]:
!pip install "openai<2,>=1.1.1"

# imports required for tracking and evaluation
from trulens_eval import Feedback, OpenAI, Tru, TruLlama, Select, OpenAI as fOpenAI
from trulens_eval.feedback import GroundTruthAgreement

# ... rest of your code ...



In [11]:
!pip list


Package                                 Version
--------------------------------------- ---------------------
absl-py                                 1.4.0
aiohttp                                 3.9.5
aiosignal                               1.3.1
alabaster                               0.7.16
albumentations                          1.3.1
alembic                                 1.13.2
altair                                  4.2.2
annotated-types                         0.7.0
anyio                                   3.7.1
argon2-cffi                             23.1.0
argon2-cffi-bindings                    21.2.0
array_record                            0.5.1
arviz                                   0.15.1
astropy                                 5.3.4
astunparse                              1.6.3
async-timeout                           4.0.3
atpublic                                4.1.0
attrs                                   23.2.0
audioread                               3.0.1
autograd  

In [13]:
!pip install -U "openai<2,>=1.1.1"

In [None]:
!pip install --force-reinstall openai

In [14]:
!pip uninstall trulens_eval -y
!pip install git+https://github.com/truera/trulens

Found existing installation: trulens-eval 0.24.0
Uninstalling trulens-eval-0.24.0:
  Successfully uninstalled trulens-eval-0.24.0
Collecting git+https://github.com/truera/trulens
  Cloning https://github.com/truera/trulens to /tmp/pip-req-build-dzmbitoq
  Running command git clone --filter=blob:none --quiet https://github.com/truera/trulens /tmp/pip-req-build-dzmbitoq
  Resolved https://github.com/truera/trulens to commit eaa03c7a7a85351fa028b5df907a19b0e0f504ee
[31mERROR: git+https://github.com/truera/trulens does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m

Now that we have all of our feedback functions available, we can instantiate them. For many of our evals, we want to check on intermediate parts of our app such as the query passed to the yelp app, or the summarization of the Yelp content. We'll do so here using Select.

In [12]:
# !pip install "openai<2,>=1.1.1"
# !pip install --force-reinstall openai
# unstable: perhaps reduce temperature?

custom_provider = Custom_OpenAI()
# Input to tool based on trimmed user input.
f_query_translation = Feedback(
    custom_provider.query_translation_score,
    name="Query Translation") \
.on_input() \
.on(Select.Record.app.query[0].args.str_or_query_bundle)

f_ratings_usage = Feedback(
    custom_provider.ratings_usage,
    name="Ratings Usage") \
.on(Select.Record.app.query[0].rets.response)

# Result of this prompt: Given the context information and not prior knowledge, answer the query.
# Query: address of Gumbo Social
# Answer: "
provider = fOpenAI()
# Context relevance between question and last context chunk (i.e. summary)
f_context_relevance = Feedback(
    provider.context_relevance,
    name="Context Relevance") \
.on_input() \
.on(Select.Record.app.query[0].rets.response)

# Groundedness
f_groundedness = (
    Feedback(
    provider.groundedness_measure_with_cot_reasons,
    name="Groundedness") \
    .on(Select.Record.app.query[0].rets.response) \
    .on_output()
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(
    provider.relevance,
    name="Answer Relevance"
).on_input_output()

✅ In Query Translation, input question1 will be set to __record__.main_input or `Select.RecordInput` .
✅ In Query Translation, input question2 will be set to __record__.app.query[0].args.str_or_query_bundle .
✅ In Ratings Usage, input last_context will be set to __record__.app.query[0].rets.response .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.query[0].rets.response .
✅ In Groundedness, input source will be set to __record__.app.query[0].rets.response .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


In [16]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.7-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: langchain-community
Successfully installed langchain-community-0.2.7


### Ground Truth Eval

It's also useful in many cases to do ground truth eval with small golden sets. We'll do so here.

In [13]:
golden_set = [
    {"query": "Hello there mister AI. What's the vibe like at oprhan andy's in SF?", "response": "welcoming and friendly"},
    {"query": "Is park tavern in San Fran open yet?", "response": "Yes"},
    {"query": "I'm in san francisco for the morning, does Juniper serve pastries?", "response": "Yes"},
    {"query": "What's the address of Gumbo Social in San Francisco?", "response": "5176 3rd St, San Francisco, CA 94124"},
    {"query": "What are the reviews like of Gola in SF?", "response": "Excellent, 4.6/5"},
    {"query": "Where's the best pizza in New York City", "response": "Joe's Pizza"},
    {"query": "What's the best diner in Toronto?", "response": "The George Street Diner"}
]

f_groundtruth = Feedback(
    GroundTruthAgreement(golden_set).agreement_measure,
    name="Ground Truth Eval") \
.on_input_output()

✅ In Ground Truth Eval, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Ground Truth Eval, input response will be set to __record__.main_output or `Select.RecordOutput` .


### Run the dashboard

By running the dashboard before we start to make app calls, we can see them come in 1 by 1.

In [25]:
tru.run_dashboard(
#     _dev=trulens_path, force=True  # if running from github
)

Starting dashboard ...
npx: installed 22 in 4.115s

Go to this url and submit the ip given here. your url is: https://many-tips-end.loca.lt

Dashboard closed.
Dashboard closed.


RuntimeError: Dashboard failed to start in time. Please inspect dashboard logs for additional information.

### Instrument Yelp App

We can instrument our yelp app with TruLlama and utilize the full suite of evals we set up.

In [14]:
tru_agent = TruLlama(agent,
    app_id='YelpAgent',
    tags = "agent prototype",
    feedbacks = [
        f_qa_relevance,
        f_groundtruth,
        f_context_relevance,
        f_groundedness,
        f_query_translation,
        f_ratings_usage
    ]
)

In [15]:
tru_agent.print_instrumented()

Components:
	TruLlama (Other) at 0x7b4165622980 with path __app__
	OpenAIAgent (Other) at 0x7b416b2e41c0 with path __app__.app
	ChatMemoryBuffer (Other) at 0x7b416a3ed480 with path __app__.app.memory
	SimpleChatStore (Other) at 0x7b416b2e42b0 with path __app__.app.memory.chat_store

Methods:
Object at 0x7b416a3ed480:
	<function BaseChatStoreMemory.put at 0x7b416c7d8160> with path __app__.app.memory
	<function BaseMemory.put at 0x7b416c61bbe0> with path __app__.app.memory
Object at 0x7b416b2e41c0:
	<function BaseQueryEngine.query at 0x7b4173691cf0> with path __app__.app
	<function BaseQueryEngine.aquery at 0x7b4173692200> with path __app__.app
	<function AgentRunner.chat at 0x7b416a8040d0> with path __app__.app
	<function AgentRunner.achat at 0x7b416a804430> with path __app__.app
	<function AgentRunner.stream_chat at 0x7b416a8041f0> with path __app__.app
	<function BaseQueryEngine.retrieve at 0x7b4173692320> with path __app__.app
	<function BaseQueryEngine.synthesize at 0x7b4173692290> 

### Instrument Standalone LLM app.

Since we don't have insight into the OpenAI innerworkings, we cannot run many of the evals on intermediate steps.

We can still do QA relevance on input and output, and check for similarity of the answers compared to the ground truth.

In [16]:
tru_llm_standalone = TruCustomApp(
    llm_standalone,
    app_id="OpenAIChatCompletion",
    tags = "comparison",
    feedbacks=[
        f_qa_relevance,
        f_groundtruth
    ]
)

In [17]:
tru_llm_standalone.print_instrumented()

Components:
	TruCustomApp (Other) at 0x7b4165456de0 with path __app__
	LLMStandaloneApp (Custom) at 0x7b416a064a90 with path __app__.app

Methods:
Object at 0x7b416a064a90:
	<function LLMStandaloneApp.__call__ at 0x7b416a0c92d0> with path __app__.app


### Start using our apps!

In [18]:
prompt_set = [
    "What's the vibe like at oprhan andy's in SF?",
    "What are the reviews like of Gola in SF?",
    "Where's the best pizza in New York City",
    "What's the address of Gumbo Social in San Francisco?",
    "I'm in san francisco for the morning, does Juniper serve pastries?",
    "What's the best diner in Toronto?"
]

In [19]:
for prompt in prompt_set:
    print(prompt)

    with tru_llm_standalone as recording:
        llm_standalone(prompt)
    record_standalone = recording.get()

    with tru_agent as recording:
         agent.query(prompt)
    record_agent = recording.get()

What's the vibe like at oprhan andy's in SF?
Added user message to memory: What's the vibe like at oprhan andy's in SF?
=== Calling Function ===
Calling function: business_search with args: {"location":"San Francisco","term":"Orphan Andy's"}
Got output: Error: 400 Client Error: Bad Request for url: https://api.yelp.com/v3/businesses/search?location=San+Francisco&term=Orphan+Andy%27s

What are the reviews like of Gola in SF?
Added user message to memory: What are the reviews like of Gola in SF?
=== Calling Function ===
Calling function: business_search with args: {"location":"San Francisco","term":"Gola"}
Got output: Error: 400 Client Error: Bad Request for url: https://api.yelp.com/v3/businesses/search?location=San+Francisco&term=Gola

=== Calling Function ===
Calling function: business_search with args: {"location":"San Francisco","term":"Gola Restaurant"}
Got output: Error: 400 Client Error: Bad Request for url: https://api.yelp.com/v3/businesses/search?location=San+Francisco&term=Go