### RAG Triad of metrics
This section describes the metrics used for evaluating RAG applications.
It includes the metrics for:
- Answer relevance
- Context relevance
- Groundedness


In [2]:
import os
from llama_index.readers.file.base import SimpleDirectoryReader
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
documents = SimpleDirectoryReader(input_files=["./MIV2 - LLM paper.pdf"]).load_data()

In [5]:
len(documents), print(documents[0].text)

Human-Robot interaction through joint robot planning
with Large Language Models
Kosi Asuzu1*
1*Birmingham City University.
Abstract
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalisation capa-
bilities, expanding their utility beyond natural language processing into various applications.
Leveraging extensive web knowledge, these models generate meaningful text data in response
to user-defined prompts, introducing a novel mode of interaction with software applications.
Recent investigations have extended the generalizability of LLMs into the domain of robotics,
addressing challenges in existing robot learning techniques such as reinforcement learning and
imitation learning. This paper explores the application of LLMs for robot planning as an alter-
native approach to generate high-level robot plans based on prompts provided to the language
model. The proposed methodology facilitates continuous user interaction and adjustment of task
execution plans in real-ti

(17, None)

In [6]:
from llama_index.schema import Document

document = Document(text="\n\n".join(doc.text for doc in documents))

In [8]:
document.text[:100]

'Human-Robot interaction through joint robot planning\nwith Large Language Models\nKosi Asuzu1*\n1*Birmi'

In [None]:
from utils import build_sentence_window_index

from llama_index.llms import OpenAI

llm = OpenAI(model="mistralai/Mistral-7B-Instruct-v0.2", temperature=0.1)

sentence_index = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="sentence_index"
)

In [None]:
from utils import get_sentence_window_query_engine

sentence_window_engine = \
get_sentence_window_query_engine(sentence_index)

In [None]:
output = sentence_window_engine.query(
    "How do you create your AI portfolio?")
output.response

### Using feedback functions for evaluations
In this section we will cover the use of feedback functions for evaluating the outputs of LLMs

#### Answer Relevance

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
from trulens_eval import OpenAI as fOpenAI

provider = fOpenAI()

In [None]:
from trulens_eval import Feedback

f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons,
    name="Answer Relevance"
).on_input_output()

In [None]:
from trulens_eval import TruLlama

context_selection = TruLlama.select_source_nodes().node.text

#### Contex Relevance

In [None]:
import numpy as np

f_qs_relevance = (
    Feedback(provider.qs_relevance,
             name="Context Relevance")
    .on_input()
    .on(context_selection)
    .aggregate(np.mean)
)

In [None]:
import numpy as np

f_qs_relevance = (
    Feedback(provider.qs_relevance_with_cot_reasons,
             name="Context Relevance")
    .on_input()
    .on(context_selection)
    .aggregate(np.mean)
)

#### Groundedness

In [None]:
from trulens_eval.feedback import Groundedness

grounded = Groundedness(groundedness_provider=provider)

In [None]:
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons,
             name="Groundedness"
            )
    .on(context_selection)
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

### Evaluating RAG applications
We will add the feedback functions as callbacks during our RAG pipeline, which will allow for evaluationf

In [None]:
from trulens_eval import TruLlama
from trulens_eval import FeedbackMode

tru_recorder = TruLlama(
    sentence_window_engine,
    app_id="App_1",
    feedbacks=[
        f_qa_relevance,
        f_qs_relevance,
        f_groundedness
    ]
)

In [None]:
eval_questions = []
with open('eval_questions.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        eval_questions.append(item)

In [None]:
eval_questions

In [None]:
eval_questions.append("How can I be successful in AI?")

In [None]:
eval_questions

In [None]:
for question in eval_questions:
    with tru_recorder as recording:
        sentence_window_engine.query(question)

In [None]:
records, feedback = tru.get_records_and_feedback(app_ids=[])
records.head()

In [None]:
import pandas as pd

pd.set_option("display.max_colwidth", None)
records[["input", "output"] + feedback]

In [None]:
tru.get_leaderboard(app_ids=[])

In [None]:
tru.run_dashboard()