<a href="https://colab.research.google.com/github/adidahl/RAGAS/blob/main/code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAGAS Evaluation for LangChain Agents (Using ChromaDB and OpenAI)

In [1]:
!python --version

Python 3.11.12


**R**etrieval **A**ugmented **G**eneration **As**sessment (RAGAS) is an evaluation framework for quantifying the performances of our RAG pipelines. In this example we will see how to use it with a RAG-enabled conversational agent in LangChain, using **ChromaDB** as the vector store and **OpenAI** for models.

Because we need an agent and RAG pipeline to evaluate RAGAS the first part of this notebook covers setting up an XML Agent with RAG. Jump ahead to **Integrating RAGAS** for the RAGAS section.

To begin, let's install the prerequisites:

In [1]:
!pip install -qU \
    langchain \
    langchain-community \
    langchain-openai \
    openai \
    chromadb==0.5.0 \
    tiktoken==0.7.0 \
    datasets==2.19.0 \
    ragas==0.1.8 \
    pandas

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.0 MB[0m [31m13.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m19.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.7/61.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.8/644.8 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.6/433.6 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Run this cell and enter your OpenAI API Key when prompted
import os
from getpass import getpass

# platform.openai.com
if "OPENAI_API_KEY" not in os.environ:
  openai_api_key = getpass("Please enter your OpenAI API key: ")
  os.environ["OPENAI_API_KEY"] = openai_api_key
else:
  print("OpenAI API Key already set.")

Please enter your OpenAI API key: ··········


## Finding Knowledge

The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-chunks).

_Note: we're using the prechunked dataset. For the raw version see [`jamescalam/ai-arxiv2`](https://huggingface.co/datasets/jamescalam/ai-arxiv2)._

In [3]:
from datasets import load_dataset

# Load a smaller subset for quicker testing, adjust as needed
dataset = load_dataset("jamescalam/ai-arxiv2-chunks", split="train[:5000]")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 5000
})

In [4]:
dataset[1]

{'doi': '2401.09350',
 'chunk-id': 1,
 'chunk': 'These neural networks and their training algorithms may be complex, and the scope of their impact broad and wide, but nonetheless they are simply functions in a high-dimensional space. A trained neural network takes a vector as input, crunches and transforms it in various ways, and produces another vector, often in some other space. An image may thereby be turned into a vector, a song into a sequence of vectors, and a social network as a structured collection of vectors. It seems as though much of human knowledge, or at least what is expressed as text, audio, image, and video, has a vector representation in one form or another.\nIt should be noted that representing data as vectors is not unique to neural networks and deep learning. In fact, long before learnt vector representations of pieces of dataâ\x80\x94what is commonly known as â\x80\x9cembeddingsâ\x80\x9dâ\x80\x94came along, data was often encoded as hand-crafted feature vectors. E

## Building the Knowledge Base with ChromaDB and OpenAI

To build our knowledge base we need _two things_:

1.  **Embeddings:** We will use `OpenAIEmbeddings` which requires the OpenAI API key we provided earlier.
2.  **A vector database:** We will use `ChromaDB`, a popular open-source vector database that can run locally.

In [5]:
from langchain_openai import OpenAIEmbeddings
import os

# Initialize OpenAI embeddings
# Using a smaller, efficient model like text-embedding-3-small is recommended
embed = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

Now, let's set up ChromaDB. We'll use a persistent client so the data isn't lost when the notebook session ends.

In [6]:
import chromadb
from langchain_community.vectorstores import Chroma

# Define the path for the persistent Chroma database
persist_directory = "./chroma_db_arxiv"
# Define the collection name
collection_name = "arxiv_openai_ragas"

# Initialize Chroma vector store. This will create the directory if it doesn't exist
# or load it if it does.
vectorstore = Chroma(
    collection_name=collection_name,
    embedding_function=embed,
    persist_directory=persist_directory
)

print(f"Chroma vector store initialized.")
print(f"Collection name: {collection_name}")
print(f"Persisting to: {persist_directory}")
print(f"Number of documents currently in collection: {vectorstore._collection.count()}")

  vectorstore = Chroma(


Chroma vector store initialized.
Collection name: arxiv_openai_ragas
Persisting to: ./chroma_db_arxiv
Number of documents currently in collection: 0


Let's check the dimensionality of our OpenAI embedding model:

In [7]:
sample_embedding = embed.embed_query("test query")
embedding_dim = len(sample_embedding)
print(f"Embedding dimension: {embedding_dim}")

Embedding dimension: 1536


### Populating our ChromaDB Collection

Now our knowledge base is ready to be populated with our data. We will use the `vectorstore.add_texts` method, which handles embedding generation internally.

We will include metadata from each record.

In [8]:
from tqdm.auto import tqdm
import pandas as pd

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas()

batch_size = 100 # ChromaDB can handle larger batches, adjust as needed
total_docs = len(data)
docs_added = 0

print(f"Preparing to add {total_docs} documents to ChromaDB in batches of {batch_size}...")

for i in tqdm(range(0, total_docs, batch_size)):
    i_end = min(total_docs, i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = batch["id"].tolist()
    # get text to embed (which becomes the document content)
    texts = batch['chunk'].tolist()
    # get metadata to store
    metadata = batch[['source', 'title']].to_dict('records')

    # Add to ChromaDB
    try:
        vectorstore.add_texts(texts=texts, metadatas=metadata, ids=ids)
        docs_added += len(ids)
    except Exception as e:
        print(f"Error adding batch {i//batch_size + 1}: {e}")
        # Optional: break or continue depending on desired behavior
        # break

# Persist the changes explicitly (good practice)
vectorstore.persist()
print(f"\nFinished adding documents.")
print(f"Total documents added in this run: {docs_added}")
print(f"Total documents in collection: {vectorstore._collection.count()}")

Preparing to add 5000 documents to ChromaDB in batches of 100...


  0%|          | 0/50 [00:00<?, ?it/s]


Finished adding documents.
Total documents added in this run: 5000
Total documents in collection: 5000


  vectorstore.persist()


Create a tool for our agent to use when searching for ArXiv papers in our ChromaDB collection:

In [9]:
from langchain.agents import tool

@tool
def arxiv_search(query: str) -> str:
    """Use this tool when answering questions about AI, machine learning, data
    science, or other technical questions that may be answered using arXiv
    papers.
    """
    # Perform similarity search in ChromaDB
    # We use the vectorstore object directly, which already has the embedder
    docs = vectorstore.similarity_search(query, k=5)
    # Reformat results into string (Langchain docs have 'page_content')
    # RAGAS expects a list of strings as context, so we return the combined string here
    # which will be split later during RAGAS evaluation.
    results_str = "\n---\n".join(
        [doc.page_content for doc in docs]
    )
    if not results_str:
        return "No relevant documents found."
    return results_str

tools = [arxiv_search]

When this tool is used by our agent it will execute it like so:

In [10]:
print(
    arxiv_search.run(tool_input={"query": "can you tell me about llama 2?"})
)

Llama2 7B, 13B 7B 7B, 13B, 33B 13B 7B 7B 7B 13B 2k 1k 8k 2k 4k 4k 4k 4k 5w 200w 200w, 300w, 430w 110w 1000w â 120w 100w English-oriented Models Llama2-chat (Touvron et al. 2023) Vicuna-V1.3 (Zheng et al. 2023) Vicuna-V1.5 (Zheng et al. 2023) WizardLM (Xu et al. 2023b) LongChat-V1 (Li* et al. 2023) LongChat-V1.5 (Li* et al. 2023) OpenChat-V3.2 (Wang et al. 2023a) GPT-3.5-turbo GPT-4 Llama2 Llama1 Llama2 Llama1 Llama1 Llama2 Llama2 - - 7B, 13B, 70B 7B, 13B, 33B 7B, 13B 13B 7B, 13B 7B 13B - - N/A N/A N/A N/A N/A N/A N/A N/A N/A 4k 2k 16k 2k 16k 32k
---
Detailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama 1 34B4 in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.
4Since Llama 2 34B was not open-

## Defining XML Agent with OpenAI

The XML agent is built primarily to support Anthropic models. Anthropic models have been trained to use XML tags like `<input>{some input}</input` or when using a tool they use:

```
<tool>{tool name}</tool>
<tool_input>{tool input}</tool_input>
```

While OpenAI models are more commonly used with ReAct or Function Calling agents, we will proceed with the XML agent as requested, using an OpenAI model (`gpt-3.5-turbo`). Its ability to strictly follow the XML format might vary.

To create an XML agent we need a `prompt`, `llm`, and list of `tools`. We can download a prebuilt prompt for conversational XML agents from LangChain hub.

In [11]:
from langchain import hub

prompt = hub.pull("hwchase17/xml-agent-convo")
prompt



ChatPromptTemplate(input_variables=['agent_scratchpad', 'input', 'tools'], input_types={}, partial_variables={'chat_history': ''}, metadata={'lc_hub_owner': 'hwchase17', 'lc_hub_repo': 'xml-agent-convo', 'lc_hub_commit_hash': '00f6b7470fa25a24eef6e4e3c1e44ba07189f3e91c4d987223ad232490673be8'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['agent_scratchpad', 'chat_history', 'input', 'tools'], input_types={}, partial_variables={}, template="You are a helpful assistant. Help the user answer any questions.\n\nYou have access to the following tools:\n\n{tools}\n\nIn order to use a tool, you can use <tool></tool> and <tool_input></tool_input> tags. You will then get back a response in the form <observation></observation>\nFor example, if you have a tool called 'search' that could run a google search, in order to search for the weather in SF you would respond:\n\n<tool>search</tool><tool_input>weather in SF</tool_input>\n<observation>64 degrees</observation>\n\n

We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools. Now we initialize the OpenAI LLM.

In [19]:
from langchain_openai import ChatOpenAI

# chat completion llm using OpenAI
llm = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model_name='gpt-4o', # Or 'gpt-4-turbo-preview', 'gpt-4' etc.
    temperature=0.0 # Low temperature for more deterministic factual answers
)

When the agent is run we will provide it with a single `input` — this is the input text from a user. However, within the agent logic an *agent_scratchpad* object will be passed too, which will include tool information. To feed this information into our LLM we will need to transform it into the XML format described above, we define the `convert_intermediate_steps` function to handle that.

In [20]:
def convert_intermediate_steps(intermediate_steps):
    log = ""
    for action, observation in intermediate_steps:
        log += (
            f"<tool>{action.tool}</tool><tool_input>{action.tool_input}"
            f"</tool_input><observation>{observation}</observation>"
        )
    return log

We must also parse the tools into a string containing `tool_name: tool_description` — we handle that with the `convert_tools` function.

In [21]:
def convert_tools(tools):
    return "\n".join([f"{tool.name}: {tool.description}" for tool in tools])

With everything ready we can go ahead and initialize our agent object using [**L**ang**C**hain **E**xpression **L**anguage (LCEL)](https://python.langchain.com/docs/expression_language/). We add instructions for when the LLM should _stop_ generating with `llm.bind(stop=[...])` and finally we parse the output from the agent using an `XMLAgentOutputParser` object.

In [22]:
from langchain.agents.output_parsers.xml import XMLAgentOutputParser

agent = (
    {
        "input": lambda x: x["input"],
        # Ensure chat_history is passed for conversational context
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: convert_intermediate_steps(
            x["intermediate_steps"]
        ),
    }
    | prompt.partial(tools=convert_tools(tools))
    | llm.bind(stop=["</tool_input>", "</observation>"]) # Adjusted stop sequences for robustness
    | XMLAgentOutputParser()
)

With our `agent` object initialized we pass it to an `AgentExecutor` object alongside our original `tools` list:

In [23]:
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from langchain_core.messages import AIMessage, HumanMessage

# Define memory key consistent with the prompt
memory_key = "chat_history"

# Initialize memory
memory = ConversationBufferMemory(memory_key=memory_key, return_messages=True)

# Initialize AgentExecutor with agent, tools, memory, and intermediate steps
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    verbose=True, # Set to True to see agent reasoning
    return_intermediate_steps=True,
    handle_parsing_errors=True # Helps with potential XML parsing issues
)

Now we can use the agent via the `invoke` method. Note that we need to provide an empty `chat_history` for the first turn if using memory this way directly (although the memory object handles this).

In [24]:
import traceback

try:
    response = agent_executor.invoke({
        "input": "can you tell me about llama 2?",
        "chat_history": [] # Start with empty history for invoke
    })
    print(response)
except Exception as e:
    print(f"Agent invocation failed: {e}")
    print("--- Full Traceback ---")
    traceback.print_exc() # Print the detailed traceback
    print("----------------------")
    response = {'output': f'Agent failed to generate response: {e}', 'intermediate_steps': []}



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<tool>arxiv_search</tool><tool_input>llama 2[0m[36;1m[1;3mLlama2 7B, 13B 7B 7B, 13B, 33B 13B 7B 7B 7B 13B 2k 1k 8k 2k 4k 4k 4k 4k 5w 200w 200w, 300w, 430w 110w 1000w â 120w 100w English-oriented Models Llama2-chat (Touvron et al. 2023) Vicuna-V1.3 (Zheng et al. 2023) Vicuna-V1.5 (Zheng et al. 2023) WizardLM (Xu et al. 2023b) LongChat-V1 (Li* et al. 2023) LongChat-V1.5 (Li* et al. 2023) OpenChat-V3.2 (Wang et al. 2023a) GPT-3.5-turbo GPT-4 Llama2 Llama1 Llama2 Llama1 Llama1 Llama2 Llama2 - - 7B, 13B, 70B 7B, 13B, 33B 7B, 13B 13B 7B, 13B 7B 13B - - N/A N/A N/A N/A N/A N/A N/A N/A N/A 4k 2k 16k 2k 16k 32k
---
0.644 0.622 0.632 0.690 0.794 0.740 0.808 0.046 0.473 0.247 0.589 0.653 0.491 0.486 0.532 0.394 0.396 0.515 0.725 0.686 0.728 0.767 0.742 0.054 0.500 0.399 0.669 0.571 0.701 0.705 0.672 0.294 0.521 0.428 0.590 0.427 0.601 0.575 0.573 0.135 0.696 0.732 0.738 0.758 0.776 0.710 0.735 0.321 0.658 0.877 0.777 0.876 0.857 0.



Now let's put together a helper function `chat` to simplify interaction, using the agent's memory.

In [25]:
def chat(text: str):
    # The AgentExecutor now manages chat history via the memory object
    try:
        out = agent_executor.invoke({"input": text})
    except Exception as e:
        print(f"Chat invocation failed: {e}")
        out = {'output': 'Agent failed to generate response.', 'intermediate_steps': []}
    return out

Now we simply chat with our agent and it will remember the context of previous interactions.

In [26]:
# Reset memory before starting a new conversation if needed
agent_executor.memory.clear()

response1 = chat("can you tell me about llama 2?")
print(f'\nUser: can you tell me about llama 2?')
print(f'Agent: {response1.get("output", "Error: No output found.")}')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<tool>arxiv_search</tool><tool_input>llama 2[0m[36;1m[1;3mLlama2 7B, 13B 7B 7B, 13B, 33B 13B 7B 7B 7B 13B 2k 1k 8k 2k 4k 4k 4k 4k 5w 200w 200w, 300w, 430w 110w 1000w â 120w 100w English-oriented Models Llama2-chat (Touvron et al. 2023) Vicuna-V1.3 (Zheng et al. 2023) Vicuna-V1.5 (Zheng et al. 2023) WizardLM (Xu et al. 2023b) LongChat-V1 (Li* et al. 2023) LongChat-V1.5 (Li* et al. 2023) OpenChat-V3.2 (Wang et al. 2023a) GPT-3.5-turbo GPT-4 Llama2 Llama1 Llama2 Llama1 Llama1 Llama2 Llama2 - - 7B, 13B, 70B 7B, 13B, 33B 7B, 13B 13B 7B, 13B 7B 13B - - N/A N/A N/A N/A N/A N/A N/A N/A N/A 4k 2k 16k 2k 16k 32k
---
0.644 0.622 0.632 0.690 0.794 0.740 0.808 0.046 0.473 0.247 0.589 0.653 0.491 0.486 0.532 0.394 0.396 0.515 0.725 0.686 0.728 0.767 0.742 0.054 0.500 0.399 0.669 0.571 0.701 0.705 0.672 0.294 0.521 0.428 0.590 0.427 0.601 0.575 0.573 0.135 0.696 0.732 0.738 0.758 0.776 0.710 0.735 0.321 0.658 0.877 0.777 0.876 0.857 0.



We can ask follow up questions that miss key information but thanks to the conversational history the LLM understands the context and uses that to adjust the search query.

In [27]:
# Ask a follow-up question
response2 = chat("was any red teaming done?")
print(f'\nUser: was any red teaming done?')
print(f'Agent: {response2.get("output", "Error: No output found.")}')

# Store the last response with intermediate steps for RAGAS
last_response_with_steps = response2



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<tool>arxiv_search</tool><tool_input>Llama 2 red teaming[0m[36;1m[1;3mLlama2 7B, 13B 7B 7B, 13B, 33B 13B 7B 7B 7B 13B 2k 1k 8k 2k 4k 4k 4k 4k 5w 200w 200w, 300w, 430w 110w 1000w â 120w 100w English-oriented Models Llama2-chat (Touvron et al. 2023) Vicuna-V1.3 (Zheng et al. 2023) Vicuna-V1.5 (Zheng et al. 2023) WizardLM (Xu et al. 2023b) LongChat-V1 (Li* et al. 2023) LongChat-V1.5 (Li* et al. 2023) OpenChat-V3.2 (Wang et al. 2023a) GPT-3.5-turbo GPT-4 Llama2 Llama1 Llama2 Llama1 Llama1 Llama2 Llama2 - - 7B, 13B, 70B 7B, 13B, 33B 7B, 13B 13B 7B, 13B 7B 13B - - N/A N/A N/A N/A N/A N/A N/A N/A N/A 4k 2k 16k 2k 16k 32k
---
Llama 2 70B Mixtral 8x7B BBQ accuracy 51.5% 56.0% BOLD sentiment score (avg Â± std) gender profession religious_ideology political_ideology race 0.293 Â± 0.073 0.218 Â± 0.073 0.188 Â± 0.133 0.149 Â± 0.140 0.232 Â± 0.049 0.323 Â±0.045 0.243 Â± 0.087 0.144 Â± 0.089 0.186 Â± 0.146 0.232 Â± 0.052
Figure 5: Bias



---

## Integrating RAGAS

To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the **query**, the **retrieved contexts**, and the **generated output**.

We have the generated output (`answer`) and the query (`question`). We need to extract the retrieved contexts from the agent's intermediate steps.

In [28]:
# Display the last response which includes intermediate steps
print(last_response_with_steps)

{'input': 'was any red teaming done?', 'chat_history': [HumanMessage(content='can you tell me about llama 2?', additional_kwargs={}, response_metadata={}), AIMessage(content='Llama 2 is a family of language models developed as an improvement over the original Llama models. It includes models of various sizes, such as 7B, 13B, and 70B parameters. Llama 2 models are designed to perform well across a range of tasks, including reasoning, comprehension, and code generation. They have been compared to other models like Mistral 7B and GPT-3.5, with Llama 2 showing competitive performance on several benchmarks. However, Mistral 7B has been noted to outperform Llama 2 13B in certain areas like code, mathematics, and reasoning benchmarks. Llama 2 models are also evaluated for their efficiency in terms of cost-performance, with some models achieving performance levels expected from larger models.', additional_kwargs={}, response_metadata={}), HumanMessage(content='was any red teaming done?', addi

When initializing our `AgentExecutor` object we included `return_intermediate_steps=True`. Those steps include the response from our `arxiv_search` tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so (remembering the tool output is a single string joined by `\n---\n`):

In [29]:
contexts = []
intermediate_steps = last_response_with_steps.get('intermediate_steps', [])

if intermediate_steps:
    # Assuming the arxiv_search tool is the first (and likely only) step
    tool_output = intermediate_steps[0][1] # Output of the tool (observation)
    if isinstance(tool_output, str):
        contexts = tool_output.split("\n---\n")
    else:
        print("Tool output is not a string, cannot extract contexts.")
else:
    print("No intermediate steps found to extract contexts from.")

print(f"Extracted {len(contexts)} contexts:")
for i, ctx in enumerate(contexts):
    print(f"--- Context {i+1} ---")
    print(ctx[:200] + "...") # Print start of each context

Extracted 5 contexts:
--- Context 1 ---
Llama2 7B, 13B 7B 7B, 13B, 33B 13B 7B 7B 7B 13B 2k 1k 8k 2k 4k 4k 4k 4k 5w 200w 200w, 300w, 430w 110w 1000w â 120w 100w English-oriented Models Llama2-chat (Touvron et al. 2023) Vicuna-V1.3 (Zheng e...
--- Context 2 ---
Llama 2 70B Mixtral 8x7B BBQ accuracy 51.5% 56.0% BOLD sentiment score (avg Â± std) gender profession religious_ideology political_ideology race 0.293 Â± 0.073 0.218 Â± 0.073 0.188 Â± 0.133 0.149 Â± 0...
--- Context 3 ---
TogetherCompute. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavi...
--- Context 4 ---
Detailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama 1 34B4 in different categorie...
--- Context 5 ---
Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Lla

## Evaluation with RAGAS

To evaluate with RAGAS we need an evaluation dataset containing `question`s, and the `ground_truth` answers to those questions. RAGAS can then use the `answer` generated by our agent and the retrieved `contexts` to calculate various metrics.

We will use a pre-made evaluation dataset based on the AI ArXiv dataset.

In [30]:
# Using the mixtral dataset, but evaluation uses OpenAI
ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")
# Note: This dataset includes ground_truth_context, which RAGAS v0.1.x
# can use for context_recall/precision if available, but often they are calculated
# without pre-defined relevant contexts, comparing generated contexts to ground_truth answer.
ragas_data

Downloading data:   0%|          | 0.00/87.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'ground_truth_context', 'ground_truth', 'question_type', 'episode_done'],
    num_rows: 51
})

In [31]:
ragas_data[0]

{'question': 'What is the impact of encoding the input prompt on inference speed in generative inference?',
 'ground_truth_context': ['- This technique works particularly well when processing large batches of data, during train-\ning Pudipeddi et al. (2020); Ren et al. (2021) or large-batch non-interactive inference Aminabadi et al.\n(2022); Sheng et al. (2023), where each layer processes a lot of tokens each time the layer is loaded\nfrom RAM.\n- In turn, when doing interactive inference (e.g. as a chat assistants), offloading works\nsignificantly slower than on-device inference.\n- The generative inference workload consists of two phases: 1) encoding the input prompt and 2)\ngenerating tokens conditioned on that prompt.\n- The key difference between these two phases is that\nprompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially\n(token-by-token and layer-by-layer).\n- In general, phase 1 works relatively well with existing Mixture-\nof-Exper

We first iterate through the questions in this evaluation dataset and ask these questions to our agent, collecting the questions, answers, contexts, and ground truths.

In [32]:
import pandas as pd
from tqdm.auto import tqdm

eval_results = []
limit = 10 # Number of questions to evaluate (adjust as needed)

print(f"Running evaluation for {limit} questions...")

# Reset agent memory before evaluation loop
agent_executor.memory.clear()

for i, row in tqdm(enumerate(ragas_data), total=limit):
    if i >= limit:
        break
    question = row["question"]
    # RAGAS expects ground_truths as a list of strings
    ground_truths = [row["ground_truth"]] if isinstance(row["ground_truth"], str) else row["ground_truth"]

    # Clear memory for each new question to avoid contamination between eval samples
    agent_executor.memory.clear()

    try:
        # Use invoke to get intermediate steps
        out = agent_executor.invoke({"input": question, "chat_history": []})
        answer = out.get('output', "ERROR: No output")
        intermediate_steps = out.get('intermediate_steps', [])

        contexts = []
        if intermediate_steps:
            # Assuming the search tool is the first step
            tool_output = intermediate_steps[0][1]
            if isinstance(tool_output, str):
                 contexts = tool_output.split("\n---\n")
            else:
                 print(f"Warning: Tool output for Q{i} is not a string.")
        else:
             # This happens if the agent answers without using the tool
             print(f"Warning: No intermediate steps (tool not used) for Q{i}. Contexts will be empty.")
             contexts = []

    except Exception as e:
        print(f"Error processing question {i}: {e}")
        answer = f"ERROR: {e}"
        contexts = []

    eval_results.append({
        "question": question,
        "answer": answer,
        "contexts": contexts,
        "ground_truth": ground_truths[0] # RAGAS expects single string ground_truth
    })

df = pd.DataFrame(eval_results)
print("\nFinished collecting evaluation data.")

Running evaluation for 10 questions...


  0%|          | 0/10 [00:00<?, ?it/s]



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<tool>arxiv_search</tool><tool_input>impact of encoding input prompt on inference speed in generative models[0m[36;1m[1;3mThe generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.
Below, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.
4



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>generating tokens affect inference speed in generative inference[0m[36;1m[1;3mâ¢ The average number of generated tokens outputted by LLMs per query. Much like the assessment of average prompt tokens, this metric provides an evaluation of computational efficiency, but from a token generation perspective. Instead of focusing on the number of tokens in the prompt, it takes into account the number of tokens generated. This is particularly significant because transformer-based generative LLMs produce content token-by-token, with each subsequent token relying on the gen- eration of preceding ones. Consequently, an increase in number of generated tokens leads to a corresponding increase in the computational cost, as each additional generated token implies another LLM forward inference. In fact, OpenAI applies a pricing structure wherein the cost for the number of generated tokens is twice that of the number of prompt tokens for their LLM A



[32;1m[1;3mTo answer your question about the differences in architecture between Mixtral 8x7B and Mistral 7B, particularly in terms of feedforward blocks and active parameters used during inference, I will search for relevant information in arXiv papers.

<tool>arxiv_search</tool><tool_input>Mixtral 8x7B architecture vs Mistral 7B feedforward blocks active parameters inference[0m[36;1m[1;3mAbstract
We introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B â Instruct, that surpasses



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>offloading A100 server MoE-based language models[0m[36;1m[1;3m# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com
# Abstract
With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) â a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their âdenseâ counterparts, but it also increases model size due to having multiple âexpertsâ. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading al



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>Mixtral Llama 2 70B code benchmarks comparison[0m[36;1m[1;3mTable 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.
70 Mixtral 8x7B. âMixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 Â§ 30 5 eo â= Mistral Â° 20 âe LlaMA2 78 (138 348 70B 7B (138 348 708 7B Â«13B 34B 708 Active Params Active Params Active Params
Figure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly super



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>Mixtral mathematics benchmarks Llama 2 70B[0m[36;1m[1;3mTable 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.
70 Mixtral 8x7B. âMixtral 8x7B Mixtral 8x7B 355 =o = Es & E60! Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 Â§ 30 5 eo â= Mistral Â° 20 âe LlaMA2 78 (138 348 70B 7B (138 348 708 7B Â«13B 34B 708 Active Params Active Params Active Params
Figure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly superior 



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>Mixtral-8x7B-Instruct model OpenAssistant dataset benchmarking expert LRU cache speculative loading expert recall rate[0m[36;1m[1;3mFor this evaluation, we run Mixtral-8x7B-Instruct model on the OpenAssistant dataset (KÃ¶pf et al., 2023). We test LRU caching by running the model on recorded conversations and measuring the recall (aka âhit ratioâ from caching perspective) for different cache sizes k. Next, we test speculative loading in isolation by âguessingâ which experts should be loaded (by applying the next layerâs gating function on current layer activations), then measuring how often the actual next experts get loaded this way. A recall of 1.0 corresponds to a situation where both (2) Mixtral active experts were pre-fetched. We test speculative loading in three settings: 1, 2 and 10 layers ahead.
# 4.2 Mixed MoE Quantization
---
7Notably, Google Colab RAM cannot fit Mixtral-8x7B with a reasonable compression rate 8Thi



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>sparse Mixture-of-Experts language models faster token generation[0m[36;1m[1;3mThe generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that prompt tokens are encoded in parallel (layer-by-layer), whereas the generation runs sequentially (token-by-token and layer-by-layer). In general, phase 1 works relatively well with existing Mixture- of-Experts algorithms, since each layer can only be loaded once for the entire prompt. In turn, when generating tokens, one must load layer once per each token generated. In practice, this means that inference speed is limited by how fast one can fetch parameters from system memory.
Below, we look for patterns in how the MoE model loads its experts and propose ways to exploit these patterns to speed up inference time.
4To learn more about these methods, please refer to sur



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>sparse Mixture-of-Experts impact on language model size[0m[36;1m[1;3mShazeer et al. (2017) builds on this idea to train a sparsely gated Mixture-of-Experts to serve as a language model. The full model consists of a recurrent neural network backbone and a MoE module with up to 131072 experts. When processing a given token, a linear gating function select 4 most suitable experts based on the latest hidden state. The resulting model (including the gating function and experts) is trained end-to-end to minimize cross-entropy, with an additional regularizer to promote equal expert utilization. Shazeer et al. (2017) observed that the MoE model not only improves perplexity, but also learns interpretable expert specializations: some experts would âspecializeâ on prepositions, while others learn to express a particular concept (e.g. speed).
---
# 2 Background & Related Work
# 2.1 Mixture-of-Experts
The recent surge in MoE language models b



[32;1m[1;3m<tool>arxiv_search</tool><tool_input>LRU caching Mixture-of-Experts language models inference speed[0m[36;1m[1;3mWe illustrate an example of how LRU cache saves experts in Figure 1 (see caption). LRU is a very simple strategy that does not consider factors like expert activation frequencies, varying cache size between MoE layers, or any sequential patterns in expert activation. However, we found that even this simple strategy can significantly speed up inference for modern Mixture-of-Experts models such as Mixtral-8x7B (see Section 4 for detailed evaluation).
# 3.2 Speculative Expert Loading
While LRU caching can reduce the average expert loading time, most of the inference time is still spent waiting for the next expert to be loaded. The reason behind this is that, unlike with dense models, MoE offloading cannot effectively overlap expert loading with computation. To understand this problem, let us zoom into the process of generating a single token, layer-by-layer. The



In [33]:
pd.set_option('display.max_colwidth', 200) # Adjust display width
df.head()

Unnamed: 0,question,answer,contexts,ground_truth
0,What is the impact of encoding the input prompt on inference speed in generative inference?,"The impact of encoding the input prompt on inference speed in generative models is significant due to the difference in processing phases. Encoding the input prompt is done in parallel, layer-by-l...",[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that pro...,"The encoding of the input prompt has an impact on inference speed in generative inference. During the encoding phase, prompt tokens are encoded in parallel, layer-by-layer, which works relatively ..."
1,How does generating tokens affect the inference speed in generative inference?,"Generating tokens in generative inference significantly affects inference speed due to the sequential nature of token generation. Unlike the parallel processing of input prompt tokens, generating ...","[â¢ The average number of generated tokens outputted by LLMs per query. Much like the assessment of average prompt tokens, this metric provides an evaluation of computational efficiency, but from...","Generating tokens affects the inference speed in generative inference by slowing it down. In interactive inference, where tokens are generated autoregressively from left to right, the inference sy..."
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,"The architecture of Mixtral 8x7B differs from Mistral 7B primarily in its use of Sparse Mixture of Experts (SMoE) layers. While Mistral 7B is a standard transformer model, Mixtral 8x7B incorporate...","[Abstract\nWe introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all e...","The architecture of Mixtral 8x7B differs from Mistral 7B in terms of feedforward blocks and active parameters used during inference. Mixtral 8x7B has 8 feedforward blocks (experts) in each layer, ..."
3,When is offloading used on the A100 server for accelerating MoE-based language models?,Offloading is used on the A100 server for accelerating MoE-based language models to manage the large model size and limited accelerator memory. The strategy involves using parameter offloading alg...,"[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\n# Abstract\nWith the widespread adoption of Large Language Models (LLMs), many deep learning pr...",Offloading is used on the A100 server for accelerating MoE-based language models when there is resource-constricted hardware and the goal is to enable broader access to these powerful models for r...
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,"Mixtral significantly outperforms Llama 2 70B in code benchmarks, as well as in most other categories, while using 5x fewer active parameters during inference. This makes Mixtral not only more eff...",[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\n70 Mix...,Mixtral outperforms Llama 2 70B in code benchmarks.


In [34]:
from datasets import Dataset
from ragas.metrics import (
    faithfulness,          # How factual is the answer based on context?
    answer_relevancy,      # How relevant is the answer to the question?
    context_precision,     # Signal-to-noise ratio in retrieved context
    context_recall,        # Were all relevant parts of ground truth context retrieved?
    answer_correctness,    # How factually correct is the answer compared to ground truth?
    answer_similarity      # How semantically similar is the answer to ground truth?
)
from ragas import evaluate

# Ensure 'ground_truth' column exists and is a string
if 'ground_truth' not in df.columns:
    raise ValueError("DataFrame must contain a 'ground_truth' column for RAGAS evaluation.")
df['ground_truth'] = df['ground_truth'].astype(str)

# Convert pandas DataFrame to Hugging Face Dataset
eval_dataset = Dataset.from_pandas(df)

# Define the metrics we want to compute
metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
    answer_similarity
]

print("Starting RAGAS evaluation...")

# Run the evaluation
# RAGAS will automatically use the OpenAI models set in the environment
result = evaluate(
    dataset=eval_dataset,
    metrics=metrics,
    # You can explicitly provide models if needed, but defaults should work
    # llm=llm,
    # embeddings=embed
)

print("RAGAS evaluation complete.")

# Convert result back to pandas DataFrame for easier viewing
ragas_df = result.to_pandas()
ragas_df.head()


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._answer_correctness import AnswerCorrectness, answer_correctness

For example, replace imports like: `from langchain.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._context_entities_recall import (


Starting RAGAS evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

RAGAS evaluation complete.


Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_precision,context_recall,answer_correctness,answer_similarity
0,What is the impact of encoding the input prompt on inference speed in generative inference?,"The impact of encoding the input prompt on inference speed in generative models is significant due to the difference in processing phases. Encoding the input prompt is done in parallel, layer-by-l...",[The generative inference workload consists of two phases: 1) encoding the input prompt and 2) generating tokens conditioned on that prompt. The key difference between these two phases is that pro...,"The encoding of the input prompt has an impact on inference speed in generative inference. During the encoding phase, prompt tokens are encoded in parallel, layer-by-layer, which works relatively ...",1.0,0.992326,1.0,1.0,0.553538,0.950995
1,How does generating tokens affect the inference speed in generative inference?,"Generating tokens in generative inference significantly affects inference speed due to the sequential nature of token generation. Unlike the parallel processing of input prompt tokens, generating ...","[â¢ The average number of generated tokens outputted by LLMs per query. Much like the assessment of average prompt tokens, this metric provides an evaluation of computational efficiency, but from...","Generating tokens affects the inference speed in generative inference by slowing it down. In interactive inference, where tokens are generated autoregressively from left to right, the inference sy...",1.0,0.934732,1.0,0.6,0.856394,0.925577
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,"The architecture of Mixtral 8x7B differs from Mistral 7B primarily in its use of Sparse Mixture of Experts (SMoE) layers. While Mistral 7B is a standard transformer model, Mixtral 8x7B incorporate...","[Abstract\nWe introduce Mistral 7B, a 7âbillion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all e...","The architecture of Mixtral 8x7B differs from Mistral 7B in terms of feedforward blocks and active parameters used during inference. Mixtral 8x7B has 8 feedforward blocks (experts) in each layer, ...",0.428571,0.911671,1.0,1.0,0.708698,0.95979
3,When is offloading used on the A100 server for accelerating MoE-based language models?,Offloading is used on the A100 server for accelerating MoE-based language models to manage the large model size and limited accelerator memory. The strategy involves using parameter offloading alg...,"[# Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com\n# Abstract\nWith the widespread adoption of Large Language Models (LLMs), many deep learning pr...",Offloading is used on the A100 server for accelerating MoE-based language models when there is resource-constricted hardware and the goal is to enable broader access to these powerful models for r...,0.75,0.953038,0.755556,0.0,0.353497,0.952448
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,"Mixtral significantly outperforms Llama 2 70B in code benchmarks, as well as in most other categories, while using 5x fewer active parameters during inference. This makes Mixtral not only more eff...",[Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.\n70 Mix...,Mixtral outperforms Llama 2 70B in code benchmarks.,1.0,0.922079,1.0,1.0,0.401302,0.93854


### Interpreting RAGAS Metrics

Let's look at the key metrics produced by RAGAS:

#### Retrieval Metrics (Context Related):

*   **`context_precision`**: Measures the signal-to-noise ratio of the retrieved contexts. Ideally, all retrieved chunks (`contexts`) should be relevant to the `question`. Scores closer to 1 are better.
*   **`context_recall`**: Measures if all necessary information from the `ground_truth` answer was found in the retrieved `contexts`. Scores closer to 1 are better.

#### Generation Metrics (Answer Related):

*   **`faithfulness`**: Measures how factually consistent the generated `answer` is with the retrieved `contexts`. It checks if the answer hallucinates or makes claims not supported by the provided context. Scores closer to 1 are better.
*   **`answer_relevancy`**: Measures how relevant the `answer` is to the original `question`. It penalizes answers that are incomplete or contain redundant information. Scores closer to 1 are better.
*   **`answer_similarity`**: Measures the semantic similarity between the generated `answer` and the `ground_truth` answer. Scores closer to 1 mean the generated answer is semantically close to the ideal answer.
*   **`answer_correctness`**: Measures both factual correctness (compared to `ground_truth`) and semantic similarity. It's a more holistic measure of whether the answer is right. Scores closer to 1 are better.

In [35]:
# Display specific metrics
print("\n--- Retrieval Metrics ---")
pd.set_option('display.max_colwidth', 150)
display(ragas_df[['question', 'context_precision', 'context_recall']])

print("\n--- Generation Metrics ---")
display(ragas_df[['question', 'faithfulness', 'answer_relevancy', 'answer_similarity', 'answer_correctness']])


--- Retrieval Metrics ---


Unnamed: 0,question,context_precision,context_recall
0,What is the impact of encoding the input prompt on inference speed in generative inference?,1.0,1.0
1,How does generating tokens affect the inference speed in generative inference?,1.0,0.6
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,1.0,1.0
3,When is offloading used on the A100 server for accelerating MoE-based language models?,0.755556,0.0
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,1.0,1.0
5,"In terms of mathematics benchmarks, how does Mixtral perform compared to Llama 2 70B?",1.0,1.0
6,"What is the relationship between benchmarking the expert LRU cache and speculative loading, and the expert recall rate in the Mixtral-8x7B-Instruc...",1.0,1.0
7,How does the use of sparse Mixture-of-Experts (MoE) in language models contribute to faster token generation?,1.0,1.0
8,What impact does the use of sparse Mixture-of-Experts (MoE) have on the size of language models?,1.0,1.0
9,How does LRU caching improve the inference speed of Mixture-of-Experts language models?,1.0,0.5



--- Generation Metrics ---


Unnamed: 0,question,faithfulness,answer_relevancy,answer_similarity,answer_correctness
0,What is the impact of encoding the input prompt on inference speed in generative inference?,1.0,0.992326,0.950995,0.553538
1,How does generating tokens affect the inference speed in generative inference?,1.0,0.934732,0.925577,0.856394
2,How does the architecture of Mixtral 8x7B differ from Mistral 7B in terms of feedforward blocks and active parameters used during inference?,0.428571,0.911671,0.95979,0.708698
3,When is offloading used on the A100 server for accelerating MoE-based language models?,0.75,0.953038,0.952448,0.353497
4,How does Mixtral compare to Llama 2 70B in code benchmarks?,1.0,0.922079,0.93854,0.401302
5,"In terms of mathematics benchmarks, how does Mixtral perform compared to Llama 2 70B?",1.0,0.936447,0.930693,0.661245
6,"What is the relationship between benchmarking the expert LRU cache and speculative loading, and the expert recall rate in the Mixtral-8x7B-Instruc...",1.0,0.997693,0.965707,0.57476
7,How does the use of sparse Mixture-of-Experts (MoE) in language models contribute to faster token generation?,1.0,1.0,0.976767,0.644192
8,What impact does the use of sparse Mixture-of-Experts (MoE) have on the size of language models?,0.636364,1.0,0.947064,0.823723
9,How does LRU caching improve the inference speed of Mixture-of-Experts language models?,1.0,0.952684,0.964203,0.522302


Analyze the scores:

*   **High `context_recall` and `context_precision`** suggest the retrieval step (ChromaDB + OpenAI embeddings) is working well, finding relevant information without too much noise.
*   **High `faithfulness`** indicates the LLM is generating answers based on the provided context and not making things up.
*   **High `answer_relevancy`** means the answer directly addresses the question without unnecessary details.
*   **High `answer_similarity` and `answer_correctness`** show the final answer is close to the ideal ground truth answer.

Low scores in specific areas can pinpoint bottlenecks: poor retrieval, LLM hallucination, or answers that don't properly address the question.