# Evaluating LlamaIndex with BeyondLLM

This guide demonstrates how to use the BeyondLLM framework to evaluate a LlamaIndex pipeline. We'll cover installing necessary packages, setting up API tokens, configuring embeddings and models, and evaluating the pipeline using various metrics.

## Installation

First, install the required packages using the following commands:

In [1]:
!pip install llama-index-vector-stores-chroma
!pip install llama-index
!pip install llama-index-embeddings-fastembed
!pip install llama-index-llms-huggingface-api
!pip install beyondllm

Collecting llama-index-vector-stores-chroma
  Downloading llama_index_vector_stores_chroma-0.2.0-py3-none-any.whl.metadata (704 bytes)
Collecting chromadb<0.6.0,>=0.4.0 (from llama-index-vector-stores-chroma)
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting llama-index-core<0.12.0,>=0.11.0 (from llama-index-vector-stores-chroma)
  Downloading llama_index_core-0.11.3-py3-none-any.whl.metadata (2.4 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb<0.6.0,>=0.4.0->llama-index-vector-stores-chroma)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb<0.6.0,>=0.4.0->llama-index-vector-stores-chroma)
  Downloading fastapi-0.112.2-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb<0.6.0,>=0.4.0->llama-index-vector-stores-chroma)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2

Collecting llama-index-llms-huggingface-api
  Downloading llama_index_llms_huggingface_api-0.2.0-py3-none-any.whl.metadata (1.3 kB)
Downloading llama_index_llms_huggingface_api-0.2.0-py3-none-any.whl (5.0 kB)
Installing collected packages: llama-index-llms-huggingface-api
Successfully installed llama-index-llms-huggingface-api-0.2.0
Collecting beyondllm
  Downloading beyondllm-0.2.3-py3-none-any.whl.metadata (7.6 kB)
Collecting llama-index==0.10.27 (from beyondllm)
  Downloading llama_index-0.10.27-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-embeddings-gemini==0.1.6 (from beyondllm)
  Downloading llama_index_embeddings_gemini-0.1.6-py3-none-any.whl.metadata (660 bytes)
Collecting nltk==3.8.1 (from beyondllm)
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting openai==1.20.0 (from beyondllm)
  Downloading openai-1.20.0-py3-none-any.whl.metadata (21 kB)
Collecting pandas==2.0.3 (from beyondllm)
  Downloading pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.


## Set Up Hugging Face API Token
To access Hugging Face models, you need to set up an API token. The token allows you to authenticate and interact with Hugging Face services. Use the following code to input and set your API token:

In [4]:
import os
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass("API:")

# Set the API token in the environment variable
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

API:··········


## Import Libraries
Import the necessary libraries to work with the LlamaIndex framework, vector stores, embeddings, and evaluation metrics:

In [12]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
import chromadb
from beyondllm.utils import CONTEXT_RELEVENCE, GROUNDEDNESS, ANSWER_RELEVENCE
import re
import numpy as np
import pysbd

## Load the  Documents
Load your documents from a directory and prepare them for indexing. This step involves reading documents and creating a data structure suitable for the LlamaIndex pipeline:

In [2]:
documents = SimpleDirectoryReader("/content/sample_data/Data").load_data()

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


## Configure Embeddings and Language Models
Set up the FastEmbeddings model for text embedding and initialize the Hugging Face language model using the API token. These components are essential for processing and evaluating text data:

In [5]:
embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")
llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mistral-7B-Instruct-v0.2", token=HUGGINGFACEHUB_API_TOKEN
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

# Initialize Chroma Vector Store
Chroma is a vector store used to manage and query vectorized embeddings. Follow these steps to set up and use Chroma with LlamaIndex.

In [6]:
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)


#Define Evaluation Functions
Implement functions to evaluate different aspects of the language model's performance:

## Context Relevancy
This function evaluates how relevant the context is to the query. It calculates the average relevancy score across all context segments:

In [18]:
def get_context_relevancy(llm, query, context):
    total_score = 0
    score_count = 0

    for content in context:
        score_response = llm.complete(CONTEXT_RELEVENCE.format(question=query, context=content))

        score_str = score_response.text
        score = float(extract_number(score_str))
        total_score += score
        score_count += 1

    average_score = total_score / score_count if score_count > 0 else 0
    return f"Context Relevancy Score: {round(average_score, 1)}"



## Answer Relevancy
This function assesses how relevant the generated answer is to the query:

In [None]:
def get_answer_relevancy(llm, query, response):
    score_response = llm.complete(ANSWER_RELEVENCE.format(question=query, context=response))
    score_str = score_response.text
    return f"Answer Relevancy Score: {score_str}"



## Groundedness
Evaluate how grounded the response is based on the context provided. This function calculates the average groundedness score for each statement in the response:

In [None]:
def get_groundedness(llm, query, context, response):
    total_score = 0
    score_count = 0

    statements = sent_tokenize(response)

    for statement in statements:
        score_response = llm.complete(GROUNDEDNESS.format(statement=statement, context=" ".join(context)))
        score_str = score_response.text
        score = float(extract_number(score_str))
        total_score += score
        score_count += 1

    average_groundedness = total_score / score_count if score_count > 0 else 0
    return f"Groundedness Score: {round(average_groundedness, 1)}"


## Execute Evaluation
Perform a query using the pipeline and evaluate its performance with the defined functions:

In [29]:
query = "what doesnt cause heart diseases"

retrieved_documents = query_engine.retrieve(query)
context = [doc.node.text for doc in retrieved_documents]

response = query_engine.query(query)

print(get_context_relevancy(llm, query, context))
print(get_answer_relevancy(llm, query, response.response))
print(get_groundedness(llm, query, context, response.response))

Context Relevancy Score: 3.0
Answer Relevancy Score: 6

The context provides some information about factors that can help protect against heart disease, but it does not provide a definitive list of things that do not cause heart disease. Therefore, the relevance score is 6.
Groundedness Score: 6.0


In [30]:
query = "what is the capital of turkey"

retrieved_documents = query_engine.retrieve(query)
context = [doc.node.text for doc in retrieved_documents]

response = query_engine.query(query)

print(get_context_relevancy(llm, query, context))
print(get_answer_relevancy(llm, query, response.response))
print(get_groundedness(llm, query, context, response.response))

Context Relevancy Score: 0.0
Answer Relevancy Score: 0
Groundedness Score: 5.0


## Conclusion

The evaluation of the LlamaIndex pipeline with BeyondLLM revealed the following results for two example queries:

### Query 1: "What doesn't cause heart diseases?"

- **Context Relevancy Score:** 3.0 - The context is partially relevant but doesn’t fully address the query.
- **Answer Relevancy Score:** 6 - The response provides some relevant information but lacks a definitive list.
- **Groundedness Score:** 6.0 - The response is reasonably grounded based on the context.

### Query 2: "What is the capital of Turkey?"

- **Context Relevancy Score:** 0.0 - The context is not relevant to the query.
- **Answer Relevancy Score:** 0 - The response does not answer the query.
- **Groundedness Score:** 5.0 - The response shows some level of groundedness, though it doesn't match the query.

These results highlight how context and response relevance affect evaluation scores. Proper alignment with the query improves these scores, while poor relevancy and incorrect answers lead to lower scores.