# Part 2: Evaluating our LLM application

## Setup

In [None]:
!pip install -r requirements.txt

### PLEASE PASTE YOUR OWN API KEY HERE

In [3]:
import os
os.environ["OPENAI_API_KEY"] = ""

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Step 8: Evaluate Retrieval

### Golden Context Dataset

Given a set of queries, we would have the correct sources that answer those queries, and optionally the correct answer that should be returned by the LLM.


In [11]:
from pathlib import Path
import json

golden_dataset_path = Path("../datasets/eval_qst.json")
data = []
# Read the JSON file
with open(golden_dataset_path, 'r', encoding='utf-8') as json_file:
    # Load the JSON data into a list
    data = json.load(json_file)
    
len(data)

2966

Our dataset contains 'question', and 'source' pairs. If we have an **ideal** context dataset, it is the best option for evaluation.

In [5]:
data[:5]

[{'question': 'What are the side effects of doxycycline?',
  'source': 'https://www.drugs.com/doxycycline.html',
  'text': '(hives, difficult breathing, swelling in your face or throat) or a severe skin reaction (fever, sore throat, burning in your eyes, skin pain, red or purple skin rash that spreads and causes blistering and peeling). Seek medical treatment if you have a serious drug reaction that can affect many parts of your body. Symptoms may include: skin rash, fever, swollen glands, flu-like symptoms, muscle aches, severe weakness, unusual bruising, or yellowing of your skin or eyes. This reaction may occur several weeks after you began using doxycycline. Doxycycline may cause serious side effects. Call your doctor at once if you have: severe stomach pain, diarrhea that is watery or bloody; throat irritation, trouble swallowing; chest pain, irregular heart rhythm, feeling short of breath; little or no urination; low white blood cell counts - fever, chills, swollen glands, body a

### Evaluating Retrieval with hit rate

Given a query, we check if our retriever pulling in the correct context to answer that query. If the LLM does not have the right context to answer the question, it cannot provide the right answer.

For each query in our evaluation dataset, we will measure the following:
1. Is the correct source included in any of the retrived chunks?
2. What is the score our retriever gives to the correct source?


In [6]:
from utils import get_retriever

In [7]:
retriever = get_retriever(similarity_top_k=5, embedding_model_name='sentence-transformers/all-mpnet-base-v2')

LLM is explicitly disabled. Using MockLLM.


Now let's evaluate our retriever.  We do this by checking how often the metadata (drug_link) of the top 5 retrieved sources matches the expected sources.

In [8]:
from tqdm import tqdm
results = []
print(data[0])
for entry in tqdm(data):
    query = entry["question"]
    expected_source = entry['source']
    
    retrieved_nodes = retriever.retrieve(query)
    retrieved_sources = [node.metadata['drug_link'] for node in retrieved_nodes]
    
    # If our label does not include a section, then any sections on the page should be considered a hit.
    if "#" not in expected_source:
        retrieved_sources = [source.split("#")[0] for source in retrieved_sources]
    
    if expected_source in retrieved_sources:
        is_hit = True
        score = retrieved_nodes[retrieved_sources.index(expected_source)].score
    else:
        is_hit = False
        score = 0.0
    
    result = {
        "is_hit": is_hit,
        "score": score,
        "retrieved": retrieved_sources,
        "expected": expected_source,
        "query": query,
    }
    results.append(result)

{'question': 'What are the side effects of doxycycline?', 'source': 'https://www.drugs.com/doxycycline.html', 'text': '(hives, difficult breathing, swelling in your face or throat) or a severe skin reaction (fever, sore throat, burning in your eyes, skin pain, red or purple skin rash that spreads and causes blistering and peeling). Seek medical treatment if you have a serious drug reaction that can affect many parts of your body. Symptoms may include: skin rash, fever, swollen glands, flu-like symptoms, muscle aches, severe weakness, unusual bruising, or yellowing of your skin or eyes. This reaction may occur several weeks after you began using doxycycline. Doxycycline may cause serious side effects. Call your doctor at once if you have: severe stomach pain, diarrhea that is watery or bloody; throat irritation, trouble swallowing; chest pain, irregular heart rhythm, feeling short of breath; little or no urination; low white blood cell counts - fever, chills, swollen glands, body aches,

100%|███████████████████████████████████████████████████████████████████████████████| 2966/2966 [17:30<00:00,  2.82it/s]


In [9]:
results[:2]

[{'is_hit': True,
  'score': 0.762622118,
  'retrieved': ['https://www.drugs.com/doxycycline.html',
   'https://www.drugs.com/cdi/doans-pills.html',
   'https://www.drugs.com/mtm/doxylamine.html',
   'https://www.drugs.com/cons/dolono.html',
   'https://www.drugs.com/doxazosin.html'],
  'expected': 'https://www.drugs.com/doxycycline.html',
  'query': 'What are the side effects of doxycycline?'},
 {'is_hit': True,
  'score': 0.724890232,
  'retrieved': ['https://www.drugs.com/spironolactone.html',
   'https://www.drugs.com/mtm/hydrochlorothiazide-and-spironolactone.html',
   'https://www.drugs.com/sprix.html',
   'https://www.drugs.com/spravato.html',
   'https://www.drugs.com/cdi/kapspargo-sprinkle.html'],
  'expected': 'https://www.drugs.com/spironolactone.html',
  'query': 'What are the side effects of spironolactone?'}]

In [10]:
total_hits = sum(result["is_hit"] for result in results)
hit_percentage = total_hits / len(results)
hit_percentage

0.8125421443020904

So this retrieval technique gives us a hit 81.25% of the time OR the expected document is in the list of retrieved documents 81.25% of the time.

In [11]:
average_score = sum(result["score"] for result in results) / len(results)
average_score

0.578564292256575

### Evaluating Retrieval Using LLamaIndex RetrievalEvaluator

In [None]:
from llama_index.evaluation import RetrieverEvaluator

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

## Step 9: Evaluating performance before Step 7 (RAG)

### Golden Responses Dataset Creation

To effectively evaluate our generated responses, we need "ground truth" responses. These ground truth responses can be generated by **feeding the correct context to a golden LLM**. Then, we can use an LLM to evaluate our generated responses compared to the ground truth responses.

We used the PaLM model (text-bison-1) here since it's been shown to be well aligned with human preferences and (that's the best we can get access to).

### Generating Golden Responses for reference

In [4]:
def generate_responses(entries, llm):
    context_window = llm.metadata.context_window - 500
    service_context = ServiceContext.from_defaults(llm=llm, context_window=context_window)
    rs = get_response_synthesizer(service_context=service_context)

    responses = []
    for entry in tqdm(entries):
        query = entry["question"]
        source = entry["source"]

        context = entry["text"]
        nodes = [NodeWithScore(node=TextNode(text=context))]

        response = rs.synthesize(query, nodes=nodes)
        responses.append(response.response)
    return responses

We can now generate our reference responses. Let's generate 10 reference responses and save them to a file.

In [None]:
!pip install -q google-generativeai

### PLEASE PASTE YOUR OWN API KEY HERE

In [8]:
import google.generativeai as palm
palm_api_key = ""
palm.configure(api_key=palm_api_key)
from llama_index.llms.palm import PaLM

In [29]:
llm = PaLM(api_key=palm_api_key)
ten_samples = data[:10]
golden_responses = generate_responses(ten_samples, llm)

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [02:35<00:00, 15.55s/it]


In [30]:
reference_dataset = [{"question": entry["question"], "source": entry["source"], "response": response} for entry, response in zip(ten_samples, golden_responses)]

In [31]:
with open("../datasets/golden-responses.json", "w") as file:
    json.dump(reference_dataset, file, indent=4)

### Generating Barebones Responses for gpt-3.5.turbo (no RAG) for reference

We will try to evaluate how the **gpt-3.5-turbo** model performs when given queries about the drug side effects but **no context**. This performance will be used to establish a baseline for comparing our RAG approach

In [3]:
from tqdm import tqdm
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.response_synthesizers import get_response_synthesizer
from llama_index.schema import TextNode, NodeWithScore

def generate_bare_responses(entries, llm):
    responses = []
    for entry in tqdm(entries):
        query = entry["question"]
        response = llm.complete(query)
        responses.append(response)
    return responses

In [30]:
llm = OpenAI(model='gpt-3.5-turbo', temperature=0.0, max_tokens=64) # max_tokens=512
ten_samples = data[0:3]
bare_responses1 = generate_bare_responses(ten_samples, llm)

100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:47<00:00, 15.77s/it]


In [32]:
llm = OpenAI(model='gpt-3.5-turbo', temperature=0.0, max_tokens=64) # max_tokens=512
ten_samples = data[3:6]
bare_responses2 = generate_bare_responses(ten_samples, llm)

100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:46<00:00, 15.60s/it]


In [33]:
llm = OpenAI(model='gpt-3.5-turbo', temperature=0.0, max_tokens=64) # max_tokens=512
ten_samples = data[6:10]
bare_responses3 = generate_bare_responses(ten_samples, llm)

100%|█████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:12<00:00, 18.24s/it]


In [36]:
bare_responses = bare_responses1 + bare_responses2 + bare_responses3

In [53]:
for i in range(len(bare_responses)):
    bare_responses[i] = str(bare_responses[i])

In [65]:
ten_samples = data[:10]

In [66]:
barebones_dataset = [{"question": entry["question"], "source": entry["source"], "response": response} for entry, response in zip(ten_samples, bare_responses)]

In [67]:
print(len(barebones_dataset))
with open("../datasets/bare-responses-gpt.json", "w") as file:
    json.dump(barebones_dataset, file, indent=4)

10


### Evaluating our bare LLM (gpt-3.5-turbo) 

#### Using the LLamaIndex Correctness Evaluator on Golden Responses Dataset

In [1]:
import json
with open("../datasets/golden-responses.json", "r") as file:
    golden_responses = json.load(file)

In [3]:
with open("../datasets/bare-responses-gpt.json", "r") as file:
    bare_responses = json.load(file)

In [8]:
from llama_index.evaluation import CorrectnessEvaluator

In [9]:
from llama_index import VectorStoreIndex, ServiceContext
palm_api_key = "AIzaSyBCDSREHajiFWH65cWEl4BlXfuAG7HjRS0"
eval_llm = PaLM(api_key=palm_api_key, temperature=0.0)
service_context = ServiceContext.from_defaults(llm=eval_llm)
evaluator = CorrectnessEvaluator(service_context=service_context)

In [10]:
eval_results = []
from tqdm import tqdm
for bare_response, golden_response in tqdm(list(zip(bare_responses, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    bare_answer = bare_response["response"]
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=bare_answer)
    eval_results.append(eval_result)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [22:20<00:00, 134.06s/it]


In [None]:
[r.score for r in eval_results]

In [12]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]

In [13]:
with open("eval-scores-bare-gpt.json", "w") as file:
    json.dump(scores, file, indent=4)

In [14]:
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.3

#### Using the LLamaIndex Correctness Evaluator on User Responses Dataset

In [None]:
import json
with open("../datasets/bare-responses-gpt.json", "r") as file:
    pred_responses = json.load(file)

In [None]:
eval_results = []
from tqdm import tqdm
for bare_response, golden_response in tqdm(list(zip(pred_responses, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    bare_answer = bare_response["response"]
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=bare_answer)
    eval_results.append(eval_result)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:29<00:00, 14.93s/it]


In [None]:
[r.score for r in eval_results]

[3.0, 3.0, 3.5, 3.0, 3.0, 3.5, 3.0, 4.0, 3.5, 3.0]

In [None]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]
with open("gpt3.5vshuman.json", "w") as file:
    json.dump(scores, file, indent=4)
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.25

#### Industry Metrics on User Responses Dataset

In [1]:
import json
with open("../datasets/human1_responses.json", "r") as file:
    human1 = json.load(file)
with open("../datasets/human2_responses.json", "r") as file:
    human2 = json.load(file)
with open("../datasets/human3_responses.json", "r") as file:
    human3 = json.load(file)

In [2]:
with open("../datasets/eval-scores-rag-gpt.json", "r") as file:
    rag = json.load(file)
with open("../datasets/bare-responses-gpt.json", "r") as file:
    bare_llm = json.load(file)

In [3]:
human1_responses = []
human2_responses = []
human3_responses = []
rag_responses = []
bare_responses = []

for i in range(0, 10):
    human1_responses.append(human1[i]["response"])
    human2_responses.append(human2[i]["response"])
    human3_responses.append(human3[i]["response"])
    rag_responses.append(rag[i]["generated_response"])
    bare_responses.append(bare_llm[i]["response"])
references_dict = {
    "human_1": human1_responses,
    "human_2": human2_responses,
    "human_3": human3_responses,
}

In [4]:
from eval import generate_human_eval_summary
barellm_vs_humans_result = generate_human_eval_summary(references_dict, bare_responses, "Bare LLM")

2023-12-06 16:01:35.543195: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-06 16:01:35.543286: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-06 16:01:35.544631: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-06 16:01:35.550869: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Calculating ROUGE Score...
Calculating BLEU Score...
Calculating BERT Score...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Calculating METEOR Score...


[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /

In [5]:
print(barellm_vs_humans_result["rouge"])

  Bare LLM    rouge1    rouge2    rougeL  rougeLsum
0  human_1  0.348891  0.133115  0.240907   0.240724
1  human_2  0.329711  0.109742  0.225374   0.224765
2  human_3  0.291540  0.105376  0.220748   0.220880


In [6]:
print(barellm_vs_humans_result["bleu"])

  Bare LLM     bleu
0  human_1  0.03488
1  human_2  0.03488
2  human_3  0.03488


In [7]:
print(barellm_vs_humans_result["meteor"])

  Bare LLM    meteor
0  human_1  0.172977
1  human_2  0.172977
2  human_3  0.172977


## Step 10: Evaluating our RAG Query Engine

#### Using LlamaIndex Correctness evaluator on Golden Responses Datset

In [4]:
import json
with open("../datasets/golden-responses.json", "r") as file:
    golden_responses = json.load(file)

In [8]:
from utils import get_query_engine
from tqdm import tqdm

In [9]:
query_engine = get_query_engine(similarity_top_k=5, llm_model_name='gpt-3.5-turbo', embedding_model_name='sentence-transformers/all-mpnet-base-v2')

# Store both the original response object and the response string.
rag_responses = []
rag_response_str = []

for entry in tqdm(golden_responses):
    query = entry["question"]
    response = query_engine.query(query)
    rag_responses.append(response)
    rag_response_str.append(response.response)
store_rag_responses = rag_responses

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [02:27<00:00, 14.78s/it]


In [10]:
rag_response_str[0]

'The side effects of doxycycline may include nausea and vomiting, upset stomach, loss of appetite, mild diarrhea, skin rash or itching, darkened skin color, vaginal itching or discharge.'

In [11]:
from llama_index.evaluation import CorrectnessEvaluator

In [12]:
from llama_index import VectorStoreIndex, ServiceContext
palm_api_key = "AIzaSyBCDSREHajiFWH65cWEl4BlXfuAG7HjRS0"
eval_llm = PaLM(api_key=palm_api_key, temperature=0.0)
service_context = ServiceContext.from_defaults(llm=eval_llm)
evaluator = CorrectnessEvaluator(service_context=service_context)

In [13]:
eval_results = []
for rag_response, golden_response in tqdm(list(zip(rag_response_str, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    generated_answer = rag_response
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=generated_answer)
    eval_results.append(eval_result)

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [02:24<00:00, 14.48s/it]


In [14]:
[r.score for r in eval_results]

[4.5, 3.5, 3.5, 4.5, 4.0, 3.5, 4.5, 4.0, 5.0, 5.0]

Let's save the query, both responses, and the score to a JSON file

In [16]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]

In [17]:
with open("eval-scores-rag-gpt.json", "w") as file:
    json.dump(scores, file, indent=4)

We can also calculate the average scores

In [18]:
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

4.2

#### Using the LLamaIndex Correctness Evaluator on User Responses Dataset

In [None]:
import json
with open("../datasets/eval-scores-rag-gpt.json", "r") as file:
    pred_responses = json.load(file)

In [None]:
eval_results = []
from tqdm import tqdm
for pred_response, golden_response in tqdm(list(zip(pred_responses, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    bare_answer = pred_response["generated_response"]
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=bare_answer)
    eval_results.append(eval_result)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:25<00:00, 14.52s/it]


In [None]:
[r.score for r in eval_results]

[4.5, 4.0, 3.5, 3.0, 3.5, 3.5, 4.5, 3.5, 4.5, 4.0]

In [None]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]
with open("gpt3.5ragvshuman.json", "w") as file:
    json.dump(scores, file, indent=4)
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.85

#### Industry Metrics on User Responses Dataset

In [10]:
from eval import generate_human_eval_summary
rag_vs_humans_result = generate_human_eval_summary(references_dict, rag_responses, "RAG")

Calculating ROUGE Score...
Calculating BLEU Score...
Calculating BERT Score...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Calculating METEOR Score...


[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /

In [13]:
rag_vs_humans_result["meteor"].to_csv("r.csv", index=False)

In [11]:
print(rag_vs_humans_result["rouge"])

       RAG    rouge1    rouge2    rougeL  rougeLsum
0  human_1  0.584965  0.441892  0.402719   0.401882
1  human_2  0.567369  0.403966  0.379018   0.379128
2  human_3  0.589005  0.489582  0.570795   0.572329


In [12]:
print(rag_vs_humans_result["bleu"])

       RAG      bleu
0  human_1  0.377064
1  human_2  0.377064
2  human_3  0.377064


In [14]:
print(rag_vs_humans_result["meteor"])

       RAG    meteor
0  human_1  0.474049
1  human_2  0.474049
2  human_3  0.474049


### Evaluation Summary of Bare LLM vs RAG in Industry Standard Metrics on Golden Responses Dataset

In [16]:
with open("../datasets/bare-responses-gpt.json", "r") as file:
    bare_llm = json.load(file)
with open("../datasets/eval-scores-rag-gpt.json", "r") as file:
    rag = json.load(file)
with open("../datasets/golden-responses.json", "r") as file:
    golden = json.load(file)

rag_responses = []
bare_responses = []
golden_responses = []
for i in range(0, 10):
    rag_responses.append(rag[i]["generated_response"])
    bare_responses.append(bare_llm[i]["response"])
    golden_responses.append(golden[i]["response"])
    
predictions_dict = {
    "Bare LLM": bare_responses,
    "RAG": rag_responses,
}

In [None]:
from eval import generate_metrics_summary
result = generate_metrics_summary(golden_responses, predictions_dict)

In [18]:
print(result["rouge"])

     System    rouge1    rouge2    rougeL  rougeLsum
0  Bare LLM  0.410927  0.172764  0.338584   0.335817
1       RAG  0.729363  0.690272  0.732632   0.728783


In [19]:
print(result["bleu"])

     System      bleu
0  Bare LLM  0.074376
1       RAG  0.594337


In [20]:
print(result["bert"])

     System  average_bertscore_precision  average_bertscore_recall  \
0  Bare LLM                     0.867267                  0.879961   
1       RAG                     0.949020                  0.970628   

   average_bertscore_f1  
0              0.872771  
1              0.958935  


In [21]:
print(result["meteor"])

     System    meteor
0  Bare LLM  0.339511
1       RAG  0.801181


### Evaluation without Golden Responses and User Responses

Generating reference responses and then using them for evaluation can give us a probably very accurate assesment on how our query engine is performing. However, this approach can be expensive and biased to the reference dataset. 

#### Evaluating for faithfulness/relevancy

In [8]:
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
from llama_index import ServiceContext
# from llama_index.llms import OpenAI
import google.generativeai as palm

palm_api_key = "AIzaSyBCDSREHajiFWH65cWEl4BlXfuAG7HjRS0"
palm.configure(api_key=palm_api_key)
from llama_index.llms.palm import PaLM

from tqdm import tqdm
def evaluate(queries: list, responses: list, metric: str):
    llm = PaLM(api_key=palm_api_key, temperature=0.0)
    service_context = ServiceContext.from_defaults(llm=llm)
    
    if metric == 'faithfulness':
        evaluator = FaithfulnessEvaluator(service_context=service_context)
    elif metric == 'relevancy':
        evaluator = RelevancyEvaluator(service_context=service_context)
    else:
        raise ValueError("Unknown metric: ", metric)

    evals = []
    for query, response in tqdm(list(zip(queries, responses))):
        eval_result = evaluator.evaluate_response(query=query, response=response)
        evals.append(eval_result)
    
    return evals

def get_pass_rate(evals):
    return len([val.passing for val in evals]) / len(evals)

In [None]:
faithfulness_results = evaluate(queries=[sample["question"] for sample in ten_samples], responses=rag_responses, metric='faithfulness')

In [32]:
faithfulness_score = get_pass_rate(faithfulness_results)
faithfulness_score

1.0

In [33]:
relevancy_results = evaluate(queries=[sample["question"] for sample in ten_samples], responses=store_rag_responses, metric='relevancy')

100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [02:14<00:00, 14.99s/it]


In [34]:
relevancy_score = get_pass_rate(relevancy_results)
relevancy_score

1.0

## Mistral Evaluation

### Setup

### PLEASE PASTE YOUR OWN API KEY HERE

In [1]:
import os
os.environ["REPLICATE_API_TOKEN"]=""

In [2]:
import pinecone
api_key = ""
pinecone.init(api_key=api_key, environment="gcp-starter")

  from tqdm.autonotebook import tqdm


In [3]:
pinecone_index = pinecone.Index("langchain-retrieval-agent")

In [5]:
import numpy as np
from llama_index.embeddings import OpenAIEmbedding, HuggingFaceEmbedding

def get_embedding_model(model_name, embed_batch_size=100):
    if model_name == "text-embedding-ada-002":
            return OpenAIEmbedding(
                model=model_name,
                embed_batch_size=embed_batch_size,
                api_key=os.environ["OPENAI_API_KEY"])
    else:
        return HuggingFaceEmbedding(
            model_name=model_name,
            embed_batch_size=embed_batch_size)

### Step 9: Evaluating performance before Step 7 (bare Mistral w/o RAG)

In [4]:
from tqdm import tqdm
from llama_index import ServiceContext

def generate_bare_responses(entries, llm):
    responses = []
    for entry in tqdm(entries):
        query = entry["question"]
        response = llm.complete(query)
        responses.append(response)
    return responses

In [5]:
import json
with open("../datasets/golden-responses.json", "r") as file:
    data = json.load(file)

In [6]:
from llama_index.llms import Replicate
llm = Replicate(
    model="mistralai/mistral-7b-v0.1:3e8a0fb6d7812ce30701ba597e5080689bef8a013e5c6a724fafb108cc2426a0"
)
ten_samples = data[0:10]
bare_responses = generate_bare_responses(ten_samples, llm)

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:51<00:00,  5.15s/it]


In [15]:
bare_responses = [str(i) for i in bare_responses]

In [17]:
barebones_dataset = [{"question": entry["question"], "source": entry["source"], "response": response} for entry, response in zip(ten_samples, bare_responses)]

In [18]:
print(len(barebones_dataset))
with open("../datasets/bare-responses-mistral.json", "w") as file:
    json.dump(barebones_dataset, file, indent=4)

10


#### Using the LLamaIndex Correctness Evaluator on Golden Responses Dataset

In [34]:
with open("../datasets/bare-responses-mistral.json", "r") as file:
    bare_responses = json.load(file)

In [37]:
from llama_index import VectorStoreIndex, ServiceContext
palm_api_key = "AIzaSyBCDSREHajiFWH65cWEl4BlXfuAG7HjRS0"
eval_llm = PaLM(api_key=palm_api_key, temperature=0.0)
service_context = ServiceContext.from_defaults(llm=eval_llm)
evaluator = CorrectnessEvaluator(service_context=service_context)

In [11]:
import json
with open("../datasets/golden-responses.json", "r") as file:
    golden_responses = json.load(file)

In [38]:
eval_results = []
from tqdm import tqdm
for bare_response, golden_response in tqdm(list(zip(bare_responses, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    bare_answer = bare_response["response"]
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=bare_answer)
    eval_results.append(eval_result)

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [02:32<00:00, 15.20s/it]


In [39]:
[r.score for r in eval_results]

[3.5, 3.5, 3.5, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.5]

In [40]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]

In [41]:
with open("eval-scores-bare-mistral.json", "w") as file:
    json.dump(scores, file, indent=4)

In [42]:
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.2

#### Using the LLamaIndex Correctness Evaluator on User Responses Dataset

In [None]:
import json
with open("../datasets/eval-scores-bare-mistral.json", "r") as file:
    pred_responses = json.load(file)

In [None]:
eval_results = []
from tqdm import tqdm
for pred_response, golden_response in tqdm(list(zip(pred_responses, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    bare_answer = pred_response["generated_response"]
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=bare_answer)
    eval_results.append(eval_result)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:29<00:00, 14.94s/it]


In [None]:
[r.score for r in eval_results]

[3.0, 3.5, 3.5, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.5]

In [None]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]
with open("mistralvshuman.json", "w") as file:
    json.dump(scores, file, indent=4)
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.25

#### Industry Metrics on User Responses Dataset

In [None]:
import json
with open("../datasets/human1_responses.json", "r") as file:
    human1 = json.load(file)
with open("../datasets/human2_responses.json", "r") as file:
    human2 = json.load(file)
with open("../datasets/human3_responses.json", "r") as file:
    human3 = json.load(file)
with open("../datasets/eval-scores-rag-mistral.json", "r") as file:
    rag = json.load(file)
with open("../datasets/bare-responses-mistral.json", "r") as file:
    bare_llm = json.load(file)
with open("../datasets/human1_responses.json", "r") as file:
    golden_responses = json.load(file)

In [None]:
human1_responses = []
human2_responses = []
human3_responses = []
rag_responses = []
bare_responses = []

for i in range(0, 10):
    human1_responses.append(human1[i]["response"])
    human2_responses.append(human2[i]["response"])
    human3_responses.append(human3[i]["response"])
    rag_responses.append(rag[i]["generated_response"])
    bare_responses.append(bare_llm[i]["response"])
references_dict = {
    "human_1": human1_responses,
    "human_2": human2_responses,
    "human_3": human3_responses,
}

In [None]:
from eval import generate_human_eval_summary
barellm_vs_humans_result = generate_human_eval_summary(references_dict, bare_responses, "Bare Mistral-LLM")

Calculating ROUGE Score...
Calculating BLEU Score...
Calculating BERT Score...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Calculating METEOR Score...


[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /

In [None]:
print(barellm_vs_humans_result["rouge"])

  Bare Mistral-LLM    rouge1    rouge2    rougeL  rougeLsum
0          human_1  0.282957  0.084747  0.167597   0.184145
1          human_2  0.282700  0.075186  0.167647   0.187259
2          human_3  0.266228  0.069078  0.170472   0.198220


In [None]:
print(barellm_vs_humans_result["bleu"])

  Bare Mistral-LLM      bleu
0          human_1  0.040892
1          human_2  0.040892
2          human_3  0.040892


In [None]:
print(barellm_vs_humans_result["meteor"])

  Bare Mistral-LLM    meteor
0          human_1  0.195242
1          human_2  0.195242
2          human_3  0.195242


### Step 10: Evaluating our RAG Query Engine

In [6]:
from llama_index.llms import Replicate
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.vector_stores import PineconeVectorStore

mistral = Replicate(
    model="mistralai/mistral-7b-v0.1:3e8a0fb6d7812ce30701ba597e5080689bef8a013e5c6a724fafb108cc2426a0"
)
embedding_model = get_embedding_model("sentence-transformers/all-mpnet-base-v2")

service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=mistral)

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    add_sparse_vector=True,
)
index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=3)

In [9]:
# Store both the original response object and the response string.
rag_responses = []
rag_response_str = []
from tqdm import tqdm
for entry in tqdm(golden_responses):
    query = entry["question"]
    response = query_engine.query(query)
    rag_responses.append(response)
    rag_response_str.append(response.response)
store_mistral_rag_responses = rag_responses

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:52<00:00,  5.29s/it]


#### Using the LLamaIndex Correctness Evaluator on Golden Responses Dataset

In [7]:
from llama_index.evaluation import CorrectnessEvaluator
with open("../datasets/human1_responses.json", "r") as file:
    golden_responses = json.load(file)

In [8]:
from llama_index import VectorStoreIndex, ServiceContext
palm_api_key = "AIzaSyBCDSREHajiFWH65cWEl4BlXfuAG7HjRS0"
eval_llm = PaLM(api_key=palm_api_key, temperature=0.0)
service_context = ServiceContext.from_defaults(llm=eval_llm)
evaluator = CorrectnessEvaluator(service_context=service_context)

In [9]:
rag_response_str = ['The side effects of doxycycline can include nausea and vomiting, upset stomach, loss of appetite, mild diarrhea, skin rash or itching, darkened skin color, vaginal itching or discharge, severe stomach pain, diarrhea that is watery or bloody, throat irritation, trouble swallowing, chest pain, irregular heart rhythm, feeling short of breath, little or no urination, low white blood cell counts - fever, chills, swollen glands, body aches, weakness, pale skin, easy bruising or bleeding, severe headaches, ring', 'Based on the context information provided, some common side effects of spironolactone may include breast swelling or tenderness. However, it is important to note that spironolactone may also cause more serious side effects, such as high potassium levels, low potassium levels, and low sodium levels, which can lead to symptoms such as nausea, weakness, chest pain, irregular heartbeats, and loss of movement. It is important to consult with a healthcare professional for a complete list of potential side effects and to discuss any concerns you may have.', 'The side effects of minocycline can include:\n\n* Nausea, vomiting, loss of appetite, diarrhea, and abdominal pain\n* Headache, dizziness, and lightheadedness\n* Hair loss\n* Skin rash, itching, and discoloration\n* Joint pain, muscle aches, and weakness\n* Fatigue, weakness, and weight loss\n* Flu-like symptoms, such as fever, chills, and sore throat\n* Difficulty swallowing, swelling of the tongue, and mouth sores\n* Chest', 'The side effects of Accutane can include dryness of the skin, lips, eyes, or nose (with possible nosebleeds), vision problems, headache, back pain, joint pain, muscle problems, skin reactions, and cold symptoms such as stuffy nose, sneezing, and sore throat. Severe side effects can include severe stomach or chest pain, pain when swallowing, heartburn, diarrhea, rectal bleeding, bloody or tarry stools, increased pressure inside the skull (with severe headaches, ringing in the ears, dizziness, nausea, vision problems, and', 'The side effects of clindamycin can include burning, itching, dryness, peeling or redness of treated skin; oily skin, nausea, vomiting, stomach pain, mild skin rash, or vaginal itching or discharge. Severe side effects can include severe redness, itching, or dryness of treated skin areas; severe stomach pain, diarrhea that is watery or bloody (even if it occurs months after your last dose); or a severe skin reaction (fever, sore throat, burning in your eyes, skin pain, red or purple skin rash that', 'The side effects of Aldactone may include breast swelling or tenderness.', 'The side effects of tretinoin may include severe burning, stinging, or irritation of treated skin; severe skin dryness; or severe redness, swelling, blistering, peeling, or crusting. Your skin may be more sensitive to weather extremes such as cold and wind while using tretinoin topical. Common side effects of tretinoin topical may include skin pain, redness, burning, itching, or irritation; sore throat ; mild warmth or stinging where the medicine was applied; or changes in color of treated skin.', 'The side effects of isotretinoin include: dryness of skin, lips, eyes, or nose; vision problems; headache, back pain, joint pain, muscle problems; skin reactions; or cold symptoms such as stuffy nose, sneezing, sore throat.\n\nIt is important to note that not all patients will experience these side effects, and some may be more severe or last longer than others. If you experience any side effects, it is important to speak with your healthcare provider, as they may be able to adjust your treatment plan or recommend additional medications to help manage your symptoms.', 'The side effects of Bactrim may include nausea, vomiting, loss of appetite; or skin rash. More severe side effects may include severe stomach pain, diarrhea that is watery or bloody, yellowing of your skin or eyes, seizure, new or unusual joint pain, increased or decreased urination, swelling, bruising, or irritation around the IV needle, increased thirst, dry mouth, fruity breath odor, new or worsening cough, fever, trouble breathing, high blood potassium, low blood sodium, or low blood cell counts. If you', 'The side effects of Retin-A may include mild warmth or stinging where the medicine was applied; or changes in color of treated skin. Severe side effects may include severe burning, stinging, or irritation of treated skin; severe redness, swelling, blistering, peeling, or crusting; or difficulty breathing, swelling of the face, lips, tongue, or throat.']

In [13]:
eval_results = []
from tqdm import tqdm
for rag_response, golden_response in tqdm(list(zip(rag_response_str, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    generated_answer = rag_response
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=generated_answer)
    eval_results.append(eval_result)

100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [02:30<00:00, 15.03s/it]


In [14]:
[r.score for r in eval_results]

[3.5, 3.5, 3.0, 3.5, 3.5, 5.0, 4.5, 4.5, 3.5, 3.5]

Let's save the query, both responses, and the score to a JSON file

In [15]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]

In [16]:
with open("eval-scores-rag-mistral.json", "w") as file:
    json.dump(scores, file, indent=4)

We can also calculate the average scores

In [17]:
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.8

#### Using the LLamaIndex Correctness Evaluator on User Responses Dataset

In [None]:
import json
with open("../datasets/eval-scores-rag-mistral.json", "r") as file:
    pred_responses = json.load(file)

In [None]:
eval_results = []
from tqdm import tqdm
for pred_response, golden_response in tqdm(list(zip(pred_responses, golden_responses))):
    query = golden_response["question"]
    golden_answer = golden_response["response"]
    bare_answer = pred_response["generated_response"]
    
    eval_result = evaluator.evaluate(query=query, reference=golden_answer, response=bare_answer)
    eval_results.append(eval_result)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:24<00:00, 14.44s/it]


In [None]:
[r.score for r in eval_results]

[3.0, 3.5, 3.5, 4.5, 3.5, 3.5, 4.5, 4.5, 4.5, 3.5]

In [None]:
scores = [
    {"question": golden_response["question"],
     "golden_response": golden_response["response"],
     "generated_response": eval_result.response,
     "score": eval_result.score,
     "reasoning": eval_result.feedback,
    }
    for eval_result, golden_response in zip(eval_results, golden_responses)
]
with open("mistralragvshuman.json", "w") as file:
    json.dump(scores, file, indent=4)
average_scores = sum(score["score"] for score in scores) / len(scores)
average_scores

3.85

#### Industry metrics on User Responses Dataset

In [None]:
from eval import generate_human_eval_summary
rag_vs_humans_result = generate_human_eval_summary(references_dict, rag_responses, "Mistral+RAG")

Calculating ROUGE Score...
Calculating BLEU Score...
Calculating BERT Score...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Calculating METEOR Score...


[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /

In [None]:
print(rag_vs_humans_result["rouge"])

  Mistral+RAG    rouge1    rouge2    rougeL  rougeLsum
0     human_1  0.599769  0.461251  0.521128   0.529843
1     human_2  0.587186  0.427751  0.499804   0.506154
2     human_3  0.490641  0.335950  0.352346   0.367209


In [None]:
print(rag_vs_humans_result["bleu"])

  Mistral+RAG      bleu
0     human_1  0.307101
1     human_2  0.307101
2     human_3  0.307101


In [None]:
print(rag_vs_humans_result["meteor"])

  Mistral+RAG    meteor
0     human_1  0.416993
1     human_2  0.416993
2     human_3  0.416993


### Industry standard metric summary of gpt-3.5-turbo RAG vs Mistral RAG on golden context dataset

In [15]:
import json
with open("../datasets/bare-responses-gpt.json", "r") as file:
    bare_llm = json.load(file)
with open("../datasets/bare-responses-mistral.json", "r") as file:
    bare_llm_mistral = json.load(file)
with open("../datasets/eval-scores-rag-gpt.json", "r") as file:
    gpt_rag = json.load(file)
with open("../datasets/golden-responses.json", "r") as file:
    golden = json.load(file)
with open("../datasets/eval-scores-rag-mistral.json", "r") as file:
    mistral_rag = json.load(file)

gpt_rag_responses = []
bare_responses = []
bare_mistral_responses = []
golden_responses = []
mistral_rag_responses = []

for i in range(0, 10):
    gpt_rag_responses.append(gpt_rag[i]["generated_response"])
    mistral_rag_responses.append(mistral_rag[i]["generated_response"])
    bare_responses.append(bare_llm[i]["response"])
    bare_mistral_responses.append(bare_llm_mistral[i]["response"])
    golden_responses.append(golden[i]["response"])
    
predictions_dict = {
    "Bare gpt-3.5.-turbo": bare_responses,
    "Bare mistral-7b-instruct-v0.1": bare_mistral_responses,
    "gpt-3.5-RAG": gpt_rag_responses,
    "mistral-RAG": mistral_rag_responses
}
from eval import generate_metrics_summary
result = generate_metrics_summary(golden_responses, predictions_dict)

Calculating ROUGE Score...
Calculating BLEU Score...
Calculating BERT Score...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

Calculating METEOR Score...


[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/aditi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aditi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /

In [16]:
print(result['rouge'])

                          System    rouge1    rouge2    rougeL  rougeLsum
0            Bare gpt-3.5.-turbo  0.410056  0.169581  0.334222   0.337933
1  Bare mistral-7b-instruct-v0.1  0.252921  0.069907  0.173862   0.196391
2                    gpt-3.5-RAG  0.729535  0.692276  0.725238   0.724041
3                    mistral-RAG  0.619861  0.501489  0.479023   0.494634


In [17]:
print(result['bleu'])

                          System      bleu
0            Bare gpt-3.5.-turbo  0.074376
1  Bare mistral-7b-instruct-v0.1  0.039006
2                    gpt-3.5-RAG  0.594337
3                    mistral-RAG  0.365213


In [18]:
print(result['meteor'])

                          System    meteor
0            Bare gpt-3.5.-turbo  0.339511
1  Bare mistral-7b-instruct-v0.1  0.239231
2                    gpt-3.5-RAG  0.801181
3                    mistral-RAG  0.609384


### Faithfulness and Relevance Calculation

In [8]:
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
from llama_index import ServiceContext
# from llama_index.llms import OpenAI
import google.generativeai as palm

palm_api_key = "AIzaSyBCDSREHajiFWH65cWEl4BlXfuAG7HjRS0"
palm.configure(api_key=palm_api_key)
from llama_index.llms.palm import PaLM

from tqdm import tqdm
def evaluate(queries: list, responses: list, metric: str):
    llm = PaLM(api_key=palm_api_key, temperature=0.0)
    service_context = ServiceContext.from_defaults(llm=llm)
    
    if metric == 'faithfulness':
        evaluator = FaithfulnessEvaluator(service_context=service_context)
    elif metric == 'relevancy':
        evaluator = RelevancyEvaluator(service_context=service_context)
    else:
        raise ValueError("Unknown metric: ", metric)

    evals = []
    for query, response in tqdm(list(zip(queries, responses))):
        eval_result = evaluator.evaluate_response(query=query, response=response)
        evals.append(eval_result)
    
    return evals

def get_pass_rate(evals):
    return len([val.passing for val in evals]) / len(evals)

In [None]:
faithfulness_results = evaluate(queries=[sample["question"] for sample in ten_samples], responses=rag_responses, metric='faithfulness')

In [32]:
faithfulness_score = get_pass_rate(faithfulness_results)
faithfulness_score

1.0

In [33]:
relevancy_results = evaluate(queries=[sample["question"] for sample in ten_samples], responses=store_mistral_rag_responses, metric='relevancy')

100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [02:14<00:00, 14.99s/it]


In [34]:
relevancy_score = get_pass_rate(relevancy_results)
relevancy_score

1.0