## Building a Retrieval-Augmented Generation (RAG) System
#### Alan García Zermeño
06/13/2024

### Section 4:  Evaluation and Optimization
#### This section includes:
- Code snippets for the evaluation and optimization process.
- Evaluation results (e.g., tables, charts).
- A summary of optimizations and their impact on performance.

In [1]:
from tqdm.notebook import tqdm
from openai import OpenAI
import time
import textwrap
import sys
import os

# Import script modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../Scripts')))
from datacleaner import data_cleaner
from RetrievalSystemWebEval import evaluator, generGemini, CRAG

[nltk_data] Downloading package punkt to /home/alan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


To evaluate the CRAG system, we will first load and clean the database.

In [2]:
corpus,questions,answers = data_cleaner()

49 Question/Answer pairs extracted!


Next, we will call a version of our CRAG with web search (check section 6) that will return all the final responses along with the two boolean variables that we will explain later.
The main idea is to use all the questions in our database to evaluate our system, and we will save all the answers in the "_answersRAG_" array.

Important: We use a two-second pass between requests to Gemini so as not to exceed its quota limit.

In [3]:
websbool = []
qbool = []
answersRAG  = []
for que in tqdm(questions):
    ans,errors,flag = CRAG(que,corpus,answers,safeURLs = ['keytruda.com','wikipedia'])
    answersRAG.append(ans)
    websbool.append(errors)
    qbool.append(flag)
    time.sleep(2.5)

  0%|          | 0/49 [00:00<?, ?it/s]

The "_websbool_" array will save a boolean variable for each call to our CRAG, indicating whether information about pages marked as safe was found (in this case keytruda.com and wikipedia). The other variable "_flag_" is stored in the array "_qbool_" and tells us whether our evaluator model authorized a web search or not for each query.


Now, to evaluate our answers, we will compare them with the official answers in the database for each of the questions. The model you will evaluate will be a version of GPT. The prompt specifically contains the two answers, the question and asks the model to evaluate which of the two is better or if both are equally good.

In [4]:
def askGPT4(query,model):
    """
    Evaluate between 2 responses.
        Args:
            query:  String with the 2 answers and the question
            model:  String: version of the GPT model {gpt-4o,gpt-4} 
    """  
    with open("../APIS/gpt.txt", 'r') as file: apik = file.readline().strip()
    client = OpenAI(api_key = apik)

    prompteval = """Evaluate the following responses to the given question.
    Your task is to determine which response is better or if both are kind of the same.
    You can only respond JUST with ONE of the following options:

    -  Response 1 is better
    -  Response 2 is better
    -  Both responses are kind of the same

    """

    completion = client.chat.completions.create(
                model=model,
                messages=[
                {"role": "system", "content": prompteval},
                {"role": "user", "content": query}
                ],
                max_tokens=120
            )
    return completion.choices[0].message.content

Now, we run our evaluation test by iteratively passing all the questions and their two answers, and we save the GOT evaluation in the "_bestres_" array.

In [5]:
bestres = []
for i in tqdm(range(len(questions))):
    prompteval = f"""Question:
    {questions[i]}

    Response 1:
    {answers[i]}

    Response 2:
    {answersRAG[i]}

    Evaluation: """
    best = askGPT4(prompteval,model = "gpt-4")
    bestres.append(best)
    #print(f"{i+1}.  {best}")

  0%|          | 0/49 [00:00<?, ?it/s]

Finally, we can define our 3 metrics to take into account based on all the responses:

- **Retriever score (RS)**: The percentage of questions for which our retriever model managed to find a correct answer within the same database. Tells us how good our retriever system is.

- **Web search score (WSS)**: On the occasions where the model had to access web search, it tells us the percentage of instances where we were able to obtain information from our secure pages (Wikipedia, Keytruda.com). It tells us how safe our web search is based on which web pages are marked as safe.

- **CRAG score (CRAGS)**: Percentage of questions where CRAG answers were evaluated as "better" than those in the database by our chosen version of GPT. It tells us how much our CRAG improves on the original answers in the database.

**First test**: First we will evaluate our CRAG where **GPT 3.5** is our generating model and we will have **two secure urls**: keytrude.com and wikipedia. The model that will qualify the answers will be **GPT-4o**.

In [54]:
#gen by Gpt-3.5,secure webs and Gpt-4o eval
two,one,same = 0,0,0
for resp in bestres:
    if '2' in resp:
        two+=1
    elif '1' in resp:
        one+=1
    else:
        same+=1

n = len(questions)
print(f"Retriever score by evaluator model: {round((n-sum(qbool))/n*100,2)}%")
print(f"Secure webs score: {round((sum(qbool)-sum(websbool))*100/sum(qbool),2)}%")
print(f"CRAG score by GPT4o: {round((two)*100/(n-same),2)}%")

Retriever score by evaluator model: 81.63%
Web search score: 44.44%
CRAG score by GPT4o: 91.84%


**Second test**: Next, we will evaluate our CRAG where **GPT-4o* is our generating model and we will **not have secure urls**, therefore, we will examine the internet without filters. The model that will qualify the answers will be **GPT-4**.

In [6]:
#gen by Gpt-4o,no secure webs and Gpt-4 eval
two,one,same = 0,0,0
for resp in bestres:
    if '2' in resp:
        two+=1
    elif '1' in resp:
        one+=1
    else:
        same+=1

n = len(questions)
print(f"Retriever score by evaluator model: {round((n-sum(qbool))/n*100,2)}%")
print(f"Secure webs score: {round((sum(qbool)-sum(websbool))*100/sum(qbool),2)}%")
print(f"CRAG score by GPT4: {round((two)*100/(n-same),2)}%")

Retriever score by evaluator model: 83.67%
Secure webs score: 100.0%
CRAG score by GPT4: 100.0%


**Third test**: Next, we will evaluate our CRAG where **GPT-4* is our generating model and we will **not have secure urls**, therefore, we will examine the internet without filters. The model that will qualify the answers will be **GPT-4**.

In [19]:
#gen by Gpt-4,no secure webs and Gpt-4 eval
two,one,same = 0,0,0
for resp in bestres:
    if '2' in resp:
        two+=1
    elif '1' in resp:
        one+=1
    else:
        same+=1

n = len(questions)
print(f"Retriever score by evaluator model: {round((n-sum(qbool))/n*100,2)}%")
print(f"Secure webs score: {round((sum(qbool)-sum(websbool))*100/sum(qbool),2)}%")
print(f"CRAG score by GPT4: {round((two)*100/(n-same),2)}%")

Retriever score by evaluator model: 81.63%
Secure webs score: 100.0%
CRAG score by GPT4: 75.86%


## Final Results

**Test** | **RS** | **WSS** | **CRAGS** |
|--------- |----------|----------|----------|
| **Fist**    | 81.63%    | 44.44%   | 91.84%   |
| **Second**    | 83.67%    | 100%   | 100%   |
| **Third**    | 81.63%    | 100%   | 75.86   |


### Observations, Conclusions:
- Since no corrections or improvements were made to the retriever system across the three tests, we can define that, on average, our retriever model achieved an **82.31%** success rate. This is a commendable result, which could be significantly enhanced by further enriching the database.

- Our web search was consistently successful when we did not limit ourselves to secure websites only. We found results on secure sites 44.44% of the time when the filter was applied. This could be improved by expanding our list of secure websites.

- Regarding the CRAG metric, we observe that when GPT-3.5 was our generating model, GPT-4 evaluated it with a score of 91%. When GPT-4 generated and was evaluated by itself, the score dropped to 75.86%. However, when GPT-4 generated and was evaluated by GPT-4o, we achieved the highest score. Therefore, we conclude that for this specific test, **GPT-4o is the most effective model to be used as the generating model for our CRAG**.