# RAG Evaluation w/ BERT Score

The BERT score evaluates text by measuring the similarity between words in the context of their sentences. It differs from simple word matching because it considers the deeper, contextual meanings of each word.

**STEP 1**: Create Embeddings
First, generate contextual embeddings for both the **reference text** (the standard or target text) and the **candidate text** (the text you are evaluating).

**STEP 2**: Calculate Similarities
Use cosine similarity to determine how closely related the words are in the embeddings of the reference and candidate texts. 

**STEP 3**: Compute Precision and Recall
- **Precision**: For each word in the candidate text, find the most similar word in the reference text and measure these matches --> High precision = low level of false positives --> 1 or 100% --> focus on correctness

![precision](../images/prec-300x53.png)

- **Recall**: For each word in the reference text, find the most similar word in the candidate text and measure these matches --> High recall = low level of false positives --> 1 or 100% --> focus on capturing all false positives

![image-2.png](../images/recall-300x64.png)

In practice, achieving both 100% precision and 100% recall is often impossible, leading to a trade-off:

🔸 A higher recall can be achieved at the expense of lower precision: capturing more positives but also increasing false positives.

🔸 A higher precision can be achieved at the cost of lower recall: fewer false positives but also missing more true positives.

**STEP 4**: Compute F1

Combine the results of Precision and Recall to calculate the F1 score, which provides a single measure of accuracy.

- **High F1**: An F1 score near 1 is very good, indicating strong precision and recall.
- **Low F1**: An F1 score near 0 is poor, indicating weak precision and recall.


❗ **NOTE**: Remember, the embeddings assess **contextual** similarity, which reflects how words are used in specific contexts, rather than direct word-to-word similarity.


In [17]:
# import modules

from bert_score import score

In [18]:
# test

candidates = ["In this article we will talk about cats"]
references = ["This is a piece of text about kittens"]

P, R, F1 = score(candidates, references, lang='en', verbose=True)

print(f"Precision: {P.mean():.3f}")
print(f"Recall: {R.mean():.3f}")
print(f"F1 Score: {F1.mean():.3f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.29 seconds, 3.46 sentences/sec
Precision: 0.870
Recall: 0.882
F1 Score: 0.876


## 1. Import RAG w/ Contextual Compression for testing

In [19]:
import sys
sys.path.append("../1-rag-contextual-compression")

from rag_cc import RAGContextualCompression

In [20]:
# logging
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.WARNING)

In [21]:
data_path = "../data/rag-con-comp-data"

# step 1: initialize agent
ragcc = RAGContextualCompression(data_path=data_path)

# step 2: load & preprocess documents
docs = ragcc.load_documents()
doc_chunks = ragcc.preprocess_documents(docs)

# step 3: initialize vector store
db = ragcc.setup_vector_store(doc_chunks)

In [22]:
user_query = "Where does the water present in the egg go after boiling the egg?"

# step 4: retrieve documents
retrieved_docs = ragcc.retrieve_documents(db, user_query)

# step 5: setup compression and redundancy filters to optimize document retrieval
contextual_comp_retriever = ragcc.setup_compression_pipeline_retriever(db)

# step 6: generate the final answer to the user query
answer = ragcc.generate_answer(retriever=contextual_comp_retriever, user_query=user_query)
print(answer)

Yow yow, nice to see you here, curious mind! When an egg is boiled, the water present in the egg doesn't go anywhere but remains within the egg. The heat from boiling causes the proteins in the egg to denature and coagulate, trapping the water within the solidified egg white and yolk.


## 2. Generate Reference Texts

Reference texts are the texts that your model's output will be compared against. Ideally, high-quality human-generated texts that represent what you expect from the model.

In [23]:
# generate references

questions = [
    "What are monosaccharides?",
    "What are reducing sugars?",
    "What are two main functions of carbohydrates in plants?",
    "Classify the following into monosaccharides and disaccharides: Ribose, 2-deoxyribose, maltose, galactose, fructose, lactose.",
    "What do you understand by the term glycosidic linkage?"
]

refs = [
    "Monosaccharides are carbohydrates that cannot be hydrolyzed further to give simpler units of polyhydroxy aldehyde or ketone.",
    "Reducing sugars are those carbohydrates that can reduce Fehling’s solution and Tollens’ reagent, including all monosaccharides whether aldose or ketose.",
    "Carbohydrates act as a primary source of energy and serve as structural components within plant cells.",
    "Monosaccharides: Ribose, 2-deoxyribose, galactose, fructose; Disaccharides: maltose, lactose.",
    "A glycosidic linkage is a type of covalent bond that joins a carbohydrate molecule to another group, which can be another carbohydrate."
]

## 3. Generate Candidate Texts

Candidate texts are the outputs generated by the imported RAG w/ Contextual Compression model.

In [24]:
# create candidates

cands = [ragcc.generate_answer(retriever=contextual_comp_retriever, user_query=q) for q in questions]

In [25]:
cands

['Yow yow, nice to see you here, curious mind! Monosaccharides are a type of carbohydrate that cannot be hydrolyzed further to give simpler units of polyhydroxy aldehyde or ketone. Some common examples of monosaccharides include glucose, fructose, and ribose.',
 'Yow yow, nice to see you here, curious mind! Reducing sugars are sugars that have free functional groups, specifically aldehydic or ketonic groups, which allow them to act as reducing agents. Examples of reducing sugars include maltose and lactose.',
 'Yow yow, nice to see you here, curious mind! The two main functions of carbohydrates in plants are providing energy and serving as a structural component. They are used as storage molecules in the form of starch and they constitute the cell wall of plant cells in the form of cellulose.',
 'Yow yow, nice to see you here, curious mind! \nThe monosaccharides in your list are Ribose, 2-deoxyribose, galactose, and fructose. The disaccharides are maltose and lactose.',
 'Yow yow, nice

## 4. Calculate Individual BERT Scores for RAG w/ Contextual Compression

In [26]:
rag_P, rag_R, rag_F1 = score(cands, refs, lang="en", verbose=True)

print("RAG w/ Contextual Compression | Precision:", rag_P)
print("RAG w/ Contextual Compression | Recall:", rag_R)
print("RAG w/ Contextual Compression | F1 Score:", rag_F1)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.43 seconds, 3.49 sentences/sec
RAG w/ Contextual Compression | Precision: tensor([0.8856, 0.8496, 0.8555, 0.8914, 0.8434])
RAG w/ Contextual Compression | Recall: tensor([0.9555, 0.8518, 0.9100, 0.9487, 0.8902])
RAG w/ Contextual Compression | F1 Score: tensor([0.9192, 0.8507, 0.8820, 0.9191, 0.8662])


## 5. Calculate Overall BERT Score

In [27]:
print(f"RAG w/ Contextual Compression | Precision: {rag_P.mean():.3f}")
print(f"RAG w/ Contextual Compression | Recall: {rag_R.mean():.3f}")
print(f"RAG w/ Contextual Compression | F1 Score: {rag_F1.mean():.3f}")

RAG w/ Contextual Compression | Precision: 0.865
RAG w/ Contextual Compression | Recall: 0.911
RAG w/ Contextual Compression | F1 Score: 0.887


## Interpretation

- **Precision** - 0.865 indicates that most of the content generated by model is relevant to the content of the human-generated references! --> **86% of the answers/candidates was generated by RAG correctly!** (generated right answers)
- **Recall** - 0.91 indicates that the model has effectively captured info in the reference texts --> **91% of what should have been included in the generated text was actually included!** (only 9% of answers left behind)
- **F1 Score** - harmonic mean of Precision & Recall
