# RAG Workflow

<div style="text-align: center;">
    <img src="images/rag-workflow.png" width="60%"></img>
</div>

# Information Retrieval

+ Retreival is the core of a RAG system
+ If retreival fails, whole system fails

## Diffrent Methods for IR

| Method                            | Description                                    | Advantage                     | Disadvantage                   |
| --------------------------------- | ---------------------------------------------- | ----------------------------- | ------------------------------ |
| **Lexical/Keyword Retrieval** (e.g., TF-IDF) | Searches based on common words                 | Fast and simple               | Does not grasp deeper meaning  |
| **Dense Retrieval** (Embedding)   | Searches based on **meaning** rather than words | More natural and accurate results | Requires an embedding model     |
| **Hybrid Retrieval**              | Combination of both                            | Best quality                  | Slightly more complex          |

### Keyword Search (Sparse Vectors)

<div style="text-align: center;">
    <img src="images/keyword-search.png" width="70%"></img>
    <img src="images/bm25.jpeg" width="40%"></img>
    <img src="images/bm25-explanation.png" width="40%"></img>
</div>


Sentence 1: The car is driven on the road.

Sentence 2: The truck is driven on the highway.

<div style="text-align: center;">
    <img src="images/tf-idf.png" width="50%"></img>
</div>


```python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["بهترین پیتزا فروشی"]
sentences_2 = ["من فست فود دوست دارم.", "من پیتزا دوست دارم ."]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)


# compute the scores via lexical mathcing
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0

lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][1])
print(lexical_scores)
# 0.11194
```

### Semantic Search (Dense Vectors)

<div style="text-align: center;">
    <img src="images/semantic-search.png" width="60%"></img>
</div>

```python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["بهترین پیتزا فروشی"]
sentences_2 = ["من فست فود دوست دارم.", "من پیتزا دوست دارم ."]

embeddings_1 = model.encode(sentences_1, 
                            batch_size=12, 
                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            )['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# [[0.468  0.6655]]

```

### Hybrid Search (sparse+dense)

<div style="text-align: center;">
    <img src="images/hybrid-search.webp" width="50%"></img>
</div>

+ **Keyword Search**: Ensures sensitivity to exact words in user query.
+ **Semantic Search**: Find documents with similar meaning, even without matching words.

<div style="text-align: center;">
    <img src="images/beta.png" width="50%"></img>
</div>

# Embeddings

+ Embeddings are numerical vector representations of text.
+ They transform words, sentences, or documents into points in a high-dimensional space where semantic similarity corresponds to geometric closeness.
+ If two texts mean similar things, their vectors will be close in the embedding space.

<div style="text-align: center;">
    <img src="images/embeddigns.png" width="100%"></img>
</div>

# Evaluating a RAG Pipeline

A RAG system has two parts that affect performance:

| Component           | Question to Evaluate                           | Metric Type            |
| ------------------- | ---------------------------------------------- | ---------------------- |
| **Retriever**       | Does it find the *right documents*?            | Retrieval Metrics      |
| **Generator (LLM)** | Does it produce a *correct and useful answer*? | Answer Quality Metrics |


##  Evaluating the Retriever

We want to check whether the retriever returns relevant documents when given a query.

### Key Retrieval Metrics

| Metric          | What it Measures                                          | Interpretation             |
| --------------- | --------------------------------------------------------- | -------------------------- |
| **Recall@k**    | Whether the correct document appears in the top k results | Higher = Better retrieval  |
| **Precision@k** | How many returned docs are actually relevant              | Higher = Cleaner retrieval |
| **Hit Rate**    | Whether at least 1 relevant doc was retrieved             | Binary version of Recall@k |


## Evaluating the Generated Answer

Even with perfect retrieval, the LLM may:
+ Hallucinate
+ Ignore evidence
+ Give incomplete answers

So we evaluate the final generated output separately.

### Answer Quality Metrics

| Metric           | Meaning                                     | How to Score |
| ---------------- | ------------------------------------------- | ------------ |
| **Faithfulness** | Is the answer *grounded in retrieved text*? | 1–5 scale    |
| **Correctness**  | Is the answer factually correct?            | 1–5 scale    |
| **Completeness** | Is the answer sufficiently detailed?        | 1–5 scale    |
| **Conciseness**  | Is the answer clear and not overly long?    | 1–5 scale    |


### Automatic Evaluation (Model-as-a-Judge)

```python
evaluation_prompt = f"""
Context:
{context}

System Answer:
{system_answer}

Evaluate:
1) Faithfulness: Does the answer rely only on context?
2) Correctness: Is the answer accurate based on context?

Return a score from 1 to 5 for each criterion.
"""
judge = chat(model="llama3:8b", messages=[{"role": "user", "content": evaluation_prompt}])
print(judge["message"]["content"])

```