# 1.0 Challanges with Naive RAG

## 1.1 Bad Retrieval

### 1.1.1 Low precision: 
- Not all chunks in retrieved set are relevant.
- Hallucination + Lost in the middle problem where llm does not pay attention to the middle sentences of the context.

### 1.1.2 Low recall
- Now all relevant chunks are retrieved.
- Lacks enough context for LLM to synthesize an answer.

### 1.1.3 Outdated information
- The data is redundant or out of date.

## 2.1 Bad Response Generation

### 2.1.1 Hallucinations
- Model makes up an answer that isin't in the context.

### 2.1.2 Irrelevance
- Model makes up an answer that dosen't answer the question.

### 2.1.3 Toxicity/Bias
- Model makes up an answer that's harmful/offensive.
- An example could be that the user wants the response in json format still LLM generates date which is not in json.

# 2.0 Correction in RAG

## 2.1 Things to look at before creating a RAG system

### 2.1.1 Data:
- Can we store additional information beyond raw text chunks ?
- How you process chunked data.
- Data can be created using a **human labeller** to create (question, answer) pairs.  
- Synthetic data can be created using an LLM to create questions from context chunks. Feed the question and context chunks into the LLM to generate a synthetic ground truth response. The dataset has been created with these (question, context, synthetic response) pair. The dataset is now set to be used with any evaluation metric you have defined.

### 2.1.2 Embeddings:
- Can we optimize the out embedding representation.
- This can improve the retrieved data from vector store.

### 2.1.3 Retrieval:
- Since top-k retrieval is standard now a days.
- We should imporve or introduce new techniques for retrieval.


### 2.1.4 Synthesis

- Can LLM be used for more than generation ? like reasoning or planning.
- Can we improve prompts and queries for better generation.

# 3.0 Evaluations

### 3.1 How do we properly evaluate RAG system ?

- Evaluate on isolation for both the generation/response and retrieval.
- Evaluate E2E (end 2 end).

### 3.2 Evaluation in isolation (Retrieval)

- Evaluate quality of retrieved chunks given user query. This is context relevant which can be also done by using trulens-eval. There are total three RAG evaluation triads for LLM inspection:
    - **Context relevant**: Is context relevant to the query.
    - **Answer relevant**: Is generation relevant to the query.
    - **Groundedness**: IS generation relevant supported by the context.

- Try **Deeplake**:
    - Create deeplake dataset. Create either by yourself or use LLM to synthetically created dataset.
    - Run retriever over dataset.
    - Measure ranking metrics.


![image.png](attachment:image.png)


### 3.3 Evaluation in isolation (E2E)

- Evaluation of final generated response given input.

### 3.3.1 Metrics for evaluation

**Correctness**: 
- Where a groundtruth reference response is evaluated with the generated response from the LLM.
- Sometimes, groundtruth reference is not needed, in terms of **label free evaluation** one can evaluate generator response against retrieved context from retriever. These also work as retrieving context and then evaluating its relevancy is also a part of RAG pipeline. 

**Relevance**:
- check why the response does not actually answers the question at hand.
- and why it is not adhering the rules and guidelines provided.

### 3.3.1 Steps for evaluation
- Create a deeplake dataset. Sometimes a human creates the dataset of like 50-100 (questions, answer) pair
- Run through full RAG pipeline.
- Collect evaluation metrics.

![image-2.png](attachment:image-2.png)