# Metric Definitions
We'll evaluate our system based on 5 key metrics: Precision, Recall, F1 Score, Mean Reciprocal Rank (MRR), and End-to-End Accuracy.

## Retrieval Metrics:

### Precision
Precision represents the proportion of retrieved chunks that are actually relevant. It answers the question: "Of the chunks we retrieved, how many were correct?"

Key points:
- High precision indicates an efficient system with few false positives.
- Low precision suggests many irrelevant chunks are being retrieved.
- Our system retrieves a minimum of 3 chunks per query, which may affect precision scores.

Formula:
$$
\text{Precision} = \frac{\text{True Positives}}{\text{Total Retrieved}} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}
$$

### Recall
Recall measures the completeness of our retrieval system. It answers the question: "Of all the correct chunks that exist, how many did we manage to retrieve?"

Key points:
- High recall indicates comprehensive coverage of necessary information.
- Low recall suggests important chunks are being missed.
- Recall is crucial for ensuring the LLM has access to all needed information.

Formula:
$$
\text{Recall} = \frac{\text{True Positives}}{\text{Total Correct}} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}
$$

### F1 Score
The F1 score provides a balanced measure between precision and recall. It's particularly useful when you need a single metric to evaluate system performance, especially with uneven class distributions.

Key points:
- F1 score ranges from 0 to 1, with 1 representing perfect precision and recall.
- It's the harmonic mean of precision and recall, tending towards the lower of the two values.
- Useful in scenarios where both false positives and false negatives are important.

Formula:
$$
\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Interpreting F1 score:
- An F1 score of 1.0 indicates perfect precision and recall.
- An F1 score of 0.0 indicates the worst performance.
- Generally, the higher the F1 score, the better the overall performance.

### Balancing Precision, Recall, and F1 Score:
- There's often a trade-off between precision and recall.
- Our system's minimum chunk retrieval favors recall over precision.
- The optimal balance depends on the specific use case.
- In many RAG systems, high recall is often prioritized, as LLMs can filter out less relevant information during generation.

### Mean Reciprocal Rank (MRR) @k
MRR measures how well our system ranks relevant information. It helps us understand how quickly a user would find what they're looking for if they started from the top of our retrieved results.

Key points:
- MRR ranges from 0 to 1, where 1 is perfect (correct answer always first).
- It only considers the rank of the first correct result for each query.
- Higher MRR indicates better ranking of relevant information.

Formula:
$$
\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}
$$

Where:
- |Q| is the total number of queries
- rank_i is the position of the first relevant item for the i-th query

## End to End Metrics:

### End to End Accuracy
We use an LLM-as-judge (Claude 3.5 Sonnet) to evaluate whether the generated answer is correct based on the question and ground truth answer.

Formula:
$$
\text{End to End Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}}
$$

This metric evaluates the entire pipeline, from retrieval to answer generation.