## Reference free evaluation

In a Retrival augmented generation system it can be time consuming to create ground truth for every set of sample to be analyzed and evaluated. Moreover intelligent insights about the performance of RAG pipeline can be derived with using questions, retrived contexts and generated answer. **Reference free evaluation metrics are the set of metrics can be used without the need of a annotated ground truth**.These metrics can be used against any set of random sample drawn out of a RAG pipeline. 



### Faithfulness 


This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not. The faithfulness score is given by  divided by 

$$
\text{Faithfulness score} = {|\text{Number of claims that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}
$$


#### Example usage

```python

from ragas.metrics.faithfulness import Faithfulness
faithfulness = Faithfulness()

# Dataset({
#     features: ['question','contexts','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = faithfulness.score(dataset)
```

#### Example

**High faithfulness**

**Low faithfulness**



### Answer Relevancy 

This measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. It is calculated from question and answer. Values range (0,1), higher the better.

The generated answer is considered as relevant if it directly addresses the question in an appropriate way. In particular, our assessment of answer relevance does not take into account factuality, but penalises cases where the answer is incomplete or where it contains redundant information. It is calculated by first prompting the LLM to generate appropriate question for the generated answer n times and then measuring the mean cosine similarity of the generated questions with the actual question. The key idea here is that if the generated answer adresses the original question the LLM should be able to generate question from the generated answer correctly.

### Example usage

```python

from ragas.metrics import AnswerRelevancy
answer_relevancy = AnswerRelevancy()

# init_model to load models used
answer_relevancy.init_model()

# Dataset({
#     features: ['question','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_relevancy.score(dataset)

```

#### Example

**High answer relevancy**

**Low answer relevancy**



### Context Precision

Measures the precision of retrived context. It is calculated from `question` and `contexts`. Values range (0,1), higher the better.

The retrieved context should ideally only contain neccessary information to answer the given query. To calculate this, first we estimate $|S|$ by identifying sentences from the retrived context that can be used to answer the given question. The final score is given by

$$
\text{context precision} = {|S| \over |\text{Total number of sentences in retrived context}|}
$$

### Example usage

```python

from ragas.metrics import ContextPrecision
context_precision = ContextPrecision(strictness=3)

# run init models to load the models used
context_precision.init_model()

# Dataset({
#     features: ['question','contexts'],
#     num_rows: 25
# })
dataset: Dataset

results = context_precision.score(dataset)
```

### Example


## Aspect critique 

Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary. It is calculated from `answer`.

Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like correctness, harmfulness,etc (Check SUPPORTED_ASPECTS to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The strictness parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. It is calculated from answer. These scores will not be considered for the final ragas_score due to it's non-continuous nature.

```python

## check predefined aspects
from ragas.metrics.critique import SUPPORTED_ASPECTS
print(SUPPORTED_ASPECTS)

from ragas.metrics.critique import conciseness
# Dataset({
#     features: ['question','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = conciseness.score(dataset)


## Define your critique
from ragas.metrics.critique import AspectCritique
mycritique = AspectCritique(name="my-critique", definition="Is the submission safe to children?", strictness=2)

results = mycritique.score(dataset)
```



## Grouth truth

Some performance of some components in a RAG system can only be fully determined using a evaluation dataset that contains annotated ground truth. TODO: Give reference to testset generation

### Context Recall

measures the recall of the retrieved context using annotated answer as ground truth. It is calculated from `ground truth` and `retrieved context`. Values range from (0,1), higher the better. 

To estimate context recall from ground truth answer each sentence from ground truth answer is analyzes to verify if can be attributed to retrived context or not. In ideal scenario, all the sentences in ground truth answer can be attributed to retrived context. 

$$
\text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|}
$$


#### Example usage

```python
from ragas.metrics import ContextRecall
context_recall = ContextRecall()
# Dataset({
#     features: ['contexts','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = context_recall.score(dataset)

```

### Answer semantic similarity

measures the semantic similarity of generated answer against ground truth. It is calculated from `ground truth` and ` answer`. Values range from (0,1), higher the better. 

Semantic similarity between answers can provide insights about the quality of generated answer. It is calculated using cross-encoder model. 

#### Example usage


```python

```

### Answer correctness




## Test set generation 
TODO: Discuss the idea of continual testing


### Pyest

## Ragas LLM


## Integrations

### Langchain

### Llama-index

### Langsmith





In [1]:
text = "This context is too brief and doesn't delve into or explain any concepts. It's a simple statement without any supporting information or explanation. Therefore, it gets a low score. Score: 2."

In [9]:
index = text.find("Score:") + len("Score:")

In [13]:
eval('2. ')

2.0