Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 48 additions & 11 deletions docs/concepts/metrics/available_metrics/context_recall.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,59 @@
# Context Recall

Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.
In short, recall is about not missing anything important. Since it is about not missing anything, calculating context recall always requires a reference to compare against.



## LLM Based Context Recall

`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important.

Since it is about not missing anything, calculating context recall always requires a reference to compare against. The LLM-based Context Recall metric uses `reference` as a proxy to `reference_contexts`, which makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims, and each claim is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.

The formula for calculating context recall is as follows:

$$
\text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}
$$

### Example
## Example

```python
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import ContextRecall

# Setup LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Create metric
scorer = ContextRecall(llm=llm)

# Evaluate
result = await scorer.ascore(
user_input="Where is the Eiffel Tower located?",
retrieved_contexts=["Paris is the capital of France."],
reference="The Eiffel Tower is located in Paris."
)
print(f"Context Recall Score: {result.value}")
```

Output:

```
Context Recall Score: 1.0
```

!!! note "Synchronous Usage"
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:

```python
result = scorer.score(
user_input="Where is the Eiffel Tower located?",
retrieved_contexts=["Paris is the capital of France."],
reference="The Eiffel Tower is located in Paris."
)
```

## LLM Based Context Recall (Legacy API)

!!! warning "Legacy API"
The following example uses the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above. This API will be deprecated in version 0.4 and removed in version 1.0.

```python
from ragas.dataset_schema import SingleTurnSample
Expand All @@ -31,9 +68,9 @@ sample = SingleTurnSample(

context_recall = LLMContextRecall(llm=evaluator_llm)
await context_recall.single_turn_ascore(sample)

```
Output

Output:
```
1.0
```
Expand Down
155 changes: 125 additions & 30 deletions docs/concepts/metrics/available_metrics/factual_correctness.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,76 @@

`FactualCorrectness` is a metric that compares and evaluates the factual accuracy of the generated `response` with the `reference`. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the `mode` parameter.

### Example

```python
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import FactualCorrectness

# Setup LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Create metric
scorer = FactualCorrectness(llm=llm)

# Evaluate
result = await scorer.ascore(
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
)
print(f"Factual Correctness Score: {result.value}")
```

Output:

```
Factual Correctness Score: 0.67
```

By default, the mode is set to `f1`. You can change the mode to `precision` or `recall` by setting the `mode` parameter:

```python
# Precision mode - measures what fraction of response claims are supported by reference
scorer = FactualCorrectness(llm=llm, mode="precision")
result = await scorer.ascore(
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
)
print(f"Precision Score: {result.value}")
```

Output:

```
Precision Score: 1.0
```

You can also configure the claim decomposition granularity using `atomicity` and `coverage` parameters:

```python
# High granularity - more detailed claim decomposition
scorer = FactualCorrectness(
llm=llm,
mode="f1",
atomicity="high", # More atomic claims
coverage="high" # Comprehensive coverage
)
```

!!! note "Synchronous Usage"
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:

```python
result = scorer.score(
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
)
```

### How It's Calculated

The formula for calculating True Positive (TP), False Positive (FP), and False Negative (FN) is as follows:

$$
Expand Down Expand Up @@ -30,36 +100,6 @@ $$
\text{F1 Score} = {2 \times \text{Precision} \times \text{Recall} \over (\text{Precision} + \text{Recall})}
$$

### Example

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._factual_correctness import FactualCorrectness


sample = SingleTurnSample(
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
)

scorer = FactualCorrectness(llm = evaluator_llm)
await scorer.single_turn_ascore(sample)
```
Output
```
0.67
```

By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.

```python
scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
```
Output
```
1.0
```

### Controlling the Number of Claims

Each sentence in the response and reference can be broken down into one or more claims. The number of claims that are generated from a single sentence is determined by the level of `atomicity` and `coverage` required for your application.
Expand Down Expand Up @@ -161,3 +201,58 @@ By adjusting both atomicity and coverage, you can customize the level of detail
- Use **Low Atomicity and Low Coverage** when only the key information is necessary, such as for summarization.

This flexibility in controlling the number of claims helps ensure that the information is presented at the right level of granularity for your application's requirements.

## Legacy Metrics API

The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.

!!! warning "Deprecation Timeline"
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.

### Example with SingleTurnSample

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._factual_correctness import FactualCorrectness


sample = SingleTurnSample(
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
)

scorer = FactualCorrectness(llm = evaluator_llm)
await scorer.single_turn_ascore(sample)
```

Output:

```
0.67
```

### Changing the Mode

By default, the mode is set to `F1`, you can change the mode to `precision` or `recall` by setting the `mode` parameter.

```python
scorer = FactualCorrectness(llm = evaluator_llm, mode="precision")
```

Output:

```
1.0
```

### Controlling Atomicity

```python
scorer = FactualCorrectness(mode="precision", atomicity="low")
```

Output:

```
1.0
```
69 changes: 60 additions & 9 deletions docs/concepts/metrics/available_metrics/nvidia_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,28 +248,47 @@ Output:
- **1** → The response is partially grounded.
- **2** → The response is fully grounded (every statement can be found or inferred from the retrieved context).

### Example

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ResponseGroundedness
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import ResponseGroundedness

sample = SingleTurnSample(
# Setup LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Create metric
scorer = ResponseGroundedness(llm=llm)

# Evaluate
result = await scorer.ascore(
response="Albert Einstein was born in 1879.",
retrieved_contexts=[
"Albert Einstein was born March 14, 1879.",
"Albert Einstein was born at Ulm, in Württemberg, Germany.",
]
)

scorer = ResponseGroundedness(llm=evaluator_llm)
score = await scorer.single_turn_ascore(sample)
print(score)
print(f"Response Groundedness Score: {result.value}")
```
Output

Output:

```
1.0
Response Groundedness Score: 1.0
```

!!! note "Synchronous Usage"
If you prefer synchronous code, you can use the `.score()` method instead of `.ascore()`:

```python
result = scorer.score(
response="Albert Einstein was born in 1879.",
retrieved_contexts=[...]
)
```

### How It’s Calculated

**Step 1:** The LLM is prompted with two distinct templates to evaluate the grounding of the response with respect to the retrieved contexts. Each prompt returns a grounding rating of **0**, **1**, or **2**.
Expand Down Expand Up @@ -299,3 +318,35 @@ In this example, the retrieved contexts provide both the birthdate and location
- **Token Usage:** Faithfulness consumes more tokens, whereas Response Groundedness is more token-efficient.
- **Explainability:** Faithfulness provides transparent, reasoning for each claim, while Response Groundedness provides a raw score.
- **Robust Evaluation:** Faithfulness incorporates user input for a comprehensive assessment, whereas Response Groundedness ensures consistency through dual LLM evaluations.

### Legacy Metrics API

The following examples use the legacy metrics API pattern. For new projects, we recommend using the collections-based API shown above.

!!! warning "Deprecation Timeline"
This API will be deprecated in version 0.4 and removed in version 1.0. Please migrate to the collections-based API shown above.

#### Example with SingleTurnSample

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ResponseGroundedness

sample = SingleTurnSample(
response="Albert Einstein was born in 1879.",
retrieved_contexts=[
"Albert Einstein was born March 14, 1879.",
"Albert Einstein was born at Ulm, in Württemberg, Germany.",
]
)

scorer = ResponseGroundedness(llm=evaluator_llm)
score = await scorer.single_turn_ascore(sample)
print(score)
```

Output:

```
1.0
```
Loading