Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
8c4f0d9
added new metrics experiments
shahules786 Jul 20, 2023
9ec071c
crtique metrics
shahules786 Jul 20, 2023
276feba
merge main
shahules786 Jul 20, 2023
3e4757d
added critique metrics
shahules786 Jul 20, 2023
9e8fd16
rmv
shahules786 Jul 20, 2023
8a59c59
rmv
shahules786 Jul 20, 2023
edd8294
added critique experiments
shahules786 Jul 21, 2023
9d1d445
update readme
shahules786 Jul 21, 2023
5cf8174
rename metrics
shahules786 Jul 21, 2023
91d6725
added aspect critique
shahules786 Jul 21, 2023
70862f9
added new metrics to tests
shahules786 Jul 21, 2023
a166da6
formating
shahules786 Jul 21, 2023
3d0f644
Merge branch 'main' of https://github.com/explodinggradients/ragas in…
shahules786 Jul 21, 2023
fbfa533
Merge branch 'main' of https://github.com/explodinggradients/ragas in…
shahules786 Jul 21, 2023
f62caf6
update base class
shahules786 Jul 21, 2023
cc9625d
rmv binary metrics from ragas_score
shahules786 Jul 21, 2023
2a0f671
crtique assesments
shahules786 Jul 21, 2023
fe416d6
update metrics
shahules786 Jul 21, 2023
5210160
update aspects
shahules786 Jul 21, 2023
dc227cb
added documentation
shahules786 Jul 22, 2023
c90c92a
Merge branch 'main' of https://github.com/explodinggradients/ragas in…
shahules786 Jul 22, 2023
1f4ff75
change to default_factory
shahules786 Jul 22, 2023
0ad0cce
revert commit
shahules786 Jul 22, 2023
118e47e
fixed defualt factory
jjmachan Jul 22, 2023
80b4d74
fixed format
jjmachan Jul 22, 2023
75ff74f
smaller batch for benchmark
jjmachan Jul 22, 2023
4e5cfcb
fix types
shahules786 Jul 22, 2023
7a7a242
Merge branch 'dev-gptscore' of https://github.com/shahules786/ragas i…
shahules786 Jul 22, 2023
6a25fc0
fix types
shahules786 Jul 22, 2023
d5675c7
added supported aspects
shahules786 Jul 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,12 +80,13 @@ results = evaluate(dataset)
If you want a more in-depth explanation of core components, check out our [quick-start notebook](./docs/quickstart.ipynb)
## :luggage: Metrics

Ragas measures your pipeline's performance against two dimensions
Ragas measures your pipeline's performance against different dimensions
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized.
2. **Context Relevancy**: measures how relevant retrieved contexts is to the question. Ideally the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.
3. **Answer Relevancy**: measures how relevant generated answer is to the question. This do not ensure factuality of the generated answer rather penalizes the presence of redundant information in the generated answer.
4. **Aspect Critiques**: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary.

Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.
The final `ragas_score` is the harmonic mean of of individual metric scores.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the harmonic mean still relavent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specified in the docs.


To read more about our metrics, checkout [docs](/docs/metrics.md).
## 🫂 Community
Expand Down
28 changes: 28 additions & 0 deletions docs/metrics.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Metrics


1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
```python
from ragas.metrics.factuality import Faithfulness
Expand Down Expand Up @@ -41,6 +42,33 @@ results = answer_relevancy.score(dataset)
```


4. `Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
- List of predefined aspects:
`correctness`,`harmfulness`,`coherence`,`conciseness`,`maliciousness`

```python
## check predefined aspects
from ragas.metrics.critique import SUPPORTED_ASPECTS
print(SUPPORTED_ASPECTS)

from ragas.metrics.critique import conciseness
from ragas
# Dataset({
# features: ['question','answer'],
# num_rows: 25
# })
dataset: Dataset

results = conciseness.score(dataset)


## Define your critique
from ragas.metrics.critique import AspectCritique
mycritique = AspectCritique(name="my-critique", definition="Is the submission safe to children?", strictness=2)

```


## Why is ragas better than scoring using GPT 3.5 directly.
LLM like GPT 3.5 struggle when it comes to scoring generated text directly. For instance, these models would always only generate integer scores and these scores vary when invoked differently. Advanced paradigms and techniques leveraging LLMs to minimize this bias is the solution ragas presents.
<h1 align="center">
Expand Down
Loading