# Quickstart

In [1]:
# only run this if your have an editable install
%load_ext autoreload
%autoreload 2

### load your data

For this quickstart we are going to be using a dataset that we prepared from [eli5](https://huggingface.co/datasets/eli5) dataset with the models response. The dataset is available in [huggingface](https://huggingface.co/datasets/explodinggradients/eli5-test).

The dataset is of the following format
| column name    | type      | description                                                                       |
|----------------|-----------|-----------------------------------------------------------------------------------|
| prompt         | str       | the prompt/question to answer                                                     |
| context        | str       | context string that has any relevent priors the LLM needs to answer the questions |
| references     | list[str] | reference documents the LLM can use to respond to the prompt                      |
| ground_truth   | list[str] | accepted answers given by human annotators                                        |
| generated_text | str       | the generated output from the LLM                                                 |

In [2]:
from datasets import load_dataset, concatenate_datasets

ds = load_dataset("explodinggradients/eli5-test", split="test_eli5")
ds

Found cached dataset parquet (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--eli5-test-217d92ce20e19249/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


Dataset({
    features: ['context', 'prompt', 'ground_truth', 'references', 'generated_text'],
    num_rows: 500
})

### choose the metrics

ragas provides you with a wide range of metrics to evaluate the generated answers based on the latest research. You can see the entire list [here](https://github.com/explodinggradients/ragas#metrics). For this quickstart we will be using 3 from each type we support.
1. `edit_ratio` - obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth.
2. `bleu_score` - It measures precision by comparing  clipped n-grams in generated text to ground truth text.
3. `bert_score` - measures the similarity between ground truth text answers and generated text using SBERT vector embeddings.

In [5]:
from ragas.metrics import edit_ratio, bleu_score, bert_score

now we can initialize the `Evaluation` object. This will load your metrics and data and run the evaluation for you.

In [7]:
from ragas.metrics import Evaluation

e = Evaluation(
    metrics=[bert_score, edit_ratio, bleu_score],
    batched=False,
    batch_size=30,
)

In [18]:
# run it with .eval()
result = e.eval(ds["ground_truth"], ds["generated_text"])

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


### analysing results

The return `Result` object is used to analyse the results.

In [28]:
from rich.pretty import pprint

pprint(result)

you can access individual metric results via `result['<name>']`. it also has a `.describe()` function to show the distribution of the results and you can access the individual score from `.scores` attribute.

In [16]:
from pandas import DataFrame

# view with pandas
df = DataFrame(result.describe())
df

Unnamed: 0,BERTScore_cosine,edit_ratio,BLEU
mean,0.375526,0.414824,0.01084858
25%,0.212339,0.399876,3.489775e-155
50%,0.332697,0.429187,4.318061e-79
75%,0.532642,0.449509,1.525948e-05
min,0.007017,0.102182,4.029193e-232
max,0.91068,0.572917,0.1506915
std,0.207559,0.058072,0.02343307


In [29]:
result.scores

Dataset({
    features: ['BERTScore_cosine', 'edit_ratio', 'BLEU'],
    num_rows: 500
})