# ARES

**IMPORTANT:** ARES utilises a lot of resources both computationally and in terms of disk storage, We split the code into two sections.

The first section is especially demanding in terms of resources so before running it make sure you have 100GB of available space and access to substantial GPU power. Unfortunaltely we were not able to run this completely ourselves, because at some point we run out of GPU memory.

The second section is ligher and therefore we were able to run it ourselves, however we did have to use a lightweight model for that.

The code examples come from the ARES website, but had to be modified in parts due to the official guide not being up-to-date. We also made some changes so that it is viable to run in our environment. Additionally, we enriched it with detailed descriptions of what it does based on the information from the paper. If you are not able to run this code, at least we recommend to follow along and read everything to discover how this package is used.

ARES's website:
https://ares-ai.vercel.app/index.html

## Environment setup

In [None]:
%pip install ares-ai faiss-gpu
%pip install datasets==2.12.0
%pip install transformers==4.43.2
%pip install vllm==0.5.2

Collecting transformers==4.40.1 (from ares-ai)
  Using cached transformers-4.40.1-py3-none-any.whl.metadata (137 kB)
Using cached transformers-4.40.1-py3-none-any.whl (9.0 MB)
[0mInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.43.2
    Uninstalling transformers-4.43.2:
      Successfully uninstalled transformers-4.43.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.5.2 requires transformers>=4.42.4, but you have transformers 4.40.1 which is incompatible.[0m[31m
[0mSuccessfully installed transformers-4.40.1
[0mCollecting transformers==4.43.2
  Using cached transformers-4.43.2-py3-none-any.whl.metadata (43 kB)
Using cached transformers-4.43.2-py3-none-any.whl (9.4 MB)
[0mInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found exis

## Downloading sample data provided by the authors of ARES

In [None]:
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_judge_scoring.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_synthetic_query_generation.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_labeled_output.tsv
!wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_unlabeled_output.tsv

--2024-11-16 16:02:56--  https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_judge_scoring.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4803 (4.7K) [text/plain]
Saving to: ‘nq_few_shot_prompt_for_judge_scoring.tsv’


2024-11-16 16:02:56 (68.4 MB/s) - ‘nq_few_shot_prompt_for_judge_scoring.tsv’ saved [4803/4803]

--2024-11-16 16:02:56--  https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_synthetic_query_generation.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.

In [None]:
few_shot_examples_path = 'nq_few_shot_prompt_for_judge_scoring.tsv'
prompts_for_synthetic_query = 'nq_few_shot_prompt_for_synthetic_query_generation.tsv'
human_preference_dataset_path = 'nq_labeled_output.tsv'
rag_output_path = 'nq_unlabeled_output.tsv'
documents_sampled = 6189

### Shortening the input files

This is to speed up computations. If you have substantial computational power available and wish to evaluate the full dataset, skip this step.

In [None]:
import pandas as pd

documents_sampled = 20

for file_path in [human_preference_dataset_path, rag_output_path]:
    df = pd.read_csv(file_path, sep='\t')
    df = df.iloc[:documents_sampled]
    df.to_csv(file_path, sep='\t', index=False)

# 1. Sythetic query generation and classifier training

Note: as mentioned at the start, code in section 1 has very high computational complexity. Free Google Colab GPU is not enough to run it.

In [None]:
from ares import ARES

# 1.1 Generate synthetic queries

In [None]:
synthetic_queries_output_path = 'synthetic_queries_1.tsv'

In [None]:
synth_config = {
    "document_filepaths": [human_preference_dataset_path],  # human preference validation set
    "few_shot_prompt_filenames": [prompts_for_synthetic_query],  # few-shot examples of in-domain queries and answers
    "synthetic_queries_filenames": [synthetic_queries_output_path],  # synthetic queries will be saved here
    "documents_sampled": documents_sampled  # number of documents
}

ares_module = ARES(synthetic_query_generator=synth_config)
results_synthetic = ares_module.generate_synthetic_data()

In [None]:
results_synthetic

# 1.2. Train ARES's classifier for Context Relevance

In [None]:
classifier_config = {
    "training_dataset": [synthetic_queries_output_path],  # synthetic data generated in stage 1
    "validation_set": [human_preference_dataset_path],  # human-annotated validation dataset
    "label_column": ["Context_Relevance_Label"],  # here we specify that we are interested in a context relevance judge
    "num_epochs": 10,  # the rest are hyperparameters for the model.
    "patience_value": 3,
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,
    "gradient_accumulation_multiplier": 32,
}

ares = ARES(classifier_model=classifier_config)
results_clf = ares.train_classifier()

In [None]:
results_clf

# 2. UES/IDP and PPI scores for RAG models outputs

Note: ARES by default uses GPT3.5-turbo as an LLM judge model. If you have an OpenAI API key and wish to use it, you can change the code below according to ARES's documentation to do that. Otherwise, we configured ARES to use a free `allenai/OLMo-1B-hf` model through vLLM. This comes at a cost of a lower quality of the results, however it allows the code to be run for free.

You should be able to run the code from this section in a reasonable time using Colab's free GPU resources provided you are using smaller version of the datasets.

In [None]:
!nohup vllm serve "allenai/OLMo-1B-hf" &

nohup: appending output to 'nohup.out'


In [None]:
# Run this to make sure that the model is already running.
# If not, wait a bit and retry.
# The model can take a few minutes to initialise on the first startup
# In case of issues look inside the nohup.out file
!curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{"model": "allenai/OLMo-1B-hf", "messages": [{"role": "user", "content": "Hello!"}]}'

{"id":"cmpl-4ac78e8d820b4236be03c80243b7e8a3","object":"chat.completion","created":1731777566,"model":"allenai/OLMo-1B-hf","choices":[{"index":0,"message":{"role":"assistant","content":"Nothing much to report. I was in the West End finishing up some laundry.\nI keep forgetting to post about this, but I was on the West End.\nI was on the West End.\nIt was interesting to see how some people reacted to the news that I was going to be leaving the West End. Some people were excited, some were sad, some told me not to go. The answer was always the same. I was doing my bit.\nI was doing my bit.\nThe West End is not the same place it used to be. It is not the same place it was when I started here. I have seen it change. I watched a lot of shows that were on when I started here and saw them only in reruns afterwards. A lot of the shows that I loved when I started here are no longer here. It was a new era for the West End. It is a new era for me.\nI was on the West End. I was doing my bit.\nI di

In [None]:
model_name = 'allenai/OLMo-1B-hf'
model_url = 'http://localhost:8000/v1'

# 2.1 Compute UES/IDP scores

These are the direct scores for each metric

In [None]:
# ~ 3min 30s
from ares import ARES

ues_idp_config = {
    "in_domain_prompts_dataset": few_shot_examples_path,
    "unlabeled_evaluation_set": rag_output_path,  # output of RAG we wish to evaluate
    "model_choice" : model_name,
    "vllm": True,
    "host_url": model_url,
}

ares = ARES(ues_idp=ues_idp_config)
results_ues_idp = ares.ues_idp()

Evaluating large subset with allenai/OLMo-1B-hf:   0%|          | 0/20 [00:00<?, ?it/s]

Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Didn't extract Yes or No!
Number of times did not extract Yes or No: 43


{'Context Relevance Scores': 0.35,
 'Answer Faithfulness Scores': 0.15,
 'Answer Relevance Scores': 0.2}

In [None]:
results_ues_idp

{'Context Relevance Scores': 0.35,
 'Answer Faithfulness Scores': 0.15,
 'Answer Relevance Scores': 0.2}

# 2.2 Compute PPI scores

Due to a bug in the ARES package (see https://github.com/stanford-futuredata/ARES/issues/67), we cannot use a custom LLM judge for calculating PPI scores. As a workaround, one can download the checkpoint provided by the authors:

https://drive.google.com/file/d/15poFyeoqdnaNZVjl41HllL2213DKyZjH/view

then upload it to Colab. However this will only work if using the default model (GPT3.5-turbo), so OpenAI API key is needed.

In [None]:
import os

# filename valid as of 16/11/2024 - change if needed
checkpoint_filename = 'ares_context_relevance_general_checkpoint_V1.1.pt'

# check if the checkpoint is available in current directory
if os.path.isfile(checkpoint_filename):
    checkpoints = [checkpoint_filename]
    # check for OpenAI api key
    try:
        os.environ['OPENAI_API_KEY']
    except KeyError:
        print('Checkpoint file detected, but OpenAI API Key is not set, so the code below will not run.')
    else:
        print('Checkpoint file detected and the OpenAI API Key is set. The code below should run with no problems :)')
else:
    print('No checkpoints detected. Unless ARES fixed their bug in the meantime, the code below will fail at the very end of execution.')
    checkpoints = []

No checkpoints detected. Unless ARES fixed their bug in the meantime, the code below will fail at the very end of execution.


In [None]:
# ~ 30 min
from ares import ARES

ppi_config = {
    "checkpoints": checkpoints,  # checkpoints for judges training
    "rag_type": "question_answering",  # type of RAG we evaluate
    "evaluation_datasets": [rag_output_path],  # RAG outputs we evaluate
    "few_shot_examples_filepath": few_shot_examples_path,
    "gold_label_paths": [human_preference_dataset_path],  # valdation dataset of human preferences
    "labels": ["Context_Relevance_Label"],  # We calculate PPI for context relevance
}
if len(checkpoints) == 0:
    # specify the self-hosted model if no checkpoints are provided
    ppi_config['llm_judge'] = model_name
    ppi_config["vllm"] = True
    ppi_config["host_url"] = model_url

ares = ARES(ppi=ppi_config)
results_ppi = ares.evaluate_RAG()

--------------------------------------------------------
Evaluation Sets: ['nq_unlabeled_output.tsv']
Checkpoints: []
Labels: ['Context_Relevance_Label']
--------------------------------------------------------
Loaded API model based on model identifier: allenai/OLMo-1B-hf
Performing Model scoring!


  0%|          | 0/16 [00:00<?, ?it/s]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 3 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 4 failed with error: 'int' object is not subscriptable


 12%|█▎        | 2/16 [04:17<30:01, 128.68s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 3 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 4 failed with error: 'int' object is not subscriptable


 19%|█▉        | 3/16 [08:22<38:22, 177.10s/it]

Didn't extract Yes or No!
All attempts failed. Last error was: 'int' object is not subscriptable


 25%|██▌       | 4/16 [08:31<22:55, 114.64s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable


 31%|███▏      | 5/16 [10:47<22:20, 121.86s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable


 50%|█████     | 8/16 [13:17<08:23, 62.94s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable


 56%|█████▋    | 9/16 [14:25<07:30, 64.42s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable


 75%|███████▌  | 12/16 [15:59<02:32, 38.05s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 3 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 4 failed with error: 'int' object is not subscriptable


 81%|████████▏ | 13/16 [20:03<05:00, 100.31s/it]

Didn't extract Yes or No!
All attempts failed. Last error was: 'int' object is not subscriptable


 88%|████████▊ | 14/16 [20:14<02:26, 73.38s/it] 

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 3 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 4 failed with error: 'int' object is not subscriptable


 94%|█████████▍| 15/16 [24:19<02:04, 124.93s/it]

Didn't extract Yes or No!
All attempts failed. Last error was: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable


100%|██████████| 16/16 [25:32<00:00, 109.43s/it]

Didn't extract Yes or No!
Attempt 1 failed with error: 'int' object is not subscriptable
Didn't extract Yes or No!
Attempt 2 failed with error: 'int' object is not subscriptable


100%|██████████| 16/16 [27:39<00:00, 103.73s/it]


AttributeError: 'numpy.ndarray' object has no attribute 'nelement'

In [None]:
results_ppi