## RAG bench 

*Robert Friel, et. al. [RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems](https://arxiv.org/abs/2407.11005). 25.06.24*

> It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals.
>
>**Source Domains**
> - bio-medical research (PubmedQA, CovidQA), 
> - general knowledge (HotpotQA, MS Marco, HAGRID, ExperQA), 
> - legal contracts (CuAD), 
> - customer support (DelucionQA, EManual, TechQA), 
> - finance (FinBench, TAT-QA)
>
><p style="color: red;">RAGBench component datasets contain between 1%- 20% hallucinations</p>

## Customer support Domain

### DelucionQA

> DelucionQA [33] is a curated collection of user queries on the operation of Jeep’s 2023 Gladiator model.
>
> **Document source**: Jeep manual
>
> **Question source**: LLM 
>
> **Num docs**: 3 ??? Вроде мануал везде один, несколько только ретриверов
>
> **Average doc length in tokens**: 296 
>
> **Train, dev, test size**: (1.5k, 177, 182) ... трейн если чистить сильно меньше

QA pairs were generated with `gpt-3.5-turbo-0125`, QA pairs were annotated with `gpt-4-turbo-2024-04-09`

<p style="color: blue;">Поскольку тут есть галлюцинации, не хочется пока трогать этот датасет. Очень тяжело оценить, насколько приводимые ответы и фрагменты текста действительно хорошие. Также минусы исходного датасета - нужно отдельно парсить сам мануал для джипа, чтобы получить контекст -> нужно больше времени на данные</p>

In [1]:
from datasets import load_dataset
import pandas as pd

import requests
import io

In [None]:
response = requests.get('https://raw.githubusercontent.com/boschresearch/DelucionQA/refs/heads/main/data/DelucionQA_final/train.csv')
initial = pd.read_csv(io.BytesIO(response.content))

In [69]:
initial['Retreival Setting'].value_counts()

Retreival Setting
Adaptive Ensemble Search     343
Ensemble Retriever (Base)    328
Lucene Search                323
Ensemble Retriever           157
Name: count, dtype: int64

In [97]:
non_duplicated_non_hallucinated_answers = initial.query('`Label` == "Not Hallucinated" and `Answerable` == True')\
    [~initial.query('`Label` == "Not Hallucinated" and `Answerable` == True')[['Question', 'Answer']].duplicated()]

In [113]:
non_duplicated_non_hallucinated_answers[non_duplicated_non_hallucinated_answers.Question == 'How many batteries does the Stop/Start system need?'].Answer.values

array(['The Stop/Start system requires two batteries - a main battery and a supplemental battery.'],
      dtype=object)

In [54]:
not_hallucinated_questions = initial.query('`Label` == "Not Hallucinated" and `Answerable` == True').Question.unique()

In [2]:
delucion_qa = load_dataset('rungalileo/ragbench', 'delucionqa')

In [5]:
delucion_qa_train = pd.DataFrame(delucion_qa['train'])

In [9]:
delucion_qa_train.columns

Index(['id', 'question', 'documents', 'response', 'generation_model_name',
       'annotating_model_name', 'dataset_name', 'documents_sentences',
       'response_sentences', 'sentence_support_information',
       'unsupported_response_sentence_keys', 'adherence_score',
       'overall_supported_explanation', 'relevance_explanation',
       'all_relevant_sentence_keys', 'all_utilized_sentence_keys',
       'trulens_groundedness', 'trulens_context_relevance',
       'ragas_faithfulness', 'ragas_context_relevance', 'gpt3_adherence',
       'gpt3_context_relevance', 'gpt35_utilization', 'relevance_score',
       'utilization_score', 'completeness_score'],
      dtype='object')

In [17]:
delucion_qa_train[['question', 'documents', 'response', 
                   'relevance_score','utilization_score', 
                   'completeness_score']].sort_values(by=['utilization_score', 'completeness_score', 'relevance_score'], ascending=False)[:25]

Unnamed: 0,question,documents,response,relevance_score,utilization_score,completeness_score
133,what should I pay attention to when the engine...,[ Engine Warm-Up Avoid full throttle operation...,"When the engine is cold, you should avoid full...",1.0,1.0,1.0
914,Can I mannually turn on and turn off the stop/...,[ To Manually Turn On The Stop/Start System Pu...,"Yes, based on the context provided, you can ma...",1.0,1.0,1.0
1366,is there an indicator light for cruise control?,[ Cruise control set indicator light — if equi...,"Yes, based on the context provided, there are ...",1.0,1.0,1.0
1335,when will the cruise indicator light be turned...,[ Cruise Control Ready Indicator Light — If Eq...,"Based on the given context, the cruise indicat...",0.833333,1.0,1.0
1241,What should I do for off-road driving?,"[ Off-Road Driving Tips, The Basics Of Off-Ro...","Based on the provided context, here are the ke...",0.97561,0.95122,0.975
1358,what does the N position do?,[ N (Neutral) N (Neutral) — This range disenga...,"Based on the provided context, the N (Neutral)...",1.0,0.928571,0.928571
843,any tips to drive safely?,"[ Off-Road Driving Tips, DRIVING TIPS Off-Roa...","Based on the given context, here are the key t...",0.923077,0.923077,1.0
1025,Any instructions for the refueling process?,[ See the following steps for refueling: The v...,"Based on the provided context, here are the in...",0.909091,0.909091,1.0
1232,any tips or suggestions for refueling the car?,[ See the following steps for refueling: The v...,"Based on the context provided, here are some t...",0.909091,0.909091,1.0
1103,Any tips for Engine Oil Filter replacement?,[ Engine Oil Filter Selection A full-flow type...,"Based on the provided context, here are some t...",0.882353,0.882353,0.933333


In [14]:
delucion_qa_train.adherence_score.describe(), delucion_qa_train.adherence_score.value_counts()

(count     1458
 unique       2
 top       True
 freq      1366
 Name: adherence_score, dtype: object,
 adherence_score
 True     1366
 False      92
 Name: count, dtype: int64)

In [24]:
delucion_qa_train.query('`adherence_score` == True').all_relevant_sentence_keys[0]

['0c', '0e', '0f', '1e', '1g', '1o', '2d', '2f', '2g']

In [59]:
not_hallucinated_questions = list(not_hallucinated_questions)

In [63]:
delucion_qa_train.query('`question` in @not_hallucinated_questions').adherence_score.value_counts()

adherence_score
True     660
False     37
Name: count, dtype: int64

In [110]:
(delucion_qa_train.query('`question` in @not_hallucinated_questions').question[0],
delucion_qa_train.query('`question` in @not_hallucinated_questions').response[0])

('How many batteries does the Stop/Start system need?',
 'The Stop/Start system needs two batteries.')

In [118]:
for q in not_hallucinated_questions: 
    print(delucion_qa_train.query('`question` == @q')[['question', 'response', 'relevance_score', 'adherence_score']].values)
    break

[['Why do I have to apply the parking brake fully for parking?'
  'You need to apply the parking brake fully when parking to guard against vehicle movement and possible injury or damage. This is important because failure to do so could cause the vehicle to roll and cause injury or damage. Additionally, fully applying the parking brake ensures that the transmission is locked in place, securing the vehicle against unwanted movement.'
  0.1267605633802817 False]
 ['Why do I have to apply the parking brake fully for parking?'
  0.056338028169014086 True]]


In [106]:
len(not_hallucinated_questions)

427

In [104]:
delucion_qa_train.question.unique().shape

(730,)

### EManual

*Abhilash Nandy, et. al. [Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework](https://arxiv.org/abs/2109.05897). 13.09.2021*

> **Document source**: TV manual
>
> **Question source**: annotator 
>
> **Num docs**: 3 ??? В оригинале вроде много мануалов
>
> **Average doc length in tokens**: 165 
>
> **Train, dev, test size**: (1k, 132, 132)

Github original - https://github.com/abhi1nandy2/EMNLP-2021-Findings/tree/main/data (есть вопросы, но самих мануалов нет, их надо парсить по ссылкам. Вопрос, насколько теперь ссылки рабочие)

Manual corpus - https://drive.google.com/drive/folders/1-gX1DlmVodP6OVRJC3WBRZoGgxPuJvvt

> The QA dataset of the **Samsung Smart TV manual** is used to sanitize a community-based question answering dataset described next. Questions are extracted from question answering forum (where well-formed answers are available) of the **different Samsung Smart TV models** sold on amazon. Annotators are asked to certify whether a question is answerable by solely using the E-Manual of the product. The dataset has a total of 3000 such questions, out of which 1028 are certified as answerable.

<p style="color: blue;">Судя по репозиторию, доки придется парсить. С другой стороны, может они были бы в корпусе. Но четко там не написано. Из rag-bench можно вытянуть порядка 500 уникальных вопросов с хорошими ответами</p>

In [120]:
emanual = load_dataset('rungalileo/ragbench', 'emanual')

train-00000-of-00001.parquet:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


validation-00000-of-00001.parquet:   0%|          | 0.00/288k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/305k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1054 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/132 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/132 [00:00<?, ? examples/s]

In [None]:
emanual_train = pd.DataFrame(emanual['train'])

In [124]:
emanual_train.adherence_score.value_counts()

adherence_score
True     901
False    153
Name: count, dtype: int64

In [134]:
emanual_train.query('`adherence_score` == True').question.unique().shape

(491,)

### TechQA

*Vittorio Castelli, et. al. [The TechQA Dataset](https://arxiv.org/abs/1911.02984). 08.11.2019*

> **Document source**: Tech notes
>
> **Question source**: tech forums
>
> **Num docs**: 5
>
> **Average doc length in tokens**: 1.8k 
>
> **Train, dev, test size**: (1.2k, 302, 310)

Original HF - https://huggingface.co/datasets/PrimeQA/TechQA/blob/main/TechQA.tar.gz (~3 Gb)

Original GitHub - https://github.com/IBM/techqa

In [2]:
techqa = load_dataset('rungalileo/ragbench', 'techqa')

train-00000-of-00001.parquet:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


validation-00000-of-00001.parquet:   0%|          | 0.00/5.40M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/5.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1192 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/304 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/314 [00:00<?, ? examples/s]

In [3]:
techqa_train = pd.DataFrame(techqa['train'])

In [5]:
techqa_train.adherence_score.value_counts()

adherence_score
True     746
False    446
Name: count, dtype: int64

In [7]:
techqa_train.query('`adherence_score` == True').question.unique().shape

(460,)

In [11]:
techqa_train.query('`adherence_score` == True').sort_values(by=['relevance_score', 'utilization_score', 'completeness_score'], ascending=False)\
    [techqa_train.query('`adherence_score` == True').sort_values(by=['relevance_score', 'utilization_score', 'completeness_score'], ascending=False).question.duplicated()]

Unnamed: 0,id,question,documents,response,generation_model_name,annotating_model_name,dataset_name,documents_sentences,response_sentences,sentence_support_information,...,trulens_groundedness,trulens_context_relevance,ragas_faithfulness,ragas_context_relevance,gpt3_adherence,gpt3_context_relevance,gpt35_utilization,relevance_score,utilization_score,completeness_score
510,techqa_TRAIN_Q331,mirandabarby I have purchased a grad pack of S...,[SPSS Support Acquisition Statistics Stats Gra...,To enter the authorization code to activate/li...,gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, SPSS Support Acquisition Statistics Sta...","[[a, To enter the authorization code to activa...",[{'explanation': 'This statement is directly s...,...,,,,,,,,0.263736,0.175824,0.666667
549,techqa_TRAIN_Q014,"NMA agent installation failure \n\nHello, I'm ...",[ITM Virtual environments install NMA N4 unabl...,"Yes, the installation of the NMA agent from th...",gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, ITM Virtual environments install NMA N4...","[[a, Yes, the installation of the NMA agent fr...",[{'explanation': 'This statement is supported ...,...,,,,,,,,0.200837,0.046025,0.229167
1129,techqa_TRAIN_Q595,How can I configure a DataPower MQ client to u...,[datapower; mq; websphere mq; ssl; mutual auth...,"Based on the provided context, here are the st...",claude-3-haiku-20240307,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, datapower; mq; websphere mq; ssl; mutua...","[[a, Based on the provided context, here are t...",[{'explanation': 'General introduction to the ...,...,,,,,,,,0.200000,0.200000,0.987179
111,techqa_TRAIN_Q192,What are the recommended configuration setting...,[tsa; tsamp; db2; hadr; ha; policy; checker; d...,1. Ensure that the Windows cluster uses the bu...,gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, tsa; tsamp; db2; hadr; ha; policy; chec...","[[a, 1. Ensure that the Windows cluster uses t...",[{'explanation': 'This is directly supported b...,...,,,,,,,,0.200000,0.164286,0.428571
487,techqa_TRAIN_Q422,Help with Security Bulletin: Vulnerability in ...,[WTX security bulletin transformation extender...,You can find detailed information regarding th...,gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, WTX security bulletin transformation ex...","[[a, You can find detailed information regardi...",[{'explanation': 'This statement is supported ...,...,,,,,,,,0.196809,0.063830,0.324324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
721,techqa_TRAIN_Q004,Help with Action required for IIB H.E. V9 & WM...,[ FLASH (ALERT)\n\nABSTRACT\n IBM Integration ...,"Based on the provided context, there is no inf...",claude-3-haiku-20240307,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, FLASH (ALERT) ABSTRACT], [0b, IBM Int...","[[a, Based on the provided context, there is n...",[{'explanation': 'This statement accurately re...,...,,,,,,,,0.000000,0.000000,1.000000
727,techqa_TRAIN_Q532,Does Tivoli Monitoring NetApp monitoring agent...,[ A FIX IS AVAILABLE\nIBM Tivoli Monitoring f...,There is no mention of a Tivoli Monitoring Net...,claude-3-haiku-20240307,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, A FIX IS AVAILABLE], [0b, IBM Tivoli ...","[[a, There is no mention of a Tivoli Monitorin...",[{'explanation': 'This statement is verified b...,...,,,,,,,,0.000000,0.000000,1.000000
929,techqa_TRAIN_Q098,Restore JazzSM DASH login page to default imag...,[DASHL2; validate; validation; credentials; pr...,"Based on the context provided, there is no inf...",claude-3-haiku-20240307,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, DASHL2; validate; validation; credentia...","[[a, Based on the context provided, there is n...",[{'explanation': 'The response correctly notes...,...,,,,,,,,0.000000,0.000000,1.000000
1141,techqa_TRAIN_Q058,Portal v8.5 install fails with INSTCONFFAILED ...,[TFIM; unparseable; date; exception; installat...,"Based on the given context, the question does ...",claude-3-haiku-20240307,gpt-4-turbo-2024-04-09,techqa_train,"[[[0a, TFIM; unparseable; date; exception; ins...","[[a, Based on the given context, the question ...",[{'explanation': 'This sentence correctly asse...,...,,,,,,,,0.000000,0.000000,1.000000
