## RAG bench 

*Robert Friel, et. al. [RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems](https://arxiv.org/abs/2407.11005). 25.06.24*

> It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals.
>
>**Source Domains**
> - bio-medical research (PubmedQA, CovidQA), 
> - general knowledge (HotpotQA, MS Marco, HAGRID, ExperQA), 
> - legal contracts (CuAD), 
> - customer support (DelucionQA, EManual, TechQA), 
> - finance (FinBench, TAT-QA)
>
><p style="color: red;">RAGBench component datasets contain between 1%- 20% hallucinations</p>

## Bio-medical research Domain

### PubmedQA

> *Qiao Jin, et. al. [PubMedQA: A Dataset for Biomedical Research Question Answering](https://arxiv.org/pdf/1909.06146). 13.09.2019*
>
> **Document source**: research abstracts
>
> **Question source**: automated heuristics
>
> **Num docs**: 4
>
> **Average doc length in tokens**: 99 
>
> **Train, dev, test size**: (19.5k, 2.5k, 2.5k) 
>
> PubMedQA is split into three subsets: labeled, unlabeled and artificially generated.

<p style="color: blue;">Представлены только вопросы да/нет. Поделены на три группы 1. размеченные, 2. неразмеченные, 3. сгенерированные (ELMO + BERT). В основном вопросы берутся из заголовка статьи и аннотации. Ответ берется из аннотации. Разметчики оценивали верность ответа - да/нет/возможно. В контексте даются чанки. Заголовок не указан, но это по сути вопрос.</p>

In [1]:
from datasets import load_dataset
import pandas as pd

import requests
import io

In [10]:
pubmedqa = load_dataset('rungalileo/ragbench', 'pubmedqa')

train-00000-of-00001.parquet:   0%|          | 0.00/80.1M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/10.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/19600 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2450 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2450 [00:00<?, ? examples/s]

In [11]:
pubmedqa_train = pd.DataFrame(pubmedqa['train'])

In [12]:
sum(pubmedqa_train.adherence_score == False)

4803

In [39]:
pubmedqa_orig = load_dataset('qiaojin/PubMedQA', 'pqa_artificial')

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


train-00000-of-00001.parquet:   0%|          | 0.00/233M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/211269 [00:00<?, ? examples/s]

In [41]:
pubmedqa_orig_train = pd.DataFrame(pubmedqa_orig['train'])

In [43]:
pubmedqa_orig_train.head()

Unnamed: 0,pubid,question,context,long_answer,final_decision
0,25429730,Are group 2 innate lymphoid cells ( ILC2s ) in...,{'contexts': ['Chronic rhinosinusitis (CRS) is...,"As ILC2s are elevated in patients with CRSwNP,...",yes
1,25433161,Does vagus nerve contribute to the development...,{'contexts': ['Phosphatidylethanolamine N-meth...,Neuronal signals via the hepatic vagus nerve c...,yes
2,25445714,Does psammaplin A induce Sirtuin 1-dependent a...,{'contexts': ['Psammaplin A (PsA) is a natural...,PsA significantly inhibited MCF-7/adr cells pr...,yes
3,25431941,Is methylation of the FGFR2 gene associated wi...,{'contexts': ['This study examined links betwe...,We identified a novel biologically plausible c...,yes
4,25432519,Do tumor-infiltrating immune cell profiles and...,{'contexts': ['Tumor microenvironment immunity...,Breast cancer immune cell subpopulation profil...,yes


In [44]:
pubmedqa_orig_train.context[0]

{'contexts': ['Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated.',
  'The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease.',
  'A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD161(+) cells in single cell suspensions through flow cytometry. ILC2 frequencies, measured as a percentage of CD45(+) cells, were compared across CRS phenotype, endo

### CovidQA

*Timo Möller, et. al. [COVID-QA: A Question Answering Dataset for COVID-19](https://aclanthology.org/2020.nlpcovid19-acl.18.pdf). 07.2020.*

> **Document source**: research papers
>
> **Question source**: expert 
>
> **Num docs**: 4
>
> **Average doc length in tokens**: 122 
>
> **Train, dev, test size**: (2.5k, 534, 492)
>
> - dataset consisting of 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19
> - selected 147 scientific articles mostly related to COVID-19 from the CORD-19 collection to be annotated by 15 experts

<p style="color: blue;">Хорошие данные, но надо подумать, как парсить, потому что 1. заголовок представлен в контексте, 2. текст не поделен на части.</p>

In [2]:
covidqa = load_dataset('rungalileo/ragbench', 'covidqa')

train-00000-of-00001.parquet:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


test-00000-of-00001.parquet:   0%|          | 0.00/854k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/913k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1252 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/246 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/267 [00:00<?, ? examples/s]

In [5]:
covidqa_train = pd.DataFrame(covidqa['train'])

In [8]:
sum(covidqa_train.adherence_score == False)

185

In [None]:
response = requests.get('https://raw.githubusercontent.com/deepset-ai/COVID-QA/refs/heads/master/data/question-answering/COVID-QA.json')
covidqa_orig = pd.read_json(io.BytesIO(response.content))

In [35]:
covidqa_orig.iloc[0].values[0]['paragraphs'][0]['context']

"Functional Genetic Variants in DC-SIGNR Are Associated with Mother-to-Child Transmission of HIV-1\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752805/\n\nBoily-Larouche, Geneviève; Iscache, Anne-Laure; Zijenah, Lynn S.; Humphrey, Jean H.; Mouland, Andrew J.; Ward, Brian J.; Roger, Michel\n2009-10-07\nDOI:10.1371/journal.pone.0007211\nLicense:cc-by\n\nAbstract: BACKGROUND: Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide. Given that the C-type lectin receptor, dendritic cell-specific ICAM-grabbing non-integrin-related (DC-SIGNR, also known as CD209L or liver/lymph node–specific ICAM-grabbing non-integrin (L-SIGN)), can interact with pathogens including HIV-1 and is expressed at the maternal-fetal interface, we hypothesized that it could influence MTCT of HIV-1. METHODS AND FINDINGS: To investigate the potential role of DC-SIGNR in MTCT of HIV-1, we carried out a genetic association study of DC-SIGNR in a well-characterized cohort of 197

In [37]:
covidqa_orig.iloc[0].values[0]['paragraphs'][0]['qas'][0]

{'question': 'What is the main cause of HIV-1 infection in children?',
 'id': 262,
 'answers': [{'text': 'Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide. ',
   'answer_start': 370}],
 'is_impossible': False}

### TREC-COVID (BEIR)

> **Объем корпуса**: 171k
>
> **Объем изначальных запросов**: 50 
>
> **Объем сгенерированных запросов**: 480k - если почистить, то будет где-то 367k

In [50]:
dataset_name = "BeIR/trec-covid"
corpus = load_dataset('mteb/trec-covid', "corpus", split="corpus")
queries = load_dataset(f'{dataset_name}-generated-queries', split="train")
qrels = load_dataset(f"{dataset_name}-qrels", "default", split="test")

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


corpus.jsonl:   0%|          | 0.00/200M [00:00<?, ?B/s]

Generating corpus split:   0%|          | 0/171332 [00:00<?, ? examples/s]

In [58]:
corpus = pd.DataFrame(corpus)
queries = pd.DataFrame(queries)
qrels = pd.DataFrame(qrels)

In [73]:
queries.head()

Unnamed: 0,_id,title,text,query
0,ug7v899j,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...,what is pneumoniae infection prevalence
1,02tnwd4m,Nitric oxide: a pro-inflammatory mediator in l...,Inflammatory diseases of the respiratory tract...,does nitric oxide cause inflammation
2,ejv2xln0,Surfactant protein-D and pulmonary host defense,Surfactant protein-D (SP-D) participates in th...,which microorganisms are affected by surfactant-d
3,2b73a28n,Role of endothelin-1 in lung disease,Endothelin-1 (ET-1) is a 21 amino acid peptide...,what is a et1
4,9785vg6d,Gene expression in epithelial cells in respons...,Respiratory syncytial virus (RSV) and pneumoni...,what is the epithelial cellular response


In [74]:
queries.shape

(480036, 4)

In [61]:
corpus.head()

Unnamed: 0,_id,title,text
0,ug7v899j,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...
1,02tnwd4m,Nitric oxide: a pro-inflammatory mediator in l...,Inflammatory diseases of the respiratory tract...
2,ejv2xln0,Surfactant protein-D and pulmonary host defense,Surfactant protein-D (SP-D) participates in th...
3,2b73a28n,Role of endothelin-1 in lung disease,Endothelin-1 (ET-1) is a 21 amino acid peptide...
4,9785vg6d,Gene expression in epithelial cells in respons...,Respiratory syncytial virus (RSV) and pneumoni...


In [60]:
qrels.head()

Unnamed: 0,query-id,corpus-id,score
0,1,005b2j4b,2
1,1,00fmeepz,1
2,1,g7dhmyyo,2
3,1,0194oljo,1
4,1,021q9884,1


In [81]:
queries.query('`text` != "" and `query` != ""').shape

(367324, 4)

In [84]:
queries.query('`text` != "" and `query` != ""').text.apply(len).mean()

np.float64(1363.4602231272663)