# Datasets

These are the datasets we use for evaluating different RAG pipelines

In [None]:
%pip install -q ragstack-ai pypdf

## Blockchain Solana Dataset

Description: A labelled RAG dataset based off an article, From Bitcoin to Solana – Innovating Blockchain towards Enterprise Applications),by Xiangyu Li, Xinyu Wang, Tingli Kong, Junhao Zheng and Min Luo, consisting of queries, reference answers, and reference contexts.

Number Of Examples: 58

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/blockchain_solana/llamaindex_baseline.py) | 0.945 | 4.457 | 1 | 1 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s):
https://arxiv.org/abs/2207.05240

In [None]:
! llamaindex-cli download-llamadataset BlockchainSolanaDataset --download-dir ./data/blockchain_solana

## Braintrust Coda Help Desk

Description: A list of automatically generated question/answer pairs from the Coda (https://coda.io/) help docs. This dataset is interesting because most models include Coda’s documentation as part of their training set, so you can baseline performance without RAG.

Number Of Examples: 100

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/braintrust_coda/llamaindex_baseline.py) | 0.955 | 4.32 | 0.9 | 0.93 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json

In [None]:
! llamaindex-cli download-llamadataset BraintrustCodaHelpDeskDataset --download-dir ./data/braintrust_coda_help_desk

## Covid Qa

Description: A human-annotated RAG dataset consisting of over 300 question-answer pairs. This dataset represents a subset of the Covid-QA dataset available on Kaggle and authored by Xhlulu. It is a collection of frequently asked questions on COVID from various websites. This subset only considers the top 10 webpages containing the most question-answer pairs.

Number Of Examples: 316

Examples Generated By: Human

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/covidqa/llamaindex_baseline.py) | ---- | 3.96 | 0.889 | 0.848 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://www.kaggle.com/datasets/xhlulu/covidqa/?select=news.csv

In [None]:
! llamaindex-cli download-llamadataset CovidQaDataset --download-dir ./data/covid_qa

## Evaluating Llm Survey Paper

Description: A labelled RAG dataset over the comprehensive, spanning 111 pages in total, survey on evaluating LLMs.

Number Of Examples: 276

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/mini_squadv2/llamaindex_baseline.py) | 0.923 | 3.81 | 0.888 | 0.808 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://arxiv.org/pdf/2310.19736.pdf

In [None]:
! llamaindex-cli download-llamadataset EvaluatingLlmSurveyPaperDataset --download-dir ./data/evaluating_llm_survey_paper

## History Of Alexnet

Description: A labelled RAG dataset based off an article, The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches, by Md Zahangir Alom, Tarek M. Taha, Christopher Yakopcic, Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, Brian C Van Esesn, Abdul A S. Awwal, Vijayan K. Asari, consisting of queries, reference answers, and reference contexts.

Number Of Examples: 160

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/history_of_alexnet/llamaindex_baseline.py) | 0.931 | 4.434 | 0.963 | 0.931 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://arxiv.org/abs/1803.01164

In [None]:
# broken
! llamaindex-cli download-llamadataset HistoryOfAlexnetDataset --download-dir ./data/history_of_alexnet

## Llama 2 Paper

Description: A labelled RAG dataset based off the Llama 2 ArXiv PDF.

Number Of Examples: 100

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/llama2_paper/llamaindex_baseline.py) | 0.939 | 4.08 | 0.97 | 0.95 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://arxiv.org/abs/2307.09288

In [None]:
! llamaindex-cli download-llamadataset Llama2PaperDataset --download-dir ./data/llama_2_paper

## Mini Squad V2

Description: This is a subset of the original SquadV2 dataset. In particular, it considers only the top 10 Wikipedia pages in terms of having questions about them.

Number Of Examples: 195

Examples Generated By: Human

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/mini_squadv2/llamaindex_baseline.py) | 0.878 | 3.464 | 0.815 | 0.697 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://huggingface.co/datasets/squad_v2

In [None]:
! llamaindex-cli download-llamadataset MiniSquadV2Dataset --download-dir ./data/mini_squad_v2

## Mini TruthfulQA

Description: This is a subset of the TruthfulQA benchmark. Only examples that are based off of Wikipedia pages are considered; and furthermore, Wikipedia pages that contain only one question are also dropped. The result is 152 examples for evaluating a RAG system.

Number Of Examples: 152

Examples Generated By: Human

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/mini_truthfulqa/llamaindex_baseline.py) | ---- | 3.845 | 0.605 | 0.599 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://huggingface.co/datasets/truthful_qa

## Origin Of COVID-19

Description: A labelled RAG dataset based off an article, The Origin Of COVID-19 and Why It Matters, by Morens DM, Breman JG, Calisher CH, Doherty PC, Hahn BH, Keusch GT, Kramer LD, LeDuc JW, Monath TP, Taubenberger JK, consisting of queries, reference answers, and reference contexts.

Number Of Examples: 24

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/origin_of_covid19/llamaindex_baseline.py) | 0.952 | 4.562 | 1 | 0.958 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7470595/

In [None]:
! llamaindex-cli download-llamadataset OriginOfCovid19Dataset --download-dir ./data/origin_of_covid_19

## Patronus AI FinanceBench

Description: This is a subset of the original FinanceBench dataset. FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). This is an open source sample of 150 annotated examples used in the evaluation and analysis of models assessed in the FinanceBench paper. The dataset comprises of questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard.

Number Of Examples: 98

Examples Generated By: Human

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/patronus_financebench/llamaindex_baseline.py) | 0.87 | 2.622 | 0.755 | 0.684 | gpt-3.5-turbo | 1024 | 1 | text-embedding-ada-002 |

Source(s): https://huggingface.co/datasets/PatronusAI/financebench

In [None]:
! llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data/patronus_ai_financebench

## Paul Graham Essay

Description: A labelled RAG dataset based off an essay by Paul Graham, consisting of queries, reference answers, and reference contexts.

Number Of Examples: 44

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/paul_graham_essay/llamaindex_baseline.py) | 0.934 | 4.239 | 0.977 | 0.977 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s): http://www.paulgraham.com/articles.html

In [None]:
! llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data/paul_grahman_essay

## Uber 10K

Description: A labelled RAG dataset based on the Uber 2021 10K document, consisting of queries, reference answers, and reference contexts.

Number Of Examples: 822

Examples Generated By: AI

| Baseline | Context Similarity | Correct-ness | Faithful-ness | Relev-ancy | LLM | Chunk Size | Similarity Top-K | Embed Model |
| ---      | ---                | ---          | ---           | ---        | --- | ---        | ---              | ---         |
| [llamaindex](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_datasets/10k/uber_2021/llamaindex_baseline.py) | 0.943 | 3.874 | 0.667 | 0.844 | gpt-3.5-turbo | 1024 | 2 | text-embedding-ada-002 |

Source(s):

In [None]:
! llamaindex-cli download-llamadataset Uber10KDataset2021 --download-dir ./data/uber_10k