# ColBERT Analysis


## Test Setup

**Baseline** (RED)
  * LlamaIndex 0.10.x with AstraDBVectorStore 
  * Embedding
    * SimpleDirectoryReader
    * TokenTextSplitter(chunk_size=512, chunk_overlap=128)
    * OpenAIEmbedding(model="text-embedding-3-small")
  * Query (with TruLlama recorder)
     * VectorIndexRetriever(similarity_top_k=5)
     * Prompt: llama_index.core.get_response_synthesizer()
     * AzureOpenAI(model="gpt-3.5-turbo")

**ColBERT** (GREEN)
  * LlamaIndex 0.10.x with embedding.AstraDB
  * Embedding
    * SimpleDirectoryReader
    * TokenTextSplitter(chunk_size=160, chunk_overlap=50)
    * ColbertTokenEmbeddings(doc_maxlen=220, nbits=1, kmeans_niters=4, nranks=1)
  * Query (with TruLlama recorder)
    * ColbertAstraRetriever(k=5, query_maxlen=32)
    * Prompt: llama_index.core.get_response_synthesizer()
    * AzureOpenAI(model="gpt-3.5-turbo")

**ColBERT_k_5_query_maxlen_64** (BLUE)
  * LlamaIndex 0.10.x with embedding.AstraDB
  * Embedding
    * SimpleDirectoryReader
    * TokenTextSplitter(chunk_size=160, chunk_overlap=50)
    * ColbertTokenEmbeddings(doc_maxlen=220, nbits=1, kmeans_niters=4, nranks=1)
  * Query (with TruLlama recorder)
    * ColbertAstraRetriever(k=5, query_maxlen=64)
    * Prompt: llama_index.core.get_response_synthesizer()
    * AzureOpenAI(model="gpt-3.5-turbo")

**RAGatouille** (Violet)
  * LlamaIndex 0.10.x 
  * Embedding
    * SimpleDirectoryReader
    * RAGatouille default embedding
  * Query (with TruLlama recorder)
    * RAGatouilleRetriever(k=5)
    * Prompt: llama_index.core.get_response_synthesizer()
    * AzureOpenAI(model="gpt-3.5-turbo")   


## Results Summary

#### From the 11 datasets, compared to the baseline, ColBERT `maxlen=64` was:
* Significantly better for 4
* Slightly better for 2
* Slightly worse for 3
* Significantly worse for 0
* Tie for 2

This confirms that ColBERT `maxlen=64` generally performs better than the baseline.

#### From the 11 datasets, compared to the baseline, ColBERT `maxlen=32` was:
* Significantly better for 4
* Slightly better for 2
* Slightly worse for 2
* Significantly worse for 2
* Inconclusive for 1

This confirms that ColBERT `maxlen=64` performs similarly to the baseline.

#### From the 11 datasets, compared to RAGatouille, ColBERT `maxlen=64` was:
* Significantly better for 1
* Slightly better for 1
* Slightly worse for 5
* Significantly worse for 0
* Tie for 4

This confirms that ColBERT `maxlen=64` performs similarly to RAGatouille.

## Detailed Results

### Blockchain Solana Dataset

Description: A labelled RAG dataset based off an article, From Bitcoin to Solana – Innovating Blockchain towards Enterprise Applications),by Xiangyu Li, Xinyu Wang, Tingli Kong, Junhao Zheng and Min Luo, consisting of queries, reference answers, and reference contexts.

Source Data: 1 PDF file (27 pages total)

Number Of Examples: 58

Examples Generated By: AI

Source(s):
https://arxiv.org/abs/2207.05240

![blockchain_solana.png](blockchain_solana.png)

#### ColBERT slightly better

Both ColBERT tests perform slightly better on answer correctness. Also context relevance and grounded-mean scores are higher.

ColBERT `(maxlen=64)` performs slightly worse than RAGatouille.

### Braintrust Coda Help Desk

Description: A list of automatically generated question/answer pairs from the Coda (https://coda.io/) help docs. This dataset is interesting because most models include Coda’s documentation as part of their training set, so you can baseline performance without RAG.

Source Data: 50 Markdown files

Number Of Examples: 100

Examples Generated By: AI

Source(s): https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json

![braintrust_coda_help_desk.png](braintrust_coda_help_desk.png)

#### ColBERT significantly better

ColBERT shows much tighter and higher answer correctness and context relevance. Grounded-ness also much better.

ColBERT `(maxlen=64)` performs slightly worse than RAGatouille.

### Covid Qa

Description: A human-annotated RAG dataset consisting of over 300 question-answer pairs. This dataset represents a subset of the Covid-QA dataset available on Kaggle and authored by Xhlulu. It is a collection of frequently asked questions on COVID from various websites. This subset only considers the top 10 webpages containing the most question-answer pairs.

Source Data: 10 html files

Number Of Examples: 316

Examples Generated By: Human

Source(s): https://www.kaggle.com/datasets/xhlulu/covidqa/?select=news.csv

![covid_qa.png](covid_qa.png)

#### Tie

Nearly identical performance between ColBERT and the baseline.

ColBERT `(maxlen=64)` performs similarly to RAGatouille.

### Evaluating Llm Survey Paper

Description: A labelled RAG dataset over the comprehensive, spanning 111 pages in total, survey on evaluating LLMs.

Source Data: 1 PDF file (111 pages total)

Number Of Examples: 276

Examples Generated By: AI

Source(s): https://arxiv.org/pdf/2310.19736.pdf

![evaluating_llm_survey_paper.png](evaluating_llm_survey_paper.png)

#### ColBERT significantly better

ColBERT `(maxlen=32)` shows much tighter and higher answer correctness. Context relevance is slightly better.  Grounded-ness also better. LLM hallucinates lees with ColBERT.

ColBERT `(maxlen=64)` has even stronger results.

ColBERT `(maxlen=64)` performs similarly to RAGatouille.

### History Of Alexnet

Description: A labelled RAG dataset based off an article, The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches, by Md Zahangir Alom, Tarek M. Taha, Christopher Yakopcic, Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, Brian C Van Esesn, Abdul A S. Awwal, Vijayan K. Asari, consisting of queries, reference answers, and reference contexts.

Source Data: 1 PDF file (39 pages total)

Number Of Examples: 160

Examples Generated By: AI

Source(s): https://arxiv.org/abs/1803.01164

![history_of_alexnet.png](history_of_alexnet.png)

#### ColBERT slightly better

ColBERT `(maxlen=64)` shows slightly better context relevance and answer correctness.

ColBERT `(maxlen=64)` performs similarly to RAGatouille.

### Llama 2 Paper

Description: A labelled RAG dataset based off the Llama 2 ArXiv PDF.

Source Data: 1 PDF file (77 pages total)

Number Of Examples: 100

Examples Generated By: AI

Source(s): https://arxiv.org/abs/2307.09288

![llama_2_paper.png](llama_2_paper.png)

#### Tie

ColBERT `(maxlen=32)` has significantly lower answer correctness and grounded-ness, despite having nearly identical context relevance.

ColBERT `(maxlen=64)` performs similarly to the baseline.

ColBERT `(maxlen=64)` performs similarly to RAGatouille.

### Mini Squad V2

Description: This is a subset of the original SquadV2 dataset. In particular, it considers only the top 10 Wikipedia pages in terms of having questions about them.

Source Data: 10 txt files

Number Of Examples: 195

Examples Generated By: Human

Source(s): https://huggingface.co/datasets/squad_v2

![mini_squad_v2.png](mini_squad_v2.png)

#### ColBERT significantly better

ColBERT `(maxlen=32)` shows significantly better answer correctness, context relevance and grounded-ness.

ColBERT `(maxlen=64)` performs significantly better than ColBERT `(maxlen=32)`

ColBERT `(maxlen=64)` performs significantly better than RAGatouille.

### Origin Of COVID-19

Description: A labelled RAG dataset based off an article, The Origin Of COVID-19 and Why It Matters, by Morens DM, Breman JG, Calisher CH, Doherty PC, Hahn BH, Keusch GT, Kramer LD, LeDuc JW, Monath TP, Taubenberger JK, consisting of queries, reference answers, and reference contexts.

Source Data: 1 PDF file (5 pages total)

Number Of Examples: 24

Examples Generated By: AI

Source(s): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7470595/

![origin_of_covid_19.png](origin_of_covid_19.png)

#### ColBERT slightly worse

ColBERT `(maxlen=64)` has slightly lower answer correctness and significantly lower context_relevance.

ColBERT `(maxlen=64)` performs slightly better than RAGatouille.

### Patronus AI FinanceBench

Description: This is a subset of the original FinanceBench dataset. FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). This is an open source sample of 150 annotated examples used in the evaluation and analysis of models assessed in the FinanceBench paper. The dataset comprises of questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard.

Source Data: 32 PDF files (4148 pages total)

Number Of Examples: 98

Examples Generated By: Human

Source(s): https://huggingface.co/datasets/PatronusAI/financebench

![patronus_finance_bench.png](patronus_ai_financebench.png)

#### ColBERT slightly worse

Nearly identical performance between ColBERT and the baseline. If anything, ColBERT is slightly worse.

ColBERT `(maxlen=64)` performs slightly worse than RAGatouille.

### Paul Graham Essay

Description: A labelled RAG dataset based off an essay by Paul Graham, consisting of queries, reference answers, and reference contexts.

Source Data: 1 txt file

Number Of Examples: 44

Examples Generated By: AI

Source(s): http://www.paulgraham.com/articles.html

![paul_grahmam_essay.png](paul_grahman_essay.png)

#### ColBERT significantly better

ColBERT has better context relevance. Answer correctness and grounded-ness are significantly better.

ColBERT `(maxlen=64)` performs slightly worse than RAGatouille.

### Uber 10K

Description: A labelled RAG dataset based on the Uber 2021 10K document, consisting of queries, reference answers, and reference contexts.

Source Data: 1 PDF file (307 pages total)

Number Of Examples: 822

Examples Generated By: AI

Source(s): https://s23.q4cdn.com/407969754/files/doc_financials/2022/ar/2021-Annual-Report.pdf

![uber_10k.png](uber_10k.png)

#### ColBERT slightly worse

ColBERT `(maxlen=32)` shows significantly lower scores on context relevance, answer correctness and grounded-ness as compared to the baseline.

ColBERT `(maxlen=64)` shows a big improvement over ColBERT `(maxlen=32)`. Both answer correctness and context relevance are just slightly worse than the baseline.

ColBERT `(maxlen=64)` performs slightly worse than RAGatouille.