## Comparing different Retriver Approaches in RAG Models

A simple comparison between a sentence window retriever, auto retriever, and an auto-merging retriever. Comparing on correctness, relevancy, faithfulness, and context similarity. The auto-merging seemed to work the best, followed by sentence window retriever, and then the auto retriever.

Install necessary libraries

In [None]:
%%capture
!pip install llama_index
!pip install chromadb
!pip install llama-index-vector-stores-chroma
!pip install llama-index-packs-rag-evaluator
!pip install llama-index-embeddings-huggingface

Set up keys that we need

In [None]:
import os
from google.colab import userdata
OPENAI_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_KEY
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

HUGGINGFACE_KEY = userdata.get('HUGGINGFACE_API_KEY')
os.environ['HUGGINGFACE_API_KEY'] = HUGGINGFACE_KEY
HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']

Go scrape the web pages for the right files. Note that the first pdf is not a 10-k filing and one of the 10-k filings is on a different page. It is accounted for so we have the right documents.

In [None]:
import requests
from bs4 import BeautifulSoup
import os

# Function to extract PDF links from a given URL
def get_pdf_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    pdf_links = []
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.endswith('.pdf'):
            # Ensure the link is an absolute URL
            if href.startswith('http'):
                pdf_url = href
            else:
                pdf_url = f"https://investors.coca-colacompany.com{href}"
            pdf_links.append(pdf_url)
    return pdf_links

# URLs of the first and second pages
urls = [
    "https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k",
    "https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k?page=2"
]

# Collect links from both pages
pdf_links_page1 = get_pdf_links(urls[0])
pdf_links_page2 = get_pdf_links(urls[1])

# Skip the first link from page 1 and take the first link from page 2
pdf_links = pdf_links_page1[1:] + pdf_links_page2[:1]

# Check if any PDF links were found
if pdf_links:
    print(f"Found {len(pdf_links)} PDF links:")
    for link in pdf_links:
        print(link)
else:
    print("No PDF links found. Check the page structure or URL.")

# Directory to save the PDFs
folder = 'coca_cola_10k'
os.makedirs(folder, exist_ok=True)

# Download each PDF and save it
for i, pdf_url in enumerate(pdf_links):
    pdf_response = requests.get(pdf_url)
    if pdf_response.status_code == 200:
        pdf_path = os.path.join(folder, f"10K_{i+1}.pdf")
        with open(pdf_path, 'wb') as f:
            f.write(pdf_response.content)
        print(f"Downloaded: {pdf_path}")
    else:
        print(f"Failed to download: {pdf_url}")


Found 10 PDF links:
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-24-000009/0000021344-24-000009.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-23-000011/0000021344-23-000011.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-22-000009/0000021344-22-000009.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-21-000008/0000021344-21-000008.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-20-000006/0000021344-20-000006.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-19-000014/0000021344-19-000014.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/content/0000021344-18-000008/0000021344-18-000008.pdf
https://investors.coca-colacompany.com/filings-reports/annual-filings-10-k/co

This creates metadata for each chunk to make sure the year it comes from is known.

In [None]:
from pathlib import Path
from llama_index.readers.file import PDFReader

# Directory containing the PDFs
pdf_directory = './coca_cola_10k/'
pdf_files = sorted(Path(pdf_directory).glob("*.pdf"))  # Ensure files are processed in order

# Define year mapping based on order (2023 for 10K_1.pdf, 2022 for 10K_2.pdf, etc.)
start_year = 2023
year_mapping = {f"10K_{i+1}.pdf": start_year - i for i in range(len(pdf_files))}

# Load PDFs and attach year metadata
documents_with_metadata = []

for pdf_file in pdf_files:
    pdf_reader = PDFReader()
    document = pdf_reader.load_data(file=pdf_file)

    # Get the year from the mapping
    year = year_mapping[pdf_file.name]

    # Attach metadata to each chunk
    for chunk in document:
        chunk.metadata = {"year": year, "source_file": str(pdf_file)}
        documents_with_metadata.append(chunk)

# Verify metadata
print("Metadata for the first chunk:")
print(documents_with_metadata[0].metadata)
print(documents_with_metadata[0].text)

# Verify document count and years
print(f"Total documents loaded: {len(documents_with_metadata)}")
print("Years included in metadata:", {doc.metadata['year'] for doc in documents_with_metadata})


Metadata for the first chunk:
{'year': 2023, 'source_file': 'coca_cola_10k/10K_1.pdf'}
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
WASHINGTON, D.C. 20549
FORM 10-K 
(Mark One)
☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2023 
OR
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from            to
Commission File Number 001-02217 
 COCA COLA CO 
(Exact name of Registrant as specified in its charter)
Delaware 58-0628465
(State or other jurisdiction of incorporation) (I.R.S. Employer Identification No.)
One Coca-Cola Plaza
Atlanta, Georgia 30313
(Address of principal executive offices) (Zip Code)
Registrant’s telephone number, including area code: (404) 676-2121 
Securities registered pursuant to Section 12(b) of the Act:
Title of each class Trading Symbol(s) Name of each exchange on which registered
Common Stock, $0.25 Par Value KO New 

In [None]:
len(documents_with_metadata)


1957

Get evaluation data set ready.

In [None]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

In [None]:
import random
from llama_index.llms.openai import OpenAI
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

# Initialize LLM
gpt4 = OpenAI(model='gpt-4o-mini', temperature=0.1)

# Randomly sample 10 chunks
sampled_chunks = random.sample(documents_with_metadata, k=20)

# Initialize dataset generator
dataset_generator = RagDatasetGenerator.from_documents(
    sampled_chunks,
    llm=gpt4,
    num_questions_per_chunk=1,
    show_progress=True,
)

# Generate the dataset
eval_dataset = dataset_generator.generate_dataset_from_nodes()


Parsing nodes:   0%|          | 0/20 [00:00<?, ?it/s]


 50%|█████     | 1/2 [00:02<00:02,  2.05s/it]
  4%|▎         | 1/28 [00:00<00:18,  1.50it/s][A
 11%|█         | 3/28 [00:01<00:07,  3.26it/s][A
 14%|█▍        | 4/28 [00:01<00:08,  2.98it/s][A
100%|██████████| 2/2 [00:03<00:00,  1.85s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
 25%|██▌       | 7/28 [00:01<00:04,  4.32it/s][A
 29%|██▊       | 8/28 [00:02<00:04,  4.26it/s][A
 32%|███▏      | 9/28 [00:02<00:03,  5.03it/s][A
 36%|███▌      | 10/28 [00:02<00:03,  5.33it/s][A
 39%|███▉      | 11/28 [00:02<00:04,  4.12it/s][A
 43%|████▎     | 12/28 [00:03<00:04,  3.22it/s][A
 50%|█████     | 14/28 [00:03<00:03,  4.25it/s][A
 54%|█████▎    | 15/28 [00:04<00:03,  3.67it/s][A
 57%|█████▋    | 16/28 [00:04<00:03,  3.04it/s][A
 64%|██████▍   | 18/28 [00:04<00:02,  4.03it/s][A
 68%|██████▊   | 19/28 [00:05<00:02,  3.79it/s][A
 75%|███████▌  | 21/28 [00:05<00:01,  4.69it/s][A
 79%|███████▊  | 22/28 [00:05<00:01,  4.63it/s][A
 86%|████████▌ | 24/28 [00:06<00:00,  4.59it/s][A
 93%|███

Sentence window retriever

Get necessary packages

In [None]:
import chromadb
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.get_or_create_collection("quickstart")
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore


Set up vector index, using chroma for efficient storage

In [None]:
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store, )
index = VectorStoreIndex.from_documents(documents_with_metadata, storage_context=storage_context)


Parsing documents into sentences.

In [None]:
from llama_index.core.node_parser import SentenceWindowNodeParser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)
nodes = node_parser.get_nodes_from_documents(documents_with_metadata)

Create my vector stores

In [None]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model='text-embedding-3-small', api_key=OPENAI_API_KEY)

sentence_index = VectorStoreIndex(nodes, embed_model=embed_model, llm=llm ,show_progress=True)

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2045 [00:00<?, ?it/s]

Set up query engine

In [None]:
from llama_index.core.indices.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    top_n=3, model="BAAI/bge-reranker-base"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [None]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
sentence_query_engine = sentence_index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window"), rerank
    ],
)

In [None]:
from llama_index.packs.rag_evaluator import RagEvaluatorPack

In [None]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=eval_dataset,
    query_engine=sentence_query_engine,
    judge_llm=gpt4,
)

base_benchmark_sentence = await rag_evaluator_pack.arun(
    batch_size=10,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)



100%|██████████| 2/2 [00:01<00:00,  1.53it/s]
100%|██████████| 2/2 [00:02<00:00,  1.25s/it]
100%|██████████| 2/2 [00:37<00:00, 18.74s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
Batch processing of predictions:  10%|█         | 1/10 [00:41<06:11, 41.23s/it][A
Batch processing of predictions:  40%|████      | 4/10 [00:41<00:46,  7.83s/it][A
 50%|█████     | 1/2 [00:01<00:01,  1.26s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:42<00:00,  4.23s/it]

100%|██████████| 2/2 [00:39<00:00, 19.83s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
Batch processing of predictions:  10%|█         | 1/10 [00:38<05:45, 38.43s/it][A
 50%|█████     | 1/2 [00:00<00:00,  1.04it/s]
Batch processing of predictions:  70%|███████   | 7/10 [00:39<00:10,  3.48s/it][A
100%|██████████| 2/2 [00:02<00:00,  1.05s/it]
  0%|          | 0/2 [00:00<?, ?it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:40<00:00,  4.01s/it]

100%|██████████| 2/2 [00:29<00:00, 14.87s/it]
  0%|          | 0/2 

Auto Retriever

In [None]:
from llama_index.core.vector_stores.types import VectorStoreInfo, MetadataInfo

metadata_info = [
    MetadataInfo(name="year", type="int", description="The fiscal year of the document."),
    MetadataInfo(name="source_file", type="str", description="The file where the document is located."),
]

vector_store_info = VectorStoreInfo(
    content_info="Main text content of the documents.",
    metadata_info=metadata_info
)


100%|██████████| 2/2 [00:01<00:00,  1.31it/s]
  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
from llama_index.core.indices.vector_store.retrievers.auto_retriever.auto_retriever import VectorIndexAutoRetriever

auto_retriever = VectorIndexAutoRetriever(
    index=index,
    vector_store_info=vector_store_info,
    llm=llm,
    similarity_top_k=5,
)

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine
auto_retrieval_query_engine = RetrieverQueryEngine(retriever=auto_retriever,
                                                   node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window"), rerank
    ])


In [None]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=eval_dataset,
    query_engine=auto_retrieval_query_engine,
    judge_llm=gpt4,
)

# Run evaluation
base_benchmark_auto = await rag_evaluator_pack.arun(
    batch_size=10,
    sleep_time_in_seconds=1
)



Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A[A

100%|██████████| 2/2 [00:12<00:00,  6.06s/it]
  0%|          | 0/2 [00:00<?, ?it/s]

Batch processing of predictions:  40%|████      | 4/10 [00:03<00:04,  1.25it/s][A[A

Batch processing of predictions:  50%|█████     | 5/10 [00:04<00:03,  1.61it/s][A[A

Batch processing of predictions:  60%|██████    | 6/10 [00:04<00:01,  2.05it/s][A[A

Batch processing of predictions:  70%|███████   | 7/10 [00:04<00:01,  2.07it/s][A[A

 50%|█████     | 1/2 [00:01<00:01,  1.20s/it]

Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.82it/s]


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A[A

Batch processing of predictions:  10%|█         | 1/10 [00:02<00:26,  2.93s/it][A[A

Batch processing of predictions:  30%|███       | 3/10 [00:03<00:05,  1.22it/s][A[A

Batch processing of predictions:  40%|████      | 4/10 [00:03<00:03,  1.70it/s][A[A

Batch processing

Auto Merging Retriever

In [None]:
from llama_index.core.retrievers.auto_merging_retriever import AutoMergingRetriever
## Load index into vector index
from llama_index.core import VectorStoreIndex


base_retriever = sentence_index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)


In [None]:
auto_merging_query_engine = RetrieverQueryEngine(retriever=retriever,
                                                   node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window"), rerank
    ])


In [None]:

# Evaluate the AutoMergingRetriever
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=eval_dataset,
    query_engine=auto_merging_query_engine,
    judge_llm=gpt4
)

# Run evaluation
base_benchmark_merge = await rag_evaluator_pack.arun(
    batch_size=10,
    sleep_time_in_seconds=1
)



100%|██████████| 2/2 [00:44<00:00, 22.34s/it]
  0%|          | 0/2 [00:00<?, ?it/s]

Batch processing of predictions:  10%|█         | 1/10 [00:43<06:27, 43.03s/it][A[A

Batch processing of predictions:  20%|██        | 2/10 [00:43<02:22, 17.78s/it][A[A

Batch processing of predictions:  60%|██████    | 6/10 [00:43<00:16,  4.10s/it][A[A

100%|██████████| 2/2 [00:02<00:00,  1.10s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:44<00:00,  4.49s/it]


100%|██████████| 2/2 [00:01<00:00,  1.70it/s]
100%|██████████| 2/2 [00:41<00:00, 20.78s/it]
  0%|          | 0/2 [00:00<?, ?it/s]

Batch processing of predictions:  10%|█         | 1/10 [00:42<06:25, 42.86s/it][A[A

Batch processing of predictions:  20%|██        | 2/10 [00:43<02:22, 17.79s/it][A[A

Batch processing of predictions:  40%|████      | 4/10 [00:43<00:40,  6.70s/it][A[A

Batch processing of predictions:  50%|█████     | 5/10 [00:43<00:23,  4.68s/it][A[A

Batch processing of predictions:  70%|██████

In [None]:
base_benchmark_merge

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.214286
mean_relevancy_score,0.892857
mean_faithfulness_score,0.928571
mean_context_similarity_score,0.944713


In [None]:
base_benchmark_sentence

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.964286
mean_relevancy_score,0.892857
mean_faithfulness_score,0.857143
mean_context_similarity_score,0.94579


In [None]:
import pandas as pd
# Combine all data
all_benchmarks = pd.concat(
    [base_benchmark_merge, base_benchmark_auto, base_benchmark_sentence],
    ignore_index=True,axis = 1
)
all_benchmarks.columns = [ 'merge', 'auto', 'sentence']
print(all_benchmarks)

                                  merge      auto  sentence
metrics                                                    
mean_correctness_score         4.214286  2.392857  3.964286
mean_relevancy_score           0.892857  0.392857  0.892857
mean_faithfulness_score        0.928571  0.392857  0.857143
mean_context_similarity_score  0.944713  0.963153  0.945790


The merge and sentence retriever are much better than the auto. The merge is slightly better than teh sentence retriever.