# LlamaIndex Indexing Strategy Comparison
Comparing RAG response between different LlamaIndex Strategies:
- VectorStore Index
- Summary Index
- Tree Index
- Keyword Table Index
- Property Graph Index

Comparison is tested on two types of datasets:
1) Essay - with queries closer to QnA tasks
2) Code - for code generation task

Indices used are set to default versions, and LLM model used is `gpt-3.5-turbo-0125` with `temperature=0`

In [1]:
# Using llama-index-core (0.10.68)
%pip install --user -qU llama-index-llms-openai

Note: you may need to restart the kernel to use updated packages.




In [2]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass()

In [3]:
# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is great."
)

In [4]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

embed_model = OpenAIEmbedding(embed_batch_size=10)
Settings.embed_model = embed_model
Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo-0125")

# Essay Dataset
Using the Paul Graham Essay Dataset from LlamaIndex

https://llamahub.ai/l/llama_datasets/Paul%20Graham%20Essay?from=llama_datasets


In [5]:
from llama_index.core.llama_dataset import download_llama_dataset

# download and install dependencies
essay_dataset, essay_documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

In [6]:
essay_dataset.to_pandas()[:5]

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"In the essay, the author mentions his early ex...",[What I Worked On\n\nFebruary 2021\n\nBefore c...,The first computer the author used for program...,ai (gpt-4),ai (gpt-4)
1,The author switched his major from philosophy ...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,The two specific influences that led the autho...,ai (gpt-4),ai (gpt-4)
2,"In the essay, the author discusses his initial...",[I couldn't have put this into words when I wa...,The two main influences that initially drew th...,ai (gpt-4),ai (gpt-4)
3,The author mentions his shift of interest towa...,[I couldn't have put this into words when I wa...,The author shifted his interest towards Lisp a...,ai (gpt-4),ai (gpt-4)
4,"In the essay, the author mentions his interest...",[So I looked around to see what I could salvag...,"The author in the essay is Paul Graham, who wa...",ai (gpt-4),ai (gpt-4)


In [7]:
essay_dataset = essay_dataset.to_pandas()

In [8]:
test_query = essay_dataset.loc[0, 'query']
print(f'Test query: \n{test_query}')
print()
test_answer = essay_dataset.loc[0, 'reference_answer']
print(f'Test answer: \n{test_answer}')
print()
test_reference = essay_dataset.loc[0, 'reference_contexts']
print(f'Test reference: \n{test_reference}')

Test query: 
In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.

Test answer: 
The first computer the author used for programming was the IBM 1401, which was used by his school district for data processing. He started using it in 9th grade, around the age of 13 or 14. The programming language he used was an early version of Fortran. The author faced several challenges while using this computer. The only form of input to programs was data stored on punched cards, and he didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but he didn't know enough math to do anything interesting of that type. Therefore, he couldn't figure out what to do with it and in retrospect, he believes there's not much he could have done with it.

Test reference: 
['What I Worked On\n\nFebru

## Vectorstore Index
https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_guide/


In [9]:
from llama_index.core import VectorStoreIndex
vectorstore_index = VectorStoreIndex.from_documents(essay_documents)
vectorstore_query_engine = vectorstore_index.as_query_engine()

response = vectorstore_query_engine.query(
    test_query
)

In [10]:
print(response)

The author's first experience with programming was on an IBM 1401 in 9th grade. The computer was located in the basement of his junior high school. The language he used was an early version of Fortran. One of the challenges he faced was the limited input options for programs, as the only form of input was data stored on punched cards, which he did not have access to. This limitation led him to struggle with finding meaningful tasks to perform with the computer.


## Summary Index
https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/


In [11]:
from llama_index.core import DocumentSummaryIndex
document_summary_index = DocumentSummaryIndex.from_documents(essay_documents)
document_summary_query_engine = document_summary_index.as_query_engine()

response = document_summary_query_engine.query(
    test_query
)

current doc id: 8c929378-8da2-4a28-8369-dcea88d4018e


In [12]:
print(response)

The author's first experience with programming was on an IBM 1401 computer in 9th grade. The language used was an early version of Fortran. The challenges he faced included not being able to figure out what to do with the computer, the limited input options through punched cards, and the lack of stored data for input. This led to difficulties in creating meaningful programs due to the constraints of the system.


## Tree Index

https://docs.llamaindex.ai/en/stable/api_reference/indices/tree/#llama_index.core.indices.TreeIndex


In [13]:
from llama_index.core import TreeIndex
tree_index = TreeIndex.from_documents(essay_documents)
tree_index_query_engine = tree_index.as_query_engine()

response = tree_index_query_engine.query(
    test_query
)

In [14]:
print(response)

The author's first experience with programming was on an IBM 1401 computer in 9th grade. The language used was an early version of Fortran. The main challenge he faced was the limited input options as the only form of input was data stored on punched cards, which he did not have access to. This limitation made it difficult for him to create programs that could perform meaningful tasks.


## Keyword Table Index
https://docs.llamaindex.ai/en/stable/api_reference/indices/keyword/#llama_index.core.indices.KeywordTableIndex


In [15]:
from llama_index.core import KeywordTableIndex
keyword_table_index = KeywordTableIndex.from_documents(essay_documents)
keyword_table_index_query_engine = keyword_table_index.as_query_engine()

response = keyword_table_index_query_engine.query(
    test_query
)

In [16]:
print(response)

The author's first experience with programming was on an IBM 1401 computer in 9th grade. The language used was an early version of Fortran. The author faced challenges due to the limited input options, as the only form of input was data stored on punched cards, and the author did not have any data stored on punched cards. This limited the author's ability to create meaningful programs on the IBM 1401.


## Property Graph Index
https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/


In [17]:
import nest_asyncio

nest_asyncio.apply()

In [18]:
from llama_index.core import PropertyGraphIndex
property_graph_index = PropertyGraphIndex.from_documents(essay_documents,)
property_graph_index_query_engine = property_graph_index.as_query_engine()

response = property_graph_index_query_engine.query(
    test_query
)

In [19]:
print(response)

The author mentions his early experiences with programming on the IBM 704 machine language. He used Lisp for programming, which was initially intended as a formal model of computation rather than a traditional programming language. The challenges he faced included the limitations of McCarthy's original Lisp interpreter, which lacked many features needed in a programming language. McCarthy had to test the interpreter by hand-simulating program executions due to the lack of powerful enough computers at the time.


# Code Dataset

reference: https://code-rag-bench.github.io/

RAG with canonical datastore
- target repository

RAG with open datastore
- research paper suggest that retreiving from a larger datastore consist of documents from different sources could unlock effectiveness of code RAG.

According to the paper, there are 4 types of code generation scenarios:
1. Basic Programming: interview-style problems, code completion
2. Open-domain: require beyond standard Python libraries (eg. `pandas`, web requests)
3. Repository-level: require editing files in context of an entire GitHub repository
4. Code-retrieval: code search/retrieval task to measure retrieval quality

For simplicity, we will test on the first scenario (Basic Programming)

In [20]:
import pandas as pd

# Knowledge base comprising programming solutions for the HumanEval and MBPP datasets 
programming_solutions = pd.read_json("hf://datasets/code-rag-bench/programming-solutions/programming_solutions.json")

In [21]:
programming_solutions.head(5)

Unnamed: 0,title,text,meta
0,has_close_elements,from typing import List\n\n\ndef has_close_ele...,"{'task_name': 'humaneval', 'task_id': 'HumanEv..."
1,separate_paren_groups,from typing import List\n\n\ndef separate_pare...,"{'task_name': 'humaneval', 'task_id': 'HumanEv..."
2,truncate_number,\n\ndef truncate_number(number: float) -> floa...,"{'task_name': 'humaneval', 'task_id': 'HumanEv..."
3,below_zero,from typing import List\n\n\ndef below_zero(op...,"{'task_name': 'humaneval', 'task_id': 'HumanEv..."
4,mean_absolute_deviation,from typing import List\n\n\ndef mean_absolute...,"{'task_name': 'humaneval', 'task_id': 'HumanEv..."


In [22]:
from llama_index.core import Document

text_list = programming_solutions['text'].tolist()
code_documents = [Document(text=t) for t in text_list]
print(code_documents[0])

Doc ID: 55a5943a-9a2f-4216-9400-0d706307a457
Text: from typing import List   def has_close_elements(numbers:
List[float], threshold: float) -> bool:     """ Check if in given list
of numbers, are any two numbers closer to each other than     given
threshold.     >>> has_close_elements([1.0, 2.0, 3.0], 0.5)     False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)     True
"""...


In [33]:
print(len(code_documents))

1128


In [23]:
import pandas as pd

humaneval_dataset = pd.read_json("hf://datasets/code-rag-bench/humaneval/humaneval.json", lines=True)

In [24]:
humaneval_dataset.head(5)

Unnamed: 0,task_id,prompt,canonical_solution,test,entry_point,docs
0,HumanEval/0,from typing import List\n\n\ndef has_close_ele...,"for idx, elem in enumerate(numbers):\n ...","\n\nMETADATA = {\n 'author': 'jt',\n 'da...",has_close_elements,[{'text': 'from typing import List def has_c...
1,HumanEval/1,from typing import List\n\n\ndef separate_pare...,result = []\n current_string = []\n ...,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",separate_paren_groups,[{'text': 'from typing import List def separ...
2,HumanEval/2,\n\ndef truncate_number(number: float) -> floa...,return number % 1.0\n,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",truncate_number,[{'text': ' def truncate_number(number: float...
3,HumanEval/3,from typing import List\n\n\ndef below_zero(op...,balance = 0\n\n for op in operations:\n...,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",below_zero,[{'text': 'from typing import List def below...
4,HumanEval/4,from typing import List\n\n\ndef mean_absolute...,mean = sum(numbers) / len(numbers)\n re...,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",mean_absolute_deviation,[{'text': 'from typing import List def mean_...


In [29]:
test_query = 'Please complete the following function:\n\n' + humaneval_dataset.loc[0, 'prompt']
print(f'Test query: \n{test_query}')
print()
test_answer = humaneval_dataset.loc[0, 'canonical_solution']
print(f'Test answer: \n{test_answer}')
print()
test_reference = humaneval_dataset.loc[0, 'docs'][0]['text']
print(f'Test reference: \n{test_reference}')

Test query: 
Please complete the following function:

from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """


Test answer: 
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True

    return False


Test reference: 
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4

## Vectorstore Index

In [30]:
from llama_index.core import VectorStoreIndex
vectorstore_index = VectorStoreIndex.from_documents(code_documents)
vectorstore_query_engine = vectorstore_index.as_query_engine()

response = vectorstore_query_engine.query(
    test_query
)

In [31]:
print(response)

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True
    return False


## Summary Index

In [43]:
from llama_index.core import DocumentSummaryIndex
document_summary_index = DocumentSummaryIndex.from_documents(code_documents)
document_summary_query_engine = document_summary_index.as_query_engine()

response = document_summary_query_engine.query(
    test_query
)

current doc id: 55a5943a-9a2f-4216-9400-0d706307a457
current doc id: 130aaed6-3876-4b81-82ed-98fff99cbf00
current doc id: 1c629d98-29ec-4e18-9fa2-49411be94f75
current doc id: 6bc5d325-4610-459b-a86d-c1c8da9ede6c
current doc id: b148e93f-1d21-44b9-af5a-7bcd2747132d
current doc id: d542249c-42ab-4602-b289-a34773d3ee8b
current doc id: 86721095-dce0-4908-8ce5-1b38db83b7a3
current doc id: be51563a-c0a1-4968-bb3f-9d19a2f5a808
current doc id: d93bc3ce-fae6-4534-92d6-367418ba4f75
current doc id: 7ec85c01-d583-46ba-9c8a-250ef2acc895
current doc id: 1fd320d2-ea54-4801-98df-222bd8f200f0
current doc id: 6509b4da-1894-481e-9e8e-a4c5fc2468d2
current doc id: e8de2602-15db-4a52-9aba-7502d95ba81c
current doc id: 4eed625f-676e-443e-be53-0d720a1708b1
current doc id: fa17512a-14eb-4a4c-a4cb-afa656d01b3f
current doc id: 3e7b5fe5-757a-4f45-ac4e-c7c1b0bf8320
current doc id: 435368e0-9149-4e39-9ce8-de0ed171ddf5
current doc id: c7cb121b-2dcd-4c7d-9161-4265c38b0971
current doc id: 08671989-5d70-444b-8459-d19e7d

In [44]:
print(response)

```python
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True

    return False
```


## Tree Index

In [36]:
from llama_index.core import TreeIndex
tree_index = TreeIndex.from_documents(code_documents)
tree_index_query_engine = tree_index.as_query_engine()

response = tree_index_query_engine.query(
    test_query
)

In [37]:
print(response)

```python
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True

    return False
```


## Keyword Table Index

In [38]:
from llama_index.core import KeywordTableIndex
keyword_table_index = KeywordTableIndex.from_documents(code_documents)
keyword_table_index_query_engine = keyword_table_index.as_query_engine()

response = keyword_table_index_query_engine.query(
    test_query
)

In [39]:
print(response)

```python
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True
    return False
```


## Property Graph Index

In [40]:
import nest_asyncio

nest_asyncio.apply()

In [45]:
from llama_index.core import PropertyGraphIndex
property_graph_index = PropertyGraphIndex.from_documents(code_documents,)
property_graph_index_query_engine = property_graph_index.as_query_engine()

response = property_graph_index_query_engine.query(
    test_query
)

In [46]:
print(response)

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for i in range(len(numbers)):
        for j in range(i+1, len(numbers)):
            if abs(numbers[i] - numbers[j]) < threshold:
                return True
    return False


# Predictions (Not in Use)

**NOTE**: The rest of the notebook illustrates how to manually perform predictions and subsequent evaluations for demonstrative purposes. Alternatively you can use the `RagEvaluatorPack` that will take care of predicting and evaluating using a RAG system that you would have provided.

In [None]:
from llama_index.core import VectorStoreIndex

# a basic RAG pipeline, uses defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

You can now create predictions and perform evaluation manually or download the `PredictAndEvaluatePack` to do this for you in a single line of code.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
# manually
prediction_dataset = await essay_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)

In [None]:
prediction_dataset.to_pandas()[:5]

# Evaluation (Not in Use)

Now that we have our predictions, we can perform evaluations on two dimensions:

1. The generated response: how well the predicted response matches the reference answer.
2. The retrieved contexts: how well the retrieved contexts for the prediction match the reference contexts.

NOTE: For retrieved contexts, we are unable to use standard retrieval metrics such as `hit rate` and `mean reciproccal rank` due to the fact that doing so requires we have the same index that was used to generate the ground truth data. But, it is not necessary for a `LabelledRagDataset` to be even created by an index. As such, we will use `semantic similarity` between the prediction's contexts and the reference contexts as a measure of goodness.

In [None]:
import tqdm

For evaluating the response, we will use the LLM-As-A-Judge pattern. Specifically, we will use `CorrectnessEvaluator`, `FaithfulnessEvaluator` and `RelevancyEvaluator`.

For evaluating the goodness of the retrieved contexts we will use `SemanticSimilarityEvaluator`.

In [None]:
# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = RelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["faithfulness"] = FaithfulnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator()

Loop through the (`labelled_example`, `prediction`) pais and perform the evaluations on each of them individually.

In [None]:
evals = {
    "correctness": [],
    "relevancy": [],
    "faithfulness": [],
    "context_similarity": [],
}

for example, prediction in tqdm.tqdm(
    zip(essay_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)

In [None]:
import json

# saving evaluations
evaluations_objects = {
    "context_similarity": [e.dict() for e in evals["context_similarity"]],
    "correctness": [e.dict() for e in evals["correctness"]],
    "faithfulness": [e.dict() for e in evals["faithfulness"]],
    "relevancy": [e.dict() for e in evals["relevancy"]],
}

with open("evaluations.json", "w") as json_file:
    json.dump(evaluations_objects, json_file)

Now, we can use our notebook utility functions to view these evaluations.

In [None]:
import pandas as pd
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [None]:
mean_scores_df

# Note: RagEvaluatorPack

On this toy example, we see that the basic RAG pipeline performs quite well against the evaluation benchmark (`essay_dataset`)! For completeness, to perform the above steps instead by using the `RagEvaluatorPack`, use the code provided below:

In [None]:
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, essay_dataset=essay_dataset, show_progress=True
)

############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds.  #
# For Usage Tier 1, settings that seemed to work well were batch_size=5,   #
# and sleep_time_in_seconds=15 (as of December 2023.)                      #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)