## Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex
- https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5
- Colab
    - https://colab.research.google.com/drive/1LPvJyEON6btMpubYdwySfNs0FuNR9nza?usp=sharing

# 1. Setup

In [3]:
!pip install llama-index pypdf

Collecting llama-index
  Downloading llama_index-0.9.3.post1-py3-none-any.whl.metadata (8.2 kB)
Collecting aiostream<0.6.0,>=0.5.2 (from llama-index)
  Downloading aiostream-0.5.2-py3-none-any.whl.metadata (9.9 kB)
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting dataclasses-json<0.6.0,>=0.5.7 (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl.metadata (22 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting fsspec>=2023.5.0 (from llama-index)
  Using cached fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx (from llama-index)
  Downloading httpx-0.25.1-py3-none-any.whl.metadata (7.1 kB)
Collecting nest-asyncio<2.0.0,>=1.5.8 (from llama-in

In [5]:
import nest_asyncio

nest_asyncio.apply()

from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator
)
from llama_index.llms import OpenAI

import openai
import time
openai.api_key = <"">

# 2. Download Data

In [6]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'

--2023-11-18 02:50:24--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: ‘data/10k/uber_2021.pdf’


2023-11-18 02:50:24 (201 MB/s) - ‘data/10k/uber_2021.pdf’ saved [1880483/1880483]



# 3. Load Data

In [7]:
# Load Data

reader = SimpleDirectoryReader("./data/10k/")
documents = reader.load_data()

In [9]:
print(len(documents))

307


# 4. Question Generation

## Sample: 질문 생성
- 아래는 Doc 의 첫번째 Page 에 대해서 질문을 생성한 것을 보여 주고 있습니다.


![uber_1st_page.png](img/uber_1st_page.png)

In [13]:
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
eval_documents = documents[:1]
data_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = data_generator.generate_questions_from_nodes(num = 40)

In [14]:
eval_questions

['What is the file type of the document "uber_2021.pdf"?',
 'When was the document "uber_2021.pdf" last accessed?',
 'What is the address of the principal executive offices of Uber Technologies, Inc.?',
 "What is the trading symbol for Uber's common stock?",
 'Is Uber Technologies, Inc. a well-known seasoned issuer?',
 'Has Uber Technologies, Inc. filed all reports required by the Securities Exchange Act of 1934 in the past 12 months?',
 'Has Uber Technologies, Inc. submitted every Interactive Data File required by Rule 405 of Regulation S-T in the past 12 months?',
 'What is the state of incorporation or organization for Uber Technologies, Inc.?',
 'What is the file size of "uber_2021.pdf"?',
 'Is Uber Technologies, Inc. a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company?']

## 20 페이지에 대해서 질문 생성

In [30]:
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
eval_documents = documents[:20]
data_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = data_generator.generate_questions_from_nodes(num = 50)

In [31]:
print(len(eval_questions))
eval_questions

50


['What is the file type of the document "uber_2021.pdf"?',
 'When was the document "uber_2021.pdf" last accessed?',
 'What is the address of the principal executive offices of Uber Technologies, Inc.?',
 "What is the trading symbol for Uber's common stock?",
 'Is Uber Technologies, Inc. a well-known seasoned issuer?',
 'Has Uber Technologies, Inc. filed all reports required by the Securities Exchange Act of 1934 in the past 12 months?',
 'Has Uber Technologies, Inc. submitted every Interactive Data File required by Rule 405 of Regulation S-T in the past 12 months?',
 'What is the state of incorporation or organization for Uber Technologies, Inc.?',
 'What is the file size of "uber_2021.pdf"?',
 'Is Uber Technologies, Inc. a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company?',
 'What is the file type of the document "uber_2021.pdf"?',
 'When was the document "uber_2021.pdf" last accessed?',
 'Is Uber consid

# 5. Setting Up Evaluation
- OpenAI Pricing Page
    - gpt-4	$0.03 / 1K tokens	$0.06 / 1K tokens
    - gpt-3.5-turbo-1106	$0.0010 / 1K tokens	$0.0020 / 1K tokens
    - gpt-3.5-turbo-instruct	$0.0015 / 1K tokens	$0.0020 / 1K tokens
    - https://openai.com/pricing
    - 

In [32]:
# We will use GPT-4 for evaluating the responses
gpt35_turbo_inst = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

# Define service context for GPT-4 for evaluation
service_context_gpt35_turbo_inst = ServiceContext.from_defaults(llm=gpt35_turbo_inst)

# Define Faithfulness and Relevancy Evaluators which are based on GPT-4
faithfulness_gpt35_turbo_inst = FaithfulnessEvaluator(service_context=service_context_gpt35_turbo_inst)
relevancy_gpt35_turbo_inst = RelevancyEvaluator(service_context=service_context_gpt35_turbo_inst)


# 6.Response Evaluation For A Chunk Size

In [33]:
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.
def evaluate_response_time_and_accuracy(chunk_size, eval_questions):
    """
    Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.

    Parameters:
    chunk_size (int): The size of data chunks being processed.

    Returns:
    tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.
    """

    total_response_time = 0
    total_faithfulness = 0
    total_relevancy = 0

    # create vector index
    llm = OpenAI(model="gpt-3.5-turbo")
    service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)
    vector_index = VectorStoreIndex.from_documents(
        eval_documents, service_context=service_context
    )
    # build query engine
    # By default, similarity_top_k is set to 2. To experiment with different values, pass it as an argument to as_query_engine()
    query_engine = vector_index.as_query_engine()
    num_questions = len(eval_questions)

    # Iterate over each question in eval_questions to compute metrics.
    # While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),
    # we're using a loop here to specifically measure response time for different chunk sizes.
    for question in eval_questions:
        start_time = time.time()
        response_vector = query_engine.query(question)
        elapsed_time = time.time() - start_time

        faithfulness_result = faithfulness_gpt35_turbo_inst.evaluate_response(
            response=response_vector
        ).passing

        relevancy_result = relevancy_gpt35_turbo_inst.evaluate_response(
            query=question, response=response_vector
        ).passing

#         faithfulness_result = faithfulness_gpt4.evaluate_response(
#             response=response_vector
#         ).passing

#         relevancy_result = relevancy_gpt4.evaluate_response(
#             query=question, response=response_vector
#         ).passing
        
        
        total_response_time += elapsed_time
        total_faithfulness += faithfulness_result
        total_relevancy += relevancy_result

    average_response_time = total_response_time / num_questions
    average_faithfulness = total_faithfulness / num_questions
    average_relevancy = total_relevancy / num_questions

    return average_response_time, average_faithfulness, average_relevancy

In [35]:
# Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.

# for chunk_size in [128, 256, 512, 1024, 2048]:
for chunk_size in [256, 512, 1024, 2048]:
  avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size, eval_questions)
  print(f"Chunk size {chunk_size} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

Chunk size 256 - Average Response time: 1.30s, Average Faithfulness: 0.86, Average Relevancy: 0.84
Chunk size 512 - Average Response time: 1.52s, Average Faithfulness: 0.78, Average Relevancy: 0.82
Chunk size 1024 - Average Response time: 1.44s, Average Faithfulness: 0.88, Average Relevancy: 0.86
Chunk size 2048 - Average Response time: 1.50s, Average Faithfulness: 0.90, Average Relevancy: 0.88
