# Segmentation Model Evaluation

This notebook is simply for _testing_ the results from the different chunking strategies, _not_ generating the chunks and indexing them.

To do this, you'll need to put the following files in the notebook's directory:

- `indexes.pkl`
- `docs.pkl`

# Basic setup

### Install prerequisites

In [12]:
!pip install -q pydantic llama_index
!pip install -q llama-index-embeddings-huggingface
!pip install -q llama-index llama-index-embeddings-jinaai llama-index-llms-huggingface "huggingface_hub[inference]"

# Load prebuilt docs and indexes

I pre-built a lot of the stuff to avoid Google Colab timing out or running out of credits in the middle of operations

Upload the following to Colab:

- `docs.pkl` - Pickled `list` of `Document` objects
- `indexes.pkl` - Pickled `list` of `Index` objects

In [1]:
from pydantic import BaseModel, Field, HttpUrl
from typing import List, Dict

In [2]:
class BlogPost(BaseModel):
  url: HttpUrl
  filename: str = ""
  text: str = ""
  text_short: str = ""
  markdown: str = "" # todo: convert markdown to plain text, store here
  questions: List = []
  chunks: dict = {} # populated by different chunking strategies later

In [3]:
from llama_index.core import VectorStoreIndex

class Index(BaseModel):
  name: str
  index: VectorStoreIndex
  questions: list[Dict[str, str]] = [] # store q and a here

  class Config:
    arbitrary_types_allowed = True

In [4]:
import pickle
import os

docs = pickle.load(open("docs.pkl", "rb"))
indexes = pickle.load(open("indexes.pkl", "rb"))

In [5]:
# check size looks okay
import glob

# Define the directory path
directory_path = './'

# Use glob to find all .pkl files in the directory
pkl_files = glob.glob(os.path.join(directory_path, "*.pkl"))

# Iterate through the files and print their sizes in MB
for file_path in pkl_files:
    file_size_bytes = os.path.getsize(file_path)  # Get the file size in bytes
    file_size_mb = file_size_bytes / (1024 * 1024)  # Convert bytes to MB
    print(f"File: {file_path}, Size: {file_size_mb:.2f} MB")

File: ./docs.pkl, Size: 0.15 MB
File: ./indexes.pkl, Size: 451.84 MB


## Ask questions to index

In [139]:
# def query_index(index, question, top_k=3):
# # def query_index(index, doc, top_k=3):

#   # answers = []

#   # configure retriever
#   retriever = VectorIndexRetriever(
#       index=index.index,
#       similarity_top_k=top_k
#       )

#   # assemble query engine
#   query_engine = RetrieverQueryEngine(
#       retriever=retriever,
#       response_synthesizer=response_synthesizer,
#   )

#   # for question in doc.questions:
#   answer = query_engine.query(question).response.strip()
#     # answers.append(
#     #     {"question: ": question,
#     #      "answer": answer.response.strip(),
#     #      "strategy": chunking_strategy,
#     #     }
#     # )

#   index.questions.append(
#       {"question": question,
#       "answer": answer
#        }
#   )
#   return answer

In [140]:
# # all questions in one list

# questions = []

# # for doc in docs:
#   for question in doc.questions:
#     questions.append(question["question"])

In [142]:
# for index in indexes:
#   print(f"Asking {index.name}")
#   for question in questions:
#     # print(f"- {question}")
#     query_index(index, question)

Asking jina-segmenter-api
Asking segmentation-model
Asking langchain_semantic
Asking text-seg-lm


In [143]:
# pickle.dump(indexes, open("indexes.pkl", "wb"))



In [165]:
# [index.name for index in indexes]

['jina-segmenter-api',
 'segmentation-model',
 'langchain_semantic',
 'text-seg-lm']

## Evaluate answers

### Generate prompts

In [41]:
output_format = """
## Overall assessment

Explain how each model stacked up overall, which was best, which was worst

## QUESTION 1: QUESTION

### jina-segmenter-api: X points

answer from jina-segmenter-api

### segmentation-model: X points

answer from segmentation-model

### langchain_semantic: X points

answer from langchain_semantic

### text-seg-lm: X points

answer from text-seg-lm

### Comments

- comment 1
- comment 2
- comment 3
- comment 4
- etc

## QUESTION 2: QUESTION

### jina-segmenter-api: X points

answer from jina-segmenter-api

### segmentation-model: X points

answer from segmentation-model

### langchain_semantic: X points

answer from langchain_semantic

### text-seg-lm: X points

answer from text-seg-lm

### Comments

- comment 1
- comment 2
- comment 3
- comment 4
- etc

## QUESTION 3: QUESTION

### jina-segmenter-api: X points

answer from jina-segmenter-api

### segmentation-model: X points

answer from segmentation-model

### langchain_semantic: X points

answer from langchain_semantic

### text-seg-lm: X points

answer from text-seg-lm

### Comments

- comment 1
- comment 2
- comment 3
- comment 4
- etc
"""

In [57]:
def generate_evaluation_prompt(doc):
  evaluation_prompt = """
  You are an expert evaluator of RAG systems.

  Your job is to look at a given text and then look at several questions about the text.
  Each question has answers from four different RAG systems. You will evaluate these answers based on:
  - Accuracy (does it correctly answer the question based on original text)
  - Conciseness (does it get to the point? does it include any extraneous information)
  - Readability (how easy is it for an expert user to understand?)
  - Total points

  The RAG systems are called:
  - 'jina-segmenter-api'
  - 'segmentation-model'
  - 'langchain_semantic'
  - 'text-seg-lm'

  Give each RAG system a score out of ten for each answer it provides, taking into account all the above criteria.
  At the end of your output, include a nicely-formatted table with the scores for each RAG system on each question.
  Follow that with a bulleted summary of how each one stacks up.

  Use this output format:
  <output format begin>
  {}
  <output format end>

  Here is the text:
  <text begin>
  {}
  <text end>

  Here are the questions and answers:
  <questions and answers begin>
  {}
  <questions and answers end>
  """
  qna_text = ""

  for doc_question in doc.questions:
    # add question
    qna_text += f"Question: {doc_question['question']}\n"

    for index in indexes:
      # add each answer
      for idx_question in index.questions:
        if doc_question["question"] == idx_question["question"]:
          qna_text += f"{index.name}: {idx_question['answer']}\n"

    prompt = evaluation_prompt.format(output_format, doc.text, qna_text)

    return prompt.strip()

In [58]:
eval_prompts = []

for doc in docs:
  prompt = generate_evaluation_prompt(doc)
  eval_prompts.append(prompt)

In [59]:
len(eval_prompts)

4

In [60]:
eval_prompts[0]

'You are an expert evaluator of RAG systems.\n\n  Your job is to look at a given text and then look at several questions about the text.\n  Each question has answers from four different RAG systems. You will evaluate these answers based on:\n  - Accuracy (does it correctly answer the question based on original text)\n  - Conciseness (does it get to the point? does it include any extraneous information)\n  - Readability (how easy is it for an expert user to understand?)\n  - Total points\n\n  The RAG systems are called:\n  - \'jina-segmenter-api\'\n  - \'segmentation-model\'\n  - \'langchain_semantic\'\n  - \'text-seg-lm\'\n\n  Give each RAG system a score out of ten for each answer it provides, taking into account all the above criteria.\n  At the end of your output, include a nicely-formatted table with the scores for each RAG system on each question.\n  Follow that with a bulleted summary of how each one stacks up.\n\n  Use this output format:\n  <output format begin>\n  \n## Overall

In [76]:
with open("eval_prompt_example.md", "w") as file:
    file.write(eval_prompts[0])

### Ask LLM to evaluate

In [25]:
import getpass

if "HF_TOKEN" not in locals():
    HF_TOKEN = getpass.getpass("Enter your HuggingFace token: ")
# HF_TOKEN = 

In [2]:
!pip install -q llama-index-llms-openai

In [6]:
import openai


openai.api_key = OPENAI_KEY

In [9]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

stream = llm.stream("Hi, write a short story")

for r in stream:
    print(r.delta, end="")

AttributeError: 'str' object has no attribute 'kwargs'

In [10]:
resp = OpenAI().complete("Paul Graham is ")
print(resp)

Retrying llama_index.llms.openai.base.OpenAI._chat in 0.5778508113232185 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}.
Retrying llama_index.llms.openai.base.OpenAI._chat in 0.6972810903863749 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}.


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [64]:
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", token=HF_TOKEN, max_input_length=32768
)

  mixtral_llm = HuggingFaceInferenceAPI(


In [65]:
def get_response(prompt):
    response = mixtral_llm.complete(prompt).text.strip()
    return response

In [68]:
del response

In [70]:
response = get_response(eval_prompts[1])

In [71]:
response

"Here is the output:\n\n## Overall assessment\n\nThe 'langchain_semantic' model performed the best overall, providing accurate and concise answers with good readability. The 'text-seg-lm' model also performed well, but its answers were slightly less concise and readable compared to 'langchain_semantic'. The 'jina-segmenter-api' model struggled with providing accurate answers, while the 'segmentation-model' model provided accurate answers but with less concise and readable responses.\n\n## QUESTION 1: What are the challenges of the simple RAG pipeline of chunking-embedding-retrieving-generating in terms of long-distance contextual dependencies?\n\n### jina-segmenter-api: 3 points\n\nThe simple RAG pipeline of chunking-embedding-retrieving-generating may face challenges in handling long-distance contextual dependencies due to the independent processing of chunks in the chunking stage. This could potentially lead to loss of contextual information when the chunks are reassembled during the

In [72]:
!mkdir segmenter-evaluations

In [75]:
with open("./segmenter-evaluations/1.md", "w") as file:
    file.write(response)

In [51]:
eval_prompts[0]

'You are an expert evaluator of RAG systems.\n\n  Your job is to look at a given text and then look at several questions about the text.\n  Each question has answers from four different RAG systems. You will evaluate these answers based on:\n  - Accuracy (does it correctly answer the question based on original text)\n  - Conciseness (does it get to the point? does it include any extraneous information)\n  - Readability (how easy is it for an expert user to understand?)\n  - Total points\n\n  The RAG systems are called:\n  - \'jina-segmenter-api\'\n  - \'segmentation-model\'\n  - \'langchain_semantic\'\n  - \'text-seg-lm\'\n\n  Give each RAG system a score out of ten for each answer it provides, taking into account all the above criteria.\n  At the end of your output, include a nicely-formatted table with the scores for each RAG system on each question.\n  Follow that with a bulleted summary of how each one stacks up.\n\n  Here is the text:\n  <text begin>\n  Title: Jina ColBERT v2: Mul