# 评估块大小
在RAG中，选择正确的块大小是提升索引准确度的关键。目标是平衡检索性能与响应质量。
本篇文章分成
1. 从PDF中提取文本
2. 将文本切成大小不同的块
3. 为每个块创建嵌入
4. 根据查询信息，获取相关的块，
5. 用相关的索引块，生成回答
6. 衡量置信度和相关度
7. 对不同块比较不同结果

# 设置环境

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI


In [2]:
client = OpenAI(
    base_url="https://api.siliconflow.cn/v1/",
    api_key=os.getenv("SILLICONFLOW_API_KEY")
    )

# 提取文本

In [3]:
def extract_text_from_pdf(pdf_path):
    mypdf = fitz.open(pdf_path)
    all_text = ""
    for page in mypdf:
        all_text += page.get_text("text") + " "
    
    return all_text.strip()

pdf_path = "data/AI_Information.pdf"

extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


# 对提取的文件进行分块

In [12]:
def chunk_text(text, chunk_size, overlap):
    chunks = []
    # 使用滑动窗口来分块
    print("一般使用块大小的1/5作为重叠, 但是这个值需要根据实际情况调整")
    print("分块中... 块大小: ", chunk_size, " 重叠: ", overlap) 
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i+chunk_size]
        chunks.append(chunk)
    return chunks
# 定义块大小128, 256, 512
chunk_sizes = [128, 256, 512]

text_chunks_dict = {size: chunk_text(extracted_text, size, size//5) 
                    for size in chunk_sizes}
for size, chunks in text_chunks_dict.items():
    print("分割线-"+"-"*100)
    print(f"块大小: {size} 的切分, 块数: {len(chunks)}")
    print("第一块: ", chunks[0])
    print("分割线-"+"-"*100)
# 为每个块创建嵌入



一般使用块大小的1/5作为重叠, 但是这个值需要根据实际情况调整
分块中... 块大小:  128  重叠:  25
一般使用块大小的1/5作为重叠, 但是这个值需要根据实际情况调整
分块中... 块大小:  256  重叠:  51
一般使用块大小的1/5作为重叠, 但是这个值需要根据实际情况调整
分块中... 块大小:  512  重叠:  102
分割线-----------------------------------------------------------------------------------------------------
块大小: 128 的切分, 块数: 326
第一块:  Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers t
分割线-----------------------------------------------------------------------------------------------------
分割线-----------------------------------------------------------------------------------------------------
块大小: 256 的切分, 块数: 164
第一块:  Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. 
分割线--------------------------------------------------------------

# 为文本块创建Embedding

In [13]:
from typing import List
from tqdm import tqdm
"""
    model: BAAI/bge-m3   context_length: 8k    
    model: BAAI/bge-large-en-v1.5    context_length: 0.5k  token: 512
"""
def create_embeddings(chunks:List[str], model= "BAAI/bge-large-en-v1.5"):
    """创建文本块的Embedding

    Args:
        texts (List[str]): 文本块列表

    Returns:
        embeddings (List[List[float]]): 文本块的Embedding列表
    """
    print("正在为文本块创建Embedding..., 文本块数: ", len(chunks))
    
    # chunks = chunks[:2]
    # print(chunks)
    Max_chunk_size = 32
    batch_chunks = []
    batch_size = len(chunks) // Max_chunk_size
    if len(chunks) % Max_chunk_size != 0:
        batch_size += 1
    for i in range(batch_size):
        batch_chunks.append(chunks[i*Max_chunk_size:(i+1)*Max_chunk_size])
    all_embeddings = []
    for batch_chunk in batch_chunks:
        response = client.embeddings.create(model=model, input=batch_chunk)
        embeddings = [np.array(embedding.embedding) for embedding in response.data]
        all_embeddings.extend(embeddings)
    return all_embeddings
    

chunk_embeddings_dict = {size: create_embeddings(chunks) for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")}   
for size, embeddings in chunk_embeddings_dict.items():
    print(f"size of chunk embeddings for chunk-size {size} is {len(embeddings)}")
    




Generating Embeddings:   0%|                                                                                                                     | 0/3 [00:00<?, ?it/s]

正在为文本块创建Embedding..., 文本块数:  326


Generating Embeddings:  33%|████████████████████████████████████▎                                                                        | 1/3 [00:12<00:24, 12.24s/it]

正在为文本块创建Embedding..., 文本块数:  164


Generating Embeddings:  67%|████████████████████████████████████████████████████████████████████████▋                                    | 2/3 [00:16<00:07,  7.38s/it]

正在为文本块创建Embedding..., 文本块数:  82


Generating Embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00,  6.23s/it]

size of chunk embeddings for chunk-size 128 is 326
size of chunk embeddings for chunk-size 256 is 164
size of chunk embeddings for chunk-size 512 is 82





In [15]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


In [16]:
def retrieve_relative_chunks(query, text_chunks, chuck_embeddings, top_k=5):
    query_embedding = create_embeddings([query])[0]
    similarities = [cosine_similarity(query_embedding, chunk_embedding) for chunk_embedding in chuck_embeddings]
    # 获取相似度最高的top_k个块
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [text_chunks[i] for i in top_indices]


In [17]:
with open("data/val.json", "r") as f:
    val_data = json.load(f)

query = val_data[3]["question"]
retrieved_chunks_dict = {size: retrieve_relative_chunks(query, text_chunks_dict[size], chunk_embeddings_dict[size], top_k=5) for size in chunk_sizes}

print(retrieved_chunks_dict[256])

正在为文本块创建Embedding..., 文本块数:  1
正在为文本块创建Embedding..., 文本块数:  1
正在为文本块创建Embedding..., 文本块数:  1
['AI enables personalized medicine by analyzing individual patient data, predicting treatment \nresponses, and tailoring interventions. Personalized medicine enhances treatment effectiveness \nand reduces adverse effects. \nRobotic Surgery \nAI-powered robotic s', ' analyzing biological data, predicting drug \nefficacy, and identifying potential drug candidates. AI-powered systems reduce the time and cost \nof bringing new treatments to market. \nPersonalized Medicine \nAI enables personalized medicine by analyzing indiv', 'g \npatient outcomes, and assisting in treatment planning. AI-powered tools enhance accuracy, \nefficiency, and patient care. \nDrug Discovery and Development \nAI accelerates drug discovery and development by analyzing biological data, predicting drug \neffica', 'mains. \nThese applications include: \nHealthcare \nAI is transforming healthcare through applications such as m

In [18]:
system_prompt = """
You are a helpful assistant that strictly answer the question based on the given context. If the answer cannot be derived from the context, you should answer "I don't have enough information to answer that.". 
"""
def generate_answer(query, system_prompt, retrieved_chunk, model="Qwen/Qwen3-8B"):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Question: {query}\nContext: {retrieved_chunk}"}
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.0
    )
    return response.choices[0].message.content


ai_answer_dict = {size: generate_answer(query, system_prompt, retrieved_chunks_dict[size]) for size in chunk_sizes}

print(ai_answer_dict[256])






AI contributes to personalized medicine by analyzing individual patient data to predict treatment responses and tailor interventions, thereby enhancing treatment effectiveness and reducing adverse effects. It also aids in drug discovery by predicting drug efficacy, identifying potential candidates, and accelerating development processes. Additionally, AI-powered tools assist in treatment planning, improve accuracy and efficiency, and support better patient outcomes through data-driven decision-making.


# Evaluating the AI Response

In [19]:
SCORE_FULL = 1
SCORE_PARTIAL = 0.5
SCORE_WRONG = 0

FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the correct answer.

Question: {question}
AI Response: {ai_response}
Correct Answer: {correct_answer}

Faithfulness measures how well the AI response aligns with the correct answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    - {SCORE_FULL} Completely faithful, no contradictions
    - {SCORE_PARTIAL} Partially faithful, minor contradictions
    - {SCORE_WRONG} No faithfulness, major contradictions or hallucinations
- Return ONLY the score, nothing else.
"""
RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {ai_response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {SCORE_FULL} = Completely relevant, directly addresses the query
    * {SCORE_PARTIAL} = Partially relevant, addresses some aspects
    * {SCORE_WRONG} = Not relevant, fails to address the query
- Return ONLY the numerical score ({SCORE_FULL}, {SCORE_PARTIAL}, or {SCORE_WRONG}) with no explanation or additional text.
"""


In [20]:
def evaluate_answer(query, ai_response, correct_answer, model="Qwen/Qwen3-8B"):
    faithfulness_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
        question=query,
        ai_response=ai_response,
        correct_answer=correct_answer,
        SCORE_FULL=SCORE_FULL,
        SCORE_PARTIAL=SCORE_PARTIAL,
        SCORE_WRONG=SCORE_WRONG
    )
    relevance_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
        question=query,
        ai_response=ai_response,
        SCORE_FULL=SCORE_FULL,
        SCORE_PARTIAL=SCORE_PARTIAL,
        SCORE_WRONG=SCORE_WRONG
    )
    faithfulness_response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": faithfulness_prompt}],
        temperature=0.0
    )
    
    relevance_response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": relevance_prompt}],
        temperature=0.0
    )   
    return faithfulness_response.choices[0].message.content, relevance_response.choices[0].message.content

true_answer = val_data[3]['ideal_answer']

# Evaluate response for chunk size 256 and 128
faithfulness, relevancy = evaluate_answer(query, ai_answer_dict[256], true_answer)
faithfulness2, relevancy2 = evaluate_answer(query, ai_answer_dict[128], true_answer)

# print the evaluation scores
print(f"Faithfulness Score (Chunk Size 256): {faithfulness}")
print(f"Relevancy Score (Chunk Size 256): {relevancy}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness2}")
print(f"Relevancy Score (Chunk Size 128): {relevancy2}")




Faithfulness Score (Chunk Size 256): 

1
Relevancy Score (Chunk Size 256): 

1


Faithfulness Score (Chunk Size 128): 

1
Relevancy Score (Chunk Size 128): 

1
