# 简单 RAG 中的上下文块头 （CCH）

检索增强生成 （RAG） 通过在生成响应之前检索相关的外部知识来提高语言模型的事实准确性。但是，标准分块通常会丢失重要的上下文，从而使检索效率降低。

上下文块标头 （CCH） 通过在嵌入之前为每个块预置高级上下文（如文档标题或章节标题）来增强 RAG。这可以提高检索质量并防止脱离上下文的响应。

### 步骤：
1. 数据接入：加载并预处理文本数据。
2. 使用上下文标题进行分块：提取章节标题并将其添加到块的前面。
3. 嵌入创建：将上下文增强的块转换为数字表示。
4. 语义搜索：根据用户查询检索相关块。
5. 响应生成：使用语言模型从检索到的文本中生成响应。
6. 评估：使用评分系统评估响应准确性。

In [1]:
import os
import numpy as np
import json
from openai import OpenAI
import fitz
from tqdm import tqdm

In [2]:
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

client = OpenAI(
    base_url="https://api.siliconflow.cn/v1/",
    api_key=os.getenv("SILLICONFLOW_API_KEY")
    )

# 根据上下文头来切分块

In [3]:
def generate_header_chunks(chunks, model="Qwen/Qwen3-8B"):
    system_prompt = f"""Generate a concise and informative title for each text of the given text arrays.
    response format is json array, each item is a json object with the following fields:
    - index: <index of the text in the text array>
    - header: <header of the text>
    """
    text_array = ""
    for chunk in chunks:
        text_array += f"index: {chunk['index']}, text: {chunk['text']}\n"
    
    user_prompt = f"""Generate header for each text in the following text array: 
    {text_array} 
    """
    print(f"user_prompt: {user_prompt[:100]} ...")
    response = client.chat.completions.create(
        model=model,
        temperature=0.1,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.choices[0].message.content


In [5]:
def chunk_text_with_headers(text, n, overlap):
    """
    将文本分块，每个块包含n个字符，重叠n-overlap个字符。
    
    Args:
        text: 需要分块的文本
        n: 每个块的字符数
        overlap: 重叠的字符数
    
    Returns:
        list: 分块后的文本列表
    """
    chunks = []
    step_size = n - overlap
    end = len(text)
    
    index = 0
    for i in range(0, end, step_size):
        start = i
        chunk = text[i:i+step_size]
        if(len(chunk.strip()) == 0):
            continue
        
        chunks.append({
            "index": index,
            "text": chunk
        })
        index += 1
    if start < end:
        chunk = text[start:]
        if len(chunk.strip()) != 0:
            chunks.append({
                "index": index,
                "text": chunk
            })
    print(f"chunking length: {len(chunks)}")
    
    response = generate_header_chunks(chunks)
    """
    response:
    [
    {
        "index": 0,
        "header": "Introduction to Artificial Intelligence"
    },
    ...
    ]
    """
    results = []
    res_json = json.loads(response)
    print(f"generated headers length: {len(res_json)}")
    length = min(len(chunks), len(res_json))
    for i in range(length):
        results.append({
            "header": res_json[i]["header"],
            "text": chunks[i]["text"]
        })
  
    return results
    
    

In [6]:
pdf_path = "data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
chunks = chunk_text_with_headers(extracted_text, 1000, 200)
index = 3
print(f"Sample Chunk {index}")
print("Header:", chunks[index]["header"])
print("Text:", chunks[index]["text"])


chunking length: 43
user_prompt: Generate header for each text in the following text array: 
    index: 0, text: Understanding Artifi ...
generated headers length: 24
Sample Chunk 3
Header: Deep Learning and Natural Language Processing: Advanced AI Techniques
Text: trained on unlabeled data, where the algorithm must 
discover patterns and structures in the data without explicit guidance. Common techniques 
include clustering (grouping similar data points) and dimensionality reduction (reducing the 
number of variables while preserving important information). 
 
Reinforcement Learning 
Reinforcement learning involves training an agent to make decisions in an environment to 
maximize a reward. The agent learns through trial and error, receiving feedback in the form of 
rewards or penalties. This approach is used in game playing, robotics, and resource 
management. 
Deep Learning 
Deep learning is a subfield of machine learning that uses artificial neural networks with multiple 
layers (d

In [7]:
def create_embeddings(text, model="BAAI/bge-m3"):
    """
    Creates embeddings for the given text.

    Args:
    text (str): The input text to be embedded.
    model (str): The embedding model to be used. Default is "BAAI/bge-en-icl".

    Returns:
    dict: The response containing the embedding for the input text.
    """
    # Create embeddings using the specified model and input text
    response = client.embeddings.create(
        model=model,
        input=text
    )
    # Return the embedding from the response
    return response.data[0].embedding

In [8]:
# Generate embeddings for each chunk
embeddings = []  # Initialize an empty list to store embeddings

# Iterate through each text chunk with a progress bar
for chunk in tqdm(chunks, desc="Generating embeddings"):
    # Create an embedding for the chunk's text
    text_embedding = create_embeddings(chunk["text"])
    # Create an embedding for the chunk's header
    header_embedding = create_embeddings(chunk["header"])
    # Append the chunk's header, text, and their embeddings to the list
    embeddings.append({"header": chunk["header"], "text": chunk["text"], "embedding": text_embedding, "header_embedding": header_embedding})

Generating embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:29<00:00,  1.23s/it]


In [9]:
def cosine_similarity(vec1, vec2):
    """
    Computes cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): First vector.
    vec2 (np.ndarray): Second vector.

    Returns:
    float: Cosine similarity score.
    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [11]:
def semantic_search(query, chunks, k=5):
    """
    Searches for the most relevant chunks based on a query.

    Args:
    query (str): User query.
    chunks (List[dict]): List of text chunks with embeddings.
    k (int): Number of top results.

    Returns:
    List[dict]: Top-k most relevant chunks.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query)

    similarities = []  # Initialize a list to store similarity scores
    
    # Iterate through each chunk to calculate similarity scores
    for chunk in chunks:
        # Compute cosine similarity between query embedding and chunk text embedding
        sim_text = cosine_similarity(np.array(query_embedding), np.array(chunk["embedding"]))
        # Compute cosine similarity between query embedding and chunk header embedding
        sim_header = cosine_similarity(np.array(query_embedding), np.array(chunk["header_embedding"]))
        # Calculate the average similarity score
        avg_similarity = (sim_text + sim_header) / 2
        print(f"avg_similarity: {avg_similarity}")
        # Append the chunk and its average similarity score to the list
        similarities.append((chunk, avg_similarity))

    # Sort the chunks based on similarity scores in descending order
    # x[1] is the similarity score, x[0] is the chunk
    similarities.sort(key=lambda x: x[1], reverse=True)
    # Return the top-k most relevant chunks
    return [x[0] for x in similarities[:k]]

In [12]:
# Load validation data
with open('data/val.json') as f:
    data = json.load(f)

query_index = 0
query = data[query_index]['question']

# Retrieve the top 2 most relevant text chunks
top_chunks = semantic_search(query, embeddings, k=2)

# Print the results
print("Current Query:", query)
for i, chunk in enumerate(top_chunks):  
    print(f"Header {i+1}: {chunk['header']}")
    print(f"Content:\n{chunk['text']}\n")

avg_similarity: 0.5141255182627062
avg_similarity: 0.4861915784763715
avg_similarity: 0.48519307808450063
avg_similarity: 0.41353554937566883
avg_similarity: 0.453214371133864
avg_similarity: 0.4681727856968765
avg_similarity: 0.4833216155728286
avg_similarity: 0.45704598346023007
avg_similarity: 0.4707456695938005
avg_similarity: 0.5203494308864659
avg_similarity: 0.5298433979651435
avg_similarity: 0.48241328650322085
avg_similarity: 0.487565906902785
avg_similarity: 0.47416048692829915
avg_similarity: 0.45119198567727997
avg_similarity: 0.4673251069267104
avg_similarity: 0.47950523840176257
avg_similarity: 0.4726999455339618
avg_similarity: 0.4347077375422013
avg_similarity: 0.4532016003143917
avg_similarity: 0.4539905446226156
avg_similarity: 0.468996053287773
avg_similarity: 0.5001564927782303
avg_similarity: 0.4703719502064978
Current Query: What is 'Explainable AI' and why is it considered important?
Header 1: AI and the Future of Work: Automation, Reskilling, and New Opportuniti

In [16]:
system_prompt = """You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"""

def generate_response(system_prompt, user_query, model="Qwen/Qwen3-8B"):
    """Generate a response to a user query based on the given system prompt and model."""
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query},
        ],
    )
    return response

user_prompt = "\n".join(f"header:{chunk['header']}\n content:{chunk['text']}\n" for i, chunk in enumerate(top_chunks))
user_prompt = f"{user_prompt}\n问题：{query}"
print("user_prompt length: ", len(user_prompt))
ai_response = generate_response(system_prompt, user_prompt)
print(ai_response.choices[0].message.content)
    

user_prompt length:  1838


Explainable AI (XAI) refers to techniques designed to make AI systems more transparent and understandable, enabling users to comprehend how AI arrives at its decisions. It is considered important because many AI systems, particularly deep learning models, operate as "black boxes," making their decision-making processes opaque. Enhancing transparency and explainability is crucial for building trust, ensuring accountability, and addressing ethical concerns related to AI's impact on privacy, security, and societal trust.


In [19]:
evaluate_system_prompt = """You are an intelligent evaluation system. 
Assess the AI assistant's response based on the provided context. 
- Assign a score of 1 if the response is very close to the true answer. 
- Assign a score of 0.5 if the response is partially correct. 
- Assign a score of 0 if the response is incorrect.
Return only the score (0, 0.5, or 1)."""

true_answer = data[query_index]['ideal_answer']

evaluate_prompt = f"""
User Query: {query}
True Answer: {true_answer}
AI Assistant's Response: {ai_response}
{evaluate_system_prompt}
"""
eval_response = generate_response(evaluate_system_prompt, evaluate_prompt)
print("Query: ", query)
print("True Answer: ", true_answer)
print("AI Assistant's Response: ", ai_response.choices[0].message.content)
print("Evaluation Response:", eval_response.choices[0].message.content)

Query:  What is 'Explainable AI' and why is it considered important?
True Answer:  Explainable AI (XAI) aims to make AI systems more transparent and understandable, providing insights into how they make decisions. It's considered important for building trust, accountability, and ensuring fairness in AI systems.
AI Assistant's Response:  

Explainable AI (XAI) refers to techniques designed to make AI systems more transparent and understandable, enabling users to comprehend how AI arrives at its decisions. It is considered important because many AI systems, particularly deep learning models, operate as "black boxes," making their decision-making processes opaque. Enhancing transparency and explainability is crucial for building trust, ensuring accountability, and addressing ethical concerns related to AI's impact on privacy, security, and societal trust.
Evaluation Response: 

1
