# 上下文增强型的检索
RAG 通过对相关外部知识的检索增强生成内容。传统的检索返回的是孤立的块内容，会导致不完整的回答。
为了解决这个问题，我们引入了上下文增强的检索。 他保证了检索信息包括了相邻的块，保证了良好的相关性。

步骤
1. 数据导入， 从PDF中导入数据
2. 分块，将数据通过交叠的方式进行分块，保障了上下文的连续性。
3. 嵌入创建， 把块转变成数字来表述
4. 上下文敏感的索引，通过索引相邻块来完成更好的上下文完整性。
5. 回答生成，用索引到的上下文进行回答生成。
6. 评估，评估模型的准确性。

In [35]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

In [21]:
# 数据导入
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text
# 分块
def chunk_text(text, chunk_size=256, overlap=0.2):
    chunks = []
    overlap_size = int(chunk_size * overlap)
    for i in range(0, len(text), chunk_size - overlap_size):
        chunk = text[i:i+chunk_size]
        chunks.append(chunk)
    return chunks
# 嵌入创建
client = OpenAI(
    base_url="https://api.siliconflow.cn/v1/",
    api_key=os.getenv("SILLICONFLOW_API_KEY")
)
def create_embeddings(chunks, model_name="BAAI/bge-m3"):
   
    response = client.embeddings.create(
        model=model_name,
        input=chunks    
    )
    return [np.array(embedding.embedding) for embedding in response.data]

In [22]:
# 分块
pdf_path = "data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path=pdf_path)
text_chunks = chunk_text(extracted_text, 1000, 0.2)
print("块数量：", len(text_chunks))
print("\n 第一块：")
print(text_chunks[0])

块数量： 42

 第一块：
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and symbolic methods. The 1980s sa

In [27]:
# 创建块嵌入
response = create_embeddings(text_chunks)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def create_context_search(query, chunk_embeddings, chunks, top_k=1, context_size=1):
    print("chunk_embeddings size: ", len(chunk_embeddings))
    print("chunks size: ", len(chunks))
    query_embedding = create_embeddings(query)
    similarities = []
    for i, embedding in enumerate(chunk_embeddings):
        similarity = cosine_similarity(query_embedding, embedding)
        similarities.append((i, similarity))
    # 排序, 按照相似度从高到低排序
    similarities.sort(key=lambda x: x[1], reverse=True)
    top_index = similarities[0][0] 
    # 返回相邻块
    start_index = max(0, top_index - context_size)
    end_index = min(len(chunks), top_index + context_size + 1)
   
    return [chunks[i] for i in range(start_index, end_index)]




# Run a Query with Context Retrieval

In [31]:
with open("data/val.json", "r") as f:
    val_data = json.load(f)

test_index = 0
query = val_data[test_index]["question"]
print("查询内容: ", query)
top_chunks = create_context_search(query, response, text_chunks)
print("获取的上下文: ", top_chunks)
for i, chunk in enumerate(top_chunks):
    print(f"上下文块 {i+1}:\n")
    print(chunk)
    print("=========================\n")





查询内容:  What is 'Explainable AI' and why is it considered important?
chunk_embeddings size:  42
chunks size:  42
获取的上下文:  ['nt aligns with societal values. Education and awareness campaigns inform the public \nabout AI, its impacts, and its potential. \nChapter 19: AI and Ethics \nPrinciples of Ethical AI \nEthical AI principles guide the development and deployment of AI systems to ensure they are fair, \ntransparent, accountable, and beneficial to society. Key principles include respect for human \nrights, privacy, non-discrimination, and beneficence. \n \n \nAddressing Bias in AI \nAI systems can inherit and amplify biases present in the data they are trained on, leading to unfair \nor discriminatory outcomes. Addressing bias requires careful data collection, algorithm design, \nand ongoing monitoring and evaluation. \nTransparency and Explainability \nTransparency and explainability are essential for building trust in AI systems. Explainable AI (XAI) \ntechniques aim to make AI decis

# 用索引上下文，生成回复

In [None]:
system_prompt = """
You are an AI assistant that STRICTLY answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.
"""

def generate_answer(system_prompt, user_prompt, model="Qwen/Qwen3-8B"):
    client = OpenAI(
        base_url="https://api.siliconflow.cn/v1/",
        api_key=os.getenv("SILLICONFLOW_API_KEY")
    )
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.choices[0].message.content

user_prompt = "\n".join([f"Context:{i+1}: {chunk}" for i, chunk in enumerate(top_chunks)])
user_prompt = f"Question: {query} \n{user_prompt}"
print("问题: \n", query)
print("通过上下文生成回答")
answer = generate_answer(system_prompt, user_prompt)
print("回答: \n", answer)





问题: 
 What is 'Explainable AI' and why is it considered important?
通过上下文生成回答
回答: 
 

Explainable AI (XAI) refers to techniques that make AI decisions more understandable, enabling users to assess the fairness, accuracy, and reliability of AI outcomes. It is considered important for several reasons:  
1. **Trust**: Transparency and explainability are critical for building trust in AI systems, as they allow users to understand how decisions are made and evaluate their validity.  
2. **Accountability**: By making AI processes interpretable, XAI supports accountability, ensuring developers and deployers can address potential harms or biases.  
3. **Fairness and Ethics**: XAI helps identify and mitigate biases in AI systems, aligning with ethical principles like non-discrimination and beneficence.  
4. **Reliability**: Understanding decision-making processes enhances confidence in AI reliability and robustness.  

These factors collectively ensure AI systems align with societal values and e

In [34]:
# 评估
evaluation_system_prompt = """
You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5.
"""

def evaluate_answer(answer, true_answer, model="Qwen/Qwen3-8B"):
    client = OpenAI(
        base_url="https://api.siliconflow.cn/v1/",
        api_key=os.getenv("SILLICONFLOW_API_KEY")
    )
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": evaluation_system_prompt},
            {"role": "user", "content": f"AI assistant's response: {answer}\n True response: {true_answer} \n {evaluation_system_prompt}"}
        ]
    )
    return response.choices[0].message.content

ground_truth = val_data[test_index]["ideal_answer"]
score = evaluate_answer(answer, ground_truth)
print("评估得分: ", score)







评估得分:  

1.0

The AI assistant's response aligns closely with the true response, covering all essential aspects of Explainable AI (XAI). It accurately defines XAI as making AI systems transparent and understandable, which matches the true response. The assistant also highlights key reasons for XAI's importance (trust, accountability, fairness) that are explicitly mentioned in the true response. While the assistant adds an additional point about "reliability," this does not contradict the true response and can be seen as a reasonable elaboration. The core information is fully aligned, making the score 1.0.
