# 增强型 RAG 系统的重新排序

此笔记本实施了重新排序技术，以提高 RAG 系统中的检索质量。重新排名是初始检索后的第二个筛选步骤，以确保使用最相关的内容来生成响应。

## 重新排名的关键概念

1. 初始检索：使用基本相似度搜索的第一步（准确性较低但速度更快）
2. 文档评分：评估每个检索到的文档与查询的相关性
3. 重新排序：按相关度分数对文档进行排序
4. 选择：仅使用最相关的文档来生成响应

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI
import re

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

In [3]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

In [4]:
client = OpenAI(
    base_url="https://api.siliconflow.cn/v1/",
    api_key=os.getenv("SILLICONFLOW_API_KEY")
)

In [5]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        """
        Initialize the vector store.
        """
        self.vectors = []  # List to store embedding vectors
        self.texts = []  # List to store original texts
        self.metadata = []  # List to store metadata for each text
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.

        Args:
        text (str): The original text.
        embedding (List[float]): The embedding vector.
        metadata (dict, optional): Additional metadata.
        """
        self.vectors.append(np.array(embedding))  # Convert embedding to numpy array and add to vectors list
        self.texts.append(text)  # Add the original text to texts list
        self.metadata.append(metadata or {})  # Add metadata to metadata list, use empty dict if None
    
    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.

        Args:
        query_embedding (List[float]): Query embedding vector.
        k (int): Number of results to return.

        Returns:
        List[Dict]: Top k most similar items with their texts and metadata.
        """
        if not self.vectors:
            return []  # Return empty list if no vectors are stored
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            # Compute cosine similarity between query vector and stored vector
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))  # Append index and similarity score
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],  # Add the corresponding text
                "metadata": self.metadata[idx],  # Add the corresponding metadata
                "similarity": score  # Add the similarity score
            })
        
        return results  # Return the list of top k similar items

In [6]:
def create_embeddings(text, model="BAAI/bge-m3"):
    """
    Creates embeddings for the given text using the specified OpenAI model.

    Args:
    text (str): The input text for which embeddings are to be created.
    model (str): The model to be used for creating embeddings.

    Returns:
    List[float]: The embedding vector.
    """
    # Handle both string and list inputs by converting string input to a list
    input_text = text if isinstance(text, list) else [text]
    
    # Create embeddings for the input text using the specified model
    response = client.embeddings.create(
        model=model,
        input=input_text
    )
    
    # If input was a string, return just the first embedding
    if isinstance(text, str):
        return response.data[0].embedding
    
    # Otherwise, return all embeddings as a list of vectors
    return [item.embedding for item in response.data]

In [7]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    Process a document for RAG.

    Args:
    pdf_path (str): Path to the PDF file.
    chunk_size (int): Size of each chunk in characters.
    chunk_overlap (int): Overlap between chunks in characters.

    Returns:
    SimpleVectorStore: A vector store containing document chunks and their embeddings.
    """
    # Extract text from the PDF file
    print("Extracting text from PDF...")
    extracted_text = extract_text_from_pdf(pdf_path)
    
    # Chunk the extracted text
    print("Chunking text...")
    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(chunks)} text chunks")
    
    # Create embeddings for the text chunks
    print("Creating embeddings for chunks...")
    chunk_embeddings = create_embeddings(chunks)
    
    # Initialize a simple vector store
    store = SimpleVectorStore()
    
    # Add each chunk and its embedding to the vector store
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        store.add_item(
            text=chunk,
            embedding=embedding,
            metadata={"index": i, "source": pdf_path}
        )
    
    print(f"Added {len(chunks)} chunks to the vector store")
    return store

In [11]:
def rerank_with_llm(query, results, top_n=3, model="Qwen/Qwen3-8B"):
    """
    Reranks search results using LLM relevance scoring.
    
    Args:
        query (str): User query
        results (List[Dict]): Initial search results
        top_n (int): Number of results to return after reranking
        model (str): Model to use for scoring
        
    Returns:
        List[Dict]: Reranked results
    """
    print(f"Reranking {len(results)} documents... with LLM model {model}")  # Print the number of documents to be reranked
    
    scored_results = []  # Initialize an empty list to store scored results
    
    # Define the system prompt for the LLM
    system_prompt = """
    You are an expert at evaluating document relevance for search queries.
    Your task is to rate documents on a scale from 0 to 10 based on how well they answer the given query.

    Guidelines:
    - Score 0-2: Document is completely irrelevant
    - Score 3-5: Document has some relevant information but doesn't directly answer the query
    - Score 6-8: Document is relevant and partially answers the query
    - Score 9-10: Document is highly relevant and directly answers the query

    You MUST respond with ONLY a single integer score between 0 and 10. Do not include ANY other text.
    """
    print("results[0]: ", results[0])
    # Iterate through each result
    for i, result in enumerate(results):
        # Show progress every 5 documents
        if i % 5 == 0:
            print(f"Scoring document {i+1}/{len(results)}...")
        
        # Define the user prompt for the LLM
        user_prompt = f"""
        Query: {query}

        Document:
        {result['text']}

        Rate this document's relevance to the query on a scale from 0 to 10:
        """
        
        # Get the LLM response
        response = client.chat.completions.create(
            model=model,
            temperature=0,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ]
        )
        
        # Extract the score from the LLM response
        score_text = response.choices[0].message.content.strip()
        
        # Use regex to extract the numerical score
        score_match = re.search(r'\b(10|[0-9])\b', score_text)
        if score_match:
            score = float(score_match.group(1))
        else:
            # If score extraction fails, use similarity score as fallback
            print(f"Warning: Could not extract score from response: '{score_text}', using similarity score instead")
            score = result["similarity"] * 10
        
        # Append the scored result to the list
        scored_results.append({
            "text": result["text"],
            "metadata": result["metadata"],
            "similarity": result["similarity"],
            "relevance_score": score
        })
    
    # Sort results by relevance score in descending order
    reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)
    
    # Return the top_n results
    return reranked_results[:top_n]

In [12]:
def rerank_with_keywords(query, results, top_n=3):
    """
    A simple alternative reranking method based on keyword matching and position.
    
    Args:
        query (str): User query
        results (List[Dict]): Initial search results
        top_n (int): Number of results to return after reranking
        
    Returns:
        List[Dict]: Reranked results
    """
    print(f"Reranking {len(results)} documents... with keyword matching")  # Print the number of documents to be reranked
    print("results[0]:  ", results[0])
    # Extract important keywords from the query
    keywords = [word.lower() for word in query.split() if len(word) > 3]
    
    scored_results = []  # Initialize a list to store scored results
    
    for result in results:
        document_text = result["text"].lower()  # Convert document text to lowercase
        
        # Base score starts with vector similarity
        base_score = result["similarity"] * 0.5
        
        # Initialize keyword score
        keyword_score = 0
        for keyword in keywords:
            if keyword in document_text:
                # Add points for each keyword found
                keyword_score += 0.1
                
                # Add more points if keyword appears near the beginning
                first_position = document_text.find(keyword)
                if first_position < len(document_text) / 4:  # In the first quarter of the text
                    keyword_score += 0.1
                
                # Add points for keyword frequency
                frequency = document_text.count(keyword)
                keyword_score += min(0.05 * frequency, 0.2)  # Cap at 0.2
        
        # Calculate the final score by combining base score and keyword score
        final_score = base_score + keyword_score
        
        # Append the scored result to the list
        scored_results.append({
            "text": result["text"],
            "metadata": result["metadata"],
            "similarity": result["similarity"],
            "relevance_score": final_score
        })
    
    # Sort results by final relevance score in descending order
    reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)
    
    # Return the top_n results
    return reranked_results[:top_n]

In [13]:
def generate_response(query, context, model="Qwen/Qwen3-8B"):
    """
    Generates a response based on the query and context.
    
    Args:
        query (str): User query
        context (str): Retrieved context
        model (str): Model to use for response generation
        
    Returns:
        str: Generated response
    """
    # Define the system prompt to guide the AI's behavior
    system_prompt = """
    You are a helpful AI assistant.
    Answer the user's question based only on the provided context. 
    If you cannot find the answer in the context, state that you don't have enough information.
    """
    
    # Create the user prompt by combining the context and query
    user_prompt = f"""
        Context:
        {context}

        Question: {query}

        Please provide a comprehensive answer based only on the context above.
    """
    
    # Generate the response using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    # Return the generated response content
    return response.choices[0].message.content

In [18]:
def rag_with_reranking(query, vector_store, reranking_method="llm", top_n=3, model="Qwen/Qwen3-8B"):
    """
    Complete RAG pipeline incorporating reranking.
    
    Args:
        query (str): User query
        vector_store (SimpleVectorStore): Vector store
        reranking_method (str): Method for reranking ('llm' or 'keywords')
        top_n (int): Number of results to return after reranking
        model (str): Model for response generation
        
    Returns:
        Dict: Results including query, context, and response
    """
    # Create query embedding
    query_embedding = create_embeddings(query)
    
    # Initial retrieval (get more than we need for reranking)
    initial_results = vector_store.similarity_search(query_embedding, k=10)
    
    print(f"Get similar {len(initial_results)} documents from vector store")
    # Apply reranking
    if reranking_method == "llm":
        reranked_results = rerank_with_llm(query, initial_results, top_n=top_n)
    elif reranking_method == "keywords":
        reranked_results = rerank_with_keywords(query, initial_results, top_n=top_n)
    else:
        # No reranking, just use top results from initial retrieval
        reranked_results = initial_results[:top_n]
    
    # Combine context from reranked results
    context = "\n\n===\n\n".join([result["text"] for result in reranked_results])
    
    # Generate response based on context
    response = generate_response(query, context, model)
    
    return {
        "query": query,
        "reranking_method": reranking_method,
        "initial_results": initial_results[:top_n],
        "reranked_results": reranked_results,
        "context": context,
        "response": response
    }

In [16]:
# Load the validation data from a JSON file
with open('data/val.json') as f:
    data = json.load(f)

query_index = 4
# Extract the first query from the validation data
query = data[query_index]['question']

# Extract the reference answer from the validation data
reference_answer = data[query_index]['ideal_answer']

# pdf_path
pdf_path = "data/AI_Information.pdf"

In [19]:
# Process document
vector_store = process_document(pdf_path)

# Example query
query = "Does AI have the potential to transform the way we live and work?"

# Compare different methods
print("Comparing retrieval methods...")

# 1. Standard retrieval (no reranking)
print("\n=== STANDARD RETRIEVAL ===")
standard_results = rag_with_reranking(query, vector_store, reranking_method="none")
print(f"\nQuery: {query}")
print(f"\nResponse:\n{standard_results['response']}")

Extracting text from PDF...
Chunking text...
Created 42 text chunks
Creating embeddings for chunks...
Added 42 chunks to the vector store
Comparing retrieval methods...

=== STANDARD RETRIEVAL ===
Get similar 10 documents from vector store

Query: Does AI have the potential to transform the way we live and work?

Response:


Yes, AI has significant potential to transform both how we live and work, as outlined in the context. Here’s a comprehensive analysis based on the provided information:

### **Transformation in Work**  
1. **Automation and Efficiency**:  
   - AI automates repetitive tasks in industries like finance (e.g., algorithmic trading) and customer service (e.g., chatbots), increasing efficiency and reducing costs.  
   - It optimizes business operations by analyzing data, predicting market trends, and streamlining processes, leading to improved decision-making and productivity.  

2. **Job Displacement and New Opportunities**:  
   - While AI may displace roles involving r

In [20]:
# 2. LLM-based reranking
print("\n=== LLM-BASED RERANKING ===")
llm_results = rag_with_reranking(query, vector_store, reranking_method="llm")
print(f"\nQuery: {query}")
print(f"\nResponse:\n{llm_results['response']}")



=== LLM-BASED RERANKING ===
Get similar 10 documents from vector store
Reranking 10 documents... with LLM model Qwen/Qwen3-8B
results[0]:  {'text': 'agement, algorithmic trading, and \ncustomer service. AI-powered systems analyze large datasets to identify patterns, predict market \nmovements, and automate financial processes. \nChapter 8: AI and the Future of Work \nAutomation and Job Displacement \nThe increasing capabilities of AI raise concerns about job displacement, particularly in industries \nwith repetitive or routine tasks. While AI may automate some jobs, it also creates new \nopportunities and transforms existing roles. \nReskilling and Upskilling \nAddressing the potential impacts of AI on the workforce requires reskilling and upskilling \ninitiatives. These programs equip workers with the skills needed to adapt to new roles and \ncollaborate with AI systems. \nHuman-AI Collaboration \nThe future of work is likely to involve increased collaboration between humans and AI s

In [21]:
# 3. Keyword-based reranking
print("\n=== KEYWORD-BASED RERANKING ===")
keyword_results = rag_with_reranking(query, vector_store, reranking_method="keywords")
print(f"\nQuery: {query}")
print(f"\nResponse:\n{keyword_results['response']}")


=== KEYWORD-BASED RERANKING ===
Get similar 10 documents from vector store
Reranking 10 documents... with keyword matching
results[0]:   {'text': 'agement, algorithmic trading, and \ncustomer service. AI-powered systems analyze large datasets to identify patterns, predict market \nmovements, and automate financial processes. \nChapter 8: AI and the Future of Work \nAutomation and Job Displacement \nThe increasing capabilities of AI raise concerns about job displacement, particularly in industries \nwith repetitive or routine tasks. While AI may automate some jobs, it also creates new \nopportunities and transforms existing roles. \nReskilling and Upskilling \nAddressing the potential impacts of AI on the workforce requires reskilling and upskilling \ninitiatives. These programs equip workers with the skills needed to adapt to new roles and \ncollaborate with AI systems. \nHuman-AI Collaboration \nThe future of work is likely to involve increased collaboration between humans and AI sys

In [24]:
def evaluate_reranking(query, standard_results, reranked_results, reference_answer=None):
    """
    Evaluates the quality of reranked results compared to standard results.
    
    Args:
        query (str): User query
        standard_results (Dict): Results from standard retrieval
        reranked_results (Dict): Results from reranked retrieval
        reference_answer (str, optional): Reference answer for comparison
        
    Returns:
        str: Evaluation output
    """
    # Define the system prompt for the AI evaluator
    system_prompt = """You are an expert evaluator of RAG systems.
    Compare the retrieved contexts and responses from two different retrieval methods.
    Assess which one provides better context and a more accurate, comprehensive answer."""
    
    # Prepare the comparison text with truncated contexts and responses
    comparison_text = f"""
    Query: {query}

    Standard Retrieval Context:
    {standard_results['context'][:1000]}... [truncated]

    Standard Retrieval Answer:
    {standard_results['response']}

    Reranked Retrieval Context:
    {reranked_results['context'][:1000]}... [truncated]

    Reranked Retrieval Answer:
    {reranked_results['response']}
    """

    # If a reference answer is provided, include it in the comparison text
    if reference_answer:
        comparison_text += f"""

        Reference Answer:
        {reference_answer}
        """

    # Create the user prompt for the AI evaluator
    user_prompt = f"""
        {comparison_text}

        Please evaluate which retrieval method provided:
        1. More relevant context
        2. More accurate answer
        3. More comprehensive answer
        4. Better overall performance

        Provide a detailed analysis with specific examples.
    """
    
    # Generate the evaluation response using the specified model
    response = client.chat.completions.create(
        model="Qwen/Qwen3-14B",
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    # Return the evaluation output
    return response.choices[0].message.content

In [25]:
# Evaluate the quality of reranked results compared to standard results
evaluation = evaluate_reranking(
    query=query,  # The user query
    standard_results=standard_results,  # Results from standard retrieval
    reranked_results=llm_results,  # Results from LLM-based reranking
    reference_answer=reference_answer  # Reference answer for comparison
)

# Print the evaluation results
print("\n=== EVALUATION RESULTS ===")
print(evaluation)


=== EVALUATION RESULTS ===


### **Evaluation of Retrieval Methods for the Query: "Does AI have the potential to transform the way we live and work?"**

---

#### **1. More Relevant Context**  
**Both retrieval methods provided identical contexts**, which are truncated but focus on **AI's role in automation, job displacement, reskilling, human-AI collaboration, and new job roles** (e.g., finance, customer service, CRM, supply chain, and ethical considerations). Since the context is the same for both methods, **relevance is equal**. However, the **reranked retrieval answer** introduces **new points** (e.g., "social and environmental impact," "AI for social good initiatives") not present in the original context. This suggests the reranked method may have **inferred or extrapolated** beyond the provided text, which could reduce relevance if the goal is strict adherence to the given context.

---

#### **2. More Accurate Answer**  
**Standard Retrieval Answer** is **more accurate** becaus