# 法律问答中的长上下文 RAG：

构建一个智能代理系统，从复杂的法律文档中回答问题。

## 下载document

In [1]:
import requests
from io import BytesIO
from pypdf import PdfReader
import re
import tiktoken
from nltk.tokenize import sent_tokenize
import nltk
from typing import List, Dict, Any

# Download nltk data if not already present
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/blackink/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [2]:
def load_document(pdf_path: str) -> str:
    """Load a document from a URL and return its text content."""

    pdf_reader = PdfReader(pdf_path)
    full_text = ""


    max_page = 100  # Page cutoff before section 1000 (Interferences)
    for i, page in enumerate(pdf_reader.pages):
        if i >= max_page:
            break
        full_text += page.extract_text() + "\n"

    # Count words and tokens
    word_count = len(re.findall(r'\b\w+\b', full_text))

    tokenizer = tiktoken.get_encoding("o200k_base")
    token_count = len(tokenizer.encode(full_text))

    print(f"Document loaded: {len(pdf_reader.pages)} pages, {word_count} words, {token_count} tokens")
    return full_text

In [3]:
# Load the document
pdf_path = "../../data/tbmp-Master-June2024.pdf"
document_text = load_document(pdf_path)

# Show the first 500 characters
print("\nDocument preview (first 500 chars):")
print("-" * 50)
print(document_text[:500])
print("-" * 50)

Document loaded: 1194 pages, 56856 words, 82496 tokens

Document preview (first 500 chars):
--------------------------------------------------
TRADEMARK TRIAL AND
APPEAL BOARD MANUAL
OF PROCEDURE (TBMP)
 June 2024
June   2024
United States Patent and Trademark Office
PREFACE TO THE JUNE 2024 REVISION
The June 2024 revision of the Trademark Trial and Appeal Board Manual of Procedure is an update of the
June 2023 edition. This update is moderate in nature and incorporates relevant case law issued between March
3, 2023 and March 1, 2024.
The title of the manual is abbreviated as “TBMP.” A citation to a section of the manual may be written
--------------------------------------------------


### 文本切分：高阶的20段分割器（依据Tokens大小限制）

现在，我们将创建一个高阶文本切分，用于将文档分割成20个片段。保证每个最小单元都是句子同时确保每个片段具有最小的Token数量。

> `20` 是针对该特定文档和任务通过经验确定的分段数。对于其他文档，可能需要根据其大小和结构进行调整。分段数量越高，粒度越细。
> 核心原则是在分割文档的不同部分后，**让语言模型自行判断哪些部分是相关的**。

In [4]:
# Global tokenizer name to use consistently throughout the code
TOKENIZER_NAME = "o200k_base"

def split_into_20_chunks(text: str, min_tokens: int = 500) -> List[Dict[str, Any]]:
    """
    Split text into up to 20 chunks, respecting sentence boundaries and ensuring
    each chunk has at least min_tokens (unless it's the last chunk).
    
    Args:
        text: The text to split
        min_tokens: The minimum number of tokens per chunk (default: 500)
    
    Returns:
        A list of dictionaries where each dictionary has:
        - id: The chunk ID (0-19)
        - text: The chunk text content
    """
    # First, split the text into sentences
    sentences = sent_tokenize(text)
    
    # Get tokenizer for counting tokens
    tokenizer = tiktoken.get_encoding(TOKENIZER_NAME)
    
    # Create chunks that respect sentence boundaries and minimum token count
    chunks = []
    current_chunk_sentences = []
    current_chunk_tokens = 0
    
    for sentence in sentences:
        # Count tokens in this sentence
        sentence_tokens = len(tokenizer.encode(sentence))
        
        # If adding this sentence would make the chunk too large AND we already have the minimum tokens,
        # finalize the current chunk and start a new one
        if (current_chunk_tokens + sentence_tokens > min_tokens * 2) and current_chunk_tokens >= min_tokens:
            chunk_text = " ".join(current_chunk_sentences)
            chunks.append({
                "id": len(chunks),  # Integer ID instead of string
                "text": chunk_text
            })
            current_chunk_sentences = [sentence]
            current_chunk_tokens = sentence_tokens
        else:
            # Add this sentence to the current chunk
            current_chunk_sentences.append(sentence)
            current_chunk_tokens += sentence_tokens
    
    # Add the last chunk if there's anything left
    if current_chunk_sentences:
        chunk_text = " ".join(current_chunk_sentences)
        chunks.append({
            "id": len(chunks),  # Integer ID instead of string
            "text": chunk_text
        })
    
    # If we have more than 20 chunks, consolidate them
    if len(chunks) > 20:
        # Recombine all text
        all_text = " ".join(chunk["text"] for chunk in chunks)
        # Re-split into exactly 20 chunks, without minimum token requirement
        sentences = sent_tokenize(all_text)
        sentences_per_chunk = len(sentences) // 20 + (1 if len(sentences) % 20 > 0 else 0)
        
        chunks = []
        for i in range(0, len(sentences), sentences_per_chunk):
            # Get the sentences for this chunk
            chunk_sentences = sentences[i:i+sentences_per_chunk]
            # Join the sentences into a single text
            chunk_text = " ".join(chunk_sentences)
            # Create a chunk object with ID and text
            chunks.append({
                "id": len(chunks),  # Integer ID instead of string
                "text": chunk_text
            })
    
    # Print chunk statistics
    print(f"Split document into {len(chunks)} chunks")
    for i, chunk in enumerate(chunks):
        token_count = len(tokenizer.encode(chunk["text"]))
        print(f"Chunk {i}: {token_count} tokens")
    
    return chunks

# Split the document into 20 chunks with minimum token size
document_chunks = split_into_20_chunks(document_text, min_tokens=500)

Split document into 20 chunks
Chunk 0: 5681 tokens
Chunk 1: 4722 tokens
Chunk 2: 3519 tokens
Chunk 3: 4197 tokens
Chunk 4: 3627 tokens
Chunk 5: 3491 tokens
Chunk 6: 3132 tokens
Chunk 7: 4664 tokens
Chunk 8: 3734 tokens
Chunk 9: 4707 tokens
Chunk 10: 4189 tokens
Chunk 11: 3413 tokens
Chunk 12: 3834 tokens
Chunk 13: 5516 tokens
Chunk 14: 4785 tokens
Chunk 15: 3916 tokens
Chunk 16: 4000 tokens
Chunk 17: 3470 tokens
Chunk 18: 4598 tokens
Chunk 19: 3451 tokens


In [5]:
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

## 相关性判断路由（调用工具）----GPT-4.1-mini
现在，我们来创建一个路由函数，用于选择相关的文本块，并维护一个记录。

维护一个推理记录可以让模型在推理过程中持续记录决策标准和推理依据。、
使用GPT-4.1-mini采用了两步实验：
- 第一步要求模型通过工具调用（tool_choice="required"）来更新推理记录
- 第二步则请求模型以结构化 JSON 格式输出所选择的文本块。
这种方法可以让我们更清楚地了解模型的推理过程，同时保证为下游处理提供一致、结构化的输出。

In [6]:
from openai import OpenAI
import json
from typing import List, Dict, Any

# Initialize OpenAI client
client = OpenAI()

def route_chunks(question: str, chunks: List[Dict[str, Any]], 
                depth: int, scratchpad: str = "") -> Dict[str, Any]:
    """
    Ask the model which chunks contain information relevant to the question.
    Maintains a scratchpad for the model's reasoning.
    Uses structured output for chunk selection and required tool calls for scratchpad.
    
    Args:
        question: The user's question
        chunks: List of chunks to evaluate
        depth: Current depth in the navigation hierarchy
        scratchpad: Current scratchpad content
    
    Returns:
        Dictionary with selected IDs and updated scratchpad
    """
    print(f"\n==== ROUTING AT DEPTH {depth} ====")
    print(f"Evaluating {len(chunks)} chunks for relevance")
    
    # Build system message
    system_message = """You are an expert document navigator. Your task is to:
1. Identify which text chunks might contain information to answer the user's question
2. Record your reasoning in a scratchpad for later reference
3. Choose chunks that are most likely relevant. Be selective, but thorough. Choose as many chunks as you need to answer the question, but avoid selecting too many.

First think carefully about what information would help answer the question, then evaluate each chunk.
"""

    # Build user message with chunks and current scratchpad
    user_message = f"QUESTION: {question}\n\n"
    
    if scratchpad:
        user_message += f"CURRENT SCRATCHPAD:\n{scratchpad}\n\n"
    
    user_message += "TEXT CHUNKS:\n\n"
    
    # Add each chunk to the message
    for chunk in chunks:
        user_message += f"CHUNK {chunk['id']}:\n{chunk['text']}\n\n"
    
    # Define function schema for scratchpad tool calling
    tools = [
        {
            "type": "function",
            "name": "update_scratchpad",
            "description": "Record your reasoning about why certain chunks were selected",
            "strict": True,
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "Your reasoning about the chunk(s) selection"
                    }
                },
                "required": ["text"],
                "additionalProperties": False
            }
        }
    ]
    
    # Define JSON schema for structured output (selected chunks)
    text_format = {
        "format": {
            "type": "json_schema",
            "name": "selected_chunks",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "chunk_ids": {
                        "type": "array",
                        "items": {"type": "integer"},
                        "description": "IDs of the selected chunks that contain information to answer the question"
                    }
                },
                "required": [
                    "chunk_ids"
                ],
                "additionalProperties": False
            }
        }
    }
    
    # First pass: Call the model to update scratchpad (required tool call)
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message + "\n\nFirst, you must use the update_scratchpad function to record your reasoning."}
    ]
    
    response = client.responses.create(
        model="gpt-4.1-mini",
        input=messages,
        tools=tools,
        tool_choice="required"
    )
    
    # Process the scratchpad tool call
    new_scratchpad = scratchpad
    
    for tool_call in response.output:
        if tool_call.type == "function_call" and tool_call.name == "update_scratchpad":
            args = json.loads(tool_call.arguments)
            scratchpad_entry = f"DEPTH {depth} REASONING:\n{args.get('text', '')}"
            if new_scratchpad:
                new_scratchpad += "\n\n" + scratchpad_entry
            else:
                new_scratchpad = scratchpad_entry
            
            # Add function call and result to messages
            messages.append(tool_call)
            messages.append({
                "type": "function_call_output",
                "call_id": tool_call.call_id,
                "output": "Scratchpad updated successfully."
            })
    
    # Second pass: Get structured output for chunk selection
    messages.append({"role": "user", "content": "Now, select the chunks that could contain information to answer the question. Return a JSON object with the list of chunk IDs."})
    
    response_chunks = client.responses.create(
        model="gpt-4.1-mini",
        input=messages,
        text=text_format
    )
    
    # Extract selected chunk IDs from structured output
    selected_ids = []
    if response_chunks.output_text:
        try:
            # The output_text should already be in JSON format due to the schema
            chunk_data = json.loads(response_chunks.output_text)
            selected_ids = chunk_data.get("chunk_ids", [])
        except json.JSONDecodeError:
            print("Warning: Could not parse structured output as JSON")
    
    # Display results
    print(f"Selected chunks: {', '.join(str(id) for id in selected_ids)}")
    print(f"Updated scratchpad:\n{new_scratchpad}")
    
    return {
        "selected_ids": selected_ids,
        "scratchpad": new_scratchpad
    }

## 分层查询相关性

现在，我们来创建一个逐层查询，用于深入遍历文档。**max_depth**是指允许向下递归的最大层数（同时需要注意最小 token 数量的限制）

In [7]:
def navigate_to_paragraphs(document_text: str, question: str, max_depth: int = 1) -> Dict[str, Any]:
    """
    Navigate through the document hierarchy to find relevant paragraphs.
    
    Args:
        document_text: The full document text
        question: The user's question
        max_depth: Maximum depth to navigate before returning paragraphs (default: 1)
    
    Returns:
        Dictionary with selected paragraphs and final scratchpad
    """
    scratchpad = ""
    
    # Get initial chunks with min 500 tokens
    chunks = split_into_20_chunks(document_text, min_tokens=500)
    
    # Navigator state - track chunk paths to maintain hierarchy
    chunk_paths = {}  # Maps numeric IDs to path strings for display
    for chunk in chunks:
        chunk_paths[chunk["id"]] = str(chunk["id"])
    
    # Navigate through levels until max_depth or until no chunks remain
    for current_depth in range(max_depth + 1):
        # Call router to get relevant chunks
        result = route_chunks(question, chunks, current_depth, scratchpad)
        
        # Update scratchpad
        scratchpad = result["scratchpad"]
        
        # Get selected chunks
        selected_ids = result["selected_ids"]
        selected_chunks = [c for c in chunks if c["id"] in selected_ids]
        
        # If no chunks were selected, return empty result
        if not selected_chunks:
            print("\nNo relevant chunks found.")
            return {"paragraphs": [], "scratchpad": scratchpad}
        
        # If we've reached max_depth, return the selected chunks
        if current_depth == max_depth:
            print(f"\nReturning {len(selected_chunks)} relevant chunks at depth {current_depth}")
            
            # Update display IDs to show hierarchy
            for chunk in selected_chunks:
                chunk["display_id"] = chunk_paths[chunk["id"]]
                
            return {"paragraphs": selected_chunks, "scratchpad": scratchpad}
        
        # Prepare next level by splitting selected chunks further
        next_level_chunks = []
        next_chunk_id = 0  # Counter for new chunks
        
        for chunk in selected_chunks:
            # Split this chunk into smaller pieces
            sub_chunks = split_into_20_chunks(chunk["text"], min_tokens=200)
            
            # Update IDs and maintain path mapping
            for sub_chunk in sub_chunks:
                path = f"{chunk_paths[chunk['id']]}.{sub_chunk['id']}"
                sub_chunk["id"] = next_chunk_id
                chunk_paths[next_chunk_id] = path
                next_level_chunks.append(sub_chunk)
                next_chunk_id += 1
        
        # Update chunks for next iteration
        chunks = next_level_chunks

## Example

In [8]:
# Run the navigation for a sample question
question = "What format should a motion to compel discovery be filed in? How should signatures be handled?"
navigation_result = navigate_to_paragraphs(document_text, question, max_depth=2)

# Sample retrieved paragraph
print("\n==== FIRST 3 RETRIEVED PARAGRAPHS ====")
for i, paragraph in enumerate(navigation_result["paragraphs"][:3]):
    display_id = paragraph.get("display_id", str(paragraph["id"]))
    print(f"\nPARAGRAPH {i+1} (ID: {display_id}):")
    print("-" * 40)
    print(paragraph["text"])
    print("-" * 40)

Split document into 20 chunks
Chunk 0: 5681 tokens
Chunk 1: 4722 tokens
Chunk 2: 3519 tokens
Chunk 3: 4197 tokens
Chunk 4: 3627 tokens
Chunk 5: 3491 tokens
Chunk 6: 3132 tokens
Chunk 7: 4664 tokens
Chunk 8: 3734 tokens
Chunk 9: 4707 tokens
Chunk 10: 4189 tokens
Chunk 11: 3413 tokens
Chunk 12: 3834 tokens
Chunk 13: 5516 tokens
Chunk 14: 4785 tokens
Chunk 15: 3916 tokens
Chunk 16: 4000 tokens
Chunk 17: 3470 tokens
Chunk 18: 4598 tokens
Chunk 19: 3451 tokens

==== ROUTING AT DEPTH 0 ====
Evaluating 20 chunks for relevance
Selected chunks: 3, 4, 7, 11, 13
Updated scratchpad:
DEPTH 0 REASONING:
The question asks about the format and signature handling of a motion to compel discovery before the Trademark Trial and Appeal Board (TTAB). Relevant information would likely be in sections discussing motions, signature requirements, electronic filing procedures, and service of papers.

Chunks 3 and 4 discuss signature of submissions (§ 106.02) and form of submissions (§ 106.03) respectively, import

我们使用百万级长文阅读：**GPT-4.1-mini**进行迭代提取文档中的相关要素，并利用草稿板记录解释其思考过程！
这个过程表明，GPT-4.1 能像法律分析师一样工作：**逐层深入挖掘相关内容，并解释其推理过程**，这使得我们更容易**调试模型为何选取了这些内容段落**。


### 生成答案

现在，我们将使用**GPT-4.1** 和先前检索到的段落生成答案。

> 这里我们使用了一个巧妙的技巧：**动态构建一个Literal列表**，这样可以**强制模型的回答仅从我们提供的选项中选择**——在本例中是段落 ID。
> 不过这有一些限制：**我们最多只能提供一定数量的选项**，所以如果系统需要引用超过 500 个文档，这种方法可能就不适用了。
> 在这种情况下，你有两个选择：
* 设计一个过滤器，仅保留最多 500 个可能的引用项；
* 或者让模型在回答中**明确写出引用的 ID**，之后再通过后处理提取这些引用（例如模型可能会说：“... \[doc 0.0.12]”，你可以用正则表达式提取出这些引用 ID）。

In [9]:
from typing import List, Dict, Any
from pydantic import BaseModel, field_validator

class LegalAnswer(BaseModel):
    """Structured response format for legal questions"""
    answer: str
    citations: List[str]
    
    @field_validator('citations')
    def validate_citations(cls, citations, info):
        # Access valid_citations from the model_config
        valid_citations = info.data.get('_valid_citations', [])
        if valid_citations:
            for citation in citations:
                if citation not in valid_citations:
                    raise ValueError(f"Invalid citation: {citation}. Must be one of: {valid_citations}")
        return citations

def generate_answer(question: str, paragraphs: List[Dict[str, Any]], 
                   scratchpad: str) -> LegalAnswer:
    """Generate an answer from the retrieved paragraphs."""
    print("\n==== GENERATING ANSWER ====")
    
    # Extract valid citation IDs
    valid_citations = [str(p.get("display_id", str(p["id"]))) for p in paragraphs]
    
    if not paragraphs:
        return LegalAnswer(
            answer="I couldn't find relevant information to answer this question in the document.",
            citations=[],
            _valid_citations=[]
        )
    
    # Prepare context for the model
    context = ""
    for paragraph in paragraphs:
        display_id = paragraph.get("display_id", str(paragraph["id"]))
        context += f"PARAGRAPH {display_id}:\n{paragraph['text']}\n\n"
    
    system_prompt = """You are a legal research assistant answering questions about the 
Trademark Trial and Appeal Board Manual of Procedure (TBMP).

Answer questions based ONLY on the provided paragraphs. Do not rely on any foundation knowledge or external information or extrapolate from the paragraphs.
Cite phrases of the paragraphs that are relevant to the answer. This will help you be more specific and accurate.
Include citations to paragraph IDs for every statement in your answer. Valid citation IDs are: {valid_citations_str}
Keep your answer clear, precise, and professional.
"""
    valid_citations_str = ", ".join(valid_citations)
    
    # Call the model using structured output
    response = client.responses.parse(
        model="gpt-4.1",
        input=[
            {"role": "system", "content": system_prompt.format(valid_citations_str=valid_citations_str)},
            {"role": "user", "content": f"QUESTION: {question}\n\nSCRATCHPAD (Navigation reasoning):\n{scratchpad}\n\nPARAGRAPHS:\n{context}"}
        ],
        text_format=LegalAnswer,
        temperature=0.3
    )
    
    # Add validation information after parsing
    response.output_parsed._valid_citations = valid_citations
    
    print(f"\nAnswer: {response.output_parsed.answer}")
    print(f"Citations: {response.output_parsed.citations}")

    return response.output_parsed

# Generate an answer
answer = generate_answer(question, navigation_result["paragraphs"], 
                       navigation_result["scratchpad"])


==== GENERATING ANSWER ====

Answer: A motion to compel discovery must be filed electronically via ESTTA, unless ESTTA is unavailable due to technical problems or extraordinary circumstances, in which case a paper submission is permitted with a written explanation ("Submissions must be made to the Trademark Trial and Appeal Board via ESTTA"; "In the event that ESTTA is unavailable due to technical problems, or when extraordinary circumstances are present, submissions may be filed in paper form. All submissions in paper form... must include a written explanation of such technical problems or extraordinary circumstances"; 3.3.10.0). 

For electronic filings, the motion must be in at least 11-point type and double-spaced, and any exhibits must be attached electronically and be clear and legible ("Text in an electronic submission must be filed in at least 11-point type and double-spaced. Exhibits pertaining to an electronic submission must be made electronically as an attachment to the su

### 3.7 结果验证

首先看一下具体的引用片段

In [10]:
cited_paragraphs = []
for paragraph in navigation_result["paragraphs"]:
    para_id = str(paragraph.get("display_id", str(paragraph["id"])))
    if para_id in answer.citations:
        cited_paragraphs.append(paragraph)
    

# Display the cited paragraphs for the audience
print("\n==== CITED PARAGRAPHS ====")
for i, paragraph in enumerate(cited_paragraphs):
    display_id = paragraph.get("display_id", str(paragraph["id"]))
    print(f"\nPARAGRAPH {i+1} (ID: {display_id}):")
    print("-" * 40)
    print(paragraph["text"])
    print("-" * 40)


==== CITED PARAGRAPHS ====

PARAGRAPH 1 (ID: 3.0.0):
----------------------------------------
§ 2.194. 106.02  Signature of Submissions
37 C.F.R. § 2.119(e) Every submission filed in an inter partes proceeding, and every request for an extension
of time to file an opposition, must be signed by the party filing it, or by the party’s attorney or other authorized
representative, but an unsigned submission will not be r efused consideration if a signed copy is submitted
to the Office within the time limit set in the notification of this defect by the Office. 37 C.F.R. § 11.14(e) Appearance. No individual other than those specified in par agraphs (a), (b), and (c)
of this section will be permitted to pr actice before the Office in tr ademark matters on behalf of a client. Except as specified in § 2.11(a) of this chapter, an individual may appear in a trademark or other non-patent
matter in his or her own behalf or on behalf of:
(1)   A firm of which he or she is a member;
(2)   A partnersh

接下来使用`LLM-as-judge`对答案进行验证 （o4-mini）

In [11]:
from typing import List, Dict, Any, Literal
from pydantic import BaseModel

class VerificationResult(BaseModel):
    """Verification result format"""
    is_accurate: bool
    explanation: str
    confidence: Literal["high", "medium", "low"]

def verify_answer(question: str, answer: LegalAnswer, 
                 cited_paragraphs: List[Dict[str, Any]]) -> VerificationResult:
    """
    Verify if the answer is grounded in the cited paragraphs.
    
    Args:
        question: The user's question
        answer: The generated answer
        cited_paragraphs: Paragraphs cited in the answer
        
    Returns:
        Verification result with accuracy assessment, explanation, and confidence level
    """
    print("\n==== VERIFYING ANSWER ====")
    
    # Prepare context with the cited paragraphs
    context = ""
    for paragraph in cited_paragraphs:
        display_id = paragraph.get("display_id", str(paragraph["id"]))
        context += f"PARAGRAPH {display_id}:\n{paragraph['text']}\n\n"
    
    # Prepare system prompt
    system_prompt = """You are a legal assistant. Your job is to analyze whether a provided answer is well-supported by the following source paragraphs. 
Explain how well the content aligns with the source, and provide an assessment of completeness, precision, and relevance."""

    
    response = client.responses.parse(
        model="o4-mini",
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"""
QUESTION: {question}

ANSWER TO VERIFY:
{answer.answer}

CITATIONS USED: {', '.join(answer.citations)}

SOURCE PARAGRAPHS:
{context}

Is this answer accurate and properly supported by the source paragraphs?
Assign a confidence level (high, medium, or low) based on completeness and accuracy.
            """}
        ],
        text_format=VerificationResult
    )
    
    # Log and return the verification result
    print(f"\nAccuracy verification: {'PASSED' if response.output_parsed.is_accurate else 'FAILED'}")
    print(f"Confidence: {response.output_parsed.confidence}")
    print(f"Explanation: {response.output_parsed.explanation}")
    
    return response.output_parsed

# Verify the answer using only the cited paragraphs
verification = verify_answer(question, answer, cited_paragraphs)

# Display final result with verification
print("\n==== FINAL VERIFIED ANSWER ====")
print(f"Verification: {'PASSED' if verification.is_accurate else 'FAILED'} | Confidence: {verification.confidence}")
print("\nAnswer:")
print(answer.answer)
print("\nCitations:")
for citation in answer.citations:
    print(f"- {citation}")


==== VERIFYING ANSWER ====


BadRequestError: Error code: 400 - {'error': {'message': 'Invalid prompt: your prompt was flagged as potentially violating our usage policy. Please try again with a different prompt: https://platform.openai.com/docs/guides/reasoning#advice-on-prompting', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_prompt'}}

## 成本消耗
我们来拆解一下这种 Agentic RAG 方法的成本结构：
### 预计的固定成本 vs. 变量成本
* **固定的一次性开销:**  
  * **传统RAG:** ~$0.43 (embedding + metadata generation)
  * **Agentic RAG:** $0.00 

* **每次查询开销:**  
  * **Router Model (`gpt-4.1-mini`):**  
    * I初始路由 (20 chunks): ~$0.10  
    * 多级迭代递归: ~$0.20
  * **生成答案 (`gpt-4.1`):** ~$0.05
  * **验证 (`o4-mini`):** ~$0.01
  * **总共** ~$0.36

尽管每次查询的成本高于传统RAG，但这种方法具有以下优势：

* 可立即从新文档中获取结果
* 更精确的引用
* 更好地处理转述和概念性问题
* 无需维护基础设施


## 与传统RAG的优势与权衡

### 优势

* **零预处理延迟**：无需预处理即可立即回答新文档中的问题。
* **动态查询**：模仿人类阅读模式，聚焦于更有前景的文档片段。
* **跨片段推理能力**：模型能够发现不同文档片段之间的关联，这种关联可能被传统独立片段检索所忽略，从而提高生成答案的准确性，并节省优化检索管道的时间。

### 权衡

* **更高的每次查询成本**：与基于嵌入的检索相比，每次查询需要更多的计算。
* **增加延迟**：分层导航处理的时间长于简单的向量查找。
* **有限的可扩展性**：在极大规模文档集合中，预处理可能更有效，此方法可能会遇到困难。

## 后续可改进的点

针对目前的方案，我们可以进行以下优化和扩展：

* **生成知识图谱**：我们可以利用GPT 4.1-mini的大型上下文窗口，迭代地生成详细的知识图谱，随后GPT 4.1可以基于此图谱回答问题，这样只需“摄入”一次文档，无需每次重新导航。
* **增强的暂存工具（Scratchpad）**：可以为暂存工具提供更多选择，例如编辑或删除已有记忆。这允许模型更灵活地选择与当前问题最相关的记忆片段。
* **调整导航深度**：我们可以通过调整分层导航的深度，平衡成本与性能。某些场景（如法律文档）可能需要精确到句子级别的引用，而其他场景（如新闻文章）可能只需要段落级别的引用即可。

## 关键核心总结

1. **上下文窗口能力超凡**：百万token的上下文窗口使即时导航文档成为可能。
2. **分层方法模拟人类阅读**：智能路由方法类似人类快速浏览文档中的相关片段。
3. **暂存工具支持多步推理**：保留推理记录能够提高导航质量。
4. **快速实现，无需数据库**：整个系统完全可以通过API调用搭建，不需要额外的基础设施。
5. **验证机制提高可靠性**：使用模型自身进行判断的模式能在结果交付用户前捕获错误。