# 12.3 高级 RAG 技术

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-org/ai-first-app/blob/main/demos/12-rag-memory/advanced_rag.ipynb)

**预计 API 费用: ~$0.05**

本 Notebook 演示高级 RAG 技术:分块策略、重排序、混合搜索。

In [None]:
!pip install -q langchain langchain-openai chromadb cohere rank-bm25

## 实验 1: 分块策略对比

In [None]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

text = """
Python 异常处理最佳实践

1. 使用具体的异常类型
不要捕获所有异常,应该捕获具体的异常类型。

2. 避免空的 except 块
空的 except 块会隐藏错误。

3. 使用 finally 清理资源
finally 块总是会执行,适合清理资源。
"""

# 策略 1: 固定字符数
print("=== 策略 1: 固定字符数 ===")
splitter1 = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks1 = splitter1.split_text(text)
for i, chunk in enumerate(chunks1):
    print(f"Chunk {i+1}: {chunk[:50]}...\n")

# 策略 2: 递归分块(推荐)
print("\n=== 策略 2: 递归分块 ===")
splitter2 = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separators=["\n\n", "\n", "。", " ", ""]
)
chunks2 = splitter2.split_text(text)
for i, chunk in enumerate(chunks2):
    print(f"Chunk {i+1}: {chunk[:50]}...\n")

## 实验 2: 重排序 (Reranking)

In [None]:
import cohere
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# 准备文档
docs_content = [
    "Python 使用 try-except 处理异常",
    "Java 使用 try-catch 处理异常",
    "异常情况下系统会重启",
    "Python 的异常处理非常灵活",
    "异常天气可能导致航班延误"
]

from langchain.schema import Document
docs = [Document(page_content=content) for content in docs_content]

# 创建向量库
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

# 查询
query = "Python 如何处理异常?"

# 向量搜索 Top-5
print("=== 向量搜索结果 ===")
results_vector = vectorstore.similarity_search_with_score(query, k=5)
for i, (doc, score) in enumerate(results_vector):
    print(f"{i+1}. [{score:.4f}] {doc.page_content}")

# Rerank
print("\n=== Rerank 后结果 ===")
co = cohere.Client("your-cohere-api-key")  # 需要 Cohere API key
rerank_results = co.rerank(
    query=query,
    documents=docs_content,
    top_n=3,
    model="rerank-multilingual-v3.0"
)

for i, result in enumerate(rerank_results.results):
    print(f"{i+1}. [{result.relevance_score:.4f}] {docs_content[result.index]}")

## 实验 3: 混合搜索 (Hybrid Search)

In [None]:
from rank_bm25 import BM25Okapi
import jieba

# BM25 关键词搜索
class BM25Searcher:
    def __init__(self, documents):
        self.documents = documents
        # 分词
        tokenized_docs = [list(jieba.cut(doc)) for doc in documents]
        self.bm25 = BM25Okapi(tokenized_docs)
    
    def search(self, query, k=3):
        tokenized_query = list(jieba.cut(query))
        scores = self.bm25.get_scores(tokenized_query)
        top_k_idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
        return [(self.documents[i], scores[i]) for i in top_k_idx]

# 混合搜索
def hybrid_search(query, vectorstore, bm25_searcher, k=3, alpha=0.5):
    # 向量搜索
    vector_results = vectorstore.similarity_search_with_score(query, k=k*2)
    
    # BM25 搜索
    bm25_results = bm25_searcher.search(query, k=k*2)
    
    # 归一化并融合分数
    combined = {}
    
    for doc, score in vector_results:
        text = doc.page_content
        combined[text] = combined.get(text, 0) + (1 - score) * alpha
    
    for text, score in bm25_results:
        combined[text] = combined.get(text, 0) + score * (1 - alpha)
    
    # 排序
    sorted_results = sorted(combined.items(), key=lambda x: x[1], reverse=True)[:k]
    return sorted_results

# 测试
bm25_searcher = BM25Searcher(docs_content)

print("=== 混合搜索结果 ===")
hybrid_results = hybrid_search(query, vectorstore, bm25_searcher, k=3)
for i, (doc, score) in enumerate(hybrid_results):
    print(f"{i+1}. [{score:.4f}] {doc}")

## 动手练习

1. **对比不同分块大小**: chunk_size 200 vs 500 vs 1000
2. **调整混合搜索权重**: alpha 0.3 vs 0.5 vs 0.7
3. **实现查询改写**: 用 LLM 改写模糊查询
4. **添加元数据过滤**: 只搜索特定类别的文档

---

## 关键要点

1. **递归分块最优**: 保持语义完整性
2. **Reranking 提升精度**: 二次排序提升 Top-K 质量
3. **混合搜索**: 向量 + BM25 兼顾语义和精确
4. **权重调优**: alpha 参数影响搜索结果

---

**下一步**: [12.4 记忆管理](./memory_chatbot.ipynb)