# 12.2 向量搜索实验

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-org/ai-first-app/blob/main/demos/12-rag-memory/vector_search.ipynb)

**预计 API 费用: ~$0.01**

本 Notebook 演示 Embedding 和向量搜索的原理。

In [None]:
!pip install -q openai chromadb numpy

## 实验 1: Embedding 基础

In [None]:
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# 测试
texts = [
    "狗是人类的朋友",
    "犬类是人类的好伙伴",
    "猫是独立的动物"
]

print("=== 文本向量化 ===")
vectors = {}
for text in texts:
    vec = get_embedding(text)
    vectors[text] = vec
    print(f"{text}: [{vec[0]:.4f}, {vec[1]:.4f}, ..., {vec[-1]:.4f}] ({len(vec)} 维)")


In [None]:
# 计算相似度
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

print("\n=== 相似度计算 ===")
for i, text1 in enumerate(texts):
    for text2 in texts[i+1:]:
        sim = cosine_similarity(vectors[text1], vectors[text2])
        print(f"{text1} vs {text2}: {sim:.4f}")

## 实验 2: ChromaDB 向量搜索

In [None]:
import chromadb
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=client.api_key,
    model_name="text-embedding-3-small"
)

client_chroma = chromadb.Client()
collection = client_chroma.create_collection(
    name="demo_collection",
    embedding_function=openai_ef
)

# 添加文档
documents = [
    "Python 是一门易学的编程语言",
    "JavaScript 用于 Web 开发",
    "Python 广泛应用于数据科学",
    "Java 是企业级开发的首选",
    "Python 的语法简洁优雅"
]

ids = [f"doc{i}" for i in range(len(documents))]

collection.add(documents=documents, ids=ids)

print(f"✅ 已添加 {len(documents)} 个文档")

In [None]:
# 搜索
test_queries = [
    "哪种语言适合初学者?",
    "Web 开发用什么语言?",
    "数据分析推荐什么语言?"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"查询: {query}")
    print(f"{'='*60}")
    
    results = collection.query(query_texts=[query], n_results=2)
    
    print("\n最相关的文档:")
    for i, doc in enumerate(results['documents'][0]):
        distance = results['distances'][0][i]
        print(f"{i+1}. {doc} (距离: {distance:.4f})")

## 实验 3: 不同 Embedding 模型对比

In [None]:
# 对比 small vs large
models = ["text-embedding-3-small", "text-embedding-3-large"]

query = "机器学习"
docs_to_compare = ["深度学习是 AI 的一个分支", "天气预报说明天会下雨"]

print(f"查询: {query}\n")

for model in models:
    print(f"=== {model} ===")
    query_vec = client.embeddings.create(model=model, input=query).data[0].embedding
    
    for doc in docs_to_compare:
        doc_vec = client.embeddings.create(model=model, input=doc).data[0].embedding
        sim = cosine_similarity(query_vec, doc_vec)
        print(f"{doc}: {sim:.4f}")

## 动手练习

1. **测试多语言**: 尝试中英文混合搜索
2. **调整 n_results**: 观察返回更多结果的效果
3. **元数据过滤**: 给文档添加 metadata 并过滤搜索
4. **性能测试**: 测试大量文档的搜索速度

---

## 关键要点

1. **Embedding 将文本转为向量**: 语义相似 → 向量接近
2. **Cosine Similarity**: 向量间相似度的度量
3. **ChromaDB 易用**: 适合快速原型和本地开发
4. **模型选择**: small 快且便宜, large 准确但贵

---

**下一步**: [12.3 高级 RAG](./advanced_rag.ipynb)