## 向量数据库的使用
***
- 向量库的数据增加
- 向量库的数据删除
- 向量库的相似性搜索
- 高级使用：MMR
- 高级使用：混合搜索

#### 向量库的数据增加
****

统一使用国产嵌入模型

In [1]:
from langchain_openai import OpenAIEmbeddings
import os
embeddings_model = OpenAIEmbeddings(
    model="BAAI/bge-m3",
    api_key=os.environ.get("DEEPSEEK_API_KEY"),
    base_url=os.environ.get("DEEPSEEK_API_BASE")+"/v1",
)

为了演示方便我们引入一个内存向量数据库，它将向量暂存在内存中，并使用字典以及numpy计算搜索的余弦相似度。

In [2]:
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embedding=embeddings_model)

In [7]:
from langchain_core.documents import Document

document_1 = Document(
    page_content="今天在抖音学会了一个新菜：锅巴土豆泥！看起来简单，实际炸了厨房，连猫都嫌弃地走开了。",
    metadata={"source": "社交媒体"},
)

document_2 = Document(
    page_content="小区遛狗大爷今日播报：广场舞大妈占领健身区，遛狗群众纷纷撤退。现场气氛诡异，BGM已循环播放《最炫民族风》两小时。",
    metadata={"source": "社区新闻"},
)

documents = [document_1, document_2]

vector_store.add_documents(documents=documents)

['d48967e1-214d-4407-a394-bf461275603a',
 '57567d0d-6624-4a6f-b034-2e783c485ce8']

你可以为添加的文档增加ID索引，便于后面管理

In [4]:
vector_store.add_documents(documents=documents, ids=["doc1", "doc2"])

['doc1', 'doc2']

#### 向量库的删除
****

In [5]:
vector_store.delete(ids=["doc1"])

#### 向量库的相似性搜索
****


In [6]:
query = "遛狗"
docs = vector_store.similarity_search(query)
print(docs[0].page_content)

小区遛狗大爷今日播报：广场舞大妈占领健身区，遛狗群众纷纷撤退。现场气氛诡异，BGM已循环播放《最炫民族风》两小时。


还可以使用“向量”查相似”向量“的方式来进行搜索

In [8]:
embedding_vector = embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

小区遛狗大爷今日播报：广场舞大妈占领健身区，遛狗群众纷纷撤退。现场气氛诡异，BGM已循环播放《最炫民族风》两小时。


注意：langchain只是在接口层面进行了封装，具体的搜索实现要依赖向量库本身的能力，比如Pinecone就可以进行元数据过滤，内存向量就不可以

https://docs.pinecone.io/guides/get-started/quickstart

In [10]:
! pip install -qU langchain-pinecone pinecone-notebooks


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [11]:
import getpass
import os
import time

from pinecone import Pinecone, ServerlessSpec

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

Error while installing plugin inference: property 'inference' of 'Pinecone' object has no setter
Traceback (most recent call last):
  File "/Volumes/MOVESPEED/AI课程/localCode/.venv/lib/python3.13/site-packages/pinecone_plugin_interface/actions/installation.py", line 13, in install_plugins
    setattr(target, plugin.namespace, impl(target.config, plugin_client_builder))
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: property 'inference' of 'Pinecone' object has no setter


初始化

In [12]:
import time

index_name = "langchain-test-index"  # change if desired

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=3072, #注意维度要一致
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

添加嵌入模型

In [13]:
from langchain_pinecone import PineconeVectorStore
# 因为国产嵌入模型的维度只有1024，所以无法使用
vector_store = PineconeVectorStore(index=index, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))

In [14]:
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="今天早餐吃了老王家的生煎包，馅料实在得快从褶子里跳出来了！这才是真正的上海味道！",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="明日天气预报：北京地区将出现大范围雾霾，建议市民戴好口罩，看不见脸的时候请不要慌张。",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="终于搞定了AI聊天机器人！我问它'你是谁'，它回答'我是你爸爸'，看来还需要调教...",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="震惊！本市一男子在便利店抢劫，只因店员说'扫码支付才有优惠'，现已被警方抓获。",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="刚看完《流浪地球3》，特效简直炸裂！就是旁边大妈一直问'这是在哪拍的'有点影响观影体验。",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="新发布的小米14Ultra值不值得买？看完这篇测评你就知道为什么李老板笑得合不拢嘴了。",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="2025年中超联赛十大最佳球员榜单新鲜出炉，第一名居然是他？！",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="用LangChain开发的AI助手太神奇了！问它'人生的意义'，它给我推荐了一份外卖优惠券...",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="A股今日暴跌，分析师称原因是'大家都在抢着卖'，投资者表示很有道理。",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="感觉我马上要被删库跑路了，祝我好运 /(ㄒoㄒ)/~~",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['4a167660-65a2-4c22-a7f0-4e2f1c27f798',
 '1687e3b4-d8cc-4652-b9e8-1626c146975a',
 '555413b0-5677-44c0-a574-3e081ecf22a1',
 '5ef9765d-e32a-4d8a-9f20-49ec6f6296c6',
 '90503ce6-9197-4738-9f5d-f7031b30f2e3',
 '9ad5e511-8dcf-4358-a4ab-d792baff6fa7',
 'b5e9d05a-fb8f-4d9b-bd3c-4e86f4cceda6',
 'def48eea-cce5-4803-ad7f-a450cf64eddf',
 'a327e865-100c-4321-8ea8-d5355b537167',
 'bebf9b8d-4cb1-4617-8c87-a36e0fb9c364']

删除啊最后一项

In [15]:
vector_store.delete(ids=[uuids[-1]])

相似性搜索支持元数据过滤

In [17]:
results = vector_store.similarity_search(
    "看电影",
    k=1,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* 刚看完《流浪地球3》，特效简直炸裂！就是旁边大妈一直问'这是在哪拍的'有点影响观影体验。 [{'source': 'tweet'}]


通过分数进行搜索

In [18]:
results = vector_store.similarity_search_with_score(
    "明天热吗?", k=1, filter={"source": "news"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.468292] 明日天气预报：北京地区将出现大范围雾霾，建议市民戴好口罩，看不见脸的时候请不要慌张。 [{'source': 'news'}]


### MMR（最大边际相关性）
****
- 并非所有向量库支持

In [21]:
vector_store.max_marginal_relevance_search(
    query="新手机",
    k=1,
    lambda_val=0.8,
    filter={"source": "website"},
)

[Document(metadata={'source': 'website'}, page_content='新发布的小米14Ultra值不值得买？看完这篇测评你就知道为什么李老板笑得合不拢嘴了。')]

#### 混合搜索
****


In [22]:
! pip install --upgrade --quiet  pinecone pinecone-text pinecone-notebooks

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-pinecone 0.2.3 requires pinecone<6.0.0,>=5.4.0, but you have pinecone 6.0.1 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [23]:
import os

api_key = os.environ["PINECONE_API_KEY"]

In [24]:
from langchain_community.retrievers import (
    PineconeHybridSearchRetriever,
)

初始化

In [25]:
import os

from pinecone import Pinecone, ServerlessSpec

index_name = "langchain-pinecone-hybrid-search"

# initialize Pinecone client
pc = Pinecone(api_key=api_key)

# create the index
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # dimensionality of dense model
        metric="dotproduct",  # sparse values supported only for dotproduct
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

Error while installing plugin inference: property 'inference' of 'Pinecone' object has no setter
Traceback (most recent call last):
  File "/Volumes/MOVESPEED/AI课程/localCode/.venv/lib/python3.13/site-packages/pinecone_plugin_interface/actions/installation.py", line 13, in install_plugins
    setattr(target, plugin.namespace, impl(target.config, plugin_client_builder))
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: property 'inference' of 'Pinecone' object has no setter


In [26]:
index = pc.Index(index_name)

In [27]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [28]:
from pinecone_text.sparse import BM25Encoder

# or from pinecone_text.sparse import SpladeEncoder if you wish to work with SPLADE

# use default tf-idf values
bm25_encoder = BM25Encoder().default()

In [29]:
corpus = ["foo", "bar", "world", "hello"]

# fit tf-idf values on your corpus
bm25_encoder.fit(corpus)

# store the values to a json file
bm25_encoder.dump("bm25_values.json")

# load to your BM25Encoder object
bm25_encoder = BM25Encoder().load("bm25_values.json")

100%|██████████| 4/4 [00:00<00:00, 247.62it/s]


In [30]:
retriever = PineconeHybridSearchRetriever(
    embeddings=embeddings, sparse_encoder=bm25_encoder, index=index
)

添加文本

In [31]:
retriever.add_texts(["foo", "bar", "world", "hello"])

100%|██████████| 1/1 [00:04<00:00,  4.96s/it]


检索

In [32]:
result = retriever.invoke("foo")


In [33]:
result[0]

Document(metadata={'score': 0.730220914}, page_content='foo')