为了提高检索效果，在实际应用中，我们可以为每个文档存储多个向量。这在多个用例中已得到证明。
LangChain 提供了一个检索器组件 MultiVectorRetriever，它支持这种机制。它可以通过以下方法实现：
1. 较小的块
将一个文档分割成较小的块，并对其进行嵌入。
2. 摘要
为每个文档创建一个摘要，将其与（或代替）文档一起嵌入。
3. 假设性问题
创建每个文档适合回答的假设性问题，将其与（或代替）文档一起嵌入。

In [1]:
# !pip install -q -U  pypdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Smaller Chunks
## 下载文件
# !wget -O nvidia_10q_2023.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/19771e6b-cc29-4027-899e-51a0c386111e.pdf


--2024-06-24 14:44:43--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/19771e6b-cc29-4027-899e-51a0c386111e.pdf
Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 18.155.204.129, 18.155.204.91, 18.155.204.120, ...
Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|18.155.204.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458891 (448K) [application/pdf]
Saving to: ‘nvidia_10q_2023.pdf’


2024-06-24 14:44:45 (359 KB/s) - ‘nvidia_10q_2023.pdf’ saved [458891/458891]



In [10]:
# CloudflareWorkersAI
from dotenv import load_dotenv
import os
from langchain_community.llms.cloudflare_workersai import CloudflareWorkersAI

# 加载当前目录下的.env文件
# load_dotenv()
# load_dotenv(override=True) 会重新读取.env
load_dotenv(override=True)

# 现在可以像访问普通环境变量一样访问.env文件中的变量了
account_id = os.getenv('CF_ACCOUNT_ID')
api_token = os.getenv('CF_API_TOKEN')

print(account_id)
print(api_token)

import getpass

model = '@cf/meta/llama-3-8b-instruct'
cf_llm = CloudflareWorkersAI(account_id=account_id, api_token=api_token, model=model)

# 最新的Embedding方式
# cloudflare_workersai
from langchain_community.embeddings.cloudflare_workersai import (
    CloudflareWorkersAIEmbeddings,
)

# //维度是：384
embeddings = CloudflareWorkersAIEmbeddings(
    account_id=account_id,
    api_token=api_token,
    model_name="@cf/baai/bge-small-en-v1.5",
)

8483c3ec7a0cbc54a8d660b5b9002b04
Gcllof8ze6dgtcqFI5FQZ2SD_5tfCD4Db7NuS6jn


# Smaller Chunks

In [3]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import PyPDFLoader

In [7]:
path = "../../file/nvidia_10q_2023.pdf"
loaders = [PyPDFLoader(path)]
docs = []
for l in loaders:
    # docs.extend 通过从可迭代对象追加元素来扩展列表
    docs.extend(l.load())
# RecursiveCharacterTextSplitter 递归字符文本分割器
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

In [8]:
len(docs)

51

In [9]:
print(docs[6])

page_content="NVIDIA CORPORATION AND SUBSIDIARIES\nCONDENSED CONSOLIDATED STATEMENTS OF SHAREHOLDERS’ EQUITY\nFOR THE SIX MONTHS ENDED JULY 30, 2023 AND JULY 31, 2022\n(Unaudited)\nCommon Stock\nOutstandingAdditional\nPaid-in\nCapitalAccumulated Other\nComprehensive LossRetained\nEarningsTotal\nShareholders'\nEquity (In millions, except per share data) Shares Amount\nBalances, January 29, 2023 2,466 $ 2 $ 11,971 $ (43) $ 10,171 $ 22,101 \nNet income — — — — 8,232 8,232 \nOther comprehensive loss — — — (8) — (8)\nIssuance of common stock from stock plans 14 — 247 — — 247 \nTax withholding related to vesting of restricted stock units (3) — (1,179) — — (1,179)\nShares repurchased (8) — (1) — (3,283) (3,284)\nCash dividends declared and paid ($0.08 per common share) — — — — (199) (199)\nStock-based compensation — — 1,591 — — 1,591 \nBalances, July 30, 2023 2,469 $ 2 $ 12,629 $ (51) $ 14,921 $ 27,501 \nBalances, January 30, 2022 2,506 $ 3 $ 10,385 $ (11) $ 16,235 $ 26,612 \nNet income — — —

In [11]:
# The vectorstore to use to index the child chunks  用于为子块建立索引的向量存储
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embeddings
)
# The storage layer for the parent documents  父文档的存储层
# InMemoryStore 任何类型数据的内存存储
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

In [12]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [13]:
## 分块，每块都放入sub_docs
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [14]:
print(sub_docs[1])

page_content='For the quarterly period ended July 30, 2023\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nCommission file number: 0-23985\nNVIDIA CORPORATION\n(Exact name of registrant as specified in its charter)\nDelaware 94-3177549\n(State or other jurisdiction of (I.R.S. Employer\nincorporation or organization) Identification No.)' metadata={'source': '../../file/nvidia_10q_2023.pdf', 'page': 0, 'doc_id': '42488fd8-97ab-419e-ac0c-dae52ac7c70b'}


In [ ]:
# 通过嵌入运行更多文档并添加到向量存储中
retriever.vectorstore.add_documents(sub_docs)
# zip() 将多个可迭代对象（如列表、元组、字符串等）中对应的元素打包成一个个元组，k,v 的形式
# mset() 转成自己的 k，v的形式
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [16]:
print(type(sub_docs))
print(list(zip(doc_ids, docs)))

<class 'list'>
[('42488fd8-97ab-419e-ac0c-dae52ac7c70b', Document(page_content='UNITED ST ATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-Q\n☒ QUARTERL Y REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the quarterly period ended July 30, 2023\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nCommission file number: 0-23985\nNVIDIA CORPORATION\n(Exact name of registrant as specified in its charter)\nDelaware 94-3177549\n(State or other jurisdiction of (I.R.S. Employer\nincorporation or organization) Identification No.)\n2788 San T omas Expressway , Santa Clara, California 95051\n(Address of principal executive offices) (Zip Code)\n(408) 486-2000\n(Registrant\'s telephone number, including area code)\nN/A\n(Former name, former address and former fiscal year if changed since last report)\nSecurities registered pursuant to Section 12(b) of the Act:\nTitle of each class Trading Symb

In [17]:
# Vectorstore alone retrieves the small chunks 仅向量存储就检索小的块
similar_docs = retriever.vectorstore.similarity_search("What is the gross margin?")

In [18]:
print(similar_docs)

[]


In [19]:
relevant_docs = retriever.get_relevant_documents("What is the gross margin?")

  warn_deprecated(


In [21]:
len(relevant_docs)
# relevant_docs[0]

0

In [22]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [23]:
from langchain.chains import ConversationalRetrievalChain

qa = ConversationalRetrievalChain.from_llm(cf_llm, retriever, memory=memory)

In [25]:
result = qa({"question": "What is the gross margin?"})

In [26]:
result

{'question': 'What is the gross margin?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content="I'd be happy to help!\n\nThe gross margin is a financial metric that represents the difference between a company's revenue and the cost of goods sold (COGS). It's expressed as a percentage and is calculated by dividing the gross profit by the revenue, then multiplying by 100.\n\nThe formula is:\n\nGross Margin = (Revenue - COGS) / Revenue x 100\n\nFor example, if a company has revenue of $100,000 and COGS of $60,000, the gross margin would be:\n\nGross Margin = ($100,000 - $60,000) / $100,000 x 100 = 40%\n\nA higher gross margin indicates that a company is able to generate more profit from each dollar of sales, while a lower gross margin suggests that a company may need to optimize its pricing or cost structure to improve its profitability.\n\nDoes that help clarify things?"),
  HumanMessage(content='What is the gross margin?'),
  AIMessage(content="I'm ha

In [27]:
result = qa({"question": "What is the main contribution to it?"})
result

{'question': 'What is the main contribution to it?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content="I'd be happy to help!\n\nThe gross margin is a financial metric that represents the difference between a company's revenue and the cost of goods sold (COGS). It's expressed as a percentage and is calculated by dividing the gross profit by the revenue, then multiplying by 100.\n\nThe formula is:\n\nGross Margin = (Revenue - COGS) / Revenue x 100\n\nFor example, if a company has revenue of $100,000 and COGS of $60,000, the gross margin would be:\n\nGross Margin = ($100,000 - $60,000) / $100,000 x 100 = 40%\n\nA higher gross margin indicates that a company is able to generate more profit from each dollar of sales, while a lower gross margin suggests that a company may need to optimize its pricing or cost structure to improve its profitability.\n\nDoes that help clarify things?"),
  HumanMessage(content='What is the gross margin?'),
  AIMessage(cont

# Summary

In [28]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.schema.document import Document

In [34]:
# openai/moonshot
# 基础代码引入
from dotenv import load_dotenv
import os
from langchain_openai import ChatOpenAI

# 加载当前目录下的.env文件
# load_dotenv()
# load_dotenv(override=True) 会重新读取.env
load_dotenv(override=True)

# 现在可以像访问普通环境变量一样访问.env文件中的变量了
api_key = os.getenv('OPENAI_API_KEY')
base_url = os.getenv('OPENAI_API_BASE')

print(api_key)
print(base_url)

ms_chat = ChatOpenAI(
    openai_api_base=base_url,
    openai_api_key=api_key,
    model_name="moonshot-v1-8k",
    temperature=0.7,
)

sk-UGVpjuTwo2Q8pewoqUDfckw1A0pbSDli9ElFMeS9WareKknG
https://api.moonshot.cn/v1/


In [37]:
# Tongyi
from dotenv import load_dotenv
import os
from langchain_community.llms.tongyi import Tongyi

# 加载当前目录下的.env文件
# load_dotenv()
# load_dotenv(override=True) 会重新读取.env
load_dotenv(override=True)

# 现在可以像访问普通环境变量一样访问.env文件中的变量了
DASHSCOPE_API_KEY = os.getenv('DASHSCOPE_API_KEY')

print(DASHSCOPE_API_KEY)
# os.environ["DASHSCOPE_API_KEY"] = DASHSCOPE_API_KEY

qwen_llm = Tongyi(model='qwen2-1.5b-instruct')

sk-01c5003340c3453b934052d737d45e01


In [47]:
chain = (
        {"doc": lambda x: x.page_content}
        | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
        | ms_chat
        | StrOutputParser()
)

In [43]:
summaries = chain.batch(docs, {"max_concurrency": 1})

RateLimitError: Error code: 429 - {'error': {'message': 'Your account conjldivk6gfi4skbkpg<ak-erqmom4bpsg111e7w3ci> request reached max request: 3, please try again after 1 seconds', 'type': 'rate_limit_reached_error'}}

In [49]:
summaries1 = chain.invoke(docs[0])

In [50]:
print(summaries1)

This document is a quarterly report (Form 10-Q) filed by NVIDIA Corporation with the United States Securities and Exchange Commission (SEC) for the period ended July 30, 2023. The report indicates that the company has met all filing requirements and submitted all required Interactive Data Files. NVIDIA is identified as a large accelerated filer and is not a shell company. As of August 18, 2023, the company has 2.47 billion shares of common stock outstanding.


In [31]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=embeddings
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [ ]:
summary_docs = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

In [ ]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [ ]:
sub_docs = vectorstore.similarity_search("What is the gross margin?")
sub_docs[0]

In [ ]:
retrieved_docs = retriever.get_relevant_documents("What is the gross margin?")
retrieved_docs[0]

In [ ]:
qa = (ConversationalRetrievalChain.from_llm
      (cf_llm, retriever, memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True)))

In [ ]:
result = qa({"question": "What is the gross margin?"})
result

In [ ]:
result = qa({"question": "What is the main contribution to it?"})
result

# Hypothetical Questions 假设性问题


In [44]:
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    },
                },
            },
            "required": ["questions"]
        }
    }
]

In [45]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
        # map，k，v
        {"doc": lambda x: x.page_content}
        # Only asking for 3 hypothetical questions, but this could be adjusted
        | ChatPromptTemplate.from_template(
    "Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{doc}")
        | qwen_llm.bind(functions=functions,
                        function_call={"name": "hypothetical_questions"})
        | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [46]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 1})

KeyError: 'request'

In [ ]:
vectorstore = Chroma(
    collection_name="hypo-questions",
    embedding_function=embeddings 
)
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend([Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list])

question_docs

retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

qa = ConversationalRetrievalChain.from_llm(qwen_llm, retriever,
                                           memory=ConversationBufferMemory(memory_key="chat_history",
                                                                           return_messages=True))

result = qa({"question": "What is the gross margin?"})
result

result = qa({"question": "What is the main contribution to it?"})
result