# 使用LangChain、GPT和Activeloop的Deep Lake来处理代码库在本教程中，我们将使用Langchain + Activeloop的Deep Lake与GPT来分析LangChain本身的代码库。

## 设计

1. 准备数据：   1. 使用 `langchain_community.document_loaders.TextLoader` 上传所有Python项目文件。我们将这些文件称为**文档**。   2. 使用 `langchain_text_splitters.CharacterTextSplitter` 将所有文档分割成片段。   3. 使用 `langchain.embeddings.openai.OpenAIEmbeddings` 和 `langchain_community.vectorstores.DeepLake` 将片段嵌入并上传到DeepLake中。2. 问答：   1. 从 `langchain.chat_models.ChatOpenAI` 和 `langchain.chains.ConversationalRetrievalChain` 构建一个链。   2. 准备问题。   3. 运行链以获取答案。

## 实现

### 集成准备

我们需要为外部服务设置密钥并安装必要的Python库。

In [None]:
# 使用pip安装langchain、deeplake和openai的最新版本!python3 -m pip install --upgrade langchain deeplake openai

设置OpenAI嵌入，Deep Lake多模态向量存储API并进行身份验证。有关Deep Lake的完整文档，请访问https://docs.activeloop.ai/，API参考请访问https://docs.deeplake.ai/en/latest/。

In [1]:
import osfrom getpass import getpassos.environ["OPENAI_API_KEY"] = getpass()# 请手动输入OpenAI密钥

如果您想创建自己的数据集并发布它，请在Deep Lake进行身份验证。您可以从平台[app.activeloop.ai](https://app.activeloop.ai)获取API密钥。

In [2]:
import getpassimport os# 获取用户输入的 Activeloop Tokenactiveloop_token = getpass.getpass("Activeloop Token:")# 将 Activeloop Token 设置为环境变量os.environ["ACTIVELOOP_TOKEN"] = activeloop_token这段代码用于获取用户输入的 Activeloop Token，并将其设置为环境变量。首先，我们导入了 `getpass` 和 `os` 模块。然后，使用 `getpass.getpass` 函数提示用户输入 Activeloop Token，并将其赋值给变量 `activeloop_token`。最后，使用 `os.environ` 将 `activeloop_token` 设置为名为 `ACTIVELOOP_TOKEN` 的环境变量。

### 准备数据

加载所有存储库文件。在这里，我们假设这个笔记本是作为`langchain`分支的一部分下载的，并且我们使用`langchain`存储库的Python文件。如果您想使用来自不同存储库的文件，请将`root_dir`更改为您存储库的根目录。

In [10]:
# 导入必要的模块import os# 使用os模块的ls函数来列出指定路径下的文件和文件夹# 通过传入相对路径"../../../../../../libs"来指定要列出的路径# 注意：这里使用了叹号前缀，表示这是一个命令行操作，而不是Python代码!ls "../../../../../../libs"以上代码使用Python中的os模块来列出指定路径下的文件和文件夹。通过传入相对路径"../../../../../../libs"来指定要列出的路径。最后使用叹号前缀的命令行操作`!ls`来执行列出操作。

CITATION.cff  MIGRATE.md  README.md  libs	  poetry.toml
LICENSE       Makefile	  docs	     poetry.lock  pyproject.toml


In [11]:
从langchain_community.document_loaders导入TextLoader模块root_dir = "../../../../../../libs"  # 设置根目录路径docs = []  # 创建一个空列表用于存储文档for dirpath, dirnames, filenames in os.walk(root_dir):  # 遍历根目录下的所有文件和文件夹    for file in filenames:  # 遍历文件夹中的所有文件        if file.endswith(".py") and "*venv/" not in dirpath:  # 如果文件以.py结尾且文件路径中不包含*venv/            try:                loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8")  # 创建TextLoader对象，加载文件                docs.extend(loader.load_and_split())  # 调用load_and_split方法加载并拆分文档，将结果添加到docs列表中            except Exception:                passprint(f"{len(docs)}")  # 打印docs列表的长度

2554


然后，对文件进行分块。

In [12]:
from langchain_text_splitters import CharacterTextSplitter  # 导入CharacterTextSplitter模块text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)  # 创建CharacterTextSplitter对象，设置chunk_size为1000，chunk_overlap为0texts = text_splitter.split_documents(docs)  # 使用text_splitter对象的split_documents方法对docs进行分割，并将结果赋值给textsprint(f"{len(texts)}")  # 打印texts的长度

Created a chunk of size 1010, which is longer than the specified 1000
Created a chunk of size 3466, which is longer than the specified 1000
Created a chunk of size 1375, which is longer than the specified 1000
Created a chunk of size 1928, which is longer than the specified 1000
Created a chunk of size 1075, which is longer than the specified 1000
Created a chunk of size 1063, which is longer than the specified 1000
Created a chunk of size 1083, which is longer than the specified 1000
Created a chunk of size 1074, which is longer than the specified 1000
Created a chunk of size 1591, which is longer than the specified 1000
Created a chunk of size 2300, which is longer than the specified 1000
Created a chunk of size 1040, which is longer than the specified 1000
Created a chunk of size 1018, which is longer than the specified 1000
Created a chunk of size 2787, which is longer than the specified 1000
Created a chunk of size 1018, which is longer than the specified 1000
Created a chunk of s

8244


然后嵌入块并将它们上传到DeepLake。这可能需要几分钟时间。

In [13]:
from langchain_openai import OpenAIEmbeddings  # 导入OpenAIEmbeddings类embeddings = OpenAIEmbeddings()  # 创建OpenAIEmbeddings对象并赋值给embeddings变量embeddings  # 打印embeddings对象

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base='', openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='', openai_organization='', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={})

In [15]:
从langchain_community.vectorstores导入DeepLake模块定义变量username，赋值为"<USERNAME_OR_ORG>"使用DeepLake的from_documents方法，传入参数texts、embeddings和dataset_path，其中dataset_path为f"hub://{username}/langchain-code"，overwrite参数为True将返回的结果赋值给变量db输出变量db

Your Deep Lake dataset has been successfully created!


 

Dataset(path='hub://adilkhan/langchain-code', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
 embedding  embedding  (8244, 1536)  float32   None   
    id        text      (8244, 1)      str     None   
 metadata     json      (8244, 1)      str     None   
   text       text      (8244, 1)      str     None   




<langchain_community.vectorstores.deeplake.DeepLake at 0x7fe1b67d7a30>

`可选项`: 您还可以使用Deep Lake的托管张量数据库作为托管服务，并在那里运行查询。为了这样做，需要在创建向量存储时将运行时参数指定为{'tensor_db': True}。此配置使得可以在托管张量数据库上执行查询，而不是在客户端上执行。需要注意的是，此功能不适用于本地或内存中存储的数据集。如果已经在托管张量数据库之外创建了向量存储，则可以按照规定的步骤将其转移到托管张量数据库中。

In [16]:
# 从langchain_community.vectorstores模块中导入DeepLake类from langchain_community.vectorstores import DeepLake# 使用DeepLake类的from_documents方法创建一个DeepLake对象，并传入texts, embeddings作为参数# dataset_path参数指定数据集的路径为f"hub://{<org_id>}/langchain-code"# runtime参数指定为{"tensor_db": True}db = DeepLake.from_documents(    texts, embeddings, dataset_path=f"hub://{<org_id>}/langchain-code", runtime={"tensor_db": True})# 打印db对象db

### 问答系统首先加载数据集，构建检索器，然后构建对话链。

In [17]:
# 导入DeepLake库from deeplake import DeepLake# 创建DeepLake对象# 参数说明：# - dataset_path: 数据集路径，使用hub://协议指定路径# - read_only: 是否只读，默认为True# - embedding: 嵌入模型，用于文本嵌入，默认为Nonedb = DeepLake(    dataset_path=f"hub://{username}/langchain-code",    read_only=True,    embedding=embeddings,)

Deep Lake Dataset in hub://adilkhan/langchain-code already exists, loading from the storage


In [18]:
# 创建一个检索器对象 retrieverretriever = db.as_retriever()# 设置检索器的搜索参数# 设置距离度量为余弦相似度retriever.search_kwargs["distance_metric"] = "cos"# 设置每次搜索返回的结果数量为20retriever.search_kwargs["fetch_k"] = 20# 设置是否启用最大边际相关性（MMR）算法retriever.search_kwargs["maximal_marginal_relevance"] = True# 设置MMR算法中的参数kretriever.search_kwargs["k"] = 20

您还可以使用[Deep Lake过滤器](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)来指定用户定义的函数。

In [19]:
def filter(x):    # 根据源代码进行过滤    if "something" in x["text"].data()["value"]:        return False    # 根据路径进行过滤，例如扩展名    metadata = x["metadata"].data()["value"]    return "only_this" in metadata["source"] or "also_that" in metadata["source"]### 打开下面的代码以进行自定义过滤# retriever.search_kwargs['filter'] = filter这段代码定义了一个名为 `filter` 的函数，用于过滤数据。首先，它根据源代码中是否包含特定字符串来进行过滤。如果 `x["text"].data()["value"]` 中包含 "something"，则返回 False。接下来，它根据路径中的元数据进行过滤。它获取 `x["metadata"].data()["value"]` 中的元数据，并检查其中的 "source" 是否包含 "only_this" 或 "also_that"。如果满足条件，则返回 True。如果要使用自定义过滤器，可以取消注释最后一行代码 `retriever.search_kwargs['filter'] = filter`。

In [20]:
from langchain.chains import ConversationalRetrievalChain  # 导入ConversationalRetrievalChain类from langchain_openai import ChatOpenAI  # 导入ChatOpenAI类model = ChatOpenAI(    model_name="gpt-3.5-turbo-0613")  # 创建一个ChatOpenAI对象，使用"gpt-3.5-turbo-0613"模型qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)  # 使用from_llm方法创建ConversationalRetrievalChain对象，传入model和retriever参数

In [32]:
# 定义问题列表questions = [    "What is the class hierarchy?",    "What classes are derived from the Chain class?",    "What kind of retrievers does LangChain have?",]# 聊天记录列表chat_history = []# 问题-答案字典qa_dict = {}# 遍历问题列表for question in questions:    # 调用qa函数获取问题的答案    result = qa({"question": question, "chat_history": chat_history})    # 将问题和答案添加到聊天记录中    chat_history.append((question, result["answer"]))    # 将问题和答案添加到问题-答案字典中    qa_dict[question] = result["answer"]    # 打印问题和答案    print(f"-> **Question**: {question} \n")    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is the class hierarchy? 

**Answer**: The class hierarchy for Memory is as follows:

    BaseMemory --> BaseChatMemory --> <name>Memory  # Examples: ZepMemory, MotorheadMemory

The class hierarchy for ChatMessageHistory is as follows:

    BaseChatMessageHistory --> <name>ChatMessageHistory  # Example: ZepChatMessageHistory

The class hierarchy for Prompt is as follows:

    BasePromptTemplate --> PipelinePromptTemplate
                           StringPromptTemplate --> PromptTemplate
                                                    FewShotPromptTemplate
                                                    FewShotPromptWithTemplates
                           BaseChatPromptTemplate --> AutoGPTPrompt
                                                      ChatPromptTemplate --> AgentScratchPadChatPromptTemplate
 

-> **Question**: What classes are derived from the Chain class? 

**Answer**: The classes derived from the Chain class are:

- APIChain
- OpenAPIEndpoin

In [31]:
抱歉，"qa_dict" 是一个变量名，缺乏上下文无法进行翻译。请提供更多上下文或者代码片段，以便我能够帮助你更好地翻译。

{'question': 'LangChain possesses a variety of retrievers including:\n\n1. ArxivRetriever\n2. AzureAISearchRetriever\n3. BM25Retriever\n4. ChaindeskRetriever\n5. ChatGPTPluginRetriever\n6. ContextualCompressionRetriever\n7. DocArrayRetriever\n8. ElasticSearchBM25Retriever\n9. EnsembleRetriever\n10. GoogleVertexAISearchRetriever\n11. AmazonKendraRetriever\n12. KNNRetriever\n13. LlamaIndexGraphRetriever\n14. LlamaIndexRetriever\n15. MergerRetriever\n16. MetalRetriever\n17. MilvusRetriever\n18. MultiQueryRetriever\n19. ParentDocumentRetriever\n20. PineconeHybridSearchRetriever\n21. PubMedRetriever\n22. RePhraseQueryRetriever\n23. RemoteLangChainRetriever\n24. SelfQueryRetriever\n25. SVMRetriever\n26. TFIDFRetriever\n27. TimeWeightedVectorStoreRetriever\n28. VespaRetriever\n29. WeaviateHybridSearchRetriever\n30. WebResearchRetriever\n31. WikipediaRetriever\n32. ZepRetriever\n33. ZillizRetriever\n\nIt also includes self query translators like:\n\n1. ChromaTranslator\n2. DeepLakeTranslator\n

In [33]:
# 打印出问题"What is the class hierarchy?"对应的答案print(qa_dict["What is the class hierarchy?"])这段代码是在打印出字典`qa_dict`中键为"What is the class hierarchy?"的值。

The class hierarchy for Memory is as follows:

    BaseMemory --> BaseChatMemory --> <name>Memory  # Examples: ZepMemory, MotorheadMemory

The class hierarchy for ChatMessageHistory is as follows:

    BaseChatMessageHistory --> <name>ChatMessageHistory  # Example: ZepChatMessageHistory

The class hierarchy for Prompt is as follows:

    BasePromptTemplate --> PipelinePromptTemplate
                           StringPromptTemplate --> PromptTemplate
                                                    FewShotPromptTemplate
                                                    FewShotPromptWithTemplates
                           BaseChatPromptTemplate --> AutoGPTPrompt
                                                      ChatPromptTemplate --> AgentScratchPadChatPromptTemplate



In [34]:
# 打印 qa_dict 字典中键为 "What classes are derived from the Chain class?" 的值print(qa_dict["What classes are derived from the Chain class?"])

The classes derived from the Chain class are:

- APIChain
- OpenAPIEndpointChain
- AnalyzeDocumentChain
- MapReduceDocumentsChain
- MapRerankDocumentsChain
- ReduceDocumentsChain
- RefineDocumentsChain
- StuffDocumentsChain
- ConstitutionalChain
- ConversationChain
- ChatVectorDBChain
- ConversationalRetrievalChain
- FlareChain
- ArangoGraphQAChain
- GraphQAChain
- GraphCypherQAChain
- HugeGraphQAChain
- KuzuQAChain
- NebulaGraphQAChain
- NeptuneOpenCypherQAChain
- GraphSparqlQAChain
- HypotheticalDocumentEmbedder
- LLMChain
- LLMBashChain
- LLMCheckerChain
- LLMMathChain
- LLMRequestsChain
- LLMSummarizationCheckerChain
- MapReduceChain
- OpenAIModerationChain
- NatBotChain
- QAGenerationChain
- QAWithSourcesChain
- RetrievalQAWithSourcesChain
- VectorDBQAWithSourcesChain
- RetrievalQA
- VectorDBQA
- LLMRouterChain
- MultiPromptChain
- MultiRetrievalQAChain
- MultiRouteChain
- RouterChain
- SequentialChain
- SimpleSequentialChain
- TransformChain
- TaskPlaningChain
- QueryChain
- CPAL

In [35]:
# 打印出 qa_dict 字典中键为 "What kind of retrievers does LangChain have?" 的值print(qa_dict["What kind of retrievers does LangChain have?"])翻译结果：# Print the value of the key "What kind of retrievers does LangChain have?" in the qa_dict dictionaryprint(qa_dict["LangChain拥有哪些品种的猎犬？"])

The LangChain class includes various types of retrievers such as:

- ArxivRetriever
- AzureAISearchRetriever
- BM25Retriever
- ChaindeskRetriever
- ChatGPTPluginRetriever
- ContextualCompressionRetriever
- DocArrayRetriever
- ElasticSearchBM25Retriever
- EnsembleRetriever
- GoogleVertexAISearchRetriever
- AmazonKendraRetriever
- KNNRetriever
- LlamaIndexGraphRetriever and LlamaIndexRetriever
- MergerRetriever
- MetalRetriever
- MilvusRetriever
- MultiQueryRetriever
- ParentDocumentRetriever
- PineconeHybridSearchRetriever
- PubMedRetriever
- RePhraseQueryRetriever
- RemoteLangChainRetriever
- SelfQueryRetriever
- SVMRetriever
- TFIDFRetriever
- TimeWeightedVectorStoreRetriever
- VespaRetriever
- WeaviateHybridSearchRetriever
- WebResearchRetriever
- WikipediaRetriever
- ZepRetriever
- ZillizRetriever
