## 一个完整的例子
这是该 LangChain 极简入门系列的最后一讲。我们将利用过去9讲学习的知识，来完成一个具备完整功能集的LLM应用。该应用基于 LangChain 框架，以某 PDF 文件的内容为知识库，提供给用户基于该文件内容的问答能力。

我们利用 LangChain 的QA chain，结合 Chroma 来实现PDF文档的语义化搜索。示例代码所引用的是AWS Serverless Developer Guide，该PDF文档共84页。

In [ ]:
# !pip install -q langchain==0.0.235 openai chromadb pymupdf tiktoken

In [1]:
# !wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf

PDF_NAME = 'serverless-core.pdf'

--2024-06-28 16:55:58--  https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf
Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 108.138.246.3, 108.138.246.69, 108.138.246.43, ...
Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|108.138.246.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4837288 (4.6M) [application/pdf]
Saving to: ‘serverless-core.pdf’


2024-06-28 16:56:17 (261 KB/s) - ‘serverless-core.pdf’ saved [4837288/4837288]



In [3]:
# !pip install pymupdf

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pymupdf
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/ed/ec22f81d858cbc37f1d00e1b177ce658f15552816915ed804b00dc5fdfae/PyMuPDF-1.24.7-cp310-none-macosx_11_0_arm64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting PyMuPDFb==1.24.6 (from pymupdf)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ca/1d/a6fde4c325da9dfc85a249f5ecb6bd52b2691f41cb7087264552b29439b7/PyMuPDFb-1.24.6-py3-none-macosx_11_0_arm64.whl (14.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.9/14.9 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.6 pymupdf-1.24.7

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m 

In [4]:
# 加载PDF文件
from langchain.document_loaders import PyMuPDFLoader

docs = PyMuPDFLoader(PDF_NAME).load()

print(f'There are {len(docs)} document(s) in {PDF_NAME}.')
print(f'There are {len(docs[0].page_content)} characters in the first page of your document.')


There are 114 document(s) in serverless-core.pdf.
There are 112 characters in the first page of your document.


In [29]:
import os
from langchain_community.llms.cloudflare_workersai import CloudflareWorkersAI
from langchain_community.llms.tongyi import Tongyi
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv(override=True)

account_id = os.getenv('CF_ACCOUNT_ID')
api_token = os.getenv('CF_API_TOKEN')
print(account_id)
print(api_token)

# CloudflareWorkersAI
model = '@cf/meta/llama-3-8b-instruct'
cf_llm = CloudflareWorkersAI(
    account_id=account_id,
    api_token=api_token,
    model=model
)

DASHSCOPE_API_KEY = os.getenv('DASHSCOPE_API_KEY')
print(DASHSCOPE_API_KEY)

# qwen
qw_llm = Tongyi(
    model='qwen2-1.5b-instruct'
)

# qwen 兼容 openai的接口
qw_llm_openai = ChatOpenAI(
    openai_api_base='https://dashscope.aliyuncs.com/compatible-mode/v1',
    openai_api_key=DASHSCOPE_API_KEY,
    model_name="qwen2-1.5b-instruct",
    temperature=0.7,
    streaming=True,
)

api_key = os.getenv('OPENAI_API_KEY')
base_url = os.getenv('OPENAI_API_BASE')
print(api_key)
print(base_url)

# openai/moonshot
ms_llm = ChatOpenAI(
    openai_api_base=base_url,
    openai_api_key=api_key,
    model_name="moonshot-v1-8k",
    temperature=0.7,
)

8483c3ec7a0cbc54a8d660b5b9002b04
Gcllof8ze6dgtcqFI5FQZ2SD_5tfCD4Db7NuS6jn
sk-01c5003340c3453b934052d737d45e01
sk-UGVpjuTwo2Q8pewoqUDfckw1A0pbSDli9ElFMeS9WareKknG
https://api.moonshot.cn/v1/


In [7]:
# 最新的Embedding方式
# cloudflare_workersai
from langchain_community.embeddings.cloudflare_workersai import (
    CloudflareWorkersAIEmbeddings,
)

# @cf/baai/bge-large-en-v1.5
# 维度是：1024

# @cf/baai/bge-small-en-v1.5
# 维度是：384
embeddings = CloudflareWorkersAIEmbeddings(
    account_id=account_id,
    api_token=api_token,
    model_name="@cf/baai/bge-small-en-v1.5",
)

In [9]:
# 拆分文档并存储文本嵌入的向量数据
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

In [10]:
split_docs

[Document(page_content='Developer Guide\nServerless\nCopyright © 2024 Amazon Web Services, Inc. and/or its aﬃliates. All rights reserved.', metadata={'source': 'serverless-core.pdf', 'file_path': 'serverless-core.pdf', 'page': 0, 'total_pages': 114, 'format': 'PDF 1.4', 'title': 'Serverless - Developer Guide', 'author': 'AWS', 'subject': '', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'producer': 'Apache FOP Version 2.6', 'creationDate': 'D:20240627120636Z', 'modDate': '', 't

In [13]:
vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="aaaa")

In [33]:
# 基于OpenAI创建QA链
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

chain = load_qa_chain(qw_llm_openai, chain_type="stuff")

In [17]:
# 基于提问，进行相似性查询
query = "What is the use case of AWS Serverless?"
similar_docs = vectorstore.similarity_search(query, 3)

In [18]:
similar_docs

[Document(page_content='Serverless\nDeveloper Guide\nSummary\n• You need an Amazon Web Services account to get started.\n• Python and JavaScript/Typescript are popular programming languages for serverless. You will \nsee these most frequently in examples, tutorials, and workshops.\n• Java, C#, Go, Ruby, and PowerShell are available runtimes, but you can also bring your own.\n• Set up your development environment with your preferred local IDE\n• AWS data centers are organized into one or more Availability Zones located in multiple regions\nacross the globe\n• Region codes and ARNs are used to identify and connect to speciﬁc AWS services and resources\n• Responsibility for security of serverless solutions is shared between you and AWS.\nSummary\n19', metadata={'author': 'AWS', 'creationDate': 'D:20240627120636Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverl

In [34]:
# 基于相关文档，利用QA链完成回答
chain.run(input_documents=similar_docs, question=query)
# chain.invoke(query)

'The primary use case of AWS Serverless is to write code that serves customer requests without having to manage servers. This allows developers to focus on their application logic and leave the operational aspects of running servers to the cloud provider. AWS Serverless offers pay-per-use pricing, automatic scaling, and support for multiple programming languages. It also enables users to easily extend their deployments across different regions and availability zones.'

In [23]:
retriever = vectorstore.as_retriever()

In [24]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_template(
    "Using the context below to answer user's question. If you can't find information within the context, simply answer I don't know.\n\n {context} {question}")

In [30]:
chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | qw_llm_openai | StrOutputParser()

In [31]:
chain.invoke(query)

'The use case of AWS Serverless is to provide developers with a guided learning path for the core services needed to build serverless solutions. This includes services such as AWS Lambda, Amazon Elastic Compute Cloud (EC2), and AWS App Runner, among others. The goal is to simplify building serverless solutions by focusing on writing code that serves customers without managing servers. Serverless technologies offer pay-as-you-go scalability, automatic scaling, and ease of expansion across geographic regions.'

In [38]:
# chain3 = {"context": retriever} | qw_llm_openai | StrOutputParser()

In [None]:
# chain3.invoke(query)