<a href="https://colab.research.google.com/github/dangspin/myExercise/blob/master/pdf_reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 一个完整的例子

这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识，来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架，以某 `PDF` 文件的内容为知识库，提供给用户基于该文件内容的问答能力。

我们利用 `LangChain` 的QA chain，结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless
Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf)，该PDF文档共84页。

1. 安装必要的 `Python` 包

In [1]:
!pip install -q langchain openai chromadb pymupdf tiktoken
!pip install python-dotenv
!pip install langchain
!pip install PyPDF2
!pip install openai langchain langchain-wenxin
!pip install --user google-search-results
# uncomment to use huggingface llms
!pip install huggingface-hub==0.14.1

# uncomment to use instructor embeddings
!pip install InstructorEmbedding==1.0.1
!pip install sentence-transformers==2.2.2
!pip install faiss-cpu

Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.4


2. 设置OpenAI环境

In [2]:
from langchain_wenxin import Wenxin

WENXIN_APP_Key = "LwQe70WdQsGKc0E3kdTRt1hk"
WENXIN_APP_SECRET = "McdLiQw6DBdC7zDuNYGkVX4nGg99CwIV"

llm = Wenxin(
    temperature=0.9,
    model="ernie-bot-turbo",
    baidu_api_key = WENXIN_APP_Key,
    baidu_secret_key = WENXIN_APP_SECRET,
    verbose=True,
)

3. 下载PDF文件AWS Serverless Developer Guide

In [3]:
!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf

PDF_NAME = 'serverless-core.pdf'

--2023-10-06 12:47:45--  https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf
Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 13.35.7.105, 13.35.7.2, 13.35.7.27, ...
Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|13.35.7.105|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4716244 (4.5M) [application/pdf]
Saving to: ‘serverless-core.pdf’


2023-10-06 12:47:47 (4.98 MB/s) - ‘serverless-core.pdf’ saved [4716244/4716244]



4. 加载PDF文件

In [4]:
from langchain.document_loaders import PyMuPDFLoader
docs = PyMuPDFLoader(PDF_NAME).load()

print (f'There are {len(docs)} document(s) in {PDF_NAME}.')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')

There are 82 document(s) in serverless-core.pdf.
There are 27 characters in the first page of your document.


5. 拆分文档并存储文本嵌入的向量数据

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "shibing624/text2vec-base-chinese"

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

embeddings = HuggingFaceEmbeddings(model_name=model_name)

vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="serverless_guide")

Downloading (…)86adb/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Downloading (…)8b55f86adb/README.md:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading (…)55f86adb/config.json:   0%|          | 0.00/856 [00:00<?, ?B/s]

Downloading (…)98b55f86adb/logs.txt:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading (…)adb/onnx/config.json:   0%|          | 0.00/836 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/407M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)/onnx/tokenizer.json:   0%|          | 0.00/439k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)86adb/onnx/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/409M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading (…)8b55f86adb/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)5f86adb/modules.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

6. 基于OpenAI创建QA链

In [7]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# llm = OpenAI(temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

7. 基于提问，进行相似性查询

In [8]:
query = "What is the use case of AWS Serverless?"
similar_docs = vectorstore.similarity_search(query, 3, include_metadata=True)

In [9]:
similar_docs

[Document(page_content='Before we dive into how the AWS SAM CLI works, we need to explain Serverless Application Model\ntemplates. And before that, you need to know about AWS CloudFormation and AWS CloudFormation \ntemplates.\nIn 2006, there were only a few services, including S3 and EC2. Managing these services was not complex, \nbut as more services were added to the AWS portfolio, creating and managing services manually became \ncumbersome.\nIn 2011, AWS released AWS CloudFormation as a service to create and conﬁgure resources. With AWS \nCloudFormation, you create text templates to deﬁne infrastructure and code. The service provisions and \nconﬁgures those resources. Adding resources became as easy as copy/paste!\nAWS CloudFormation provisions your resources in a safe, repeatable manner, enabling you to \nfrequently build your infrastructure and applications without manual steps. When you change the \nconﬁguration, AWS CloudFormation determines the right operations to perform to up

8. 基于相关文档，利用QA链完成回答

In [10]:
chain.run(input_documents=similar_docs, question=query)

"AWS Serverless is a solution for developers who want to focus on building applications instead of worrying about infrastructure. It uses Serverless Application Model templates to create infrastructure as code, making it easier to provision, update, and scale resources. This approach is particularly useful for applications that require high availability, scalability, and low-cost resources. Serverless also provides a pay-as-you-go model, which means you only pay for the resources you use, and you don't have to provision or manage them yourself."