<a href="https://colab.research.google.com/github/dangspin/myExercise/blob/master/pdf_reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 一个完整的例子

这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识，来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架，以某 `PDF` 文件的内容为知识库，提供给用户基于该文件内容的问答能力。

我们利用 `LangChain` 的QA chain，结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless
Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf)，该PDF文档共84页。

1. 安装必要的 `Python` 包

In [35]:
!pip install -q langchain openai chromadb pymupdf tiktoken
!pip install python-dotenv
!pip install langchain
!pip install PyPDF2
!pip install openai langchain langchain-wenxin
!pip install --user google-search-results
# uncomment to use huggingface llms
!pip install huggingface-hub==0.14.1

# uncomment to use instructor embeddings
!pip install InstructorEmbedding==1.0.1
!pip install sentence-transformers==2.2.2
!pip install faiss-cpu

Collecting huggingface-hub==0.14.1
  Using cached huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.17.3
    Uninstalling huggingface-hub-0.17.3:
      Successfully uninstalled huggingface-hub-0.17.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tokenizers 0.14.1 requires huggingface_hub<0.18,>=0.16.4, but you have huggingface-hub 0.14.1 which is incompatible.
transformers 4.34.0 requires huggingface-hub<1.0,>=0.16.4, but you have huggingface-hub 0.14.1 which is incompatible.[0m[31m
[0mSuccessfully installed huggingface-hub-0.14.1
Collecting huggingface-hub>=0.4.0 (from sentence-transformers==2.2.2)
  Using cached huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
Installing collected packages: huggingface-hub
  Attemp

2. 设置OpenAI环境

In [36]:
import os

os.environ['BAIDU_API_KEY'] = "LwQe70WdQsGKc0E3kdTRt1hk"
os.environ['BAIDU_SECRET_KEY'] = "McdLiQw6DBdC7zDuNYGkVX4nGg99CwIV"

In [37]:
from langchain_wenxin import Wenxin


llm = Wenxin(
    temperature=0.9,
    model="ernie-bot-turbo",
    verbose=True,
)

3. 下载PDF文件AWS Serverless Developer Guide

In [38]:
!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf

PDF_NAME = 'sunshine.pdf'

--2023-10-09 02:36:02--  https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf
Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 65.8.248.16, 65.8.248.109, 65.8.248.56, ...
Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|65.8.248.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4733257 (4.5M) [application/pdf]
Saving to: ‘serverless-core.pdf.1’


2023-10-09 02:36:03 (8.22 MB/s) - ‘serverless-core.pdf.1’ saved [4733257/4733257]



4. 加载PDF文件

In [39]:
from langchain.document_loaders import PyMuPDFLoader
docs = PyMuPDFLoader(PDF_NAME).load()

print (f'There are {len(docs)} document(s) in {PDF_NAME}.')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')

There are 43 document(s) in sunshine.pdf.
There are 1372 characters in the first page of your document.


5. 拆分文档并存储文本嵌入的向量数据

In [None]:
# from langchain.embeddings.openai import OpenAIEmbeddings
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.vectorstores import Chroma
# from langchain.embeddings import HuggingFaceEmbeddings

# model_name = "shibing624/text2vec-base-chinese"

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# split_docs = text_splitter.split_documents(docs)

# embeddings = HuggingFaceEmbeddings(model_name=model_name)

# vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="serverless_guide")

In [40]:
from langchain_wenxin.embeddings import WenxinEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

wenxin_embed = WenxinEmbeddings(truncate="END")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(split_docs, wenxin_embed, collection_name="serverless_guide")

6. 基于OpenAI创建QA链

In [41]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# llm = OpenAI(temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

7. 基于提问，进行相似性查询

In [63]:
query = "本产品当中，原位癌需如何定义的?"
similar_docs = vectorstore.similarity_search(query, 3, include_metadata=True)

In [64]:
similar_docs

[Document(page_content='多倍保障重大疾病保险条款                                                    新华人寿保险股份有限公司 \n55520161                                                                             第 10 页[共 33 页] \n注：原位癌指恶性细胞局限于上皮内尚未穿破基底膜浸润周围正常组织的癌\n细胞新生物。被保险人必须经对固定活组织的组织病理学检查明确诊断为原位癌，\n且已经接受了针对原位癌病灶的积极治疗。 \n \n \n \n \n6.4.2\n侵\n蚀 性 葡 萄\n胎 \n指异常增生的绒毛组织浸润性生长侵入子宫肌层或转移至其他器官或组织的\n葡萄胎，经组织病理学检查明确诊断并已经进行化疗或手术治疗的。 \n \n \n \n \n6.4.3\n不\n典 型 的 急\n性 心 肌 梗\n塞 \n指被临床诊断为急性心肌梗塞并接受了急性心肌梗塞治疗，且满足下列全部\n条件，但未达到本合同所指重大疾病“急性心肌梗塞”的标准： \n（1）心肌酶或肌钙蛋白有诊断意义的升高； \n（2）心电图有损伤性的 ST 段改变但未出现病理 Q 波。 \n \n \n \n \n6.4.4\n冠\n状 动 脉 介\n入手术 \n指为了治疗明显的冠状动脉狭窄性疾病，首次实际实施了冠状动脉球囊扩张\n成形术、冠状动脉支架植入术、冠状动脉粥样斑块切除术或激光冠状动脉成形术。 \n被保险人已达到本合同所指重大疾病“急性心肌梗塞”或“严重冠心病”标\n准的，本公司不承担本项手术保险责任。 \n本公司仅对本条“冠状动脉介入手术”、6.4.8 条“微创冠状动脉搭桥术”和\n6.4.14 条“激光心肌血运重建术”三项中的其中一项承担保险责任，给付其中一\n项保险金后，对其他两项轻症疾病保险责任同时终止。 \n \n \n \n \n6.4.5\n心\n脏 瓣 膜 介\n入手术 \n指为了治疗心脏瓣膜疾病，实际实施了非开胸的经胸壁打孔内镜手术或经皮\n经导管介入手术进行的心脏瓣膜置换或修复手术。 \n实施开胸心脏瓣膜置换或修复手术的，本公司不承担本项手术保险责任。 \n \n \n \n \n6.4.6\n轻\n度 原 

8. 基于相关文档，利用QA链完成回答

In [65]:
print(chain.run(input_documents=similar_docs, question=query))

根据原文中的描述，原位癌指恶性细胞局限于上皮内尚未穿破基底膜浸润周围正常组织的癌细胞新生物。被保险人必须经对固定活组织的组织病理学检查明确诊断为原位癌，且已经接受了针对原位癌病灶的积极治疗。因此，在产品中，原位癌需要被保险人经对固定活组织的组织病理学检查明确诊断为原位癌，且已经接受了针对原位癌病灶的手术治疗。
