<a href="https://colab.research.google.com/github/dangspin/myExercise/blob/master/pdf_reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 一个完整的例子

这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识，来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架，以某 `PDF` 文件的内容为知识库，提供给用户基于该文件内容的问答能力。

我们利用 `LangChain` 的QA chain，结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless
Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf)，该PDF文档共84页。

1. 安装必要的 `Python` 包

In [35]:
!pip install -q langchain openai chromadb pymupdf tiktoken
!pip install python-dotenv
!pip install langchain
!pip install PyPDF2
!pip install openai langchain langchain-wenxin
!pip install --user google-search-results
# uncomment to use huggingface llms
!pip install huggingface-hub==0.14.1

# uncomment to use instructor embeddings
!pip install InstructorEmbedding==1.0.1
!pip install sentence-transformers==2.2.2
!pip install faiss-cpu

Collecting huggingface-hub==0.14.1
  Using cached huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.17.3
    Uninstalling huggingface-hub-0.17.3:
      Successfully uninstalled huggingface-hub-0.17.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tokenizers 0.14.1 requires huggingface_hub<0.18,>=0.16.4, but you have huggingface-hub 0.14.1 which is incompatible.
transformers 4.34.0 requires huggingface-hub<1.0,>=0.16.4, but you have huggingface-hub 0.14.1 which is incompatible.[0m[31m
[0mSuccessfully installed huggingface-hub-0.14.1
Collecting huggingface-hub>=0.4.0 (from sentence-transformers==2.2.2)
  Using cached huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
Installing collected packages: huggingface-hub
  Attemp

2. 设置OpenAI环境

In [36]:
import os

os.environ['BAIDU_API_KEY'] = "LwQe70WdQsGKc0E3kdTRt1hk"
os.environ['BAIDU_SECRET_KEY'] = "McdLiQw6DBdC7zDuNYGkVX4nGg99CwIV"

In [37]:
from langchain_wenxin import Wenxin


llm = Wenxin(
    temperature=0.9,
    model="ernie-bot-turbo",
    verbose=True,
)

3. 下载PDF文件AWS Serverless Developer Guide

In [38]:
!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf

PDF_NAME = 'sunshine.pdf'

--2023-10-09 02:36:02--  https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf
Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 65.8.248.16, 65.8.248.109, 65.8.248.56, ...
Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|65.8.248.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4733257 (4.5M) [application/pdf]
Saving to: ‘serverless-core.pdf.1’


2023-10-09 02:36:03 (8.22 MB/s) - ‘serverless-core.pdf.1’ saved [4733257/4733257]



4. 加载PDF文件

In [39]:
from langchain.document_loaders import PyMuPDFLoader
docs = PyMuPDFLoader(PDF_NAME).load()

print (f'There are {len(docs)} document(s) in {PDF_NAME}.')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')

There are 43 document(s) in sunshine.pdf.
There are 1372 characters in the first page of your document.


5. 拆分文档并存储文本嵌入的向量数据

In [None]:
# from langchain.embeddings.openai import OpenAIEmbeddings
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.vectorstores import Chroma
# from langchain.embeddings import HuggingFaceEmbeddings

# model_name = "shibing624/text2vec-base-chinese"

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# split_docs = text_splitter.split_documents(docs)

# embeddings = HuggingFaceEmbeddings(model_name=model_name)

# vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="serverless_guide")

In [40]:
from langchain_wenxin.embeddings import WenxinEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

wenxin_embed = WenxinEmbeddings(truncate="END")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(split_docs, wenxin_embed, collection_name="serverless_guide")

6. 基于OpenAI创建QA链

In [41]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# llm = OpenAI(temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

7. 基于提问，进行相似性查询

In [67]:
query = "本产品当中，不在保障范围内的轻度恶性肿瘤?"
similar_docs = vectorstore.similarity_search(query, 3, include_metadata=True)

In [68]:
similar_docs

[Document(page_content='41-28\n非脊髓内的其他椎管内肿瘤、脊柱肿瘤、神经纤维瘤不在保障范围内。\n脊髓内恶性肿瘤不在本疾病保障范围内。\n3.6.86\n严重癫痫\n须由我们认可的医院专科医生根据典型临床症状和脑电图及 MRI、PET、CT 等影像\n学检查做出诊断。须提供 180 天以上的相关病历记录证明被保险人存在经抗癫痫药\n物治疗无效而反复发作的强直阵挛性发作或癫痫大发作，且实际实施了神经外科手\n术。\n发热性惊厥以及没有全身性发作的失神发作（癫痫小发作）不在保障范围内。\n3.6.87\n严 重 脊 髓\n灰质炎\n指由于急性脊髓灰质炎病毒感染所导致的运动功能障碍的瘫痪性疾病。本疾病须经\n我们认可的医院专科医生出具医学诊断证明，并提供相关的脊髓灰质炎病毒感染的\n证据（例如：脑脊液检查或血清学抗体检查报告），且脊髓灰质炎造成的神经系统功\n能损害导致被保险人一肢（含）以上肢体肌力2级（含）以下。\n3.6.88\n严 重 肾 上\n腺 脑 白 质\n营养不良\n肾上腺脑白质营养不良是一种脂质代谢障碍病，由于体内缺乏过氧化物酶而致长链\n脂肪酸在体内沉积，造成脑白质和肾上腺皮质破坏。主要表现为情感障碍、运动功\n能障碍、肾上腺皮质功能减退等。须经我们认可的医院专科医生诊断，且已经造成\n自主生活能力完全丧失，确诊 180 天以后仍无法独立完成六项基本日常生活活动中\n的三项或三项以上。\n3.6.89\n结 核 性 脊\n髓炎\n指因结核杆菌引起的脊髓炎，即该疾病首次确诊180天后，仍遗留下列一种或一种以\n上障碍：\n（1）一肢（含）以上肢体肌力2级（含）以下；\n（2）自主生活能力完全丧失，无法独立完成六项基本日常生活活动中的三项或三项\n以上。\n3.6.90\n脑型疟疾\n恶性疟原虫严重感染导致的脑病或脑型疟疾，以昏迷为主要特征。脑型疟疾的诊断\n须由我们认可的医院专科医生确认，且外周血涂片存在恶性疟原虫。\n其他明确病因导致的脑病不在保障范围内。\n3.6.91\n严 重 强 直\n性脊柱炎\n强直性脊柱炎是一种慢性全身性炎性疾病，主要侵犯脊柱导致脊柱畸形。\n强直性脊柱炎必须明确诊断并且满足下列所有条件：\n（1）严重脊柱畸形；\n（2）自主生活能力完全丧失，无法独立完成六项基本日常生活活动中的三项或三项\n以上

8. 基于相关文档，利用QA链完成回答

In [71]:
str_res = chain.run(input_documents=similar_docs, question=query)
print(str_res)

根据保险条款，本产品中不在保障范围内的轻度恶性肿瘤包括：

1. 甲状腺癌（TNM分期为Ⅰ期）；
2. 前列腺癌（TNM分期为T1N0M0期）；
3. 皮肤恶性肿瘤（黑色素瘤以外的未发生淋巴结和远处转移的皮肤恶性肿瘤）；
4. 慢性淋巴细胞白血病（相当于Binet分期方案A期程度）；
5. 何杰金氏病（相当于Ann Arbor分期方案Ⅰ期程度）；
6. 神经内分泌肿瘤（未发生淋巴结和远处转移且WHO分级为G1级别）。

以上是符合条款中不在保障范围内的轻度恶性肿瘤，如有其他问题，可以进一步咨询保险公司。


## Parse the results

In [88]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema, CommaSeparatedListOutputParser
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_wenxin import ChatWenxin

In [79]:
chat_model = ChatWenxin(
    temperature=0.9,
    model="ernie-bot-turbo",
    verbose=True,
)

In [95]:
output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()
prompt = ChatPromptTemplate(
     messages=[
        HumanMessagePromptTemplate.from_template(f"请把这句话 {str_res} 中每一条，依照换行符整理成JSON 格式")
    ],
)


_input = prompt.format_prompt(subject=f"不在保障范围的疾病：{str_res}")
output = chat_model(_input.to_messages())

# output_parser.parse(output)

In [101]:
print(output.content)

以下是将您提供的文字按照换行符整理成JSON格式的结果：

```json
{
  "轻度恶性肿瘤": [
    {
      "疾病名称": "甲状腺癌（TNM分期为Ⅰ期）",
      "描述": "不在保障范围内"
    },
    {
      "疾病名称": "前列腺癌（TNM分期为T1N0M0期）",
      "描述": "不在保障范围内"
    },
    {
      "疾病名称": "皮肤恶性肿瘤（黑色素瘤以外的未发生淋巴结和远处转移的皮肤恶性肿瘤）",
      "描述": "不在保障范围内"
    },
    {
      "疾病名称": "慢性淋巴细胞白血病（相当于Binet分期方案A期程度）",
      "描述": "不在保障范围内"
    },
    {
      "疾病名称": "何杰金氏病（相当于Ann Arbor分期方案Ⅰ期程度）",
      "描述": "不在保障范围内"
    },
    {
      "疾病名称": "神经内分泌肿瘤（未发生淋巴结和远处转移且WHO分级为G1级别）",
      "描述": "不在保障范围内，如有其他问题，可以进一步咨询保险公司"
    }
  ]
}```
