# 基于langchain创建自己专属的对话大模型

1. 领域精准问答
2. 数据更新频繁
3. 生成内容可解释可追溯
4. 数据隐私保护

通过这个例子，我们将基于`LangChain`, `OpenAI(LLM)`,  `vector DB`构建一个属于自己的LLM模型。

主要使用的技术————***Retrieval Augmented Generation (RAG)***

首先确保自己拥有一个 `OpenAI API key` (也并非必须)


### 准备环境

In [2]:
! pip install -qU \
    langchain==0.0.316 \
    openai==0.28.1  \
    tiktoken==0.5.1  \
    cohere \
    chromadb==0.4.15

### 创建一个对话模型(no RAG)

In [1]:
import os
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

OpenAI Python 的例子
```python
[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Knock knock."},
    {"role": "assistant", "content": "Who's there?"},
    {"role": "user", "content": "Orange."},
]
```
https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models


但是langchain 需要使用以下的格式

In [2]:

from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Knock knock."),
    AIMessage(content="Who's there?"),
    HumanMessage(content="Orange"),
]

In [3]:
res = chat(messages)
res

AIMessage(content='Orange who?')

因为 `res`也是`AIMessage`属性，所以我们可以直接进行添加，即可实现下一次的响应

In [4]:
messages.append(res)
res = chat(messages)

print(res.content)

Orange you glad I'm here to assist you?


#### 处理LLM存在的缺陷
1. 容易出现幻觉
2. 信息滞后
3. 专业领域深度知识匮乏


In [7]:
messages = [
    SystemMessage(content="你是一个专业的知识助手。"),
    HumanMessage(content="你知道baichuan2模型吗？"),
]

In [5]:
res = chat(messages)
print(res.content)

Orange you glad I'm here to assist you?


chatgpt AI无法满足我们在某些特定领域的专业需求，我们可以通过知识注入的方式，利用prompt来解决这种问题：

In [9]:

baichuan2_information = [
    "Baichuan 2是一个大规模多语言语言模型，它专注于训练在多种语言中表现优异的模型，包括不仅限于英文。这使得Baichuan 2在处理各种语言的任务时能够取得显著的性能提升。",
    "Baichuan 2是从头开始训练的，使用了包括了2.6万亿个标记的庞大训练数据集。相对于以往的模型，Baichuan 2提供了更丰富的数据资源，从而能够更好地支持多语言的开发和应用。",
    "Baichuan 2不仅在通用任务上表现出色，还在特定领域（如医学和法律）的任务中展现了卓越的性能。这为特定领域的应用提供了强有力的支持。"
]

source_knowledge = "\n".join(baichuan2_information)

In [10]:
print(source_knowledge)

Baichuan 2是一个大规模多语言语言模型，它专注于训练在多种语言中表现优异的模型，包括不仅限于英文。这使得Baichuan 2在处理各种语言的任务时能够取得显著的性能提升。
Baichuan 2是从头开始训练的，使用了包括了2.6万亿个标记的庞大训练数据集。相对于以往的模型，Baichuan 2提供了更丰富的数据资源，从而能够更好地支持多语言的开发和应用。
Baichuan 2不仅在通用任务上表现出色，还在特定领域（如医学和法律）的任务中展现了卓越的性能。这为特定领域的应用提供了强有力的支持。


In [11]:
query = "你知道baichuan2模型吗？"

prompt_template = f"""基于以下内容回答问题：

内容:
{source_knowledge}

Query: {query}"""

In [13]:
prompt = HumanMessage(
    content=prompt_template
)
messages.append(prompt)

res = chat(messages)

In [14]:
print(res.content)

是的，我知道Baichuan 2模型。Baichuan 2是一个大规模多语言语言模型，它专注于训练在多种语言中表现优异的模型，包括不仅限于英文。Baichuan 2从头开始训练，使用了庞大的训练数据集，其中包括了2.6万亿个标记。相比以往的模型，Baichuan 2提供了更丰富的数据资源，从而能够更好地支持多语言的开发和应用。Baichuan 2不仅在通用任务上表现出色，还在特定领域的任务中展现了卓越的性能，为特定领域的应用提供了强有力的支持。


当我们注入一些专业的知识后，模型就能够很好的回答相关问题。
如果每一个问题都去用相关的外部知识进行增强拼接的话，那么回答的准确性就大大增加？？？？

### 创建一个RAG对话模型

#### 1. 加载数据 （以baichuan2论文为例）

   https://arxiv.org/pdf/2309.10305v2.pdf

In [None]:
! pip install pypdf

In [22]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://arxiv.org/pdf/2309.10305.pdf")

pages = loader.load_and_split()

In [23]:
pages

[Document(page_content='Baichuan 2: Open Large-scale Language Models\nAiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan\nDian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai\nGuosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji\nJian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma\nMang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun\nTao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng\nXiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang\nYiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu\nBaichuan Inc.\nAbstract\nLarge language models (LLMs) have\ndemonstrated remarkable performance on\na variety of natural language tasks based\non just a few examples of natural language\ninstructions, reducing the need for extensive\nfeature engineering. However, most powerful\nLLMs are closed-sourc

#### 2. 知识切片 将文档分割成均匀的块。每个块是一段原始文本

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50,
)

docs = text_splitter.split_documents(pages)

In [None]:
len(docs)

215

In [None]:
docs[0]

Document(page_content='Baichuan 2: Open Large-scale Language Models\nAiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan\nDian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai\nGuosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji\nJian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma\nMang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun', metadata={'source': 'baichuan2.pdf', 'page': 0})

#### 3. 利用embedding模型对每个文本片段进行向量化，并储存到向量数据库中

In [25]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma


embed_model = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents=docs, embedding=embed_model)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/embeddings (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1129)'))).
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/embeddings (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1129)'))).


#### 4. 通过向量相似度检索和问题最相关的K个文档。

In [26]:
query = "How large is the baichuan2 vocabulary?"
docs = vectorstore.similarity_search(query)

In [27]:
docs

[Document(page_content='languages, such as Chinese.\nIn this technical report, we introduce Baichuan\n2, a series of large-scale multilingual language\nmodels. Baichuan 2 has two separate models,\nBaichuan 2-7B with 7 billion parameters and\nBaichuan 2-13B with 13 billion parameters. Both\nmodels were trained on 2.6 trillion tokens, which\nto our knowledge is the largest to date, more than\ndouble that of Baichuan 1 (Baichuan, 2023b,a).\nWith such a massive amount of training data,', metadata={'page': 1, 'source': 'C:\\Users\\blackink\\AppData\\Local\\Temp\\tmpw2rjzd69\\tmp.pdf'}),
 Document(page_content='Baichuan 2: Open Large-scale Language Models\nAiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan\nDian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai\nGuosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji\nJian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma\nMang Wang, Mickel

#### 5. 原始`query`与检索得到的文本组合起来输入到语言模型，得到最终的回答

In [28]:
def augment_prompt(query: str):
  # 获取top3的文本片段
  results = vectorstore.similarity_search(query, k=3)
  source_knowledge = "\n".join([x.page_content for x in results])
  # 构建prompt
  augmented_prompt = f"""Using the contexts below, answer the query.

  contexts:
  {source_knowledge}

  query: {query}"""
  return augmented_prompt

In [29]:
print(augment_prompt(query))

Using the contexts below, answer the query.

  contexts:
  languages, such as Chinese.
In this technical report, we introduce Baichuan
2, a series of large-scale multilingual language
models. Baichuan 2 has two separate models,
Baichuan 2-7B with 7 billion parameters and
Baichuan 2-13B with 13 billion parameters. Both
models were trained on 2.6 trillion tokens, which
to our knowledge is the largest to date, more than
double that of Baichuan 1 (Baichuan, 2023b,a).
With such a massive amount of training data,
Baichuan 2: Open Large-scale Language Models
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan
Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai
Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji
Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma
Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun
With such a massive amount of training data,
Baich

In [30]:
# 创建prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)

messages.append(prompt)

res = chat(messages)

print(res.content)

Based on the given contexts, the size of the Baichuan2 vocabulary is not explicitly mentioned.


### 没有OPENAI api key怎么办 创建一个非openai的对话模型  


1.   embedding模型  
2.   chat模型


In [31]:
! pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
     ---------------------------------------- 0.0/86.0 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/86.0 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/86.0 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/86.0 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/86.0 kB ? eta -:--:--
     ------------- ------------------------ 30.7/86.0 kB 100.9 kB/s eta 0:00:01
     --------------------------- ---------- 61.4/86.0 kB 204.8 kB/s eta 0:00:01
     -------------------------------------- 86.0/86.0 kB 269.4 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting torchvision (from sentence-transformers)
  Downloading torchvision-0.16.1-cp39-cp39-win_amd64.whl.metadata (6.6 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_lea

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
altair 5.2.0 requires jsonschema>=3.0, which is not installed.
altair 5.2.0 requires toolz, which is not installed.
gradio 4.7.1 requires matplotlib~=3.0, which is not installed.


In [32]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

model_name = "sentence-transformers/sentence-t5-large"

In [33]:
embedding = HuggingFaceEmbeddings(model_name=model_name)

  from .autonotebook import tqdm as notebook_tqdm
.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 587kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 95.2kB/s]
2_Dense/config.json: 100%|██████████| 116/116 [00:00<00:00, 58.0kB/s]
pytorch_model.bin: 100%|██████████| 3.15M/3.15M [00:02<00:00, 1.31MB/s]
README.md: 100%|██████████| 2.02k/2.02k [00:00<00:00, 1.01MB/s]
config.json: 100%|██████████| 1.39k/1.39k [00:00<00:00, 463kB/s]
config_sentence_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 61.1kB/s]
pytorch_model.bin: 100%|██████████| 670M/670M [03:39<00:00, 3.05MB/s] 
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 20.2kB/s]
special_tokens_map.json: 100%|██████████| 1.79k/1.79k [00:00<00:00, 598kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 799kB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:01<00:00, 1.22MB/s]
tokenizer_config.json: 100%|██████████| 1.92k/1.92k [00:00<00:00, 1.92MB/s]
modules.json: 100%|█

In [34]:
vectorstore = Chroma.from_documents(documents=docs[:50], embedding=embedding , collection_name="embedding_dim300")

In [35]:
print(augment_prompt(query))

Using the contexts below, answer the query.

  contexts:
  languages, such as Chinese.
In this technical report, we introduce Baichuan
2, a series of large-scale multilingual language
models. Baichuan 2 has two separate models,
Baichuan 2-7B with 7 billion parameters and
Baichuan 2-13B with 13 billion parameters. Both
models were trained on 2.6 trillion tokens, which
to our knowledge is the largest to date, more than
double that of Baichuan 1 (Baichuan, 2023b,a).
With such a massive amount of training data,
Baichuan 2: Open Large-scale Language Models
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan
Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai
Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji
Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma
Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun
instructions, reducing the need for extensive
feat

通过本地部署的模型进行交互