# 10个范例带你入门LangChain 

本第二节课通过演示10个具有代表性的应用范例，带你零基础入门langchain，所有代码全部可执行

1，文本总结(Summarization): 对文本/聊天内容的重点内容总结。

2，文档问答(Question and Answering Over Documents): 使用文档作为上下文信息，基于文档内容进行问答。

3，信息抽取(Extraction): 从文本内容中抽取结构化的内容。

4，结果评估(Evaluation): 分析并评估LLM输出的结果的好坏。

5，数据库问答(Querying Tabular Data): 从数据库/类数据库内容中抽取数据信息。

6，代码理解(Code Understanding): 分析代码，并从代码中获取逻辑，同时也支持QA。

7，API交互(Interacting with APIs): 通过对API文档的阅读，理解API文档并向真实世界调用API获取真实数据。

8，聊天机器人(Chatbots): 具备记忆能力的聊天机器人框架（有UI交互能力)

9，用不到 50 行代码实现一个文档对话机器人

10，智能体(Agents): 使用LLMs进行任务分析和决策，并调用工具执行决策

In [1]:
# 在我们开始前，安装需要的依赖
%pip install langchain==0.0.258
%pip install openai==0.23.1

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
# 使用你自己的OpenAI API key
openai_api_key='your OpenAI API key'
# 没有api_key的小伙伴不用慌，我们可以在

In [3]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
# 帮助你的ipynb看起来更舒服

## 一， 文本总结(Summarization)

扔给LLM一段文本，让他给你生成总结可以说是最常见的场景之一了

目前最火的应用应该是 chatPDF，可以利用langchain进行实现。


### 1，短文本总结

In [5]:

# Summaries Of Short Text

from langchain.llms import OpenAI
from langchain import PromptTemplate

llm = OpenAI(temperature=0, model_name = 'gpt-3.5-turbo', openai_api_key=openai_api_key) # 初始化LLM模型

# 创建模板
template = """
%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:
{text}
"""

# 创建一个 Lang Chain Prompt 模板，稍后可以插入值
prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

RuntimeError: no validator found for <class 're.Pattern'>, see `arbitrary_types_allowed` in Config

In [None]:
confusing_text = """
For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”
"""

In [None]:
print ("------- Prompt Begin -------")
# 打印模板内容
final_prompt = prompt.format(text=confusing_text)
print(final_prompt)

print ("------- Prompt End -------")

------- Prompt Begin -------

%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:

For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”


------- Prompt End -------


In [None]:
output = llm(final_prompt)
print (output)

For a really long time, scientists argued about what Prototaxites was. Some said it was a lichen, some said it was a fungus, and others thought it was a tree. The problem was that when they looked closely at it, they couldn't figure out exactly what it was. It was also really, really big, so people didn't understand how it could be a lichen that was 20 feet tall.


### 2，长文本总结

对于文本长度较短的文本我们可以直接这样执行summary操作

但是对于文本长度超过lLM支持的max token size 时将会遇到困难

Lang Chain 提供了开箱即用的工具解决长文本的问题：load_summarize_chain


In [None]:
# Summaries Of Longer Text

from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [None]:
with open('wonderland.txt', 'r',encoding = 'utf-8') as file:
    text = file.read() # 文章本身是爱丽丝梦游仙境

# 打印小说的前285个字符
print (text[:285])

Wonderland
Where is the wonderland I want to find it.——题记
（一）
“Wonderland”，很多人会把这个词翻译作“奇幻世界”之类的。但我，更爱把它称做“乐土”。
人生只有一次。我奉信完全随心所欲、自由自在的人生。因此，我也一直，梦想着……不，幻想着，寻找到“乐土”。
我觉得有一句来自东方古国的话很适合当我的座右铭——“人生得意须尽欢”。如果这唯一的人生都不能过得快乐，令自己满意，那还有什么意义呢？人生在世，重要的不是为别人留下些什么，而是能够在离开的时候潇洒一笑，说，我这一生过得无怨无悔。
没错，我就是这


In [None]:
%pip install tiktoken --user # 安装用于分割文本的依赖

Collecting tiktoken
  Using cached https://files.pythonhosted.org/packages/bf/5a/d2491f94558be17493c4fb3606265a67959a2d9ecf271fd45e45727c6773/tiktoken-0.5.1-cp38-cp38-macosx_10_9_x86_64.whl
Collecting regex>=2022.1.18 (from tiktoken)
  Using cached https://files.pythonhosted.org/packages/14/25/6c92544ec70c8e717739a05e9908caaf0e03f8be7b8b689ff500ee6ae98d/regex-2023.8.8-cp38-cp38-macosx_10_9_x86_64.whl
Installing collected packages: regex, tiktoken
Successfully installed regex-2023.8.8 tiktoken-0.5.1
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
num_tokens = llm.get_num_tokens(text)

print (f"There are {num_tokens} tokens in your file") 
# 全文一共4w8词
# 很明显这样的文本量是无法直接送进LLM进行处理和生成的

There are 3298 tokens in your file


解决长文本的方式无非是'chunking','splitting' 原文本为小的段落/分割部分


In [None]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=5000, chunk_overlap=350)
# 虽然我使用的是 RecursiveCharacterTextSplitter，但是你也可以使用其他工具
docs = text_splitter.create_documents([text])

print (f"You now have {len(docs)} docs intead of 1 piece of text")

You now have 1 docs intead of 1 piece of text


现在就需要一个 Lang Chain 工具，将分段文本送入LLM进行summary

In [None]:
# 设置 lang chain
# 使用 map_reduce的chain_type，这样可以将多个文档合并成一个
chain = load_summarize_chain(llm=llm, chain_type='map_reduce') # verbose=True 展示运行日志

In [None]:
# Use it. This will run through the 36 documents, summarize the chunks, then get a summary of the summary.
# 典型的map reduce的思路去解决问题，将文章拆分成多个部分，再将多个部分分别进行 summarize，最后再进行 合并，对 summarys 进行 summary
output = chain.run(docs)
print (output)
# Try yourself

 The narrator reflects on their life and dreams of finding a better place, while remembering a person from their past who was burdened by family and societal pressures and died from overworking. Visiting the person's grave, the narrator realizes that the pressures they faced allowed them to be themselves.


## 二，文档问答(QA based Documents)

为了确保LLM能够执行QA任务
1. 需要向LLM传递能够让他参考的上下文信息
2. 需要向LLM准确地传达我们的问题

### 1，短文本问答

In [None]:
# 概括来说，使用文档作为上下文进行QA系统的构建过程类似于 llm(your context + your question) = your answer
# Simple Q&A Example

from langchain.llms import OpenAI

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [None]:
context = """
Rachel is 30 years old
Bob is 45 years old
Kevin is 65 years old
"""

question = "Who is under 40 years old?"

In [None]:
output = llm(context + question)

print (output.strip())

Rachel is under 40 years old.


### 2，长文本问答

对于更长的文本，可以文本进行分块，对分块的内容进行 embedding，将 embedding 存储到数据库中，然后进行查询。

目标是选择相关的文本块，但是我们应该选择哪些文本块呢？目前最流行的方法是基于比较向量嵌入来选择相似的文本。


In [None]:
%pip install faiss-cpu --user # 需要注意，faiss存在GPU和CPU版本基于你的 runtime 安装对应的版本

Collecting faiss-cpu
  Using cached https://files.pythonhosted.org/packages/6b/b2/6271d05bb22ac692221a9326eb3a300b191c097084fd76278daf77f8fc9c/faiss_cpu-1.7.4-cp38-cp38-macosx_10_9_x86_64.whl
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.4
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain import OpenAI
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [None]:
loader = TextLoader('wonderland.txt',encoding='utf-8') # 载入一个长文本，我们还是使用爱丽丝漫游仙境这篇小说作为输入
doc = loader.load()
print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

You have 1 document
You have 1683 characters in that document


In [None]:
# 将小说分割成多个部分
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(doc)

In [None]:
# 获取字符的总数，以便可以计算平均值
num_total_characters = sum([len(x.page_content) for x in docs])

print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

Now you have 1 documents that have an average of 1,683 characters (smaller pieces)


In [None]:
# 设置 embedding 引擎
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Embed 文档，然后使用伪数据库将文档和原始文本结合起来
# 这一步会向 OpenAI 发起 API 请求
docsearch = FAISS.from_documents(docs, embeddings)

In [None]:
# 创建QA-retrieval chain
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [None]:
query = "What does the author describe the Alice following with?"
qa.run(query)
# 这个过程中，检索器会去获取类似的文件部分，并结合你的问题让 LLM 进行推理，最后得到答案
# 这一步还有很多可以细究的步骤，比如如何选择最佳的分割大小，如何选择最佳的 embedding 引擎，如何选择最佳的检索器等等
# 同时也可以选择云端向量存储

' The author describes Alice following with a large and heavy book bag.'

## 三，信息抽取(Extraction)

Extraction是从一段文本中解析结构化数据的过程.

通常与Extraction parser一起使用，以构建数据，以下是一些使用范例。

1. 从句子中提取结构化行以插入数据库
2. 从长文档中提取多行以插入数据库
3. 从用户查询中提取参数以进行 API 调用
4. 最近最火的 Extraction 库是 KOR

### 1，手动格式转换

In [None]:
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

from langchain.chat_models import ChatOpenAI


chat_model = ChatOpenAI(temperature=0, model='gpt-3.5-turbo', openai_api_key=openai_api_key)

In [None]:
# Vanilla Extraction
instructions = """
You will be given a sentence with fruit names, extract those fruit names and assign an emoji to them
Return the fruit name and emojis in a python dictionary
"""

fruit_names = """
Apple, Pear, this is an kiwi
"""

In [None]:
# Make your prompt which combines the instructions w/ the fruit names
prompt = (instructions + fruit_names)

# Call the LLM
output = chat_model([HumanMessage(content=prompt)])

print (output.content)
print (type(output.content))

{
  "Apple": "🍎",
  "Pear": "🍐",
  "kiwi": "🥝"
}
<class 'str'>


In [None]:
output_dict = eval(output.content) #利用python中的eval函数手动转换格式

print (output_dict)
print (type(output_dict))

{'Apple': '🍎', 'Pear': '🍐', 'kiwi': '🥝'}
<class 'dict'>


### 2，自动格式转换

使用langchain.output_parsers.StructuredOutputParser可以自动生成一个带有格式说明的提示。

这样就不需要担心提示工程输出格式的问题了，将这部分完全交给 Lang Chain 来执行，将LLM的输出转化为 python 对象。



In [None]:
# 解析输出并获取结构化的数据
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

response_schemas = [
    ResponseSchema(name="artist", description="The name of the musical artist"),
    ResponseSchema(name="song", description="The name of the song that the artist plays")
]

# 解析器将会把LLM的输出使用我定义的schema进行解析并返回期待的结构数据给我
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)


In [None]:
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```


In [None]:
# 这个 Prompt 与之前我们构建 Chat Model 时 Prompt 不同
# 这个 Prompt 是一个 ChatPromptTemplate，它会自动将我们的输出转化为 python 对象
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("Given a command from the user, extract the artist and song names \n \
                                                    {format_instructions}\n{user_prompt}")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [None]:
artist_query = prompt.format_prompt(user_prompt="I really like So Young by Portugal. The Man")
print(artist_query.messages[0].content)

Given a command from the user, extract the artist and song names 
                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```
I really like So Young by Portugal. The Man


In [None]:
artist_output = chat_model(artist_query.to_messages())
output = output_parser.parse(artist_output.content)

print (output)
print (type(output))
# 这里要注意的是，因为我们使用的 turbo 模型，生成的结果并不一定是每次都一致的
# 替换成gpt4模型可能是更好的选择

{'artist': 'Portugal. The Man', 'song': 'So Young'}
<class 'dict'>


## 四，结果评估(Evaluation)

由于自然语言的不可预测性和可变性，评估LLM的输出是否正确有些困难，langchain 提供了一种方式帮助我们去解决这一难题。


In [None]:
# Embeddings, store, and retrieval
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Model and doc loader
from langchain import OpenAI
from langchain.document_loaders import TextLoader

# Eval
from langchain.evaluation.qa import QAEvalChain

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [None]:
# 还是使用爱丽丝漫游仙境作为文本输入
loader = TextLoader('wonderland.txt',encoding='utf-8')
doc = loader.load()

print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

You have 1 document
You have 1683 characters in that document


In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(doc)

# Get the total number of characters so we can see the average later
num_total_characters = sum([len(x.page_content) for x in docs])

print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

Now you have 1 documents that have an average of 1,683 characters (smaller pieces)


In [None]:
# Embeddings and docstore
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = FAISS.from_documents(docs, embeddings)

In [None]:
chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), input_key="question")
# 注意这里的 input_key 参数，这个参数告诉了 chain 我的问题在字典中的哪个 key 里
# 这样 chain 就会自动去找到问题并将其传递给 LLM

In [None]:
question_answers = [
    {'question' : "Which animal give alice a instruction?", 'answer' : 'rabbit'},
    {'question' : "What is the author of the book", 'answer' : 'Elon Mask'}
]

In [None]:
predictions = chain.apply(question_answers)
predictions
# 使用LLM模型进行预测，并将答案与我提供的答案进行比较，这里信任我自己提供的人工答案是正确的

[{'question': 'Which animal give alice a instruction?',
  'answer': 'rabbit',
  'result': " I don't know."},
 {'question': 'What is the author of the book',
  'answer': 'Elon Mask',
  'result': " I don't know."}]

In [None]:
# Start your eval chain
eval_chain = QAEvalChain.from_llm(llm)

graded_outputs = eval_chain.evaluate(question_answers,
                                     predictions,
                                     question_key="question",
                                     prediction_key="result",
                                     answer_key='answer')

In [None]:
graded_outputs

[{'results': ' INCORRECT'}, {'results': ' INCORRECT'}]

## 五，数据库问答(Querying Tabular Data)

In [None]:
# 使用自然语言查询一个 SQLite 数据库，我们将使用旧金山树木数据集
# Don't run following code if you don't run sqlite and follow db
from langchain import OpenAI, SQLDatabase
%pip install  --user langchain_experimental
from langchain_experimental.sql import SQLDatabaseChain
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

Collecting langchain_experimental
[?25l  Downloading https://files.pythonhosted.org/packages/3b/46/941c13034361545531d0970ca3a8458d2a5bc800a0bca2608daab452af09/langchain_experimental-0.0.22-py3-none-any.whl (110kB)
[K     |████████████████████████████████| 112kB 96kB/s eta 0:00:01
Installing collected packages: langchain-experimental
Successfully installed langchain-experimental-0.0.22
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
sqlite_db_path = 'San_Francisco_Trees.db'
db = SQLDatabase.from_uri(f"sqlite:///{sqlite_db_path}")

In [None]:
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)



In [None]:
db_chain.run("How many Species of trees are there in San Francisco?")



[1m> Entering new SQLDatabaseChain chain...[0m
How many Species of trees are there in San Francisco?
SQLQuery:[32;1m[1;3mSELECT COUNT(DISTINCT "qSpecies") FROM "SFTrees";[0m
SQLResult: [33;1m[1;3m[(578,)][0m
Answer:[32;1m[1;3mThere are 578 Species of trees in San Francisco.[0m
[1m> Finished chain.[0m


'There are 578 Species of trees in San Francisco.'

1. Find which table to use
2. Find which column to use
3. Construct the correct sql query
4. Execute that query
5. Get the result
6. Return a natural language reponse back

confirm LLM result via pandas

In [None]:
import sqlite3
%pip install pandas --user
import pandas as pd

# Connect to the SQLite database
connection = sqlite3.connect(sqlite_db_path)

# Define your SQL query
query = "SELECT count(distinct qSpecies) FROM SFTrees"

# Read the SQL query into a Pandas DataFrame
df = pd.read_sql_query(query, connection)

# Close the connection
connection.close()

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/78/a8/07dd10f90ca915ed914853cd57f79bfc22e1ef4384ab56cb4336d2fc1f2a/pandas-2.0.3-cp38-cp38-macosx_10_9_x86_64.whl
Collecting tzdata>=2022.1 (from pandas)
  Using cached https://files.pythonhosted.org/packages/d5/fb/a79efcab32b8a1f1ddca7f35109a50e4a80d42ac1c9187ab46522b2407d7/tzdata-2023.3-py2.py3-none-any.whl
Collecting pytz>=2020.1 (from pandas)
  Using cached https://files.pythonhosted.org/packages/32/4d/aaf7eff5deb402fd9a24a1449a8119f00d74ae9c2efa79f8ef9994261fc2/pytz-2023.3.post1-py2.py3-none-any.whl
Installing collected packages: tzdata, pytz, pandas
Successfully installed pandas-2.0.3 pytz-2023.3.post1 tzdata-2023.3
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Display the result in the first column first cell
print(df.iloc[0,0])

578


## 六，代码理解(Code Understanding)

代码理解用到的工具和文档问答差不多，不过我们的输入是一个项目的代码。


In [None]:
# Helper to read local files
import os

# Vector Support
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

# Model and chain
from langchain.chat_models import ChatOpenAI

# Text splitters
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader

llm = ChatOpenAI(model='gpt-3.5-turbo', openai_api_key=openai_api_key)

In [None]:
embeddings = OpenAIEmbeddings(disallowed_special=(), openai_api_key=openai_api_key)

In [None]:
root_dir = './thefuzz-master/'
docs = []

# Go through each folder
for dirpath, dirnames, filenames in os.walk(root_dir):
    
    # Go through each file
    for file in filenames:
        try: 
            # Load up the file as a doc and split
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

In [None]:
for i,doc in enumerate(docs):
    docs[i].page_content = docs[i].page_content[:300]

In [None]:
print (f"You have {len(docs)} documents\n")
print ("------ Start Document ------")
print (docs[0].page_content[:300])


You have 129 documents

------ Start Document ------
id|custom_title|stubhub_title|vividseats_title
701562|Toronto Blue Jays at Baltimore Orioles (Wednesday April 25, 2012)|Baltimore Orioles vs Toronto Blue Jays [4/25/2012] Tickets at StubHub!|Toronto Blue Jays at Baltimore Orioles
701563|Texas Rangers at Baltimore Orioles (Tuesday May 8, 2012)|Baltim


In [None]:
docsearch = FAISS.from_documents(docs, embeddings)

In [None]:
# Get our retriever ready
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [None]:
query = "What function do I use if I want to find the most similar item in a list of items?"
output = qa.run(query)

In [None]:
print (output)

To find the most similar item in a list of items, you can use the cosine similarity function.


In [None]:
query = "Can you write the code to use the process.extractOne() function? Only respond with code. No other text or explanation"
output = qa.run(query)
print(output)

from fuzzywuzzy import process

choices = ['Detroit Tigers vs Houston Astros', 'New York Mets vs Houston Astros', 'Cleveland Indians at Chicago White Sox',
           'Milwaukee Brewers at Chicago White Sox', 'New York Yankees at Chicago White Sox', 'Boston Red Sox at Chicago White Sox',
           'Washington Nationals at Boston Red Sox', 'Toronto Blue Jays at Boston Red Sox']

query = 'Detroit Tigers at Houston Astros'

result = process.extractOne(query, choices)
print(result)


## 七，API交互(Interacting with APIs)

如果你需要的数据或操作在 API 之后，就需要LLM能够和API进行交互。

到这个环节，就与 Agents 和 Plugins 息息相关了。

Demo可能很简单，但是功能可以很复杂。

In [None]:
from langchain.chains import APIChain
from langchain.llms import OpenAI

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [None]:
api_docs = """

BASE URL: https://restcountries.com/

API Documentation:

The API endpoint /v3.1/name/{name} Used to find informatin about a country. All URL parameters are listed below:
    - name: Name of country - Ex: italy, france
    
The API endpoint /v3.1/currency/{currency} Uesd to find information about a region. All URL parameters are listed below:
    - currency: 3 letter currency. Example: USD, COP
    
Woo! This is my documentation
"""

chain_new = APIChain.from_llm_and_api_docs(llm, api_docs, verbose=True)

In [None]:
chain_new.run('Can you tell me information about france?')



[1m> Entering new APIChain chain...[0m
[32;1m[1;3m https://restcountries.com/v3.1/name/france[0m
[33;1m[1;3m[{"name":{"common":"France","official":"French Republic","nativeName":{"fra":{"official":"République française","common":"France"}}},"tld":[".fr"],"cca2":"FR","ccn3":"250","cca3":"FRA","cioc":"FRA","independent":true,"status":"officially-assigned","unMember":true,"currencies":{"EUR":{"name":"Euro","symbol":"€"}},"idd":{"root":"+3","suffixes":["3"]},"capital":["Paris"],"altSpellings":["FR","French Republic","République française"],"region":"Europe","subregion":"Western Europe","languages":{"fra":"French"},"translations":{"ara":{"official":"الجمهورية الفرنسية","common":"فرنسا"},"bre":{"official":"Republik Frañs","common":"Frañs"},"ces":{"official":"Francouzská republika","common":"Francie"},"cym":{"official":"French Republic","common":"France"},"deu":{"official":"Französische Republik","common":"Frankreich"},"est":{"official":"Prantsuse Vabariik","common":"Prantsusmaa"},"f

' France is an officially-assigned, independent country located in Western Europe. Its capital is Paris and its official language is French. Its currency is the Euro (€). It has a population of 67,391,582 and its borders are with Andorra, Belgium, Germany, Italy, Luxembourg, Monaco, Spain, and Switzerland.'

In [None]:
chain_new.run('Can you tell me about the currency COP?')



[1m> Entering new APIChain chain...[0m
[32;1m[1;3m https://restcountries.com/v3.1/currency/COP[0m
[33;1m[1;3m[{"name":{"common":"Colombia","official":"Republic of Colombia","nativeName":{"spa":{"official":"República de Colombia","common":"Colombia"}}},"tld":[".co"],"cca2":"CO","ccn3":"170","cca3":"COL","cioc":"COL","independent":true,"status":"officially-assigned","unMember":true,"currencies":{"COP":{"name":"Colombian peso","symbol":"$"}},"idd":{"root":"+5","suffixes":["7"]},"capital":["Bogotá"],"altSpellings":["CO","Republic of Colombia","República de Colombia"],"region":"Americas","subregion":"South America","languages":{"spa":"Spanish"},"translations":{"ara":{"official":"جمهورية كولومبيا","common":"كولومبيا"},"bre":{"official":"Republik Kolombia","common":"Kolombia"},"ces":{"official":"Kolumbijská republika","common":"Kolumbie"},"cym":{"official":"Gweriniaeth Colombia","common":"Colombia"},"deu":{"official":"Republik Kolumbien","common":"Kolumbien"},"est":{"official":"Colom

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-JdLoPt56PJ8Tc0pwuac3gPUq on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-JdLoPt56PJ8Tc0pwuac3gPUq on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco


[1m> Finished chain.[0m


' The currency of Colombia is the Colombian peso (COP), symbolized by the "$" sign.'

## 八，聊天机器人(Chatbots)

聊天机器人使用了之前提及过的很多工具，且最重要的是增加了一个重要的工具：记忆力。

与用户进行实时交互，为用户提供自然语言问题的平易近人的 UI，

In [None]:
from langchain.llms import OpenAI
from langchain import LLMChain
from langchain.prompts.prompt import PromptTemplate

# Chat specific components
from langchain.memory import ConversationBufferMemory

In [None]:
template = """
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it

{chat_history}
Human: {human_input}
Chatbot:"""

prompt = PromptTemplate(
    input_variables=["chat_history", "human_input"], 
    template=template
)
memory = ConversationBufferMemory(memory_key="chat_history")

In [None]:
llm_chain = LLMChain(
    llm=OpenAI(openai_api_key=openai_api_key), 
    prompt=prompt, 
    verbose=True, 
    memory=memory
)

In [None]:

llm_chain.predict(human_input="Is an pear a fruit or vegetable?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it


Human: Is an pear a fruit or vegetable?
Chatbot:[0m

[1m> Finished chain.[0m


" It's neither, it's a mystery!"

In [None]:
llm_chain.predict(human_input="What was one of the fruits I first asked you about?")
# 这里第二个问题的答案是来自于第一个答案本身的，因此我们使用到了 memory



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it

Human: Is an pear a fruit or vegetable?
AI:  It's neither, it's a mystery!
Human: What was one of the fruits I first asked you about?
Chatbot:[0m


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-JdLoPt56PJ8Tc0pwuac3gPUq on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-JdLoPt56PJ8Tc0pwuac3gPUq on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco


[1m> Finished chain.[0m


' An enigma!'

## 九，用不到 50 行代码实现一个文档对话机器人
假设所有 2022 年更新的内容都存在于 2022.txt 这个文档中，那么通过如下的代码，就可以让 ChatGPT 来支持回答 2022 年的问题

其中原理也很简单：

对用户的输入/prompt向量化
文档分词
文档分割
文本向量化
向量化了才能进行向量之间相似度的计算
向量化的文本存到向量数据库里
根据用户的输入/prompt去向量数据里寻找答案(答案的判定是基于prompt/输入与文本中相关段落向量的相似性匹配)
最后通过LLM返回答案


In [None]:
#!/usr/bin/python
# -*- coding: UTF-8 -*-

# 在我们开始前，安装需要的依赖
%pip install langchain==0.0.258
%pip install openai==0.23.1

%pip install chroma --user

import os                            # 导入os模块，用于操作系统相关的操作
os.environ['OPENAI_API_KEY'] = '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'

%pip install jieba --user            # 安装jieba分词库
import jieba as jb                   # 导入结巴分词库
from langchain.chains import ConversationalRetrievalChain   # 导入用于创建对话检索链的类
from langchain.chat_models import ChatOpenAI                # 导入用于创建ChatOpenAI对象的类
from langchain.document_loaders import DirectoryLoader      # 导入用于加载文件的类
from langchain.embeddings import OpenAIEmbeddings           # 导入用于创建词向量嵌入的类
from langchain.text_splitter import TokenTextSplitter       # 导入用于分割文档的类
from langchain.vectorstores import Chroma                   # 导入用于创建向量数据库的类

if not os.path.exists('./data/cut'):
    os.makedirs('./data/cut')

# 初始化函数，用于处理输入的文档
def init():  
    files = ['2022.txt']      # 需要处理的文件列表
    for file in files:        # 遍历每个文件
        with open(f"./data/{file}", 'r', encoding='utf-8') as f:   # 以读模式打开文件
            data = f.read()   # 读取文件内容

        cut_data = " ".join([w for w in list(jb.cut(data))])       # 对读取的文件内容进行分词处理
        cut_file = f"./data/cut/cut_{file}"      # 定义处理后的文件路径和名称
        with open(cut_file, 'w') as f:           # 以写模式打开文件
            f.write(cut_data)                    # 将处理后的内容写入文件

# 新建一个函数用于加载文档
def load_documents(directory):  
    # 创建DirectoryLoader对象，用于加载指定文件夹内的所有.txt文件
    loader = DirectoryLoader(directory, glob='**/*.txt')  
    docs = loader.load()  # 加载文件
    return docs  # 返回加载的文档

# 新建一个函数用于分割文档
def split_documents(docs):  
    # 创建TokenTextSplitter对象，用于分割文档
    text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)  
    docs_texts = text_splitter.split_documents(docs)  # 分割加载的文本
    return docs_texts  # 返回分割后的文本

# 新建一个函数用于创建词嵌入
def create_embeddings(api_key):  
    # 创建OpenAIEmbeddings对象，用于获取OpenAI的词向量
    embeddings = OpenAIEmbeddings(openai_api_key=api_key)  
    return embeddings  # 返回创建的词嵌入

# 新建一个函数用于创建向量数据库
def create_chroma(docs_texts, embeddings, persist_directory):  
    # 使用文档，embeddings和持久化目录创建Chroma对象
    vectordb = Chroma.from_documents(docs_texts, embeddings, persist_directory=persist_directory)  
    vectordb.persist()      # 持久化存储向量数据
    return vectordb         # 返回创建的向量数据库

# load函数，调用上面定义的具有各个职责的函数
def load():
    docs = load_documents('./data/cut')        # 调用load_documents函数加载文档
    docs_texts = split_documents(docs)         # 调用split_documents函数分割文档
    
    api_key = os.environ.get('OPENAI_API_KEY')   # 从环境变量中获取OpenAI的API密钥
    if not api_key:
        raise ValueError("OpenAI API key is missing. Please set it as an environment variable.")

    embeddings = create_embeddings(api_key)      # 调用create_embeddings函数创建词嵌入

    # 调用create_chroma函数创建向量数据库
    vectordb = create_chroma(docs_texts, embeddings, './data/cut/')  

    # 创建ChatOpenAI对象，用于进行聊天对话
    openai_ojb = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")  

    # 从模型和向量检索器创建ConversationalRetrievalChain对象
    chain = ConversationalRetrievalChain.from_llm(openai_ojb, vectordb.as_retriever())  
    return chain  # 返回该对象

# 在加载之前调用init()进行初始化处理
init()

# 调用load函数，获取ConversationalRetrievalChain对象
chain = load()  

# 定义一个函数，根据输入的问题获取答案
def get_ans(question):  
    chat_history = []      # 初始化聊天历史为空列表
    result = chain({       # 调用chain对象获取聊天结果
        'chat_history': chat_history,  # 传入聊天历史
        'question': question,          # 传入问题
    })
    return result['answer']      # 返回获取的答案

if __name__ == '__main__':       # 如果此脚本作为主程序运行
    s = input('please input:')   # 获取用户输入
    while s != 'exit':      # 如果用户输入的不是'exit'
        ans = get_ans(s)    # 调用get_ans函数获取答案
        print(ans)  # 打印答案
        s = input('please input:')  # 获取用户输入

Collecting langchain==0.0.258
  Using cached https://files.pythonhosted.org/packages/99/b6/f94cd453f69f9f49bb357d5035b7d17fdb213e1212dca560959010886904/langchain-0.0.258-py3-none-any.whl
Collecting pydantic<2,>=1 (from langchain==0.0.258)
  Using cached https://files.pythonhosted.org/packages/ee/d7/dcc878bac609c805925696d29a9a517802cb53d9d750b17935c22bc2177c/pydantic-1.10.13-cp38-cp38-macosx_10_9_x86_64.whl
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain==0.0.258)
  Using cached https://files.pythonhosted.org/packages/a8/e7/22abb5a10733bf8142984201aedf27d4a58f5810ebdfe9679f9876c7bf4d/openapi_schema_pydantic-1.2.4-py3-none-any.whl
Installing collected packages: pydantic, openapi-schema-pydantic, langchain
  Found existing installation: pydantic 2.3.0
    Uninstalling pydantic-2.3.0:
      Successfully uninstalled pydantic-2.3.0
  Rolling back uninstall of pydantic
  Moving to /Users/xuan/Library/Caches/com.apple.python/Users/xuan/Library/Python/3.8/lib/python/site-packages/

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 24: invalid start byte

## 十，智能体(Agents)

Agents是 LLM 中最热门的 🔥 主题之一。

Agents可以查看数据、推断下一步应该采取什么行动，并通过工具为您执行该行动, 是一个具备AI智能的决策者。

温馨提示：小心使用 Auto GPT, 会迅速消耗掉你大量的token。

In [None]:
# Helpers
import os
import json

from langchain.llms import OpenAI

# Agent imports
from langchain.agents import load_tools
from langchain.agents import initialize_agent

# Tool imports
from langchain.agents import Tool
from langchain.utilities import GoogleSearchAPIWrapper
from langchain.utilities import TextRequestsWrapper

In [None]:
os.environ["GOOGLE_CSE_ID"] = "YOUR_GOOGLE_CSE_ID"
os.environ["GOOGLE_API_KEY"] = "YOUR_GOOGLE_API_KEY"

In [None]:
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [None]:
search = GoogleSearchAPIWrapper()

requests = TextRequestsWrapper()

In [None]:
toolkit = [
    Tool(
        name = "Search",
        func=search.run,
        description="useful for when you need to search google to answer questions about current events"
    ),
    Tool(
        name = "Requests",
        func=requests.get,
        description="Useful for when you to make a request to a URL"
    ),
]

In [None]:
agent = initialize_agent(toolkit, llm, agent="zero-shot-react-description", verbose=True, return_intermediate_steps=True)

In [None]:
response = agent({"input":"What is the capital of canada?"})
response['output']

In [None]:
response = agent({"input":"Tell me what the comments are about on this webpage https://news.ycombinator.com/item?id=34425779"})
response['output']