# Langchain 入门范例


In [95]:
import os
from dotenv import load_dotenv

# 加载.env文件中的环境变量
load_dotenv()

True

## 文本总结(Summarization)
扔给LLM一段文本，让他给你生成总结可以说是最常见的场景之一了

目前最火的应用应该是 chatPDF，可以利用langchain进行实现。
### 1，短文本总结


In [3]:
from langchain_community.chat_models import ChatOpenAI
from langchain import PromptTemplate
llm = ChatOpenAI(temperature=0, 
                 model_name = 'gpt-3.5-turbo') # 初始化LLM模型

# 创建模板
template = """
%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:
{text}
"""

# 创建一个 Lang Chain Prompt 模板，稍后可以插入值
prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [4]:
confusing_text = """
For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”
"""
print ("------- Prompt Begin -------")
# 打印模板内容
final_prompt = prompt.format(text=confusing_text)
print(final_prompt)

print ("------- Prompt End -------")

------- Prompt Begin -------

%INSTRUCTIONS:
Please summarize the following piece of text.
Respond in a manner that a 5 year old would understand.

%TEXT:

For the next 130 years, debate raged.
Some scientists called Prototaxites a lichen, others a fungus, and still others clung to the notion that it was some kind of tree.
“The problem is that when you look up close at the anatomy, it’s evocative of a lot of different things, but it’s diagnostic of nothing,” says Boyce, an associate professor in geophysical sciences and the Committee on Evolutionary Biology.
“And it’s so damn big that when whenever someone says it’s something, everyone else’s hackles get up: ‘How could you have a lichen 20 feet tall?’”


------- Prompt End -------


In [8]:
output = llm.invoke(final_prompt)
print (output)

content="For a long time, scientists argued about what Prototaxites was. Some thought it was a lichen, some thought it was a fungus, and others thought it was a tree. The problem was that when they looked closely at it, it looked like different things but didn't match anything exactly. It was also really big, so people couldn't agree on what it was." response_metadata={'token_usage': {'completion_tokens': 77, 'prompt_tokens': 166, 'total_tokens': 243}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None} id='run-1df4ea55-c1d3-45ee-9891-f94f09a0c55f-0'


### 2，长文本总结

对于文本长度较短的文本我们可以直接这样执行summary操作

但是对于文本长度超过lLM支持的max token size 时将会遇到困难

Lang Chain 提供了开箱即用的工具解决长文本的问题：load_summarize_chain

In [9]:
# Summaries Of Longer Text
from langchain_community.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

llm = ChatOpenAI(temperature=0)

In [15]:
with open('data/wonderland.txt', 'r',encoding = 'utf-8') as file:
    text = file.read() # 文章本身是爱丽丝梦游仙境

# 打印小说的前285个字符
print (text[:285])

Wonderland
Where is the wonderland I want to find it.——题记
（一）
“Wonderland”，很多人会把这个词翻译作“奇幻世界”之类的。但我，更爱把它称做“乐土”。
人生只有一次。我奉信完全随心所欲、自由自在的人生。因此，我也一直，梦想着……不，幻想着，寻找到“乐土”。
我觉得有一句来自东方古国的话很适合当我的座右铭——“人生得意须尽欢”。如果这唯一的人生都不能过得快乐，令自己满意，那还有什么意义呢？人生在世，重要的不是为别人留下些什么，而是能够在离开的时候潇洒一笑，说，我这一生过得无怨无悔。
没错，我就是这


In [18]:
num_tokens = llm.get_num_tokens(text)

print (f"There are {num_tokens} tokens in your file") 
# 全文一共4w8词

There are 5505 tokens in your file


解决长文本的方式无非是'chunking','splitting' 原文本为小的段落/分割部分

In [20]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], 
                                               chunk_size=4000, 
                                               chunk_overlap=350)
# 虽然我使用的是 RecursiveCharacterTextSplitter，但是你也可以使用其他工具
docs = text_splitter.create_documents([text])

print (f"You now have {len(docs)} docs intead of 1 piece of text")

You now have 2 docs intead of 1 piece of text


In [21]:
docs

[Document(page_content='Wonderland\nWhere is the wonderland I want to find it.——题记\n（一）\n“Wonderland”，很多人会把这个词翻译作“奇幻世界”之类的。但我，更爱把它称做“乐土”。\n人生只有一次。我奉信完全随心所欲、自由自在的人生。因此，我也一直，梦想着……不，幻想着，寻找到“乐土”。\n我觉得有一句来自东方古国的话很适合当我的座右铭——“人生得意须尽欢”。如果这唯一的人生都不能过得快乐，令自己满意，那还有什么意义呢？人生在世，重要的不是为别人留下些什么，而是能够在离开的时候潇洒一笑，说，我这一生过得无怨无悔。\n没错，我就是这么一个自私的人。不过我一直认为这样很好。\n曾经有那么一个人——跟我完全相反的一个人。他身上背负着来自家庭、来自社会各方面的压力和负担，但他从来没认为这些是应该抛弃的累赘，反而认为这是他的责任。\n对，责任。他认为这是他必须承担的，是他生命的一部分。\n哈哈，笑话。你看，他最后被那些责任压垮了哦。英年早逝，死因是劳累过度。这些都和我早就过世的“家人”一模一样。他们都是这样让我无法理解的人。如果人不能为自己而活……我实在是无法想象他们的那种人生。\n我没有去他的葬礼，而是在他下葬之后的第二天去了公墓。我把从郊外采来，还带着露水的白色野百合放在他的墓碑前，然后就一屁股坐在了地上，舒舒服服地靠着他的墓碑。\n那天天气不太好，空中飘着密密麻麻的雨丝。低气压也令人生厌，我觉得呼吸困难，鼻腔酸涩。公墓那边是这座濒临毁灭的城市中唯一一处还有点绿化的地方了。寥寥几棵种植在公墓外围的树上的叶子在雨中泛出鲜嫩的绿色，在风中沙沙地摇曳着。你看这世界上的人多不可理喻。明明就剩下这么一点环境还算可以的地方，却要留给已经不在了的人们。为什么……不能留给还在的人？\n我胡乱抹了抹脸上的水，顺带着捋顺了在雨中凌乱不堪的一头杂毛。我的余光瞟到了让我感兴趣的东西。于是我转头盯着那人墓碑上的遗像发呆。\n“你说，我以前从来没看过你的笑容。这唯一一次却是看到了你的遗像。”我禁不住笑了出来，“你说，要是你能像我一样只为自己而活的话，是不是就不会是现在这个样子了？”\n那个人叫宇智波鼬。\n其实……我也明白，只有这样的他才算是他。\n不过他已经再也不能反驳我的话了。\n（二）\n宇智波鼬是我

In [22]:
# 设置 lang chain
# 使用 map_reduce的chain_type，这样可以将多个文档合并成一个
chain = load_summarize_chain(llm=llm, 
                             chain_type='map_reduce') # verbose=True 展示运行日志

In [23]:
# Use it. This will run through the 36 documents, summarize the chunks, then get a summary of the summary.
# 典型的map reduce的思路去解决问题，将文章拆分成多个部分，再将多个部分分别进行 summarize，最后再进行 合并，对 summarys 进行 summary
output = chain.run(docs)
print (output)
# Try yourself

The narrator reflects on their search for happiness and freedom in life, contrasting their carefree lifestyle with that of a responsible individual. They recall interactions with a talented classmate and explore themes of loneliness and the search for happiness in a troubled world. The story ends with a man in a mental hospital creating a painting of a young man labeled as a patient, questioning if anyone has truly found their own paradise.


## 二，文档问答(QA based Documents)

为了确保LLM能够执行QA任务
1. 需要向LLM传递能够让他参考的上下文信息
2. 需要向LLM准确地传达我们的问题

### 1，短文本问答

In [33]:
from langchain_community.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
llm = ChatOpenAI(temperature=0)

In [34]:
loader = TextLoader('data/wonderland.txt', 
                    encoding='utf-8') # 载入一个长文本，我们还是使用爱丽丝漫游仙境这篇小说作为输入
doc = loader.load()
print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

You have 1 document
You have 4723 characters in that document


In [35]:
# 将小说分割成多个部分
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, 
                                               chunk_overlap=400)
docs = text_splitter.split_documents(doc)

In [36]:
# 获取字符的总数，以便可以计算平均值
num_total_characters = sum([len(x.page_content) for x in docs])

print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

Now you have 2 documents that have an average of 2,537 characters (smaller pieces)


In [37]:
# 设置 embedding 引擎
embeddings = OpenAIEmbeddings()

# Embed 文档，然后使用伪数据库将文档和原始文本结合起来
# 这一步会向 OpenAI 发起 API 请求
docsearch = FAISS.from_documents(docs, embeddings)

In [38]:
# 创建QA-retrieval chain
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff", 
                                 retriever=docsearch.as_retriever())

In [39]:
query = "What does the author describe the Alice following with?"
qa.run(query)
# 这个过程中，检索器会去获取类似的文件部分，并结合你的问题让 LLM 进行推理，最后得到答案
# 这一步还有很多可以细究的步骤，比如如何选择最佳的分割大小，如何选择最佳的 embedding 引擎，如何选择最佳的检索器等等
# 同时也可以选择云端向量存储

"I don't have enough information to determine what the author describes Alice following with."

## 三，信息抽取(Extraction)

Extraction是从一段文本中解析结构化数据的过程.

通常与Extraction parser一起使用，以构建数据，以下是一些使用范例。

1. 从句子中提取结构化行以插入数据库
2. 从长文档中提取多行以插入数据库
3. 从用户查询中提取参数以进行 API 调用
4. 最近最火的 Extraction 库是 KOR

### 1，手动格式转换

In [40]:
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_community.chat_models import ChatOpenAI


chat_model = ChatOpenAI(temperature=0, 
                        model='gpt-3.5-turbo')

In [41]:
# Vanilla Extraction
instructions = """
You will be given a sentence with fruit names, extract those fruit names and assign an emoji to them
Return the fruit name and emojis in a python dictionary
"""

fruit_names = """
Apple, Pear, this is an kiwi
"""

In [42]:
# Make your prompt which combines the instructions w/ the fruit names
prompt = (instructions + fruit_names)

# Call the LLM
output = chat_model([HumanMessage(content=prompt)])

print (output.content)
print (type(output.content))

{
    "Apple": "🍎",
    "Pear": "🍐",
    "kiwi": "🥝"
}
<class 'str'>


In [43]:
output_dict = eval(output.content) #利用python中的eval函数手动转换格式

print (output_dict)
print (type(output_dict))

{'Apple': '🍎', 'Pear': '🍐', 'kiwi': '🥝'}
<class 'dict'>


### 2，自动格式转换

使用langchain.output_parsers.StructuredOutputParser可以自动生成一个带有格式说明的提示。

这样就不需要担心提示工程输出格式的问题了，将这部分完全交给 Lang Chain 来执行，将LLM的输出转化为 python 对象。

In [44]:
# 解析输出并获取结构化的数据
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

response_schemas = [
    ResponseSchema(name="artist", description="The name of the musical artist"),
    ResponseSchema(name="song", description="The name of the song that the artist plays")
]

# 解析器将会把LLM的输出使用我定义的schema进行解析并返回期待的结构数据给我
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)


In [45]:
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```


In [46]:
# 这个 Prompt 与之前我们构建 Chat Model 时 Prompt 不同
# 这个 Prompt 是一个 ChatPromptTemplate，它会自动将我们的输出转化为 python 对象
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("Given a command from the user, extract the artist and song names \n \
                                                    {format_instructions}\n{user_prompt}")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [48]:
artist_query = prompt.format_prompt(user_prompt="I really like So Young by Portugal. The Man")
print(artist_query.messages[0].content)
print("--------------")
artist_query

Given a command from the user, extract the artist and song names 
                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```
I really like So Young by Portugal. The Man
--------------


ChatPromptValue(messages=[HumanMessage(content='Given a command from the user, extract the artist and song names \n                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"artist": string  // The name of the musical artist\n\t"song": string  // The name of the song that the artist plays\n}\n```\nI really like So Young by Portugal. The Man')])

In [49]:
artist_output = chat_model(artist_query.to_messages())
output = output_parser.parse(artist_output.content)

print (output)
print (type(output))
# 这里要注意的是，因为我们使用的 turbo 模型，生成的结果并不一定是每次都一致的
# 替换成gpt4模型可能是更好的选择

{'artist': 'Portugal. The Man', 'song': 'So Young'}
<class 'dict'>


In [50]:
artist_query.to_messages()

[HumanMessage(content='Given a command from the user, extract the artist and song names \n                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"artist": string  // The name of the musical artist\n\t"song": string  // The name of the song that the artist plays\n}\n```\nI really like So Young by Portugal. The Man')]

## 四，结果评估(Evaluation)

由于自然语言的不可预测性和可变性，评估LLM的输出是否正确有些困难，langchain 提供了一种方式帮助我们去解决这一难题。

In [51]:
# Embeddings, store, and retrieval
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Model and doc loader
from langchain.document_loaders import TextLoader
from langchain_community.chat_models import ChatOpenAI
# Eval
from langchain.evaluation.qa import QAEvalChain

llm = ChatOpenAI(temperature=0)

In [52]:
# 还是使用爱丽丝漫游仙境作为文本输入
loader = TextLoader('data/wonderland.txt',encoding='utf-8')
doc = loader.load()

print (f"You have {len(doc)} document")
print (f"You have {len(doc[0].page_content)} characters in that document")

You have 1 document
You have 4723 characters in that document


In [53]:
# Embeddings and docstore
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(docs, embeddings)

In [54]:
chain = RetrievalQA.from_chain_type(llm=llm, 
                                    chain_type="stuff", 
                                    retriever=docsearch.as_retriever(), 
                                    input_key="question")
# 注意这里的 input_key 参数，这个参数告诉了 chain 我的问题在字典中的哪个 key 里
# 这样 chain 就会自动去找到问题并将其传递给 LLM

In [55]:
question_answers = [
    {'question' : "Which animal give alice a instruction?", 'answer' : 'rabbit'},
    {'question' : "What is the author of the book", 'answer' : 'Elon Mask'}
]

In [56]:
predictions = chain.apply(question_answers)
predictions
# 使用LLM模型进行预测，并将答案与我提供的答案进行比较，这里信任我自己提供的人工答案是正确的

[{'question': 'Which animal give alice a instruction?',
  'answer': 'rabbit',
  'result': 'The text provided does not mention any specific animal giving Alice instructions. Therefore, based on the context provided, there is no information about an animal giving Alice instructions.'},
 {'question': 'What is the author of the book',
  'answer': 'Elon Mask',
  'result': "I don't have enough information to determine the author of the book based on the provided context."}]

In [58]:
# Start your eval chain
eval_chain = QAEvalChain.from_llm(llm)

graded_outputs = eval_chain.evaluate(question_answers,
                                     predictions,
                                     question_key="question",
                                     prediction_key="result",
                                     answer_key='answer')
graded_outputs

[{'results': 'INCORRECT'}, {'results': 'CORRECT'}]

## 五，数据库问答(Querying Tabular Data)

In [63]:
# 使用自然语言查询一个 SQLite 数据库，我们将使用旧金山树木数据集
# Don't run following code if you don't run sqlite and follow db
from langchain import OpenAI, SQLDatabase
from langchain_experimental.sql import SQLDatabaseChain
llm = OpenAI(temperature=0)

In [64]:
sqlite_db_path = 'data/San_Francisco_Trees.db'
db = SQLDatabase.from_uri(f"sqlite:///{sqlite_db_path}")

In [65]:
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)



In [66]:
db_chain.run("How many Species of trees are there in San Francisco?")



[1m> Entering new SQLDatabaseChain chain...[0m
How many Species of trees are there in San Francisco?
SQLQuery:[32;1m[1;3mSELECT COUNT(DISTINCT qSpecies) FROM SFTrees[0m
SQLResult: [33;1m[1;3m[(578,)][0m
Answer:[32;1m[1;3mThere are 578 species of trees in San Francisco.[0m
[1m> Finished chain.[0m


'There are 578 species of trees in San Francisco.'

In [67]:
import sqlite3
import pandas as pd

# Connect to the SQLite database
connection = sqlite3.connect(sqlite_db_path)

# Define your SQL query
query = "SELECT count(distinct qSpecies) FROM SFTrees"

# Read the SQL query into a Pandas DataFrame
df = pd.read_sql_query(query, connection)
print(len(df))
# Close the connection
connection.close()

1


In [68]:
# Display the result in the first column first cell
print(df.iloc[0,0])

578


## 六，代码理解(Code Understanding)

代码理解用到的工具和文档问答差不多，不过我们的输入是一个项目的代码。

In [70]:
# Helper to read local files
import os

# Vector Support
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

# Model and chain
from langchain.chat_models import ChatOpenAI

# Text splitters
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader

llm = ChatOpenAI(model='gpt-3.5-turbo')

In [71]:
embeddings = OpenAIEmbeddings(disallowed_special=())

In [72]:
root_dir = './data/thefuzz/'
docs = []

# Go through each folder
for dirpath, dirnames, filenames in os.walk(root_dir):
    
    # Go through each file
    for file in filenames:
        try: 
            # Load up the file as a doc and split
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

In [75]:
for i,doc in enumerate(docs):
    docs[i].page_content = docs[i].page_content[:300]

In [76]:
print (f"You have {len(docs)} documents\n")
print ("------ Start Document ------")
print (docs[0].page_content[:300])

You have 129 documents

------ Start Document ------
id|custom_title|stubhub_title|vividseats_title
701562|Toronto Blue Jays at Baltimore Orioles (Wednesday April 25, 2012)|Baltimore Orioles vs Toronto Blue Jays [4/25/2012] Tickets at StubHub!|Toronto Blue Jays at Baltimore Orioles
701563|Texas Rangers at Baltimore Orioles (Tuesday May 8, 2012)|Baltim


In [77]:
docsearch = FAISS.from_documents(docs, embeddings)
# Get our retriever ready
qa = RetrievalQA.from_chain_type(llm=llm, 
                                 chain_type="stuff", 
                                 retriever=docsearch.as_retriever())
query = "What function do I use if I want to find the most similar item in a list of items?"
output = qa.run(query)

In [78]:
print (output)

You can use the cosine similarity function to find the most similar item in a list of items.


In [79]:
query = "Can you write the code to use the process.extractOne() function? Only respond with code. No other text or explanation"
output = qa.run(query)
print(output)

from fuzzywuzzy import process

choices = ["Atlanta Braves", "Boston Red Sox", "Chicago Cubs", "Los Angeles Dodgers"]
query = "Red Sox"

process.extractOne(query, choices)


## 七，API交互(Interacting with APIs)
如果你需要的数据或操作在 API 之后，就需要LLM能够和API进行交互。

到这个环节，就与 Agents 和 Plugins 息息相关了。

Demo可能很简单，但是功能可以很复杂。

In [80]:
from langchain.chains import APIChain
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)

In [86]:
api_docs = """

BASE URL: https://restcountries.com/

API Documentation:

The API endpoint /v3.1/name/{name} Used to find informatin about a country. All URL parameters are listed below:
    - name: Name of country - Ex: italy, france
    
The API endpoint /v3.1/currency/{currency} Uesd to find information about a region. All URL parameters are listed below:
    - currency: 3 letter currency. Example: USD, COP
    
Woo! This is my documentation
"""
chain_new = APIChain.from_llm_and_api_docs(llm, 
                                           api_docs,
                                           limit_to_domains=["https://restcountries.com/"],
                                           verbose=True)
chain_new.run('Can you tell me information about france?')



[1m> Entering new APIChain chain...[0m
[32;1m[1;3m https://restcountries.com/v3.1/name/france[0m
[33;1m[1;3m[{"name":{"common":"France","official":"French Republic","nativeName":{"fra":{"official":"République française","common":"France"}}},"tld":[".fr"],"cca2":"FR","ccn3":"250","cca3":"FRA","cioc":"FRA","independent":true,"status":"officially-assigned","unMember":true,"currencies":{"EUR":{"name":"Euro","symbol":"€"}},"idd":{"root":"+3","suffixes":["3"]},"capital":["Paris"],"altSpellings":["FR","French Republic","République française"],"region":"Europe","subregion":"Western Europe","languages":{"fra":"French"},"translations":{"ara":{"official":"الجمهورية الفرنسية","common":"فرنسا"},"bre":{"official":"Republik Frañs","common":"Frañs"},"ces":{"official":"Francouzská republika","common":"Francie"},"cym":{"official":"French Republic","common":"France"},"deu":{"official":"Französische Republik","common":"Frankreich"},"est":{"official":"Prantsuse Vabariik","common":"Prantsusmaa"},"f

' The response from the API provides information about France, including its official name, native name, top-level domain, country codes, currencies, capital, region, languages, translations, borders, area, population, timezones, continents, flags, coat of arms, and postal code format.'

In [87]:
chain_new.run('Can you tell me about the currency COP?')



[1m> Entering new APIChain chain...[0m
[32;1m[1;3m https://restcountries.com/v3.1/currency/COP[0m
[33;1m[1;3m[{"name":{"common":"Colombia","official":"Republic of Colombia","nativeName":{"spa":{"official":"República de Colombia","common":"Colombia"}}},"tld":[".co"],"cca2":"CO","ccn3":"170","cca3":"COL","cioc":"COL","independent":true,"status":"officially-assigned","unMember":true,"currencies":{"COP":{"name":"Colombian peso","symbol":"$"}},"idd":{"root":"+5","suffixes":["7"]},"capital":["Bogotá"],"altSpellings":["CO","Republic of Colombia","República de Colombia"],"region":"Americas","subregion":"South America","languages":{"spa":"Spanish"},"translations":{"ara":{"official":"جمهورية كولومبيا","common":"كولومبيا"},"bre":{"official":"Republik Kolombia","common":"Kolombia"},"ces":{"official":"Kolumbijská republika","common":"Kolumbie"},"cym":{"official":"Gweriniaeth Colombia","common":"Colombia"},"deu":{"official":"Republik Kolumbien","common":"Kolumbien"},"est":{"official":"Colom

' The currency COP belongs to Colombia, a country in South America with a population of over 50 million. The flag of Colombia consists of three horizontal bands of yellow, blue, and red. The capital of Colombia is Bogotá and the official language is Spanish. '

## 八，聊天机器人(Chatbots)

聊天机器人使用了之前提及过的很多工具，且最重要的是增加了一个重要的工具：记忆力。

与用户进行实时交互，为用户提供自然语言问题的平易近人的 UI，


In [88]:
from langchain.llms import OpenAI
from langchain import LLMChain
from langchain.prompts.prompt import PromptTemplate

# Chat specific components
from langchain.memory import ConversationBufferMemory

In [89]:
template = """
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it

{chat_history}
Human: {human_input}
Chatbot:"""

prompt = PromptTemplate(
    input_variables=["chat_history", "human_input"], 
    template=template
)
memory = ConversationBufferMemory(memory_key="chat_history")

In [90]:
llm_chain = LLMChain(
    llm=OpenAI(), 
    prompt=prompt, 
    verbose=True, 
    memory=memory
)

In [91]:
llm_chain.predict(human_input="Is an pear a fruit or vegetable?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it


Human: Is an pear a fruit or vegetable?
Chatbot:[0m

[1m> Finished chain.[0m


" It's both! It's a fruit when it's young and a vegetable when it's old and wrinkly. Just like my grandma."

In [92]:
llm_chain.predict(human_input="What was one of the fruits I first asked you about?")
# 这里第二个问题的答案是来自于第一个答案本身的，因此我们使用到了 memory



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a chatbot that is unhelpful.
Your goal is to not help the user but only make jokes.
Take what the user is saying and make a joke out of it

Human: Is an pear a fruit or vegetable?
AI:  It's both! It's a fruit when it's young and a vegetable when it's old and wrinkly. Just like my grandma.
Human: What was one of the fruits I first asked you about?
Chatbot:[0m

[1m> Finished chain.[0m


" Hmm, let me think. Oh yes, it was an unripe banana. But let's be real, any banana is just a potential banana bread ingredient."

## 九，用不到 50 行代码实现一个文档对话机器人
假设所有 2022 年更新的内容都存在于 2022.txt 这个文档中，那么通过如下的代码，就可以让 ChatGPT 来支持回答 2022 年的问题

其中原理也很简单：

对用户的输入/prompt向量化
文档分词
文档分割
文本向量化
向量化了才能进行向量之间相似度的计算
向量化的文本存到向量数据库里
根据用户的输入/prompt去向量数据里寻找答案(答案的判定是基于prompt/输入与文本中相关段落向量的相似性匹配)
最后通过LLM返回答案

In [94]:
import jieba as jb                   # 导入结巴分词库
from langchain.chains import ConversationalRetrievalChain   # 导入用于创建对话检索链的类
from langchain.chat_models import ChatOpenAI                # 导入用于创建ChatOpenAI对象的类
from langchain.document_loaders import DirectoryLoader      # 导入用于加载文件的类
from langchain.embeddings import OpenAIEmbeddings           # 导入用于创建词向量嵌入的类
from langchain.text_splitter import TokenTextSplitter       # 导入用于分割文档的类
from langchain.vectorstores import Chroma                   # 导入用于创建向量数据库的类

if not os.path.exists('./data/cut'):
    os.makedirs('./data/cut')

# 初始化函数，用于处理输入的文档
def init():  
    files = ['2022.txt']      # 需要处理的文件列表
    for file in files:        # 遍历每个文件
        with open(f"./data/{file}", 'r', encoding='utf-8') as f:   # 以读模式打开文件
            data = f.read()   # 读取文件内容

        cut_data = " ".join([w for w in list(jb.cut(data))])       # 对读取的文件内容进行分词处理
        cut_file = f"./data/cut/cut_{file}"      # 定义处理后的文件路径和名称
        with open(cut_file, 'w') as f:           # 以写模式打开文件
            f.write(cut_data)                    # 将处理后的内容写入文件

# 新建一个函数用于加载文档
def load_documents(directory):  
    # 创建DirectoryLoader对象，用于加载指定文件夹内的所有.txt文件
    loader = DirectoryLoader(directory, glob='**/*.txt')  
    docs = loader.load()  # 加载文件
    return docs  # 返回加载的文档

# 新建一个函数用于分割文档
def split_documents(docs):  
    # 创建TokenTextSplitter对象，用于分割文档
    text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)  
    docs_texts = text_splitter.split_documents(docs)  # 分割加载的文本
    return docs_texts  # 返回分割后的文本

# 新建一个函数用于创建词嵌入
def create_embeddings():  
    # 创建OpenAIEmbeddings对象，用于获取OpenAI的词向量
    embeddings = OpenAIEmbeddings()  
    return embeddings  # 返回创建的词嵌入

# 新建一个函数用于创建向量数据库
def create_chroma(docs_texts, embeddings, persist_directory):  
    # 使用文档，embeddings和持久化目录创建Chroma对象
    vectordb = Chroma.from_documents(docs_texts, embeddings, persist_directory=persist_directory)  
    vectordb.persist()      # 持久化存储向量数据
    return vectordb         # 返回创建的向量数据库

# load函数，调用上面定义的具有各个职责的函数
def load():
    docs = load_documents('./data/cut')        # 调用load_documents函数加载文档
    docs_texts = split_documents(docs)         # 调用split_documents函数分割文档
    
    api_key = os.environ.get('OPENAI_API_KEY')   # 从环境变量中获取OpenAI的API密钥
    if not api_key:
        raise ValueError("OpenAI API key is missing. Please set it as an environment variable.")

    embeddings = create_embeddings(api_key)      # 调用create_embeddings函数创建词嵌入

    # 调用create_chroma函数创建向量数据库
    vectordb = create_chroma(docs_texts, embeddings, './data/cut/')  

    # 创建ChatOpenAI对象，用于进行聊天对话
    openai_ojb = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")  

    # 从模型和向量检索器创建ConversationalRetrievalChain对象
    chain = ConversationalRetrievalChain.from_llm(openai_ojb, vectordb.as_retriever())  
    return chain  # 返回该对象

# 在加载之前调用init()进行初始化处理
init()

# 调用load函数，获取ConversationalRetrievalChain对象
chain = load()  

# 定义一个函数，根据输入的问题获取答案
def get_ans(question):  
    chat_history = []      # 初始化聊天历史为空列表
    result = chain({       # 调用chain对象获取聊天结果
        'chat_history': chat_history,  # 传入聊天历史
        'question': question,          # 传入问题
    })
    return result['answer']      # 返回获取的答案

if __name__ == '__main__':       # 如果此脚本作为主程序运行
    s = input('please input:')   # 获取用户输入
    while s != 'exit':      # 如果用户输入的不是'exit'
        ans = get_ans(s)    # 调用get_ans函数获取答案
        print(ans)  # 打印答案
        s = input('please input:')  # 获取用户输入

FileNotFoundError: [Errno 2] No such file or directory: './data/2022.txt'

## 十，智能体(Agents)

Agents是 LLM 中最热门的 🔥 主题之一。

Agents可以查看数据、推断下一步应该采取什么行动，并通过工具为您执行该行动, 是一个具备AI智能的决策者。

温馨提示：小心使用 Auto GPT, 会迅速消耗掉你大量的token。

我们需要在Google Cloud credential console (https://console.cloud.google.com/apis/credentials)中获取`GOOGLE_API_KEY`，在Programmable Search Enginge (https://programmablesearchengine.google.com/controlpanel/create)中获取`GOOGLE_CSE_ID`，接着安装`pip install google-api-python-client`第三方模块。


In [98]:
# Helpers
import os
import json

from langchain.llms import OpenAI

# Agent imports
from langchain.agents import load_tools
from langchain.agents import initialize_agent

# Tool imports
from langchain.agents import Tool
from langchain.utilities import GoogleSearchAPIWrapper
from langchain.utilities import TextRequestsWrapper

In [99]:
llm = OpenAI(temperature=0)

In [102]:
search = GoogleSearchAPIWrapper()

requests = TextRequestsWrapper()

In [103]:
toolkit = [
    Tool(
        name = "Search",
        func=search.run,
        description="useful for when you need to search google to answer questions about current events"
    ),
    Tool(
        name = "Requests",
        func=requests.get,
        description="Useful for when you to make a request to a URL"
    ),
]

In [104]:
agent = initialize_agent(toolkit, 
                         llm, 
                         agent="zero-shot-react-description", 
                         verbose=True, 
                         return_intermediate_steps=True)

In [106]:
response = agent({"input":"What is the capital of canada?"})
response['output']



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should search for the answer on google
Action: Search
Action Input: "capital of canada"[0m

TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。