# LangChain: Debugging & Evaluation

评估应用程序的表现，是否达到某种验收标准？当改变系统中的参数，比如换一种 llm、向量数据库、prompt 等，如何知道结果是变好还是变坏？本节将介绍如何评估基于 llm 的应用程序，以及对 chains 的中间过程进行调试，将中间过程每一步的 prompt、检索到的文档、中间结果 ... 展示出来。

## 环境初始化

In [None]:
!pip install python-dotenv
!pip install openai
!pip install --upgrade langchain
!pip install pandas
!pip install docarray
!pip install tiktoken

In [None]:
%env OPENAI_API_KEY=sk-T8NU5uCIOvnyvU===QuV

In [4]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

## 创建一个 QA 应用

In [5]:
from langchain.chains import RetrievalQA # 帮助检索文档的 chain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch # vectorstore_cls
from langchain.indexes import VectorstoreIndexCreator # 用于创建 Vector Store Index

In [10]:
# 数据加载，有关户外服装目录的数据
loader = CSVLoader(
  file_path="drive/MyDrive/LangChain-Learning/OutdoorClothingCatalog_1000.csv",
  encoding='utf-8'
)
data = loader.load()

In [8]:
# 对数据进行 embedding 处理，将向量与文本 chunks 存储在内存中（该 Vector Store 是内存存储）
vector_store_index = VectorstoreIndexCreator(
  vectorstore_cls = DocArrayInMemorySearch
).from_loaders([loader])

In [11]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store_index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

## 手动创建测试用例

通过观察数据集里面的文本，手动创建 QA 测试用例，不推荐这种方式。

In [12]:
data[10]

Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'drive/MyDrive/LangChain-Learning/OutdoorClothingCatalog_1000.csv', 'row': 10})

In [13]:
data[11]

Document(page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.', metadata={'source': 'drive/MyDrive/LangChain-Learning/OutdoorClothingCatalog_1000.csv', 'row': 11})

In [None]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

## 借助 LLM 生成测试用例

使用 QAGenerateChain 传入数据集，就会自动生成对应的 QA。

In [14]:
from langchain.evaluation.qa import QAGenerateChain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [16]:
examples = example_gen_chain.apply_and_parse(
  [{"doc": t} for t in data[:5]]
)

examples



[{'query': "What is the weight of the Women's Campside Oxfords per pair?",
  'answer': "The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair."},
 {'query': 'What are the dimensions for the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece?",
  'answer': "The swimsuit has bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and resists snags. The fabric is also UPF 50+ rated, providing the highest rated sun protection possible by blocking 98% of the sun's harmful rays. The swimsuit has crossover no-slip straps and a fully lined bottom ensuring a secure fit and maximum coverage. It can be machine washed and line dried for best results."},
 {'query': 'Wha

## 开启调试功能

为了方便调试，想知道 chains 每一步具体的输入输出，只需要设置 debug 属性即可。

In [17]:
import langchain
langchain.debug = True

In [18]:
qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is the weight of the Women's Campside Oxfords per pair?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is the weight of the Women's Campside Oxfords per pair?",
  "context": ": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole wi

"The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair."

## 借助 LLM 使用测试用例评估应用

In [19]:
langchain.debug = False

In [20]:
predictions = qa.apply(examples)



[1m> Entering new  chain...[0m

[1m> Finished chain.[0m


[1m> Entering new  chain...[0m

[1m> Finished chain.[0m


[1m> Entering new  chain...[0m

[1m> Finished chain.[0m


[1m> Entering new  chain...[0m





[1m> Finished chain.[0m


[1m> Entering new  chain...[0m





[1m> Finished chain.[0m


In [21]:
predictions

[{'query': "What is the weight of the Women's Campside Oxfords per pair?",
  'answer': "The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair.",
  'result': "The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair."},
 {'query': 'What are the dimensions for the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".',
  'result': 'The dimensions for the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28" and the dimensions for the medium size are 22.5" x 34.5".'},
 {'query': "What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece?",
  'answer': "The swimsuit has bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and resists snags. The fabric is also UPF 50+ rated, providing the highest rated

In [23]:
from langchain.evaluation.qa import QAEvalChain

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)

graded_outputs



[{'text': 'CORRECT'},
 {'text': 'CORRECT'},
 {'text': 'CORRECT'},
 {'text': 'CORRECT'},
 {'text': 'CORRECT'}]

In [24]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer']) # 借助 llm 基于完整的文档生成的 answer
    print("Predicted Answer: " + predictions[i]['result']) # 通过 embedding + vector store 检索文档后生成的 answer
    print("Predicted Grade: " + graded_outputs[i]['text']) # 借助 llm 对 real answer 和 predicted answer 进行评分
    print()

Example 0:
Question: What is the weight of the Women's Campside Oxfords per pair?
Real Answer: The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair.
Predicted Answer: The Women's Campside Oxfords weigh approximately 1 lb. 1 oz. per pair.
Predicted Grade: CORRECT

Example 1:
Question: What are the dimensions for the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions for the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28" and the dimensions for the medium size are 22.5" x 34.5".
Predicted Grade: CORRECT

Example 2:
Question: What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece?
Real Answer: The swimsuit has bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and res

## LangChain 应用评估平台

ref：https://blog.langchain.dev/auto-eval-of-question-answering-tasks/

该平台通过 GUI 的方式可以对测试集进行可视化修改、添加等操作。