## Structured generation with LLM(1)：Kor

第一步，安装必要的库 `pip3 install kor langchain pydantic`

然后，导入它们

In [1]:
import os
from langchain_community.chat_models import ChatOpenAI
from kor import create_extraction_chain, Object, Text, Number, from_pydantic
import pydantic
from pydantic import BaseModel, Field
from typing import List, Optional
import enum
from pprint import pprint

配置openai api的base_url和key，国内许多厂商都适配openai api。

安利一个国内的硅基流动，他们家**免费**提供了一些9B以下的优秀开源LLM api，而且也适配openai api；免费的模型列表看这里：https://docs.siliconflow.cn/docs/model-names

In [2]:
os.environ['OPENAI_BASE_URL'] = "https://api.siliconflow.cn/v1" # 本次实验使用了硅基流动的免费api
os.environ['OPENAI_API_KEY'] = "xxx" # 替换成自己申请的api key

llm = ChatOpenAI(
    model_name='THUDM/glm-4-9b-chat',
    temperature=0, # 设置成0最稳定；structured generation中稳定最重要
    max_tokens=2000,
    model_kwargs = {
        'frequency_penalty':0,
        'presence_penalty':0,
        'top_p':1.0,
    }
)

  warn_deprecated(


## Example 1: 中文翻译器

效果：输入任意文本，返回{"translate_result": {"chinese": 翻译结果}}

在结构化输出中，一般只需两步即可：

1. 设置schema（即想要llm输出的结构，同时包含注释、例子）；
2. 用结构化输出工具（例如本文提到的Kor）得到schema结果。

Kor支持两种设置schema的模式，*Kor schema*和*Pydantic Model*，在这个例子中，我们使用Kor schema。

**注意**：此处不对Kor做过多介绍，细节请读者参阅文档：https://eyurtsev.github.io/kor/

In [3]:
# kor schema，我们想要的输出格式
schema = Object(
    id="translate_result",
    description=(
        "任意文本的翻译结果。"
    ),
    attributes=[
        Text(
            id="chinese",
            description="中文翻译结果",
            examples=[], # Kor支持few-shot examples，但本例子比较简单，故不需要
            many=False, 
        ),
    ],
    many=False,
)

In [4]:
# 运行结果
chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
text = "We've trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT-like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools."
print(chain.run(text)['data'])

  warn_deprecated(
  warn_deprecated(


{'translate_result': {'chinese': '我们训练了一个基于GPT-4的模型，称为CriticGPT，用于捕捉ChatGPT代码输出的错误。我们发现，当人们从CriticGPT那里获得帮助来审查ChatGPT代码时，他们比没有帮助的人高出60%的效率。我们正在开始将类似CriticGPT的模型集成到我们的RLHF标记流程中，为我们的训练师提供明确的AI辅助。这是朝着能够评估来自高级AI系统的输出迈出的一步，这些输出在没有更好的工具的情况下很难被人类评估。'}}


**示例1成功运行：）**

我们打印kor的prompt来看看。

In [5]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

translate_result: { // 任意文本的翻译结果。
 chinese: string // 中文翻译结果
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: [user input]
Output:


接着，我们进入第二个示例。

## Example 2：评价解析

预期效果：输入一段用户评价，得到评价属性（口味、价格等）、评价极性（正向、负向、中立）、评价词（好吃、贵等）、参考片段。

结构化输出，第一步是定义schema，我们可以设置成这样的schema

```
[
    {
        'aspect': 评价属性,
        'sentiment': 评价极性,
        'sentiment_word': 评价词,
        'reference': 参考片段,
    }
]
```

在这个例子中，我们使用*Pydantic Model*来定义schema，*Pydantic Model*也能够支持few-shot examples，其额外好处是可以Validate

In [6]:
# 评价解析的pydantic model
class Sentiment(enum.Enum):
    positive = "positive"
    negative = "negative"
    neural = "neural"

class Dianpin(BaseModel):
    aspect: str = Field(
        description="评价属性"
    )
    sentiment_word: str = Field(
        description='对评价属性的评价词，从原文中抽取',
    )
    sentiment: Optional[Sentiment] = Field(
        description='对评价属性的情感，positive\negative\neural中的一个',
    )
    reference: str = Field(
        description='评价的原文'
    ) 

In [7]:
# 运行kor
schema, validator = from_pydantic(
    Dianpin, 
    description='对评价的解析结果', 
    examples=[],  
    many=True #支持list of aspect
)
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", validator=validator
)

pprint(chain.run("整体来说，环境可以，味道的话也还不错，但价格有一点小贵。"))

{'data': {},
 'errors': [ParseError('The LLM has returned structured data which does not match the expected schema. Providing additional examples may help improve the parse.')],
 'raw': '\n'
        '<json>\n'
        '[\n'
        '  {\n'
        '    "aspect": "环境",\n'
        '    "sentiment_word": "可以",\n'
        '    "sentiment": "positive"\n'
        '  },\n'
        '  {\n'
        '    "aspect": "味道",\n'
        '    "sentiment_word": "还不错",\n'
        '    "sentiment": "positive"\n'
        '  },\n'
        '  {\n'
        '    "aspect": "价格",\n'
        '    "sentiment_word": "小贵",\n'
        '    "sentiment": "negative"\n'
        '  }\n'
        ']\n'
        '</json>',
 'validated_data': {}}


**注意**，此时`data`字段数据为空，**因为LLM的返回不符合预期的schema**，kor建议加入examples

于是我们加入一个简单的example

In [8]:
# 运行kor
schema, validator = from_pydantic(
    Dianpin, 
    description='对评价的解析结果', 
    examples=[
        ('味道真不错，下次还来！', [{"aspect":"味道", "sentiment_word": "真不错", "sentiment": "positive", "reference": "味道真不错"}])
    ],
    many=True #支持list of aspect
)
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", validator=validator
)

pprint(chain.run("整体来说，环境可以，味道的话也还不错，但价格有一点小贵。"))

{'data': {'dianpin': [{'aspect': '环境',
                       'reference': '整体来说，环境可以',
                       'sentiment': 'positive',
                       'sentiment_word': '可以'},
                      {'aspect': '味道',
                       'reference': '味道的话也还不错',
                       'sentiment': 'positive',
                       'sentiment_word': '还不错'},
                      {'aspect': '价格',
                       'reference': '但价格有一点小贵',
                       'sentiment': 'negative',
                       'sentiment_word': '小贵'}]},
 'errors': [],
 'raw': '\n'
        '<json>\n'
        '{\n'
        '  "dianpin": [\n'
        '    {\n'
        '      "aspect": "环境",\n'
        '      "sentiment_word": "可以",\n'
        '      "sentiment": "positive",\n'
        '      "reference": "整体来说，环境可以"\n'
        '    },\n'
        '    {\n'
        '      "aspect": "味道",\n'
        '      "sentiment_word": "还不错",\n'
        '      "sentiment": "positive",\n'
        '      "refere

**示例2也成功运行啦！**

我们也打印kor的prompt，看看长什么样，以及few-shot examples是如何使用的。

In [9]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

dianpin: Array<{ // 对评价的解析结果
 aspect: string // 评价属性
 sentiment_word: string // 对评价属性的评价词，从原文中抽取
 sentiment: "positive" | "negative" | "neural" // 对评价属性的情感，positive
egative
eural中的一个
 reference: string // 评价的原文
}>
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: 味道真不错，下次还来！
Output: <json>{"dianpin": [{"aspect": "味道", "sentiment_word": "真不错", "sentiment":