#### **1. 导入模块**

导入 Python 标准库、第三方库和本项目自定义库

In [1]:
# 标准库
import os
import sys
from collections import defaultdict

# 第三方库
# pip install tqdm
from tqdm import tqdm

# 将上级目录加入系统路径
# 以便导入项目自定义库
sys.path.append(os.path.abspath('..'))

# 自定义库
from src.utils import load_data
from src.llm_client import LLMClient

#### **2. 读取语料**

读取 TSV 格式的平行语料库

In [2]:
# 指定语料库路径
data_dir = '../data/raw'

# 读取 TSV 格式语料
# 参数 limit 控制平行句对数量
# 数据存储于 DataFrame 对象
data = load_data(data_dir, limit=10)
print(f'成功读取 {len(data)} 条平行句对')

# 预览数据
# 第一列：汉语原文，选自《鹿鼎记》
# 第二列：英语译文，选自《The Deer and The Cauldron》（闵福德译）
print('数据前 5 行如下：')
data.head()

成功读取 10 条平行句对
数据前 5 行如下：


Unnamed: 0,source,target
1,江南近海滨的一条大路上，一队清兵手执刀枪，押着七辆囚车，冲风冒寒，向北而行。,Along a coastal road somewhere south of the Ya...
2,前面三辆囚车中分别监禁的是三个男子，都作书生打扮，一个是白发老者，两个是中年人。,In each of the first three carts a single male...
3,后面四辆囚车中坐的是女子，最后一辆囚车中是个少妇，怀中抱着个女婴。,"The four rear carts were occupied by women, th..."
4,女婴啼哭不休。 她母亲温言相呵，女婴只是大哭。,The little girl was crying in a continuous wai...
5,囚车旁一清兵恼了，伸腿在车上踢了一脚，喝道：“再哭，再哭！,"One of the soldiers marching alongside, irrita..."


In [3]:
# 提取指定行的数据

# 第一行
row_id = 1

# 提取原文
src_lang = 'Chinese'
src_text = data.iloc[row_id - 1]['source']

# 提取译文
tgt_lang = 'English'
tgt_text = data.iloc[row_id - 1]['target']

print(f'[ID]: {row_id:05d}')
print(f'{"-" * 80}')
print(f'[{src_lang}]：{src_text}')
print(f'[{tgt_lang}]：{tgt_text}')

[ID]: 00001
--------------------------------------------------------------------------------
[Chinese]：江南近海滨的一条大路上，一队清兵手执刀枪，押着七辆囚车，冲风冒寒，向北而行。
[English]：Along a coastal road somewhere south of the Yangtze River, a detachment of soldiers, each of them armed with a halberd, was escorting a line of seven prison carts, trudging northwards in the teeth of a bitter wind.


#### **3. 加载模型**

加载大模型 API 接口

In [4]:
# 加载模型前，请登录阿里云百炼平台：https://bailian.console.aliyun.com/
# 申请调用大模型服务的 API-Key
# 并在 config 文件中设置 LLM_API_KEY=sk-********

# 新注册用户可免费调用部分模型的 API
# 登录后可在模型服务页面查看免费模型列表

# 指定模型名称
# 可选模型包括：
# qwen-flash, qwen-plus, qwen3-max, glm-4.7, deepseek-v3.2
model = "deepseek-v3.2"

# 初始化大模型 API 接口
client = LLMClient(model)

In [5]:
# === 注意 ===
# 为节省 API 调用成本
# 将大模型生成内容保存于本地缓存 data/llm_cache
# 完成首次调用后，再次调用只需从本地数据库读取生成结果

# 若需测试 API 连接是否正常
# 可修改提示词后，重新标注

#### **4. 提示词模版**

编写可复用、可自动填充变量的提示词模版

In [6]:
# 词性标注提示词模版 v1

# {lang} 和 {text} 是占位符
# 用于动态填充语种和数据

prompt_tmpl_1 = """
You are a professional corpus linguist specialized in Part-of-Speech (POS) tagging for {lang} text.

Your task is to perform POS tagging on the given text.
First tokenize the text, then assign a POS tag to each token.

text: {text}
"""

In [7]:
# 构造中文词性标注提示词 v1
src_prompt_1 = prompt_tmpl_1.format(
    lang=src_lang,
    text=src_text,
)
print(f'[中文词性标注提示词 v1]\n{src_prompt_1}')

# 调用大模型 API
# 开始中文词性标注

text_result = client.get_text_response(
    prompt=src_prompt_1,
)

# 输出大模型标注结果
print(f'{"-" * 80}\n')
print(f'[大模型输出结果]\n\n{text_result}')

[中文词性标注提示词 v1]

You are a professional corpus linguist specialized in Part-of-Speech (POS) tagging for Chinese text.

Your task is to perform POS tagging on the given text.
First tokenize the text, then assign a POS tag to each token.

text: 江南近海滨的一条大路上，一队清兵手执刀枪，押着七辆囚车，冲风冒寒，向北而行。

--------------------------------------------------------------------------------

[大模型输出结果]

江南/ns 近/v 海滨/n 的/u 一/m 条/q 大路/n 上/f ，/w 一/m 队/q 清兵/n 手执/v 刀枪/n ，/w 押/v 着/u 七/m 辆/q 囚车/n ，/w 冲风冒寒/v ，/w 向/p 北/f 而/c 行/v 。/w


In [8]:
# 构造英文词性标注提示词 v1
tgt_prompt_1 = prompt_tmpl_1.format(
    lang=tgt_lang,
    text=tgt_text,
)
print(f'[英文词性标注提示词 v1]\n{tgt_prompt_1}')


# 调用大模型 API
# 开始英文词性标注
text_result = client.get_text_response(
    prompt=tgt_prompt_1,
)

# 输出大模型标注结果
print(f'{"-" * 80}\n')
print(f'[大模型输出结果]\n\n{text_result}')

[英文词性标注提示词 v1]

You are a professional corpus linguist specialized in Part-of-Speech (POS) tagging for English text.

Your task is to perform POS tagging on the given text.
First tokenize the text, then assign a POS tag to each token.

text: Along a coastal road somewhere south of the Yangtze River, a detachment of soldiers, each of them armed with a halberd, was escorting a line of seven prison carts, trudging northwards in the teeth of a bitter wind.

--------------------------------------------------------------------------------

[大模型输出结果]

**Tokenization & Part-of-Speech Tagging**

1. **Along** – ADP (preposition/subordinating conjunction)  
2. **a** – DET (determiner)  
3. **coastal** – ADJ (adjective)  
4. **road** – NOUN (noun)  
5. **somewhere** – ADV (adverb)  
6. **south** – ADV (adverb)  
7. **of** – ADP (preposition/subordinating conjunction)  
8. **the** – DET (determiner)  
9. **Yangtze** – PROPN (proper noun)  
10. **River** – PROPN (proper noun)  
11. **,** – PUNCT (pu

#### **4. 结构化输出**

要求大模型以 JSON 格式返回结果

In [9]:
# 词性标注提示词模版 v2

# {lang} 和 {text} 是占位符
# 用于动态填充语种和数据
# 同时指定输出格式为 JSON

prompt_tmpl_2 = """
You are a professional corpus linguist specialized in Part-of-Speech (POS) tagging for {lang} text.

Your task is to perform POS tagging on the given text.
First tokenize the text, then assign a POS tag to each token.

text: {text}

Output format:
Return output in JSON format with the following fields:
- tokens: List of tokens (words and punctuation)
- pos_tags: List of POS tags (must correspond one-to-one with tokens)
"""

In [10]:
# 构造中文词性标注提示词 v2

src_prompt_2 = prompt_tmpl_2.format(
    lang=src_lang,
    text=src_text,
)
print(f'[中文词性标注提示词 v2]\n{src_prompt_2}')

# 调用大模型 API
# 开始中文词性标注

json_result = client.get_json_response(
    prompt=src_prompt_2,
)

# 输出大模型标注结果
print(f'{"-" * 80}\n')
print(f'[大模型输出结果]\n\n{json_result}')

[中文词性标注提示词 v2]

You are a professional corpus linguist specialized in Part-of-Speech (POS) tagging for Chinese text.

Your task is to perform POS tagging on the given text.
First tokenize the text, then assign a POS tag to each token.

text: 江南近海滨的一条大路上，一队清兵手执刀枪，押着七辆囚车，冲风冒寒，向北而行。

Output format:
Return output in JSON format with the following fields:
- tokens: List of tokens (words and punctuation)
- pos_tags: List of POS tags (must correspond one-to-one with tokens)

--------------------------------------------------------------------------------

[大模型输出结果]

{'tokens': ['江南', '近', '海滨', '的', '一条', '大路', '上', '，', '一队', '清兵', '手执', '刀枪', '，', '押着', '七辆', '囚车', '，', '冲风', '冒寒', '，', '向北', '而', '行', '。'], 'pos_tags': ['NR', 'VV', 'NN', 'DEG', 'CD', 'NN', 'LC', 'PU', 'CD', 'NN', 'VV', 'NN', 'PU', 'VV', 'CD', 'NN', 'PU', 'VV', 'VV', 'PU', 'VV', 'CC', 'VV', 'PU']}


In [11]:
# 构造英语词性标注提示词 v2

tgt_prompt_2 = prompt_tmpl_2.format(
    lang=tgt_lang,
    text=tgt_text,
)
print(f'[英文词性标注提示词 v2]\n{tgt_prompt_2}')

# 调用大模型 API
# 开始英文词性标注

json_result = client.get_json_response(
    prompt=tgt_prompt_2,
)

# 输出大模型标注结果
print(f'{"-" * 80}\n')
print(f'[大模型输出结果]\n\n{json_result}')

[英文词性标注提示词 v2]

You are a professional corpus linguist specialized in Part-of-Speech (POS) tagging for English text.

Your task is to perform POS tagging on the given text.
First tokenize the text, then assign a POS tag to each token.

text: Along a coastal road somewhere south of the Yangtze River, a detachment of soldiers, each of them armed with a halberd, was escorting a line of seven prison carts, trudging northwards in the teeth of a bitter wind.

Output format:
Return output in JSON format with the following fields:
- tokens: List of tokens (words and punctuation)
- pos_tags: List of POS tags (must correspond one-to-one with tokens)

--------------------------------------------------------------------------------

[大模型输出结果]

{'tokens': ['Along', 'a', 'coastal', 'road', 'somewhere', 'south', 'of', 'the', 'Yangtze', 'River', ',', 'a', 'detachment', 'of', 'soldiers', ',', 'each', 'of', 'them', 'armed', 'with', 'a', 'halberd', ',', 'was', 'escorting', 'a', 'line', 'of', 'seven', 'pr

#### **5. 批量标注**

批量标注所有汉英平行句对的词性

In [12]:
annos = []

# 逐行遍历所有数据
for row_id, row in tqdm(data.iterrows(), total=len(data), desc="Tagging"):
    #if row_id > 1: break
    try:
        record = defaultdict(lambda: defaultdict(dict))
        record['id'] = f'{row_id:05d}'
    
        # 提取原文和译文
        src_text = row['source']
        tgt_text = row['target']
        record['source']['text'] = src_text
        record['target']['text'] = tgt_text
        
        # === 标注原文 ===

        # 构建原文标注提示词
        src_prompt = prompt_tmpl_2.format(
            lang=src_lang,
            text=src_text,
        )
        # 调用大模型 API 标注原文
        src_anno = client.get_json_response(
            prompt = src_prompt
        )

        # === 标注译文 ===

        # 构建译文标注提示词
        tgt_prompt = prompt_tmpl_2.format(
            lang=tgt_lang,
            text=tgt_text,
        )
        # 调用大模型 API 标注译文
        tgt_anno = client.get_json_response(
            prompt = tgt_prompt
        )
    
        # 提取标注结果
        record['source']['tok'] = src_anno['tokens']
        record['source']['pos'] = src_anno['pos_tags']
        record['target']['tok'] = tgt_anno['tokens']
        record['target']['pos'] = tgt_anno['pos_tags']
        annos.append(record)
    except Exception as e:
        print(f'Error at index {row_id}: {e}')
        continue

Tagging: 100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<?, ?it/s]


In [13]:
# 打印标注结果

for anno in annos:
    id = anno['id']
    print(f'[ID]: {id}')
    print(f'{"-" * 80}')
    
    src_text = anno['source']['text']
    src_tok = anno['source']['tok']
    src_pos = anno['source']['pos']
    src_tags = [(tok, pos) for tok, pos in zip (src_tok, src_pos)]
    print(f'{src_text}\n')
    print(f'{src_tags}')
    print(f'{"-" * 80}')
    
    tgt_text = anno['target']['text']
    tgt_tok = anno['target']['tok']
    tgt_pos = anno['target']['pos']
    tgt_tags = [(tok, pos) for tok, pos in zip (tgt_tok, tgt_pos)]
    print(f'{tgt_text}\n')
    print(f'{tgt_tags}')
    print(f'{"=" * 80}')

[ID]: 00001
--------------------------------------------------------------------------------
江南近海滨的一条大路上，一队清兵手执刀枪，押着七辆囚车，冲风冒寒，向北而行。

[('江南', 'NR'), ('近', 'VV'), ('海滨', 'NN'), ('的', 'DEG'), ('一条', 'CD'), ('大路', 'NN'), ('上', 'LC'), ('，', 'PU'), ('一队', 'CD'), ('清兵', 'NN'), ('手执', 'VV'), ('刀枪', 'NN'), ('，', 'PU'), ('押着', 'VV'), ('七辆', 'CD'), ('囚车', 'NN'), ('，', 'PU'), ('冲风', 'VV'), ('冒寒', 'VV'), ('，', 'PU'), ('向北', 'VV'), ('而', 'CC'), ('行', 'VV'), ('。', 'PU')]
--------------------------------------------------------------------------------
Along a coastal road somewhere south of the Yangtze River, a detachment of soldiers, each of them armed with a halberd, was escorting a line of seven prison carts, trudging northwards in the teeth of a bitter wind.

[('Along', 'IN'), ('a', 'DT'), ('coastal', 'JJ'), ('road', 'NN'), ('somewhere', 'RB'), ('south', 'RB'), ('of', 'IN'), ('the', 'DT'), ('Yangtze', 'NNP'), ('River', 'NNP'), (',', ','), ('a', 'DT'), ('detachment', 'NN'), ('of', 'IN'), ('soldiers