这是该论文的代码仓库: 《基于多层级语言特征融合的中文文本可读性分级模型》 (中文信息学报2024-待刊出)。
论文的Preprint版本可以在该链接中进行查看。 [基于多层级语言特征融合的中文文本可读性分级模型-Preprint]
如果您发现这项工作对您的研究有用,请引用我们的论文:
@article{谭可人2023基于多层级语言特征融合的中文文本可读性分级模型,
title={基于多层级语言特征融合的中文文本可读性分级模型},
author={谭可人 and 兰韵诗 and 张杨 and 丁安琪},
journal={中文信息学报},
volume={-},
number={-},
pages={-},
year={2024}
}
运行以下代码以安装所需的库:
pip install -r requirements.txt
使用Python 3.7环境
数据集来自于中文文本可读性分级数据集[CTRDG]。
训练集:hsk_all/data/train.txt
验证集:hsk_all/data/dev.txt
测试集:hsk_all/data/test.txt
中文BERT预训练模型下载地址:[BERT]
中文BERT预训练模型配置文件:bert_pretrain/config.json
中文BERT预训练模型参数文件:bert_pretrain/pytorch_model.bin
中文BERT预训练模型词表文件:bert_pretrain/vocab.txt
{
(configurations about BERT Pretrained Model)
......
"with_linguistic_information_embedding_layer": "True",
"with_character_level_embedding_layer": "True",
"with_word_level_embedding_layer": "True",
"with_grammar_level_embedding_layer": "False",
"character_level_size_embedding_layer": 8,
"word_level_size_embedding_layer": 8,
"grammar_level_size_embedding_layer": 8,
"with_linguistic_information_selfattention_layer" : "True",
"linguistic_information_selfattention_layer_num": 7,
"with_character_level_selfattention_layer": "False",
"with_word_level_selfattention_layer": "False",
"with_grammar_level_selfattention_layer": "True",
"character_level_hp_selfattention_layer": 1,
"word_level_hp_selfattention_layer": 1,
"grammar_level_hp_selfattention_layer": 1,
"level_with_nnembedding": "False",
"add_begin_attention_layer": 0
}
Note:
with_linguistic_information_embedding_layer是否在Bertembedding层添加语言特征with_character/word/grammar_level_embedding_layer是否在Bertembedding层添加汉字/词汇/语法语言特征character_level_size_embedding_layer汉字/词汇/语法语言特征的等级数量 (大纲等级数量为7,添加1维作为未匹配等级)with_linguistic_information_selfattention_layer是否在Bertselfattention层添加语言特征linguistic_information_selfattention_layer_num添加语言特征的Bertselfattention层数量with_character/word/grammar_level_selfattention_layer是否在Bertselfattention层添加汉字/词汇/语法语言特征character/word/grammar_level_hp_selfattention_layer在Bertselfattention层中添加汉字/词汇/语法语言特征的权重超参数level_with_nnembedding在Bertselfattention层中是否对特征的等级进行embedding处理add_begin_attention_layer设置添加语言特征的起始Bertselfattention层编号
注意保持以下两个字段与bert_pretrain/config.json文件的一致性。
self.with_linguistic_information_embedding_layer = True # 是否在bertembedding层融入语言特征
self.with_linguistic_information_selfattention_layer = True # 是否在bertselfattention层融入语言特征
使用以下代码对模型进行训练。
python run.py
使用以下代码对输入的文本进行可读性等级预测。
python predict.py
python llm_evaluation/MLF-BERT_llm_evaluate.py
训练好的MLF-BERT模型可以在该链接中下载获取 [MLF-BERT]
| Model | Test_Acc/% | Test_F1/% |
|---|---|---|
| BERT | 91.10 | 90.97 |
| MLF-BERT | 94.24 | 93.96 |