Oracle Theme

对甲骨文进行主题分类。

模型：

LSTM
BERT-Chinese
JiaguTextBERT（宋晨阳）

Data

应该将 train、dev 和 test 数据放在同一个目录，比如：

./my/data/path/train.json
./my/data/path/dev.json
./my/data/path/test.json

然后在 train.py 的第 20 行的 data_dir 设为数据所在路径：

# ...
data_dir = Path('./my/data/path')
# ...

数据格式：

[
  {
    "book_name": "H26018",
    "text": "鼎（貞）：翼（翌）…其冓（遘）…歲于■…",
    "theme": "來時 / 祭祀",
    "oracle_text": " 嗀瓠…磦昅…厺枂… "
  },
  {
    "book_name": "H27996",
    "text": " 庚…； 叀（惠）用沙于止（翦）方，不雉眾。； 戍※方戍。； 弗（翦）。； 戍…（翦）。",
    "theme": "問捷 / 方國",
    "oracle_text": " 域…譫笘 "
  },
  // ...
]

其中 theme 字段就是 label

目前无视 oracle_text 字段。

Labels

The raw data is labeled manually, so the theme labeling schema is not very strict. We post-process the labels by merging similar labels. This results in 28 themes:

[
  '災難', '祭祀', '奴隸主貴族', '卜法', '氣象', '時間', 
  '地名', '漁獵、畜牧', '人名', '方域', '戰爭', '吉凶、夢幻', 
  '鬼神崇拜', '刑法', '死喪', '農業', '音樂', '飲食', 
  '生育', '官吏', '居住', '貢納', '疾病', '文字', 
  '軍隊', '天文', '奴隸和平民', '交通'
]

The mapping from merged themes to original themes is stored in data/label_clusters.json.

Execution

Training

LSTM:

python3 train_lstm.py

BERT and RBT:

python3 train.py

Testing

同上，将来会单独把 test 代码分离出来的。

Result

Model	Acc.	Micro-F1	Macro-F1
LSTM	15.43	20.05	9.21
BiLSTM	23.05	25.93
RBT3	81.00	75.76	42.37
BERT (Rand-init)
BERT (Pretrained)	79.10

F1 of RBT3 on each theme sorted by example count (descending):

So, on less frequent themes, the F1 is very bad.

NER using BiLSTM

Data

数据应该放在 ner 目录下，以 csv 格式存储，如：

train set: ner/train.csv
dev set: ner/dev.csv
test set: ner/test.csv

Training

python3 train_ner_lstm.py

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
images		images
ner		ner
tokenization		tokenization
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
get_labels.py		get_labels.py
get_score.py		get_score.py
get_unk_prop.py		get_unk_prop.py
get_vocab.py		get_vocab.py
model.py		model.py
train.py		train.py
train.sh		train.sh
train_lstm.py		train_lstm.py
train_lstm.sh		train_lstm.sh
train_ner_lstm.py		train_ner_lstm.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oracle Theme

Data

Labels

Execution

Training

Testing

Result

NER using BiLSTM

Data

Training

About

Releases

Packages

Languages

chen-yingfa/oracle-theme

Folders and files

Latest commit

History

Repository files navigation

Oracle Theme

Data

Labels

Execution

Training

Testing

Result

NER using BiLSTM

Data

Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages