CofeNet

This is the source code of COLING 2022 paper "CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction". See our paper for more details.

Abstract: Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rulebased approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net (CofeNet) for quotation extraction. CofeNet is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (i.e., PolNeAR and Riqua) and one proprietary dataset (i.e., PoliticsZH), we show that our CofeNet achieves state-of-the-art performance on complicated quotation extraction.

1. Setup

Environment

# Python version==3.7
git clone https://github.com/cofe-ai/CofeNet.git
cd CofeNet
pip install -r requirements.txt

Datasets

The data set is in the ./res directory. Here we give two datasets polnear and riqua in our paper. You can store other datasets here for the framework to read.

./res
├── polnear
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── riqua
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── others
└── ...

If you want to use other datasets, you need to build 5 files for each dataset. The file names do not change:

train.txt, test.txt, valid.txt: Structured Dataset.

Each item of data is stored in a line by json. The key "tokens" is the text words sequence, and "labels" is the corresponding sequence label tag.

{"tokens": ["WikiLeaks", "claims", "`", "state", ...], "labels": ["B-source", "B-cue", "B-content", "I-content", ...]}

tag.txt: The set of "labels" in the dataset.
voc.txt: Tokens vocabulary for non-pretrained model(i.e., LSTM).

Experiment Configure

Configuration files are stored in the conf/setting directory. Here we give the experimental configuration's name(exp_name) in the paper so that you can quickly reproduce the experimental results. You can also configure your experiments here.

Base Model	Dateset	Base	with CRF	with Cofe	Dateset	Base	with CRF	with Cofe
Embedding	polnear	`pn_emb`	`pn_emb_crf`	`pn_emb_cofe`	riqua	`rq_emb`	`rq_emb_crf`	`rq_emb_cofe`
CNN	polnear	`pn_cnn`	`pn_cnn_crf`	`pn_cnn_cofe`	riqua	`rq_cnn`	`rq_cnn_crf`	`rq_cnn_cofe`
GRU	polnear	`pn_gru`	`pn_gru_crf`	`pn_gru_cofe`	riqua	`rq_gru`	`rq_gru_crf`	`rq_gru_cofe`
LSTM	polnear	`pn_lstm`	`pn_lstm_crf`	`pn_lstm_cofe`	riqua	`rq_lstm`	`rq_lstm_crf`	`rq_lstm_cofe`
BiLSTM	polnear	`pn_blstm`	`pn_blstm_crf`	`pn_blstm_cofe`	riqua	`rq_blstm`	`rq_blstm_crf`	`rq_blstm_cofe`
BiLSTM L2	polnear	`pn_blstm2`	`pn_blstm2_crf`	`pn_blstm2_cofe`	riqua	`rq_blstm2`	`rq_blstm2_crf`	`rq_blstm2_cofe`
BERT	polnear	`pn_bert`	`pn_bert_crf`	`pn_bert_cofe`	riqua	`rq_bert`	`rq_bert_crf`	`rq_bert_cofe`
BERT-CNN	polnear	`pn_bert_cnn`			riqua	`rq_bert_cnn`
BERT-LSTM	polnear	`pn_bert_lstm`			riqua	`rq_bert_lstm`
BERT-BiLSTM	polnear	`pn_bert_blstm`	`pn_bert_blstm_crf`		riqua	`rq_bert_blstm`	`rq_bert_blstm_crf`

Trained model

Download the trained model. Save in ./conf/models. Reproduce our results by Evaluate.

2. Run

Train

(a) Run the code

# Cofe for polnear
python run_train.py --exp_name pn_emb_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_cnn_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_gru_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_lstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm2_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_bert_cofe --trn_name v1 --eval_per_step 500 --max_epoch 6 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

# Cofe for riqua
python run_train.py --exp_name rq_emb_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_cnn_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_gru_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_lstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm2_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_bert_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

(b) Check log

You can find log files in ./log. For an experiment, here you can find these files:

Parameter Configuration (i.e., pn_bert_cofe_v1_20221101_040732.json)
Training Log (i.e., pn_bert_cofe_v1_20221101_040732.txt)
Tensorboard Files (i.e., pn_bert_cofe_v1_20221101_040732/)

In this example, pn_bert_cofe_v1_20221101_040732 is the unique name of an experiment. It contains the experiment name(pn_bert_cofe), training version(v1), and start training time(20221101_040732).

(c) Run Tensorboard

tensorboard --bind_all --port 9900 --logdir ./log

Evaluate

Run the code to print the experimental results of the trained model.

# Cofe for polnear
python run_eval.py --exp_name pn_emb_cofe --gpu 0
python run_eval.py --exp_name pn_cnn_cofe --gpu 0
python run_eval.py --exp_name pn_gru_cofe --gpu 0
python run_eval.py --exp_name pn_lstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm2_cofe --gpu 0
python run_eval.py --exp_name pn_bert_cofe --gpu 0

# Cofe for riqua
python run_eval.py --exp_name rq_emb_cofe --gpu 0
python run_eval.py --exp_name rq_cnn_cofe --gpu 0
python run_eval.py --exp_name rq_gru_cofe --gpu 0
python run_eval.py --exp_name rq_lstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm2_cofe --gpu 0
python run_eval.py --exp_name rq_bert_cofe --gpu 0

3. Experiment

CofeNet Detail Experiment here

4. Cite

If the code help you, please cite the following paper.

@inproceedings{wang-etal-2022-cofenet,
    title = "{C}ofe{N}et: Context and Former-Label Enhanced Net for Complicated Quotation Extraction",
    author = "Wang, Yequan  and
      Li, Xiang  and
      Sun, Aixin  and
      Meng, Xuying  and
      Liao, Huaming  and
      Guo, Jiafeng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.215",
    pages = "2438--2449",
    abstract = "Quotation extraction aims to extract quotations from written text. There are three components in a quotation: \textit{source} refers to the holder of the quotation, \textit{cue} is the trigger word(s), and \textit{content} is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the \textbf{Co}ntext and \textbf{F}ormer-Label \textbf{E}nhanced \textbf{Net} () for quotation extraction. is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (and ) and one proprietary dataset (), we show that our achieves state-of-the-art performance on complicated quotation extraction.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
conf/setting		conf/setting
data		data
docs		docs
exe		exe
models		models
res		res
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_eval.py		run_eval.py
run_train.py		run_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CofeNet

1. Setup

Environment

Datasets

Experiment Configure

Trained model

2. Run

Train

Evaluate

3. Experiment

4. Cite

About

Releases

Packages

Contributors 2

Languages

License

cofe-ai/CofeNet

Folders and files

Latest commit

History

Repository files navigation

CofeNet

1. Setup

Environment

Datasets

Experiment Configure

Trained model

2. Run

Train

Evaluate

3. Experiment

4. Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages