MTGNN-SUM

This repository contains the implementation for our paper: Multi Graph Neural Network for Extractive Long Document Summarization

Installation

The code is written in Python 3.6+. Its dependencies are summarized in the file requirements.txt. You can install these dependencies like this:

pip install -r requirements.txt

Datasets

Download Pubmed and Arxiv datasets from here

Preprocess data

For pubmed dataset:

python preprocess_data.py --input_path dataset/pubmed-dataset --output_path dataset/pubmed --task train
python preprocess_data.py --input_path dataset/pubmed-dataset --output_path dataset/pubmed --task val
python preprocess_data.py --input_path dataset/pubmed-dataset --output_path dataset/pubmed --task test

For arxiv dataset:

python preprocess_data.py --input_path dataset/arxiv-dataset --output_path dataset/arxiv --task train
python preprocess_data.py --input_path dataset/arxiv-dataset --output_path dataset/arxiv --task val
python preprocess_data.py --input_path dataset/arxiv-dataset --output_path dataset/arxiv --task test

After getting the standard JSON format, you process the dataset by running a script: sh PrepareDataset.sh in the project directory. The processed files will be put under the cache directory.

Get contextualized embeddings

For pubmed dataset:

python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/pubmed/train.label.jsonl --output ./bert_features_pubmed/bert_features_train --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/pubmed/val.label.jsonl --output ./bert_features_pubmed/bert_features_val --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/pubmed/test.label.jsonl --output ./bert_features_pubmed/bert_features_test --batch_size 100

For arxiv dataset:

python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/arxiv/train.label.jsonl --output ./bert_features_arxiv/bert_features_train --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/arxiv/val.label.jsonl --output ./bert_features_arxiv/bert_features_val --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/arxiv/test.label.jsonl --output ./bert_features_arxiv/bert_features_test --batch_size 100

Training

Run command like this

python train.py --cuda --gpu 0 --data_dir <data/dir/of/your/json-format/dataset> --cache_dir <cache/directory/of/graph/features> --embedding_path <glove_path> --model [HSG|MTHSG] --save_root <model path> --log_root <log path> --bert_path <bert feature path> --lr_descent --grad_clip -m 3

For example:

python train.py --cuda --gpu 0 --data_dir dataset/arxiv --cache_dir cache/arxiv --embedding_path glove.42B.300d.txt --model MTHSG --save_root models_arxiv --log_root log_arxiv/ --bert_path bert_features_arxiv --lr_descent --grad_clip -m 3

Evaluation

For evaluation, the command may like this:

python evaluation.py --cuda --gpu 0 --data_dir <data/dir/of/your/json-format/dataset> --cache_dir <cache/directory/of/graph/features> --embedding_path <glove_path>  --model [HSG|HDSG] --save_root <model path> --log_root <log path> --bert_path <bert feature path> -m 5 --test_model multi --use_pyrouge

For example:

python evaluation.py --cuda --gpu 0 --data_dir dataset/arxiv --cache_dir cache/arxiv --embedding_path glove.42B.300d.txt  --model MTHSG --save_root models_arxiv --log_root log_arxiv/ --bert_path bert_features_arxiv -m 5 --test_model multi --use_pyrouge

Note: To use ROUGE evaluation, you need to download the 'ROUGE-1.5.5' package and then use pyrouge.

Error Handling: If you encounter the error message Cannot open exception db file for reading: /path/to/ROUGE-1.5.5/data/WordNet-2.0.exc.db when using pyrouge, the problem can be solved from here.

Some code are borrowed from HSG. Thanks for their work.

Citation

@inproceedings{doan-etal-2022-multi,
    title = "Multi Graph Neural Network for Extractive Long Document Summarization",
    author = "Doan, Xuan-Dung  and Nguyen, Le-Minh  and Bui, Khac-Hoai Nam",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    year = "2022"
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
bert_features_arxiv		bert_features_arxiv
bert_features_pubmed		bert_features_pubmed
cache		cache
dataset		dataset
models		models
module		module
script		script
tools		tools
HiGraph.py		HiGraph.py
PrepareDataset.sh		PrepareDataset.sh
README.md		README.md
Tester.py		Tester.py
evaluation.py		evaluation.py
feature_extraction.py		feature_extraction.py
preprocess_data.py		preprocess_data.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

dungdx34/MTGNN-SUM

Folders and files

Latest commit

History

Repository files navigation

MTGNN-SUM

Installation

Datasets

Preprocess data

Get contextualized embeddings

Training

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Languages