This repository contains code related to the paper “CATS: Customizable Abstractive Topic-based Summarization” published at Transactions of Information Systems (TOIS) journal, 2021.
The code has been developed using python 2.7, and Tensorflow 1.4. This implementation is based on code releases related to Pointer-Generator Networks here and the TextSum project.
Obtaining the non-anonymized CNN/DailyMail dataset Used in the Paper: In order to obtain the dataset, we encourage users to download and preprocess the dataset as described here. Furthermore, we use the exact same setting of chunked data.
The LDA models used in our paper can be obtained from here. The current code release has been tested with the 150 topics pre-trained LDA model. You can make a reference to one of the provided LDA topic models in data.py in the TopicModel class.
In order to train the model you may run:
python run_summarization.py --mode=train --data_path=/path/to/chunked/train_* --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment
This will create a subdirectory of your specified log_root called myexperiment where all checkpoints will be saved. Then the model will start training using the train_*.bin files as training data.
As stated in the paper, no topic information were used at test time. In order to decode without topic information, we used the pointer-generator basic model code here. After downloading the code, you may decode using:
python run_summarization.py --mode=decode --data_path=/path/to/chunked/val_* --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment
Please note that one should run the above command using the same settings entered for the training job (plus any decode mode specific flags like beam_size).
This will repeatedly load random examples from your specified datafile and generate a summary using beam search. The results will be printed to screen.
If you would like to run evaluation on the entire validation or test set and obtain ROUGE scores, set the flag single_pass=1. This will go through the entire dataset in the same order, writing the generated summaries to file, and then running evaluation using pyrouge.