Improving Multi-Document Modeling via Cross-Document Question-Answering

This repository contains the accompanying code for the paper:

"Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering ." Avi Caciularu, Arman Cohan, Ido Dagan, Jacob Goldberger and Arman Cohan. In ACL, 2023. [PDF]

You can either pretrain by yourself or use the pretrained QAmden model weights and tokenizer files, which are available on HuggingFace.

Pre-trained Model Usage

Code for loading and using the QAmden pre-trained model:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden')

Please note that during our pretraining we used the document separators (similarly as PRIMERA), which you might want to add to your data. The document separator is <doc-sep> (the last token in the vocabulary).

We also provide QAmden fine-tuned over the multinews dataset:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden-multinews')

Pre-training your own QAmden model

For generating the pre-training your own QAmden model:

Download and untar the preprocessed newshead data.
Process the data by running pretrain_preprocess_qasem.py.
Filter the processed data and create the csv files by running preprocess_and_filter_data.py.

Instead, you can download and use the already preprocessed data:

from datasets import load_dataset
qamden_pretraining_dataset = load_dataset("biu-nlp/QAmden-pretraining")

Once you have the data, launch pre-training using the pretrain_qamden.py script.

Evaluating the QAmden model on multi-document summarization

Use the finetune_summarization.py script to evaluate over multi-news or over multi_x_science_sum.

Citation:

If you find our work useful, please cite the paper as:

@article{caciularu2023Peekacross,
  title={Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering},
  author={Caciularu, Avi and Peters, Matthew E. and Goldberger, Jacob and Dagan, Ido and Cohan, Arman},
  journal={The Annual Meeting of the Association for Computational Linguistics (ACL 2023)},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
compute_scores.py		compute_scores.py
dataloader.py		dataloader.py
finetune_summarization.py		finetune_summarization.py
preprocess_and_filter_data.py		preprocess_and_filter_data.py
pretrain_preprocess_qasem.py		pretrain_preprocess_qasem.py
pretrain_qamden.py		pretrain_qamden.py
qasem_end_to_end_pipeline.py		qasem_end_to_end_pipeline.py
trainer_seq2seq_qa.py		trainer_seq2seq_qa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

compute_scores.py

compute_scores.py

dataloader.py

dataloader.py

finetune_summarization.py

finetune_summarization.py

preprocess_and_filter_data.py

preprocess_and_filter_data.py

pretrain_preprocess_qasem.py

pretrain_preprocess_qasem.py

pretrain_qamden.py

pretrain_qamden.py

qasem_end_to_end_pipeline.py

qasem_end_to_end_pipeline.py

trainer_seq2seq_qa.py

trainer_seq2seq_qa.py

Repository files navigation

Improving Multi-Document Modeling via Cross-Document Question-Answering

Pre-trained Model Usage

Pre-training your own QAmden model

Evaluating the QAmden model on multi-document summarization

Citation:

About

Releases

Packages

Languages

License

aviclu/peekacross

Folders and files

Latest commit

History

Repository files navigation

Improving Multi-Document Modeling via Cross-Document Question-Answering

Pre-trained Model Usage

Pre-training your own QAmden model

Evaluating the QAmden model on multi-document summarization

Citation:

About

Resources

License

Stars

Watchers

Forks

Languages