Skip to content

[NeurIPS 2023] Pre-training Machine-Reader (Instead of Masked Language Model) at Scale

Notifications You must be signed in to change notification settings

DAMO-NLP-SG/PMR

Repository files navigation

Pre-training Machine Reader (Instead of Masked Language Model) at Scale

Introduction

Pre-trained Machine Reader (PMR) is pre-trained with 18 million Machine Reading Comprehension (MRC) examples constructed with Wikipedia Hyperlinks using a learning objective called Wiki Anchor Extraction (WAE). Specifically, we constructed a large volume of general-purpose and high-quality MRC-style training data based on Wikipedia anchors (i.e., hyperlinked texts). For each Wikipedia anchor, we composed a pair of correlated articles. One side of the pair is the Wikipedia article that contains detailed descriptions of the hyperlinked entity, which we defined as the definition article. The other side of the pair is the article that mentions the specific anchor text, which we defined as the mention article. We composed an MRC-style training instance in which the anchor is the answer, the surrounding passage of the anchor in the mention article is the context, and the definition of the anchor entity in the definition article is the query.

PMR data

Based on the above data, we then introduced a novel WAE problem as the pre-training task of PMR. In this task, PMR determines whether the context and the query are relevant. If so, PMR extracts the answer from the context that satisfies the query description.

PMR model

Requirements

  • python 3.6.12
  • pytorch 1.9.0+cu111 (downloaded from the official webpage)
  • transformers 4.17.0
  • tqdm 4.63.0
  • sentencepiece==0.1.96
  • Levenshtein

Preprocessing Wikipedia (Optional)

  1. PMR

    1. We downloaded the English Wikipedia dump (2022-01-01) from Wikimedia and preprocessed it with WikiExtractor. The following code is our command to extract plain text with hyperlinks from the Wiki dump, which will create a folder named en that stores all extracted pages.
      python WikiExtractor.py enwiki-20220801-pages-articles.xml.bz2 -c -l --json -o en   
      
    2. Then we use the following code to do tokenization for all the wiki pages. The outputs are two files named en/processed/e2p.bz2 and en/processed/e2c.bz2. e2p is a dictionary that maps the entity (e) to its tokenized wiki page (p). e2c is another dictionary that maps the entity (e) to its anchor's context (c). We provide our processed files here.
      cp -r wikiextractor/en ./
      python preprocessing_PMR/wiki_preprocess_multiprocess.py --file en --processes 20
      
    3. Once we get the tokenized data, we do sampling to select several contexts for each wiki entity. We set an upper bound (--up) of context number and a filtering threshold (--bottom) to filter out the entities that have fewer contexts.
      python preprocessing_PMR/wiki_sampling.py --method fre --fre 10 --k 10 --file en --sample_data full-random10
      
    4. We also generate the token id using the model tokenizer and save it in a cached file for fast reusage and accelerating the training process (--evaluate for generating the cache file of test set).
      python preprocessing_PMR/wikipedia2mrc_multiprocess.py --processes 20 --file en --sample_data full-random10 --model_type roberta --model_name_or_path roberta-base --do_negative --buffer 7000000 --simple
      python preprocessing_PMR/wikipedia2mrc_multiprocess.py --processes 20 --file en --sample_data full-random10 --model_type roberta --model_name_or_path roberta-base --do_negative --buffer 7000000 --simple --evaluate
      
  2. mPMR

    1. We downloaded the English Wikipedia dump (2022-08-01) from Wikimedia and preprocessed it with WikiExtractor. The following code is our command to extract plain text with hyperlinks from the Wiki dump, which will create a folder named en that stores all extracted pages.

      python WikiExtractor.py enwiki-20220801-pages-articles.xml.bz2 -c -l --json -o en   
      cp -r ./en ../mPMR/
      
    2. After extracting Wikipedia pages for all languages (step 1 in the last section), we provide preprocessing scripts to create MRC examples for all languages.

      bash preprocessing_mPMR/wiki_preprocess.sh
      bash preprocessing_mPMR/wiki_sampling.sh
      bash preprocessing_mPMR/wiki2mrc.sh
      
    3. Since each cache file contains the MRC examples of the same language, we do another mix procedure such that each cache file contains exmaples from all languages with the same distribution (--evaluate for mixing the cache file of the test set).

      python preprocessing_mPMR/wiki_mix_fast.py --langs ar_bn_de_fi_fr_el_en_es_hi_id_it_ja_ko_nl_pl_pt_ru_sv_sw_te_th_tr_vi_zh --sample_data full-random-10 --model_type xlmr --saved_buffer 7000000 --buffer 10000000 --do_negative --out_dir mix24
      python preprocessing_mPMR/wiki_mix_fast.py --langs ar_bn_de_fi_fr_el_en_es_hi_id_it_ja_ko_nl_pl_pt_ru_sv_sw_te_th_tr_vi_zh --sample_data full-random-10 --model_type xlmr --saved_buffer 7000000 --buffer 10000000 --do_negative --out_dir mix24 --evaluate
      

Pre-training (Optional)

Once we get the cached file, we can start pre-training. Please refer to the following codes for pre-training PMR/mPMR models.

bash train.sh           ## Notes: PMR on top of RoBERTa-base
bash train-large.sh     ## Notes: PMR on top of RoBERTa-large
bash train-xxlarge.sh   ## Notes: PMR on top of ALBERT-xxlarge
bash mtrain.sh          ## Notes: mPMR on top of XLMR-base
bash mtrain-large.sh    ## Notes: mPMR on top of XLMR-large

Pre-trained Models

Model HF Name
PMR-base DAMO-NLP-SG/PMR-base
PMR-large DAMO-NLP-SG/PMR-large
PMR-xxlarge DAMO-NLP-SG/PMR-xxlarge
NER-PMR-large DAMO-NLP-SG/NER-PMR-large
EQA-PMR-large DAMO-NLP-SG/EQA-PMR-large
mPMR-base DAMO-NLP-SG/mPMR-base
mPMR-large DAMO-NLP-SG/mPMR-large

Fine-tuning

  • The downstream datasets are provided in $task/Data/$dataset.
  • Using the provided scripts to fine-tune a task-specific model. (Following is an example of fine-tuning CONLL03)
    cd NER
    bash conll03.sh
    

Example Usage of Fine-tuned PMR

TODO

Citation

If the code is used in your research, please star our repo and cite our paper as follows:

@article{xu2022clozing,
  title={From Clozing to Comprehending: Retrofitting Pre-trained Language Model to Pre-trained Machine Reader},
  author={Xu, Weiwen and Li, Xin and Zhang, Wenxuan and Zhou, Meng and Bing, Lidong and Lam, Wai and Si, Luo},
  journal={arXiv preprint arXiv:2212.04755},
  year={2022}
}

@inproceedings{xu2022mpmr,
    title = "mPMR: A Multilingual Pre-trained Machine Reader at Scale",
    author = "Xu, Weiwen  and
      Li, Xin  and
      Lam, Wai  and
     Bing, Lidong",
    booktitle = "The 61th Annual Meeting of the Association for Computational Linguistics.",
    year = "2023",
}

About

[NeurIPS 2023] Pre-training Machine-Reader (Instead of Masked Language Model) at Scale

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages