Skip to content

Fantabulous-J/CLASS

Repository files navigation

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

The source code for our Paper Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision.

Install environment

pip install -r requirements.txt

Evaluation

Models

XOR-TYDI-QA

Download Dataset

mkdir -p data/XOR-Retrieve
cd data/XOR-Retrieve
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_train_retrieve_eng_span.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_retrieve_eng_span_v1_1.jsonl
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/models/enwiki_20190201_w100.tsv -O psgs_w100.tsv
cd ../../

mkdir -p data/XOR-Full
cd data/XOR-Full
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_train_full.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_full_v1_1.jsonl
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv
wget https://nlp.cs.washington.edu/xorqa/cora/models/all_w100.tsv
cd ../../

XOR-Retrieve

Generate Embeddings

Encode Query

bash scripts/XOR-Retrieve/encode_query.sh

Encode Corpus

bash scripts/XOR-Retrieve/encode_corpus.sh

Note that MODEL_PATH should be fanjiang98/CLASS-XOR-Retrieve.

Retrieve

bash scripts/XOR-Retrieve/retrieve_hn.sh

Note that MODEL_PATH should be fanjiang98/CLASS-XOR-Retrieve We use the official scripts provided by XOR-TYDI-QA for evaluation:

python3 evals/eval_xor_retrieve.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

This leads to the following results:

Model R@2k R@5k
Ar Bn Fi Ja Ko Ru te Avg Ar Bn Fi Ja Ko Ru te Avg
CLASS-US 54.5 67.4 58.6 47.7 51.6 59.9 65.6 57.9 64.8 73.0 64.7 57.3 58.6 67.9 70.6 65.3
CLASS-ZS 59.2 70.1 59.9 51.5 57.2 51.5 72.3 60.2 66.7 78.6 66.6 60.2 63.2 58.2 78.2 67.4
CLASS 66.7 79.6 64.3 58.1 66.0 64.1 77.7 68.1 70.6 84.9 71.0 66.0 72.6 70.0 81.9 73.9

XOR-Full

Retrieve

It is the same as in XOR-Retrieve. Please find corresponding scripts under scripts/XOR-Full and replace MODEL_PATH with fanjiang98/CLASS-XOR-Full.

Answer Generation

bash scripts/XOR-Full/eval_reader.sh

MODEL_PATH should be fanjiang98/CLASS-XOR-Full. We use the official scripts provided by XOR-TYDI-QA for evaluation:

python3 evals/eval_xor_full.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

This leads to the following results:

Model F1 Macro Average
Ar Bn Fi Ja Ko Ru te F1 EM BLEU
CORA 42.9 26.9 41.4 36.8 30.4 33.9 30.9 34.7 25.8 23.3
CLASS 49.1 32.0 46.7 44.1 38.4 39.9 41.1 41.6 32.5 28.2

Training

Please download the training data from OneDrive and put them on corresponding directories under data.

  1. Stage-1 Pre-training:
bash scripts/train_mss_distill_reader.sh
  1. Stage-2 Pre-training:
bash scripts/XOR-Retrieve/train_mss_iterative_reader.sh
  1. Fine-tuning on Natural Questions (zero-shot model):
bash scripts/XOR-Retrieve/train_nq_iterative_reader.sh
  1. Fine-tuning on XOR-Retrieve training data (i.e., our released CLASS-XOR-Retrieve model):
bash scripts/XOR-Retrieve/train_iterative_reader.sh

The training pipeline for XOR-Full is the same, please find corresponding scripts under scripts/XOR-Full for steps 2, 3 and 4.

We use slurm for training on 32 80G A100 for stage-1 and 16 A100 for the rest.

Acknowledgement

Some of the code was adapted from https://github.com/jzbjyb/ReAtt.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published