This repository contains the code and resources for our paper:
- Xuanang Chen, Jian Luo, Ben He, Le Sun, Yingfei Sun. Towards Robust Dense Retrieval via Local Ranking Alignment. In IJCAI 2022.
Our code is developed based on Tevatron DR training toolkit.
We recommend you to create a new conda environment conda create -n rodr python=3.7
,
activate it conda activate rodr
, and then install the following packages:
torch==1.8.1
, faiss-cpu==1.7.1
, transformers==4.9.2
, datasets==1.11.0
.
Note: In this repo, we mainly take MS MARCO passage ranking dataset for example. Before the experiments,
you can refer todownload_raw_data.sh
script to download and process the raw data, which will be saved in thedata/msmarco_passage/raw
folder, liketrain.negatives.tsv
file that contains the negatives of each train query for constructing the training data.
Dev Query: All query variation sets for MS MARCO small Dev set used in our paper are provided
in the data/msmarco_passage/query/dev
folder. You can directly use these query variation
sets to test the robustness of your DR model, and you can also use query_variation_generation.py
script to generate a query variation set by yourself:
qv_type=MisSpell
python query_variation_generation.py
--original_query_file ./msmarco_passage/raw/queries.dev.small.tsv
--query_variation_file ./msmarco_passage/process/query/dev/queries.dev.small.${qv_type}.tsv
--variation_type ${qv_type}
You need to appoint the type of query variation (namely, qv_type
) from pre-defined eight types of query variations:
MisSpell, ExtraPunc, BackTrans, SwapSyn_Glove, SwapSyn_WNet, TransTense, NoStopword, SwapWords
.
Note that a few queries can be kept original in a certain query variation set.
For example, if one query does not contain any stopword, the NoStopword
variation is
not applicable. Besides, before using the query_variation_generation.py
script, you may need to install
TextFlint,
TextAttack,
NLTK toolkits.
Train Query: We also need to generate variations for train queries to enhance the DR model.
Similar to Dev set, we first generate eight variation sets for the train query set, and then merge
them uniformly to obtain the final train query variation set (our generated train query variation file
is available in the data/msmarco_passage/query/train
folder), which is used to insert variations
into the training data, by adding a 'query_variation'
field into each training examples.
You can refer to construct_train_query_variations.py
script after you obtain train variation sets
and original training data.
Standard DR: To obtain a standard DR model, like DR_OQ
in our paper, you need to
construct the training data first:
OQ
: the training data with original train queries, generated bybulid_train.py
script.QV
: the training data with train query variations, by inserting the variation version of original train queries into theOQ
training data.
After that, you can refer to train_standard_dpr.sh
script, to train the
DR_OQ
, DR_QV
, and DR_OQ->QV
models using the OQ
and QV
training data
as described in our paper.
RoDR:
As for our proposed RoDR model, to achieve better alignment, you need to collect nearer neighbors
for queries. Specifically, you can update the negatives in the OQ
training data by sampling from
the top candidates returned by DR_OQ
model. After that, you can refer to bulid_train_nn.py
script, wherein --query_variation
argument requires the generated train query variation file.
Certainly, you can also add the variation version of train queries after constructing
the training data, similar to QV
, using construct_training_data_with_variations
function
available in the construct_train_query_variations.py
script.
After that, you can refer to train_rodr_dpr.sh
script, to train a RoDR w/ DR_OQ
model
on top of the DR_OQ
model. Compared to standard DR training, you need to change --training_mode
to oq.qv.lra
mode, provide the initial DR model path to --model_name_or_path
argument, and set
the loss weights in Eq. 8, as described in our paper.
After training a DR model, you can use it to carry out dense retrieval as follows:
- Tokenizing: using
tokenize_passages.py
andtokenize_queries.py
scripts to tokenize all passages in the corpus, the original queries and query variations. - Encoding and Retrieval: refer to
encode_retrieve_dpr.sh
to first encode passages and queries into vectors, and then use Faiss to index and retrieve.
As for zero-shot retrieval on ANTIQUE, all DR models are only trained on MS MARCO passage dataset,
please refer to run_antique_zeroshot.sh
script.
For the evaluation on MS MARCO passage ranking dataset, such as MRR@10, Recall, and statistical t-test,
we provide variations_avg_tt_test.py
script to compute the metrics for all paired run files
from two DR models waiting for comparison. You can use it like this:
# for single run file
python variations_avg_tt_test.py qrels run_file1 run_file2
# for all run files
python variations_avg_tt_test.py qrels run_dir1 run_dir2 fusion
-
Query variations:
- Passage-Dev: available in the
data/msmarco_passage/query
folder, for bothdev
andtrain
query sets. - Document-Dev: available in the
data/msmarco_doc/query
folder, for bothdev
andtrain
query sets. - ANTIQUE: available in the
data/antique/query
folder, which are collected from five types of manually validated query variations.
- Passage-Dev: available in the
-
Models:
MS MARCO Passage MS MARCO Document DR_OQ DR_OQ DR_QV DR_QV DR_OQ->QV DR_OQ->QV RoDR w/ DR_OQ RoDR w/ DR_OQ -
Retrieval files*:
Dataset DR_OQ RoDR w/ DR_OQ Passage-Dev Download Download Document-Dev Download Download ANTIQUE Download Download * Due to the large size of run files on Passage-Dev, we only provide the run files of
DR_OQ
andRoDR w/ DR_OQ
models. If you want to obtain the run files ofDR_QV
andDR_OQ->QV
models, please feel free to contact us.
If you want to apply RoDR to publicly available DR models, such as ANCE, TAS-Balanced and ADORE+STAR, which are enhanced in our paper, you need to make some minor changes in the model level, such as adding the pooler in ANCE, and using separate query and passage encoders in ADORE+STAR. Herein, we provide the model checkpoints and retrieval files for the reproducibility of our experiments and other research uses.
-
Models:
Original RoDR ANCE RoDR w/ ANCE TAS-Balanced RoDR w/ TAS-Balanced ADORE+STAR RoDR w/ ADORE+STAR -
Retrieval files**:
Model Passage-Dev ANTIQUE RoDR w/ ANCE Download Download RoDR w/ TAS-Balanced Download Download RoDR w/ ADORE+STAR Download Download ** Due to the large size of run files on Passage-Dev, we only provide the run files of RoDR models. If you want to obtain the run files of original existing DR models, please feel free to contact us.
If you find our paper/resources useful, please cite:
@inproceedings{chen_ijcai2022-275,
title = {Towards Robust Dense Retrieval via Local Ranking Alignment},
author = {Xuanang Chen and
Jian Luo and
Ben He and
Le Sun and
Yingfei Sun},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI-22}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {1980--1986},
year = {2022}
}