Search engines have difficulty searching into knowledge repositories since they are not tailored to the users' information needs. User's queries are, more often than not, under-specified that also retrieve irrelevant documents. Query refinement, also known as query reformulation
, or suggesetion
, is the process of transforming users' queries into new refined
versions without semantic drift to enhance the relevance of search results. Prior query refiners have been benchmarked on web retrieval datasets following weak assumptions
that users' input queries improve gradually within a search session. To fill the gap, we contribute RePair
, an open-source configurable toolkit to generate large-scale gold-standard benchmark datasets from a variety of domains for the task of query refinement. RePair
takes a dataset of queries and their relevance judgements (e.g. msmarco
or aol
), an information retrieval method (e.g., bm25
), and an evaluation metric (e.g., map
), and outputs refined versions of queries using a transformer (e.g., T5
), each of which with the relevance improvement guarantees. Currently, RePair
includes gold standard datasets for aol-ia
and msmarco.passage
.
Future Work: We are investigating contexual
query refinement by incorporating query session information like user or time information of queries on the performance of neural query refinement methods compared to the lack thereof.
You need to have Python=3.8
and install pyserini
package (needs Java
), among others listed in requirements.txt
. We also suggest you to clone our repo with the --recurse-submodules
(altenatively, use the git submodule update --init
inside the cloned repository) to get the trec_eval
metric evaluation tool:
By pip
, clone the codebase and install the required packages:
git clone https://github.com/Fani-Lab/RePair --recurse-submodules
cd RePair
pip install -r requirements.txt
By conda
:
git clone https://github.com/Fani-Lab/RePair --recurse-submodules
cd RePair
conda env create -f environment.yml
conda activate repair
Note: When installing Java
, remember to set JAVA_HOME
in Windows's environment variables.
For trec_eval
:
cd src/trec_eval.9.0.4
make
cd ..
To perform fast IR tasks, we need to build the indexes of document corpora or use the prebuilt-indexes
like msmarco.passage
. The path to the index need to be set in ./src/param.py
like param.settings['msmarco.passage']['index']
.
In case there is no prebuilt index, steps include collecting the corpus and building an index as we did for aol-ia
using ir-datasets
.
We use T5
to train a model, that when given an input query (origianl query), generates refined (better) versions of the query in terms of retrieving more relevant documents at higher ranking positions. Currently, we finetuned T5
model on msmarco.passage
(no context) and aol
(w/o userid
). For yandex
dataset, we will train T5
from scratch since the tokens are anonymized by random numbers.
As seen in the above workflow
, RePair
has four pipelined steps:
- Transformer Finetuning: [
pair
,finetune
]- Refined Query Prediction: [
predict
]- Performance Evaluation: [
search
,eval
]- Dataset Curation: [
agg
,box
]
To run RePair
pipeline, we need to set the required parameters of each step in ./src/param.py
such as pairing strategy (pairing
) for a query set, the choice of transformer (t5model
), and etc. Then, the pipeline can be run by its driver at ./src/main.py
:
python -u main.py -data ../data/raw/toy.msmarco.passage -domain msmarco.passage
python -u main.py -data ../data/raw/toy.aol-ia -domain aol-ia
python -u main.py -data ../data/raw/toy.msmarco.passage ../data/raw/toy.aol-ia -domain msmarco.passage aol-ia
We create training sets based on different pairings of queries and relevant passages in the ./data/preprocessed/{domain name}/
for each domain like ./data/preprocessed/toy.msmarco.passage/
for msmarco.passage
.
ctx.query.docs
: context: query -> concatenated relevant documents (passages)ctx.docs.query
: context: concatenated relevant documents (passages) -> query, like in docTTTTTTQueryctx.query.doc
: context: query -> relevant document (passage)ctx.doc.query
: context: relevant documents (passages) -> query
where the context will be userid
(personalized) or empty (context free). For instance, for msmarco.passage
which has no contextual information, we have docs.query
or query.docs
since there is no context. Further, if a query has more than one relevant documents, we can either concatenate all relevant documents into a single document, i.e., doc
+s
or duplicate the (query, doc) pairs for each relevant document, i.e., doc
.
After this step, ./data/
directory looks like:
├── data
│ ├── preprocessed
│ │ └── toy.msmarco.passage
│ │ ├── docs.query.passage.test.tsv
│ │ ├── docs.query.passage.train.tsv
│ │ ├── queries.qrels.docs.ctx.passage.test.tsv
│ │ └── queries.qrels.docs.ctx.passage.train.tsv
We have used T5
to generate the refinements to the original queries. We can run T5
on local machine (cpu/gpu), or on google cloud (tpu), which is the T5
pereferance,
We store the finetuned transformer in ./output/{domain name}/{transformer name}.{pairing strategy}
. For instance, for T5
whose small
version has been finetuned on a local machine for toy.msmarco.passage
, we save the model in ./output/toy.msmarco.passage/t5.small.local.docs.query.passage/
After this step, ./output/
looks like:
├── output
│ ├── t5-data
│ │ ├── pretrained_models
│ │ │ └── small
│ │ └── vocabs
│ │ ├── cc_all.32000
│ │ └── cc_en.32000
│ └── toy.msmarco.passage
│ └── t5.small.local.docs.query.passage
│ ├── checkpoint
│ ├── events.out.tfevents.1675961098.HFANI
│ ├── graph.pbtxt
│ ├── model.ckpt-1000000.data-00000-of-00002
│ ├── model.ckpt-1000000.index
│ ├── model.ckpt-1000000.meta
│ ├── model.ckpt-1000005.data-00000-of-00002
│ ├── model.ckpt-1000005.index
│ ├── model.ckpt-1000005.meta
│ ├── operative_config.gin
Once a transformer has been finetuned, we feed input original queries w/ or w/o context to the model and whaterver the model generates is considered as a potential
refined query. To have a collection of potential refined queries for the same original query, we used the top-k random sampling
as opposed to beam search
, suggested by Nogueira and Lin
. So, we ran the transformer for nchanges
times at inference and generate nchanges
potential refined queries.
We store the i
-th potential refined query of original queries at same folder as the finetuned model, i.e., ./output/{domain name}/{transformer name}.{pairing strategy}/pred.{refinement index}-{model checkpoint}
like ./output/toy.msmarco.passage/t5.small.local.docs.query.passage/pred.0-1000005
After this step, prediction files will be added to ./output
:
├── output
│ └── toy.msmarco.passage
│ └── t5.small.local.docs.query.passage
│ ├── original.-1
│ ├── pred.0-1000005
│ ├── pred.1-1000005
│ ├── pred.2-1000005
│ ├── pred.3-1000005
│ ├── pred.4-1000005
We search the relevant documents for both the original query and each of the potential
refined queries. We need to set an information retrieval method, called ranker, that retrieves relevant documents and ranks them based on relevance scores. We integrate pyserini
, which provides efficient implementations of sparse and dense rankers, including bm25
and qld
(query likelihood with Dirichlet smoothing).
We store the result of search for the i
-th potential refined query at same folder in files with names ending with ranker, i.e., ./output/{domain name}/{transformer name}.{pairing strategy}/pred.{refinement index}-{model checkpoint}.{ranker name}
like ./output/toy.msmarco.passage/t5.small.local.docs.query.passage/pred.0-1000005.bm25
.
After this step, search results will be added to ./output
:
├── output
│ └── toy.msmarco.passage
│ └── t5.small.local.docs.query.passage
│ ├── original.-1.bm25
│ ├── original.-1.qld
│ ├── pred.0-1000005.bm25
│ ├── pred.0-1000005.qld
│ ├── pred.1-1000005.bm25
│ ├── pred.1-1000005.qld
│ ├── pred.2-1000005.bm25
│ ├── pred.2-1000005.qld
│ ├── pred.3-1000005.bm25
│ ├── pred.3-1000005.qld
│ ├── pred.4-1000005.bm25
│ ├── pred.4-1000005.qld
The search results of each potential refined queries are evaluated based on how they improve the performance with respect to an evaluation metric like map
or mrr
.
We store the result of evaluation for the i
-th potential refined query at same folder in files with names ending with evaluation metric, i.e., ./output/{domain name}/{transformer name}.{pairing strategy}/pred.{refinement index}-{model checkpoint}.{ranker name}.{metric name}
like ./output/toy.msmarco.passage/t5.small.local.docs.query.passage/pred.0-1000005.bm25.map
.
After this step, evaluation results will be added to ./output
:
├── output
│ └── toy.msmarco.passage
│ └── t5.small.local.docs.query.passage
│ ├── original.-1.bm25.map
│ ├── original.-1.bm25.success.10
│ ├── original.-1.qld.map
│ ├── original.-1.qld.success.10
Finaly, we keep those potential refined queries whose performance (metric score) have been better or equal compared to the original query, i.e., refined_query_metric >= original_query_metric and refined_q_metric > 0
.
We keep two main datasets as the final outcome of the RePair
pipeline:
./output/{input query set}/{transformer name}.{pairing strategy}/{ranker}.{metric}.agg.gold.tsv
: contains the original queries and their refined queries that garanteed thebetter
performance along with the performance metric values
./output/{input query set}/{transformer name}.{pairing strategy}/{ranker}.{metric}.agg.all.tsv
: contains the original queries andall
their predicted refined queries along with the performance metric values
For instance, for toy
query sets of msmarco.passage
and aol-ia.title
, here are the files:
./output/toy.msmarco.passage/t5.small.local.docs.query.passage/bm25.map.agg.gold.tsv
./output/toy.msmarco.passage/t5.small.local.docs.query.passage/bm25.map.agg.all.tsv
./output/toy.aol-ia/t5.small.local.docs.query.title/bm25.map.agg.gold.tsv
./output/toy.aol-ia/t5.small.local.docs.query.title/bm25.map.agg.all.tsv
For boxing, since we keep the performances for all the potential queries, we can change the condition and have a customized selection like having diamond
refined queries with maximum possible performance, i.e., 1
by setting the condition: refined_query_metric >= original_query_metric and refined_q_metric == 1
. The boxing condition can be set at ./src/param.py
.
'box': {'gold': 'refined_q_metric >= original_q_metric and refined_q_metric > 0',
'platinum': 'refined_q_metric > original_q_metric',
'diamond': 'refined_q_metric > original_q_metric and refined_q_metric == 1'}
After this step, ./output
will further include:
├── output
│ └── toy.msmarco.passage
│ └── t5.small.local.docs.query.passage
│ ├── qld.map.agg.all.tsv
│ ├── qld.map.agg.all_.tsv
│ ├── qld.map.agg.gold.tsv
│ ├── qld.map.boxes
│ │ ├── diamond.tsv
│ │ ├── gold.tsv
│ │ ├── platinum.tsv
│ │ └── stamps
│ │ ├── diamond.change.qld.map
│ │ ├── diamond.original.qld.map
│ │ ├── gold.change.qld.map
│ │ ├── gold.original.qld.map
│ │ ├── platinum.change.qld.map
│ │ └── platinum.original.qld.map
│ ├── qld.success.10.agg.all.tsv
│ ├── qld.success.10.agg.all_.tsv
│ ├── qld.success.10.agg.gold.tsv
│ └── qld.success.10.boxes
│ ├── diamond.tsv
│ ├── gold.tsv
│ ├── platinum.tsv
│ └── stamps
│ ├── diamond.change.qld.success.10
│ ├── diamond.original.qld.success.10
│ ├── gold.change.qld.success.10
│ ├── gold.original.qld.success.10
│ ├── platinum.change.qld.success.10
│ └── platinum.original.qld.success.10
It is imperative for us to support dense retrieval as a ranker method for which we choose to explore tct_colbert
the current state-of-the-art
method.
tct colbert
is supported directly by pyserini
, which allows us to index and search.
We explain in detail how to do this for any dataset and release our dense index for aol.title
and aol.title.url
, for msmarco.passage
, we used the prebuild index provided by pyserini
Indexing:
Before building a dense index, we need to encode the passages either in faiss format or jsonl format. We provide a sample command that indexes aol.title
:
python -m pyserini.encode \
input --corpus data/raw/aol-ia/title\
--fields text \
--delimiter "\n" \
--shard-id 0 \
--shard-num 1 \
output --embeddings data/raw/aol-ia/dense-encoder/tct_colbert.title \
--to-faiss \
encoder --encoder castorini/tct_colbert-v2-hnp-msmarco \
--fields text \
--batch 32
Now once the encoding is done, we need to index this, This indexing will take about 24 hours:
python -m pyserini.index.faiss \
--input data/raw/aol-ia/dense-encoder/tct_colbert.title \
--output data/raw/aol-ia/dense-index/tct_colbert.title \
--hnsw
We chose to go with the HNSW file structure,you can explore more about it here
Searching: We added the searching to our pipeline, since it's very similar as searching in a sparse index.
from pyserini.search.faiss import TctColBertQueryEncoder, FaissSearcher
encoder = TctColBertQueryEncoder('./data/raw/aol-ia/dense-encoder/tct_colbert.title')
searcher = FaissSearcher('./data/raw/aol-ia/dense-index/tct_colbert.title',encoder)
We do not apply tct_colbert
to our whole dataset instead we attempt to improve our predicted queries which does not meet a certain condition:
original_q_metric > refined_q_metric and 0 >= original_q_metric >= 1
After this step, the ./output
folder will have a final structure as below:
├── output
│ └── toy.msmarco.passage
│ └── t5.small.local.docs.query.passage
│ ├── qld.map.agg.all.tsv
│ ├── qld.map.agg.all_.tsv
│ ├── qld.map.agg.gold.tsv
| ├── qld.map.agg.no_pred.tsv
│ ├── qld.recip_rank.10.agg.all.tsv
│ ├── qld.recip_rank.10.agg.all_.tsv
| ├── qld.recip_rank.agg.no_pred.tsv
| ├── original.no_pred.tct_colbert
| ├── original.no_pred.tct_colbert.map
| ├── original.no_pred.tct_colbert.recip_rank.10
| ├── pred.no_pred.tct_colbert
| ├── pred.no_pred.tct_colbert.map
| ├── pred.no_pred.tct_colbert.recip_rank.10
| ├── colbert.comparison.no_pred.map.tsv
| ├── colbert.comparison.no_pred.recip_rank.10.tsv
│ ├── qld.map.boxes
│ │ ├── diamond.tsv
│ │ ├── gold.tsv
│ │ ├── platinum.tsv
│ │ └── stamps
│ │ ├── diamond.change.qld.map
│ │ ├── diamond.original.qld.map
│ │ ├── gold.change.qld.map
│ │ ├── gold.original.qld.map
│ │ ├── platinum.change.qld.map
│ │ └── platinum.original.qld.map
│ ├── qld.success.10.agg.all.tsv
│ ├── qld.success.10.agg.all_.tsv
│ ├── qld.success.10.agg.gold.tsv
│ └── qld.success.10.boxes
│ ├── diamond.tsv
│ ├── gold.tsv
│ ├── platinum.tsv
│ └── stamps
│ ├── diamond.change.qld.success.10
│ ├── diamond.original.qld.success.10
│ ├── gold.change.qld.success.10
│ ├── gold.original.qld.success.10
│ ├── platinum.change.qld.success.10
│ └── platinum.original.qld.success.10
query set | final gold standard dataset | data folder (raw and preprocessed) | output folder (model, predictions, ...) |
---|---|---|---|
msmarco.passage |
./output/msmarco.passage/t5.base.gc.docs.query.passage/bm25.map.agg.gold.tsv 398 MB |
./data/raw/msmarco.passage ./data/preprocessed/msmarco.passage |
./output/msmarco.passage/t5.base.gc.docs.query.passage/ |
aol-ia.title |
./output/aol-ia/t5.base.gc.docs.query.title/bm25.map.agg.gold.tsv 756 MB |
./data/raw/aol-ia ./data/preprocessed/aol-ia |
./output/aol-ia/t5.base.gc.docs.query.title/ |
aol-ia.url.title |
./output/aol-ia/t5.base.gc.docs.query.url.title/bm25.map.agg.gold.tsv 706 MB |
./data/raw/aol-ia ./data/preprocessed/aol-ia |
./output/aol-ia/t5.base.gc.docs.query.url.title/ |
Here is the refined queries for the two original queries in ./output/aol-ia/t5.base.gc.docs.query.title/bm25.map.agg.gold.tsv
for aol-ia.title
:
qid order query bm25.map
8c418e7c9e5993 -1 rentdirect com 0.0
8c418e7c9e5993 pred.9 hurston apartments 0.0556
8c418e7c9e5993 pred.4 rental apartments 0.0357
8c418e7c9e5993 pred.8 apartments nyc 0.0312
8c418e7c9e5993 pred.6 first class apartments in nyc 0.0135
8c418e7c9e5993 pred.2 apartments and new york 0.0068
0cc411681d1441 -1 staple com 0.037
0cc411681d1441 pred.8 staple pubs 1.0
0cc411681d1441 pred.7 staple england pub 0.5
0cc411681d1441 pred.1 staple east of england 0.1
0cc411681d1441 pred.10 staple 0.0385
0cc411681d1441 pred.3 staple england 0.0385
0cc411681d1441 pred.4 staple england 0.0385
0cc411681d1441 pred.6 staple england 0.0385
As seen, order: -1
shows the original query with its retrieval preformance. For the rest, it shows the refined queries in decreasing retrieval performance. For instance, for the original query query: staple com
, the retrieval performance is bm25.map: 0.037
while the best refined query could imporove it to bm25.map: 1.0
!
RePair
has generated gold standard query refinement datasets for msmarco.passage
and aol-ia
query sets using t5.base
transformer on google cloud's tpus (gc
) with docs.query
pairing strategy for bm25
ranker and map
evaluation metric. The golden datasets along with all the artifacts including preprocessed docs.query
pairs, model checkpoint, predicted refined queries, their search results and evaluation metric values are available at the above links. The running settings were:
settings = {
't5model': 'base.gc'
'iter': 4000, # number of finetuning iteration for t5
'nchanges': 10, # number of changes to a query
'ranker': 'bm25',
'batch': 100, # search per batch of queries using pyserini, if None, search per query
'topk': 100, # number of retrieved documents for a query
'large_ds': True, # set if dataset size is greater than one Million
'metric': 'map', # any valid trec_eval.9.0.4 metric like map, ndcg, recip_rank, ...
'box': {'gold': 'refined_q_metric >= original_q_metric and refined_q_metric > 0',
'platinum': 'refined_q_metric > original_q_metric',
'diamond': 'refined_q_metric > original_q_metric and refined_q_metric == 1'},
'msmarco.passage': {
'index': '../data/raw/msmarco.passage/lucene-index.msmarco-v1-passage.20220131.9ea315/',
'pairing': [None, 'docs', 'query'], # input=doc(s), output=query, s means concat of relevant docs
'lseq':{"inputs": 32, "targets": 256}, # query length and doc length for t5 model,
},
'aol-ia': {
'index_item': ['title'], # ['url'], ['title', 'url'], ['title', 'url', 'text']
'index': '../data/raw/aol-ia/lucene-index/title/',
'pairing': [None, 'docs', 'query'], #input=doc(s) output=query
'lseq':{"inputs": 32, "targets": 256}, # query length and doc length for t5 model,
'filter': {'minql': 1, 'mindocl': 10}. # [min query length, min doc length], after merge queries with relevant 'index_item', if |query| <= minql drop the row, if |'index_item'| < mindocl, drop row
}
}
query set={q} | #q | avg|q| | avgbm25.map (q) |
#gold | avg|q*| | % | avgbm25.map (q*) |
Δ% | #diamond (ap=1) | %q* |
---|---|---|---|---|---|---|---|---|---|---|
msmarco.passage | 502,939 | 5.9675 | 0.0862 | 414,337 | 7.4419 | 82% | 0.5704 | +562 % | 176,922 | 35% |
aol-ia-title | 4,459,613 | 3.5849 | 0.0252 | 2,583,023 | 3.1270 | 58% | 0.4175 | +1,556 % | 649,764 | 14% |
aol-ia-url-title | 4,672,506 | 3.5817 | 0.0271 | 2,421,347 | 3.5354 | 52% | 0.3997 | +1,374 % | 591,001 | 13% |
We benefit from trec_eval
, pyserini
, ir-dataset
, and other libraries. We would like to thank the authors of these libraries and helpful resources.
©2024. This work is licensed under a CC BY-NC-SA 4.0 license.
@inproceedings{DBLP:conf/cikm/NarayananF23,
author = {Yogeswar Lakshmi Narayanan and Hossein Fani},
title = {RePair: An Extensible Toolkit to Generate Large-Scale Datasets via Transformers for Query Refinement},
booktitle = {Proceedings of the 32nd {ACM} International Conference on Information {\&} Knowledge Management, University of Birmingham and Eastside Rooms, UK, October 21-25, 2023},
pages = {},
publisher = {{ACM}},
year = {2023},
url = {},
doi = {10.1145/3583780.3615129},
biburl = {},
bibsource = {dblp computer science bibliography, https://dblp.org}
}