Serverless BM25 and BERT reranking pipeline

This repository contains the code for the paper titled Serverless BM25 Search and BERT Reranking. As mentioned in the paper, the project was developed in two parts: BM25 retrieval and BERT reranking.

Stage 1: For BM25 retrieval task, refer to bm25_retrieval

Stage 2: For BERT, the instructions can be found in rerank

This project is essentially a retrieval - reranking pipeline built on top of AWS. Once the instructions to set up BM25 retrieval and BERT are completed, the AWS API Gateway will provide a REST API endpoint using which the queries can be passed on to each stage.

We provide a script to run this pipeline end-to-end (run_end_to_end.py), where we have to specify the API URLs as well as path to our dataset where we need to read the queries from. For our experiment for the paper, we used the dev set from the MS MARCO passage dataset which can be found at collections/msmarco-passage/queries.dev.small.tsv once you set up Anserini as part of setting up BM25 retrieval.

The format of the file (to be used as input) is something like below:

head collections/msmarco-passage/queries.dev.small.tsv
1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number

So essentially any file with a qid or query id and query, separated by a \t will work for our input.

In run_end_to_end.py, we go through each query in our source file and first call our search API to return 1000 docids. This result is used to fetch the relevant passages from DynamoDB (see file fetch_msmarco_passage_all.py), where we'd ingested our corpus as part of setting up BM25 search.

Before running the experiment, replace the url in run_end_to_end.py with your own

For testing purposes, we can pass --limit parameter which specifies the number of queries the code will run through, for example:

python3 run_end_to_end.py --limit 100

We run through limit number of queries, for each query we retrieve 1000 passages using BM25, pass the passage content as well as the query to the BERT API for reranking, sorting the result further based on the score and pick top k (default is 100).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bm25_retrieval

bm25_retrieval

rerank

rerank

README.md

README.md

fetch_msmarco_passage_all.py

fetch_msmarco_passage_all.py

run_end_to_end.py

run_end_to_end.py

Repository files navigation

Serverless BM25 and BERT reranking pipeline

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
bm25_retrieval		bm25_retrieval
rerank		rerank
README.md		README.md
fetch_msmarco_passage_all.py		fetch_msmarco_passage_all.py
run_end_to_end.py		run_end_to_end.py

castorini/serverless-bert-reranking

Folders and files

Latest commit

History

Repository files navigation

Serverless BM25 and BERT reranking pipeline

About

Resources

Stars

Watchers

Forks

Languages