# BERTQE Integation with PyTerrier

Craig Macdonald, University of Glasgow
21/12/120

In [1]:
#!conda install -y numpy
#!pip install --force-reinstall -r requirements.txt

In [2]:
#this suppresses the various Tensorflow INFO messages.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import warnings
warnings.filterwarnings("ignore", message=r"Passing", category=FutureWarning)

import tensorflow as tf
# more work at suppressing logging messages
tf.get_logger().setLevel('WARN')

# finally, check the GPU is activated
tf.test.is_gpu_available()

True

## PyTerrier setup

This code block assumes that:
 - you have PyTerrier installed, 
 - you have an index of the MSMARCO passage ranking dataset with text metadata. See the [PyTerrier indexing documentation](https://pyterrier.readthedocs.io/en/latest/modules/terrier-indexing.html) for how to create this

We thereafter load the index, and setup a DPH retrieval transformer.

In [3]:
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')
dataset = pt.get_dataset("trec-deep-learning-passages")
index = pt.IndexFactory.of("/users/tr.craigm/projects/trec2020/passage_index/data.properties")

DPH = pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "text"], num_results=10, verbose=True)


23:30:06.287 [main] WARN  o.t.structures.CompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 2.2 GiB of memory would be required.


## Loading our custom transformer

We implement a PyTerrier transformer class to perform the integration with PyTerrier In particular, the BERTQE class extends TransformerBase, and is defined in pyt_bertqe.py. It borrows code from expansion_inference.py and functions.py

To instantiate it, we provide:
 - the location of our BERT configuration (we used BERT-base)
 - the location of the checkpoint provided by the original authors of BERT-QE (we used Robust04 Fold 1 for Bert-base).

Once fully instantiated, the last output of the call below should be `BERTQE Ready`.

In [4]:
from pyt_bertqe import BERTQE
bqe = BERTQE(
    "/users/tr.craigm/projects/pyterrier/BERT-QE/robust04_base/bert_config.json", 
    "/users/tr.craigm/projects/pyterrier/BERT-QE/robust04_base/Fold1/model.ckpt-9375",
    verbose=True)




The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

BERTQE Ready


# Testing the transformer.

Lets see if it works. We're going to give one query (`chemical reactions`) and the text of two documents. One documents is clealy more related to the query than the other, despite both documents having the same document lengths, and each matching only one query term.

The output should be a ranking where, hopefully, d1 gets a higher score than d2.

Note that compared to the original BERT QE code, there is no need to encode passages into files etc before running the QE phase.

In [5]:
import pandas as pd
df = pd.DataFrame([
        ["q1", "chemical reactions", "d1", "profossor proton demonstrated the chemical reaction"], 
        ["q1", "chemical reactions", "d2", "the chemical brothers started their gig"]\
    ], columns=["qid", "query", "docno", "text"])


bqe.transform(df)

HBox(children=(FloatProgress(value=0.0, description='BERTQE', max=1.0, style=ProgressStyle(description_width='…






Instructions for updating:
Use keras.layers.dense instead.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use standard file APIs to check for files with this prefix.



Unnamed: 0,qid,query,docno,score,rank
0,q1,chemical reactions,d1,0.092511,0
1,q1,chemical reactions,d2,0.06637,1


This example looks good - d1 gets a higher score than d2!

## PyTerrier Ranking Integration

Now lets formulate a ranking pipeline using the output of DPH for re-ranking by BERT QE.

In [6]:
pipe1 = DPH >> bqe

Lets try to execute that piptline on the first TREC DL Passage ranking topics.

The output should be a ranking of passages for that query, as scored by BERT QE. You can also see how the order differs from that provided by DPH.

In [7]:
pipe1(dataset.get_topics("test-2019").head(1))

23:30:14.212 [main] WARN  o.t.a.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


HBox(children=(FloatProgress(value=0.0, description='BR(DPH)', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='BERTQE', max=1.0, style=ProgressStyle(description_width='…




Unnamed: 0,qid,query,docno,score,rank
0,1108939,what slows down the flow of blood,3652655,0.081384,9
1,1108939,what slows down the flow of blood,4744533,0.090464,0
2,1108939,what slows down the flow of blood,6707713,0.085161,7
3,1108939,what slows down the flow of blood,7152561,0.090107,1
4,1108939,what slows down the flow of blood,4069373,0.086065,6
5,1108939,what slows down the flow of blood,5992241,0.087156,4
6,1108939,what slows down the flow of blood,841975,0.087522,3
7,1108939,what slows down the flow of blood,6041119,0.08411,8
8,1108939,what slows down the flow of blood,130390,0.088443,2
9,1108939,what slows down the flow of blood,6959553,0.086724,5


## PyTerrier Experiment

Now lets conduct an experiment - we want to compare the effectiveness of BERT QE with DPH, on the passage ranking topics.

In [8]:
# there are 200 topics, but only 43 appear in the qrels; lets shortcut down to just those topcis with correspoding qrels.
topics_with_judgements = dataset.get_topics("test-2019").merge(dataset.get_qrels("test-2019")[["qid"]], on="qid").drop_duplicates()

pt.Experiment(
    [DPH, pipe1], 
    topics_with_judgements, 
    dataset.get_qrels("test-2019"), 
    eval_metrics=["map", "ndcg_cut_10"],
    names=["DPH", "DPH >> BERT_QE"],
    baseline=0
)

23:30:20.209 [main] WARN  o.t.a.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


HBox(children=(FloatProgress(value=0.0, description='BR(DPH)', max=43.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='BR(DPH)', max=43.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='BERTQE', max=43.0, style=ProgressStyle(description_width=…




Unnamed: 0,name,map,ndcg_cut_10,map +,map -,map p-value,ndcg_cut_10 +,ndcg_cut_10 -,ndcg_cut_10 p-value
0,DPH,0.11107,0.502513,,,,,,
1,DPH >> BERT_QE,0.108519,0.499562,11.0,19.0,0.597111,19.0,19.0,0.809695


So, according to this experiment, there was no overall improvement, *on average*, for MAP and NDCG@10, on the MSMARCO passage ranking dataset, but a number of queries were improved as well as degraded. In particular, for NDCG@10, 19 queries were improved and 19 were degraded. In short, there was no significant difference ($p=0.59$ for MAP, $p=0.81$ for NDCG@10).

However, overall, these results are promising, as the full implementation of the BERT-QE paper has not been demonstrated. In particular,
 - Equation (a) and the equation in the appendix need to be both integrated and checked
 - The role of qc_scores is not yet fully understood.
 - Results should be replicated for Robust04 and GOV2 datasets