# PyTerrier ANCE Demo Notebook - Vaswani

This notebook demonstrates use of [PyTerrier plugin for ANCE](https://github.com/terrierteam/pyterrier_ance) for dense passage retrieval. 

[ANCE](https://github.com/microsoft/ANCE) is a dense retrieval system leveraging single representations to encode documents and queries. ANCE does not require combination with sparse retrieval. ANCE leverages a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances than the negative training instances selected by a sparse retrieval mechanism.

ANCE is built on top of [BERT](https://arxiv.org/abs/1810.04805), and it nearly matches the accuracy of sparse retrieval and BERT reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.

The corpus used in this demo is the [Vaswani NPL corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a corpus of 11,429 scientific abstract, with corresponding queries and relevance assessments.

## Installation 

We need to install [PyTerrier](https://github.com/terrier-org/pyterrier).

In [1]:
!pip install -q python-terrier

[ANCE](https://github.com/microsoft/ANCE) requires [FAISS](https://github.com/facebookresearch/faiss), a library for efficient similarity search and clustering of dense vectors.

This is the setup for FAISS on Colab. YMMV outside of Colab.

In [2]:
!apt install libomp-dev
!pip install faiss

Unable to locate an executable at "/Library/Java/JavaVirtualMachines/openjdk-8.jdk/Contents/Home/bin/apt" (-1)


This installs the [PyTerrier plugin for ANCE](https://github.com/terrierteam/pyterrier_ance). It supplies an indexer and a retrieval transformer. This also installs [ANCE](https://github.com/microsoft/ANCE).

In [3]:
!pip install --upgrade git+https://github.com/seanmacavaney/pyterrier_ance.git@reranker

Collecting git+https://github.com/seanmacavaney/pyterrier_ance.git@reranker
  Cloning https://github.com/seanmacavaney/pyterrier_ance.git (to revision reranker) to /private/var/folders/_l/bjhppdnd3k1_5g6bgx2p9mqw0000gn/T/pip-req-build-qyz10tua
  Running command git clone -q https://github.com/seanmacavaney/pyterrier_ance.git /private/var/folders/_l/bjhppdnd3k1_5g6bgx2p9mqw0000gn/T/pip-req-build-qyz10tua
  Running command git checkout -b reranker --track origin/reranker
  Switched to a new branch 'reranker'
  Branch 'reranker' set up to track remote branch 'reranker' from 'origin'.
Collecting ANCE@ git+https://github.com/cmacdonald/ANCE.git
  Cloning https://github.com/cmacdonald/ANCE.git to /private/var/folders/_l/bjhppdnd3k1_5g6bgx2p9mqw0000gn/T/pip-install-yvnwutkf/ance_c73f747fcab047b0bbbffc0ddf3d84bc
  Running command git clone -q https://github.com/cmacdonald/ANCE.git /private/var/folders/_l/bjhppdnd3k1_5g6bgx2p9mqw0000gn/T/pip-install-yvnwutkf/ance_c73f747fcab047b0bbbffc0ddf3d84b

Collecting s3transfer<0.5.0,>=0.4.0
  Downloading s3transfer-0.4.2-py2.py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 1.6 MB/s eta 0:00:01
[?25hCollecting botocore<1.21.0,>=1.20.101
  Downloading botocore-1.20.101-py2.py3-none-any.whl (7.7 MB)
[K     |████████████████████████████████| 7.7 MB 1.7 MB/s eta 0:00:01
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Using cached jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting typing-extensions
  Using cached typing_extensions-3.10.0.0-py3-none-any.whl (26 kB)
Building wheels for collected packages: pyterrier-ance, ANCE, pytrec-eval
  Building wheel for pyterrier-ance (setup.py) ... [?25ldone
[?25h  Created wheel for pyterrier-ance: filename=pyterrier_ance-0.0.1-py3-none-any.whl size=5521 sha256=1e2ca2fec783df7d4e0c5f0615cb6c8339264ab672191c5a2b04b5b1c5ea9585
  Stored in directory: /private/var/folders/_l/bjhppdnd3k1_5g6bgx2p9mqw0000gn/T/pip-ephem-wheel-cache-w3i00ma6/wheels/f3/99/14/270f46700f5f46f42065ab17b03

# Setup

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org) IR platform.

In [4]:
import pyterrier as pt
pt.init(tqdm='notebook')

PyTerrier 0.6.0 has loaded Terrier 5.5 (built by craigmacdonald on 2021-05-20 13:12)


We are using the [Vaswani dataset](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/) – lets collect the topics & qrels.

In [5]:
dataset = pt.get_dataset("irds:vaswani")

This downloads the model checkpoint listed on the [ANCE github repository](https://github.com/microsoft/ANCE/#results). Download time can vary, on average it requires 11-12 minutes.

In [6]:
import os
if not os.path.exists("Passage_ANCE_FirstP_Checkpoint.zip"):
  !wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
  !unzip Passage_ANCE_FirstP_Checkpoint.zip

--2021-06-27 00:06:15--  https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
Resolving webdatamltrainingdiag842.blob.core.windows.net (webdatamltrainingdiag842.blob.core.windows.net)... 52.239.193.68
Connecting to webdatamltrainingdiag842.blob.core.windows.net (webdatamltrainingdiag842.blob.core.windows.net)|52.239.193.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1277112820 (1,2G) [application/octet-stream]
Saving to: ‘Passage_ANCE_FirstP_Checkpoint.zip’


2021-06-27 00:49:37 (480 KB/s) - ‘Passage_ANCE_FirstP_Checkpoint.zip’ saved [1277112820/1277112820]

Archive:  Passage_ANCE_FirstP_Checkpoint.zip
   creating: Passage ANCE(FirstP) Checkpoint/
  inflating: Passage ANCE(FirstP) Checkpoint/config.json  
  inflating: Passage ANCE(FirstP) Checkpoint/desktop.ini  
  inflating: Passage ANCE(FirstP) Checkpoint/merges.txt  
  inflating: Passage ANCE(FirstP) Checkpoint/optimizer.pt  
  inflating: P

## Indexing

This indexes the [Vaswani dataset](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/). Indexing takes about 3 minutes using a Colab GPU.

In [None]:
!rm -rf ./anceindex

import pyterrier_ance
indexer = pyterrier_ance.ANCEIndexer(checkpoint_path="./Passage ANCE(FirstP) Checkpoint",
                                     index_path="./anceindex",
                                     num_docs=11429)
indexer.index(dataset.get_corpus_iter())

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[INFO] loading configuration file ./Passage ANCE(FirstP) Checkpoint/config.json
[INFO] Model config {
  "_num_labels": 2,
  "architectures": [
    "RobertaDot_NLL_LN"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bad_words_ids": null,
  "bos_token_id": 0,
  "decoder_start_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": 2,
  "eos_token_ids": 0,
  "finetuning_task": "MSMarco",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 514,
  "min_length": 0,
  "model_type": "roberta",
  "no_repeat_ngram_size": 0,
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "num_

Using mean: False


[INFO] Inference parameters <pyterrier_ance. object at 0x7fd43697dc40>


Indexing:   0%|          | 0/11429 [00:00<?, ?d/s]

Segment 0


[INFO] ***** Running ANN Embedding Inference *****
[INFO]   Batch size = 128


Inferencing: 0it [00:00, ?it/s]

Not running in distributed mode


We will not need the indexer anymore, so we free up some memory.

In [None]:
del(indexer)

The indexing procedure generates a number of [FAISS](https://github.com/facebookresearch/faiss) shards, together with some additional files.

In [None]:
!ls /content/anceindex

# Retrieval

Now that indexing has completed, we can load in the index and the checkpoint model (which we will need for encoding queries). Index loading can take some times, as the [FAISS](https://github.com/facebookresearch/faiss) shards need to be loaded in main memory.

In [None]:
ance_retr = pyterrier_ance.ANCERetrieval(checkpoint_path="/content/Passage ANCE(FirstP) Checkpoint",
                                        index_path="/content/anceindex")

Here we can ask [PyTerrier](https://github.com/terrier-org/pyterrier) to search the [ANCE](https://github.com/microsoft/ANCE) index for `'chemical reactions'`, returning the top 10 relevant documents.

In [None]:
(ance_retr % 10).search("chemical reactions")

# Running an Experiment

Lets prepare an experiment. Firstly, lets create in a BM25 baseline transformer.

In [None]:
bm25 = pt.BatchRetrieve(pt.get_dataset("vaswani").get_index(), wmodel="BM25")

You can also use ANCE as a text scorer (in a re-ranking setting). We'll compare with that baseline as well here.

In [None]:
ance_rerank = (bm25 % 100) >> pt.text.get_text(dataset, 'text') >> pyterrier_ance.ANCETextScorer(checkpoint_path="/content/Passage ANCE(FirstP) Checkpoint")

Finally, lets evaluate our performance. We also load in an BM25 index for the same corpus for comparison reasons.

In [None]:
pt.Experiment(
    [bm25, ance_rerank, ance_retr], 
    dataset.get_topics(), 
    dataset.get_qrels(), 
    eval_metrics=["map", "recip_rank", "mrt"],
    names=['BM25', 'BM25 >> ANCE Re-Rank', 'ANCE']
    )

So on this collection, ANCE isnt as effective under MAP or MRR (either as a ranker or a BM25 re-ranker), but the ranker does have a lower mean response time.