[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/castorini/pygaggle/blob/master/notebooks/pygaggle_covidqa_demo.ipynb)

# **PyGaggle CovidQA demo**

## Install pyserini

In [0]:
%%capture
!pip install pyserini
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

## Checkout GPU, install transformers and pygaggle

In [0]:
!nvidia-smi

Mon May  4 21:26:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
%%capture
# Install huggingface
!pip uninstall -y transformers
!pip install transformers

In [0]:
%%capture
# Clone the master branch from pygaggle
!rm -rf pygaggle && pip uninstall -y pygaggle
!git clone https://github.com/castorini/pygaggle.git # use master once that branch is merged
!cd pygaggle && pip install --editable .

## Get the CORD-19 paragraph index from 2020-04-10

In [0]:
%%capture
%cd /content/pygaggle
!sh scripts/update-index.sh

## Let's start of with BM-25

### First, we use the natural query string format

In [0]:
!python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25

[32m2020-05-04 21:42:57[0m [1;30m[INFO][0m file_utils: PyTorch version 1.5.0+cu101 available.
2020-05-04 21:42:58.128420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[32m2020-05-04 21:42:59[0m [1;30m[INFO][0m file_utils: TensorFlow version 2.2.0-rc3 available.
[32m2020-05-04 21:43:04[0m [1;30m[INFO][0m kaggle: Average spans: 1.5725806451612903
[32m2020-05-04 21:43:04[0m [1;30m[INFO][0m kaggle: Random P@1: 0.011513690878122968
[32m2020-05-04 21:43:04[0m [1;30m[INFO][0m kaggle: Random R@3: 0.034106620910472035
[32m2020-05-04 21:43:04[0m [1;30m[INFO][0m kaggle: Random MRR: 0.05247032691539293
100% 124/124 [00:08<00:00, 14.85it/s]
[32m2020-05-04 21:43:13[0m [1;30m[INFO][0m evaluate_kaggle_highlighter: precision@1 0.15
[32m2020-05-04 21:43:13[0m [1;30m[INFO][0m evaluate_kaggle_highlighter: recall@3    0.2164
[32m2020-05-04 21:43:13[0m [1;30m[INFO][0m evaluate_kaggle_highlighter: rec

### Then, we evaluate with keyword query format

In [0]:
!python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25 \ 
                                                     --split kq

[32m2020-05-04 21:50:07[0m [1;30m[INFO][0m file_utils: PyTorch version 1.5.0+cu101 available.
2020-05-04 21:50:07.779993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[32m2020-05-04 21:50:09[0m [1;30m[INFO][0m file_utils: TensorFlow version 2.2.0-rc3 available.
usage: evaluate_kaggle_highlighter.py [-h] [--dataset DATASET] --method
                                      {transformer,bm25,t5,seq_class_transformer,qa_transformer,random}
                                      [--model-name MODEL_NAME]
                                      [--split {nq,kq}]
                                      [--batch-size BATCH_SIZE]
                                      [--device DEVICE]
                                      [--tokenizer-name TOKENIZER_NAME]
                                      [--do-lower-case]
                                      [--metrics {precision@1,recall@3,recall@50,recall@1000,mrr,mrr@10} [{precis

## Let's evaluate using our best neural ranker, T5

### Again, we first use the natural query string format

In [0]:
!python -um pygaggle.run.evaluate_kaggle_highlighter --method t5


[32m2020-05-04 21:43:22[0m [1;30m[INFO][0m file_utils: PyTorch version 1.5.0+cu101 available.
2020-05-04 21:43:22.711873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[32m2020-05-04 21:43:24[0m [1;30m[INFO][0m file_utils: TensorFlow version 2.2.0-rc3 available.
[32m2020-05-04 21:43:29[0m [1;30m[INFO][0m kaggle: Average spans: 1.5725806451612903
[32m2020-05-04 21:43:29[0m [1;30m[INFO][0m kaggle: Random P@1: 0.011513690878122968
[32m2020-05-04 21:43:29[0m [1;30m[INFO][0m kaggle: Random R@3: 0.034106620910472035
[32m2020-05-04 21:43:29[0m [1;30m[INFO][0m kaggle: Random MRR: 0.05247032691539293
2020-05-04 21:43:29.564495: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadata check was skipped.
2020-05-04 21:43:29.721468: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an e

### Finally, we evaluate using the keyword query string format.

In [0]:
!python -um pygaggle.run.evaluate_kaggle_highlighter --method t5 --split kq

[32m2020-05-04 21:51:13[0m [1;30m[INFO][0m file_utils: PyTorch version 1.5.0+cu101 available.
2020-05-04 21:51:13.856830: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[32m2020-05-04 21:51:15[0m [1;30m[INFO][0m file_utils: TensorFlow version 2.2.0-rc3 available.
[32m2020-05-04 21:51:20[0m [1;30m[INFO][0m kaggle: Average spans: 1.5725806451612903
[32m2020-05-04 21:51:20[0m [1;30m[INFO][0m kaggle: Random P@1: 0.011513690878122968
[32m2020-05-04 21:51:20[0m [1;30m[INFO][0m kaggle: Random R@3: 0.034106620910472035
[32m2020-05-04 21:51:20[0m [1;30m[INFO][0m kaggle: Random MRR: 0.05247032691539293
2020-05-04 21:51:20.636924: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadata check was skipped.
2020-05-04 21:51:20.774761: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an e