<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/anserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anserini Demo on Robust04

This notebook provides a brief overview of how to use [Anserini](http://anserini.io) to perform an _ad hoc_ retrieval run over the test collection from the TREC 2004 Robust Track.


## Setup


First, install Maven (Java 11 comes pre-installed already):



In [1]:
%%capture
!apt-get install maven -qq

Clone and build Anserini:

In [2]:
%%capture
!git clone --recurse-submodules https://github.com/castorini/anserini.git
%cd anserini
!cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
!mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true

If all goes well, you should be able to see   `anserini-X.Y.Z-SNAPSHOT-fatjar.jar` in `target/`:



In [3]:
!ls target

anserini-0.21.1-SNAPSHOT-fatjar.jar   classes		      maven-status
anserini-0.21.1-SNAPSHOT.jar	      generated-sources       test-classes
anserini-0.21.1-SNAPSHOT-sources.jar  generated-test-sources
appassembler			      maven-archiver


Let's grab the pre-built index:

In [4]:
%%capture
!wget https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.robust04.20221005.252b5e.tar.gz
!tar xvfz lucene-index.robust04.20221005.252b5e.tar.gz -C indexes/

Sanity check of index size:

In [5]:
!du -h indexes/lucene-index.robust04.20221005.252b5e

2.0G	indexes/lucene-index.robust04.20221005.252b5e


## Batch Retrieval and Evaluation

Let's run the queries from the TREC 2004 Robust Track, with BM25 as the ranking model:

In [6]:
!target/appassembler/bin/SearchCollection \
 -index indexes/lucene-index.robust04.20221005.252b5e \
 -topics src/main/resources/topics-and-qrels/topics.robust04.txt \
 -topicreader Trec \
 -output runs/run.robust04.bm25.txt \
 -bm25

2023-04-02 15:38:45,021 INFO  [main] search.SearchCollection (SearchCollection.java:929) - Index: indexes/lucene-index.robust04.20221005.252b5e
2023-04-02 15:38:45,258 INFO  [main] search.SearchCollection (SearchCollection.java:933) - Fields: []
2023-04-02 15:38:45,259 INFO  [main] search.SearchCollection (SearchCollection.java:691) - Using DefaultEnglishAnalyzer
2023-04-02 15:38:45,260 INFO  [main] search.SearchCollection (SearchCollection.java:692) - Stemmer: porter
2023-04-02 15:38:45,261 INFO  [main] search.SearchCollection (SearchCollection.java:693) - Keep stopwords? false
2023-04-02 15:38:45,262 INFO  [main] search.SearchCollection (SearchCollection.java:694) - Stopwords file: null
2023-04-02 15:38:45,344 INFO  [main] search.SearchCollection (SearchCollection.java:1208) - runtag: Anserini
2023-04-02 15:38:57,421 INFO  [pool-3-thread-8] search.SearchCollection$SearcherThread (SearchCollection.java:843) - ranker: bm25(k1=0.9,b=0.4), reranker: default: 100 queries processed
2023-04

Finally, let's use `trec_eval` to determine how good the results are:

In [7]:
!tools/eval/trec_eval.9.0.4/trec_eval -c src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.robust04.bm25.txt

runid                 	all	Anserini
num_q                 	all	249
num_ret               	all	241339
num_rel               	all	17412
num_rel_ret           	all	10272
map                   	all	0.2531
gm_map                	all	0.1456
Rprec                 	all	0.2924
bpref                 	all	0.2603
recip_rank            	all	0.6769
iprec_at_recall_0.00  	all	0.7158
iprec_at_recall_0.10  	all	0.5286
iprec_at_recall_0.20  	all	0.4268
iprec_at_recall_0.30  	all	0.3541
iprec_at_recall_0.40  	all	0.2789
iprec_at_recall_0.50  	all	0.2299
iprec_at_recall_0.60  	all	0.1786
iprec_at_recall_0.70  	all	0.1404
iprec_at_recall_0.80  	all	0.0866
iprec_at_recall_0.90  	all	0.0553
iprec_at_recall_1.00  	all	0.0281
P_5                   	all	0.5004
P_10                  	all	0.4382
P_15                  	all	0.3987
P_20                  	all	0.3631
P_30                  	all	0.3102
P_100                 	all	0.1837
P_200                 	all	0.1250
P_500                 	all	0.0689
P_1000           