<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/anserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anserini Demo on Robust04

This notebook provides a brief overview of how to use [Anserini](http://anserini.io) to perform an _ad hoc_ retrieval run over the test collection from the TREC 2004 Robust Track.


## Setup


First, setup Java 11 and Maven:



In [0]:
%%capture

!apt-get update
!apt-get install -y openjdk-11-jdk-headless -qq 
!apt-get install maven -qq

Clone and build Anserini:

In [0]:
%%capture

!git clone https://github.com/castorini/anserini.git
%cd anserini
!mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true
!cd eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make

If all goes well, you should be able to see   `anserini-X.Y.Z-SNAPSHOT-fatjar.jar` in `target/`:



In [0]:
!ls target

anserini-0.6.1-SNAPSHOT-fatjar.jar  classes		    maven-archiver
anserini-0.6.1-SNAPSHOT.jar	    generated-sources	    maven-status
appassembler			    generated-test-sources  test-classes


Let's grab the pre-built index:

In [0]:
%%capture

!wget https://www.dropbox.com/s/mdoly9sjdalh44x/lucene-index.robust04.pos%2Bdocvectors%2Brawdocs.tar.gz
!tar xvfz lucene-index.robust04.pos+docvectors+rawdocs.tar.gz

Sanity check of index size:

In [0]:
!du -h lucene-index.robust04.pos+docvectors+rawdocs

2.1G	lucene-index.robust04.pos+docvectors+rawdocs


## Batch Retrieval and Evaluation

Let's run the queries from the TREC 2004 Robust Track, with BM25 as the ranking model:

In [0]:
!target/appassembler/bin/SearchCollection -topicreader Trec -index lucene-index.robust04.pos+docvectors+rawdocs \
 -topics src/main/resources/topics-and-qrels/topics.robust04.txt -output run.robust04.bm25.topics.robust04.txt -bm25

2019-11-01 18:09:49,830 INFO  [main] search.SearchCollection (SearchCollection.java:212) - Reading index at lucene-index.robust04.pos+docvectors+rawdocs
2019-11-01 18:09:50,095 INFO  [main] search.SearchCollection (SearchCollection.java:239) - Use Bag of Terms query
2019-11-01 18:09:50,161 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:163) - [Start] Ranking with similarity: BM25(k1=0.9,b=0.4)
2019-11-01 18:10:13,069 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:195) - [Finished] Ranking with similarity: BM25(k1=0.9,b=0.4)
2019-11-01 18:10:13,109 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:196) - Run 250 topics searched in 00:00:22
2019-11-01 18:10:13,148 INFO  [main] search.SearchCollection (SearchCollection.java:562) - Total run time: 00:00:23


Finally, let's use `trec_eval` to determine how good the results are:

In [0]:
!eval/trec_eval.9.0.4/trec_eval -c src/main/resources/topics-and-qrels/qrels.robust04.txt run.robust04.bm25.topics.robust04.txt


runid                 	all	Anserini
num_q                 	all	249
num_ret               	all	241339
num_rel               	all	17412
num_rel_ret           	all	10272
map                   	all	0.2531
gm_map                	all	0.1455
Rprec                 	all	0.2924
bpref                 	all	0.2603
recip_rank            	all	0.6769
iprec_at_recall_0.00  	all	0.7158
iprec_at_recall_0.10  	all	0.5286
iprec_at_recall_0.20  	all	0.4268
iprec_at_recall_0.30  	all	0.3541
iprec_at_recall_0.40  	all	0.2789
iprec_at_recall_0.50  	all	0.2299
iprec_at_recall_0.60  	all	0.1786
iprec_at_recall_0.70  	all	0.1404
iprec_at_recall_0.80  	all	0.0866
iprec_at_recall_0.90  	all	0.0553
iprec_at_recall_1.00  	all	0.0281
P_5                   	all	0.5004
P_10                  	all	0.4382
P_15                  	all	0.3987
P_20                  	all	0.3631
P_30                  	all	0.3102
P_100                 	all	0.1837
P_200                 	all	0.1250
P_500                 	all	0.0689
P_1000           