# Information Retrieval I #
## Assignment 2: retrieval models [100 points + 10 bonus points] ##
**TA**: Christophe Van Gysel (cvangysel@uva.nl; C3.258B, Science Park 904)

**Secondary TAs**: Harrie Oosterhuis, Nikos Voskarides

In this assignment you will get familiar with basic information retrieval concepts. You will implement and evaluate different information retrieval ranking models and evaluate their performance.

We provide you with a VirtualBox image that comes pre-loaded with an index and a Python installation. To query the index, you'll use a Python package ([pyndri](https://github.com/cvangysel/pyndri)) that allows easy access to the underlying document statistics.

For evaluation you'll use the [TREC Eval](https://github.com/usnistgov/trec_eval) utility, provided by the National Institute of Standards and Technology of the United States. TREC Eval is the de facto standard way to compute Information Retrieval measures and is frequently referenced in scientific papers.

This is a **groups-of-two assignment**, the deadline is **23:59 - 25 January, 2017**. Code quality, informative comments and convincing analysis of the results will be considered when grading. Submission should be done through blackboard, questions can be asked on the course [Piazza](https://piazza.com/class/ixoz63p156g1ts).

### Technicalities (must-read!) ###
This assignment comes pre-loaded on a VirtualBox running Ubuntu. We have configured the indexing software and Python environment such that it works out of the box. You are allowed to extract the files from the VirtualBox and set-up your own non-virtualized environment. However, in this case you are on your own w.r.t. software support.

The assignment directory is organized as follows:
   * `./assignment.ipynb` (this file): the description of the assignment.
   * `./index/`: the index we prepared for you.
   * `./ap_88_90/`: directory with ground-truth and evaluation sets:
      * `qrel_test`: test query relevance collection (**test set**).
      * `qrel_validation`: validation query relevance collection (**validation set**).
      * `topics_title`: semicolon-separated file with query identifiers and terms.
      
`Python + Jupyter`, `Indri`, `Gensim` and `Pyndri` come pre-installed (see `$HOME/.local`). TREC Eval can be found in `$HOME/Downloads/trec_eval.9.0`. The password of the `student` account on the VirtualBox is `datascience`.

### TREC Eval primer ###
The TREC Eval utility can be downloaded and compiled as follows:

    git clone https://github.com/usnistgov/trec_eval.git
    cd trec_eval
    make

TREC Eval computes evaluation scores given two files: ground-truth information regarding relevant documents, named *query relevance* or *qrel*, and a ranking of documents for a set of queries, referred to as a *run*. The *qrel* will be supplied by us and should not be changed. For every retrieval model (or combinations thereof) you will generate a run of the top-1000 documents for every query. The format of the *run* file is as follows:

    $query_identifier Q0 $document_identifier $rank_of_document_for_query $query_document_similarity $run_identifier
    
where
   * `$query_identifier` is the unique identifier corresponding to a query (usually this follows a sequential numbering).
   * `Q0` is a legacy field that you can ignore.
   * `$document_identifier` corresponds to the unique identifier of a document (e.g., APXXXXXXX where AP denotes the collection and the Xs correspond to a unique numerical identifier).
   * `$rank_of_document_for_query` denotes the rank of the document for the particular query. This field is ignored by TREC Eval and is only maintained for legacy support. The ranks are computed by TREC Eval itself using the `$query_document_similarity` field (see next). However, it remains good practice to correctly compute this field.
   * `$query_document_similarity` is a score indicating the similarity between query and document where a higher score denotes greater similarity.
   * `$run_identifier` is an identifier of the run. This field is for your own convenience and has no purpose beyond bookkeeping.
   
For example, say we have two queries: `Q1` and `Q2` and we rank three documents (`DOC1`, `DOC2`, `DOC3`). For query `Q1`, we find the following similarity scores `score(Q1, DOC1) = 1.0`, `score(Q1, DOC2) = 0.5`, `score(Q1, DOC3) = 0.75`; and for `Q2`: `score(Q2, DOC1) = -0.1`, `score(Q2, DOC2) = 1.25`, `score(Q1, DOC3) = 0.0`. We can generate run using the following snippet:

In [45]:
import logging
import sys

def write_run(model_name, data, out_f,
              max_objects_per_query=sys.maxsize,
              skip_sorting=False):
    """
    Write a run to an output file.
    Parameters:
        - model_name: identifier of run.
        - data: dictionary mapping topic_id to object_assesments;
            object_assesments is an iterable (list or tuple) of
            (relevance, object_id) pairs.
            The object_assesments iterable is sorted by decreasing order.
        - out_f: output file stream.
        - max_objects_per_query: cut-off for number of objects per query.
    """
    for subject_id, object_assesments in data.items():
        if not object_assesments:
            logging.warning('Received empty ranking for %s; ignoring.',
                            subject_id)

            continue

        # Probe types, to make sure everything goes alright.
        # assert isinstance(object_assesments[0][0], float) or \
        #     isinstance(object_assesments[0][0], np.float32)
        assert isinstance(object_assesments[0][1], str) or \
            isinstance(object_assesments[0][1], bytes)

        if not skip_sorting:
            object_assesments = sorted(object_assesments, reverse=True)

        if max_objects_per_query < sys.maxsize:
            object_assesments = object_assesments[:max_objects_per_query]

        if isinstance(subject_id, bytes):
            subject_id = subject_id.decode('utf8')

        for rank, (relevance, object_id) in enumerate(object_assesments):
            if isinstance(object_id, bytes):
                object_id = object_id.decode('utf8')

            out_f.write(
                '{subject} Q0 {object} {rank} {relevance} '
                '{model_name}\n'.format(
                    subject=subject_id,
                    object=object_id,
                    rank=rank + 1,
                    relevance=relevance,
                    model_name=model_name))
            
# The following writes the run to standard output.
# In your code, you should write the runs to local
# storage in order to pass them to trec_eval.
write_run(
    model_name="PLMscores",
    data=PLM_scores,
    out_f=open("results/PLM_scores.run", "w"),
    max_objects_per_query=1000)

Now, imagine that we know that `DOC1` is relevant and `DOC3` is non-relevant for `Q1`. In addition, for `Q2` we only know of the relevance of `DOC3`. The query relevance file looks like:

    Q1 0 DOC1 1
    Q1 0 DOC3 0
    Q2 0 DOC3 1
    
We store the run and qrel in files `example.run` and `example.qrel` respectively on disk. We can now use TREC Eval to compute evaluation measures. In this example, we're only interested in Mean Average Precision and we'll only show this below for brevity. However, TREC Eval outputs much more information such as NDCG, recall, precision, etc.

    $ trec_eval -m all_trec -q example.qrel example.run | grep -E "^map\s"
    > map                   	Q1	1.0000
    > map                   	Q2	0.5000
    > map                   	all	0.7500
    
Now that we've discussed the output format of rankings and how you can compute evaluation measures from these rankings, we'll now proceed with an overview of the indexing framework you'll use.

In [None]:
import subprocess
import re
r = re.compile(r'([^ \\t]*)\\t*')

def create_output(type_set, filename):
    if type_set == 'test':
        command = "trec_eval -m all_trec -q ap_88_89/qrel_test "
    else:
        command = "trec_eval -m all_trec -q ap_88_89/qrel_validation "
    command +=  "results/" + filename #+" | grep -E '\sall\s'"
    
    output = str(subprocess.check_output(command, shell = True))
    return output

In [None]:
def analyse_output(output, title):
    # NDCG@10, Mean Average Precision (MAP@1000), Precision@5 and Recall@1000.
    measure_results = {}
    measures = ["ndcg_cut_10", ["100"]], ["map_cut_1000",[]], ["P_5", ["500", "relative"]], ["recall_1000",[]]
    for measure in measures:
        measure_list = []
        measure_all = 0
        for line in output.split():
            if measure[0] in line:
                clean = True
                for restriction in measure[1]:
                    if restriction in line:
                        clean = False
                if clean:
                    if "all" in line:
                        measure_all = r.findall(line)[-1]
                    else:
                        measure_list.append(r.findall(line)[-1])
        measure_results[measure[0]] = measure_all, measure_list

    return [title, measure_results]

## Dirichlet mu optimisation:

In [None]:
output500 = create_output('validation', 'dirichlet_scores_500.run')
measure_results500 = analyse_output(output500, "Dirichlet mu=500")
output1000 = create_output('validation', 'dirichlet_scores_1000.run')
measure_results1000 = analyse_output(output1000, "Dirichlet mu=1000")
output1500 = create_output('validation', 'dirichlet_scores_1500.run')
measure_results1500 = analyse_output(output1500, "Dirichlet mu=1500")
output2000 = create_output('validation', 'dirichlet_scores_2000.run')
measure_results2000 = analyse_output(output2000, "Dirichlet mu=2000")
dirichlet_measures = [measure_results500, measure_results1000, measure_results1500, measure_results2000]

In [None]:
# for param in dirichlet_measures:
#     print(param[0])
#     for key,value in param[1].items():
#         print(str(key)+':', value[0])
#     print('')

** -> Dirichlet mu winner: 2000**

## Jelinek lambda optimisation:

In [None]:
outl01 = create_output('validation', 'jelinek_scores_0_1.run')
outl02 = create_output('validation', 'jelinek_scores_0_2.run')
outl03 = create_output('validation', 'jelinek_scores_0_3.run')
outl04 = create_output('validation', 'jelinek_scores_0_4.run')
outl05 = create_output('validation', 'jelinek_scores_0_5.run')
outl06 = create_output('validation', 'jelinek_scores_0_6.run')
outl07 = create_output('validation', 'jelinek_scores_0_7.run')
outl08 = create_output('validation', 'jelinek_scores_0_8.run')
outl09 = create_output('validation', 'jelinek_scores_0_9.run')
resl01 = analyse_output(outl01, "Jelinek lamb=0.1")
resl02 = analyse_output(outl02, "Jelinek lamb=0.2")
resl03 = analyse_output(outl03, "Jelinek lamb=0.3")
resl04 = analyse_output(outl04, "Jelinek lamb=0.4")
resl05 = analyse_output(outl05, "Jelinek lamb=0.5")
resl06 = analyse_output(outl06, "Jelinek lamb=0.6")
resl07 = analyse_output(outl07, "Jelinek lamb=0.7")
resl08 = analyse_output(outl08, "Jelinek lamb=0.8")
resl09 = analyse_output(outl09, "Jelinek lamb=0.9")
jelinek_measures = [resl01, resl02, resl03, resl04, resl05, resl06, resl07, resl08, resl09]

In [None]:
# for param in jelinek_measures:
#     print(param[0])
#     for key,value in param[1].items():
#         print(str(key)+':', value[0])
#     print('')

** -> Jelinek lambda winner: 0.1**

## Absolute Discounting delta optimisation:

In [None]:
outd01 = create_output('validation', 'AD_scores_0_1.run')
outd02 = create_output('validation', 'AD_scores_0_2.run')
outd03 = create_output('validation', 'AD_scores_0_3.run')
outd04 = create_output('validation', 'AD_scores_0_4.run')
outd05 = create_output('validation', 'AD_scores_0_5.run')
outd06 = create_output('validation', 'AD_scores_0_6.run')
outd07 = create_output('validation', 'AD_scores_0_7.run')
outd08 = create_output('validation', 'AD_scores_0_8.run')
outd09 = create_output('validation', 'AD_scores_0_9.run')
resd01 = analyse_output(outd01, "AD delta=0.1")
resd02 = analyse_output(outd02, "AD delta=0.2")
resd03 = analyse_output(outd03, "AD delta=0.3")
resd04 = analyse_output(outd04, "AD delta=0.4")
resd05 = analyse_output(outd05, "AD delta=0.5")
resd06 = analyse_output(outd06, "AD delta=0.6")
resd07 = analyse_output(outd07, "AD delta=0.7")
resd08 = analyse_output(outd08, "AD delta=0.8")
resd09 = analyse_output(outl09, "AD delta=0.9")
AD_measures = [resd01, resd02, resd03, resd04, resd05, resd06, resd07, resd08, resd09]

In [None]:
# for param in AD_measures:
#     print(param[0])
#     for key,value in param[1].items():
#         print(str(key)+':', value[0])
#     print('')

** -> AD delta winner: 0.8**

## Now run the winning parameters on the test set

In [None]:
outd08t = create_output('test', 'AD_scores_test.run')
resd08t = analyse_output(outd08t, "AD delta=0.8")
print(resd08t[0])
for key,value in resd08t[1].items():
    print(str(key)+':', value[0])
print('')

outl02t = create_output('test', 'jelinek_scores_test.run')
resl02t = analyse_output(outl02t, "Jelinek lamb=0.2")
print(resl02t[0])
for key,value in resl02t[1].items():
    print(str(key)+':', value[0])
print('')

output2000t = create_output('test', 'dirichlet_scores_test.run')
measure_results2000t = analyse_output(output2000t, "Dirichlet mu=2000")
print(measure_results2000t[0])
for key,value in measure_results2000t[1].items():
    print(str(key)+':', value[0])
print('')

### Pyndri primer ###
For this assignment you will use [Pyndri](https://github.com/cvangysel/pyndri) [[1](https://arxiv.org/abs/1701.00749)], a python interface for [Indri](https://www.lemurproject.org/indri.php). We have indexed the document collection and you can query the index using Pyndri. We will start by giving you some examples of what Pyndri can do:

First we read the document collection index with Pyndri:

In [1]:
import pyndri

index = pyndri.Index('index/')

The loaded index can be used to access a collection of documents in an easy manner. We'll give you some examples to get some idea of what it can do, it is up to you to figure out how to use it for the remainder of the assignment.

First let's look at the number of documents, since Pyndri indexes the documents using incremental identifiers we can simply take the lowest index and the maximum document and consider the difference:

In [2]:
# print("There are %d documents in this collection." % (index.maximum_document() - index.document_base()))
# print(index.maximum_document())
# print(index.document_base())
# print(index.maximum_document()-index.document_base())


Let's take the first document out of the collection and take a look at it:

In [3]:
# example_document = index.document(index.document_base())
# # print(example_document)

Here we see a document consists of two things, a string representing the external document identifier and an integer list representing the identifiers of words that make up the document. Pyndri uses integer representations for words or terms, thus a token_id is an integer that represents a word whereas the token is the actual text of the word/term. Every id has a unique token and vice versa with the exception of stop words: words so common that there are uninformative, all of these receive the zero id.

To see what some ids and their matching tokens we take a look at the dictionary of the index:

In [4]:
token2id, id2token, _ = index.get_dictionary()
# print(list(id2token.items())[:15])

Using this dictionary we can see the tokens for the (non-stop) words in our example document:

In [5]:
# print([id2token[word_id] for word_id in example_document[1] if word_id > 0])

The reverse can also be done, say we want to look for news about the "University of Massachusetts", the tokens of that query can be converted to ids using the reverse dictionary:

In [6]:
# query_tokens = index.tokenize("University of Massachusetts")
# print("Query by tokens:", query_tokens)
# query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
# print("Query by ids with stopwords:", query_id_tokens)
# query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]
# print("Query by ids without stopwords:", query_id_tokens)
# print(token2id.get('university'))

Naturally we can now match the document and query in the id space, let's see how often a word from the query occurs in our example document:

In [7]:
# matching_words = sum([True for word_id in example_document[1] if word_id in query_id_tokens])
# print("Document %s has %d word matches with query: \"%s\"." % (example_document[0], matching_words, ' '.join(query_tokens)))
# print("Document %s and query \"%s\" have a %.01f%% overlap." % (example_document[0], ' '.join(query_tokens),matching_words/float(len(example_document[1]))*100))

While this is certainly not everything Pyndri can do, it should give you an idea of how to use it. Please take a look at the [examples](https://github.com/cvangysel/pyndri) as it will help you a lot with this assignment.

**CAUTION**: Avoid printing out the whole index in this Notebook as it will generate a lot of output and is likely to corrupt the Notebook.

### Parsing the query file
You can parse the query file (`ap_88_89/topics_title`) using the following snippet:

In [8]:
import collections
import io
import logging
import sys

def parse_topics(file_or_files,
                 max_topics=sys.maxsize, delimiter=';'):
    assert max_topics >= 0 or max_topics is None

    topics = collections.OrderedDict()

    if not isinstance(file_or_files, list) and \
            not isinstance(file_or_files, tuple):
        if hasattr(file_or_files, '__iter__'):
            file_or_files = list(file_or_files)
        else:
            file_or_files = [file_or_files]

    for f in file_or_files:
        assert isinstance(f, io.IOBase)

        for line in f:
            assert(isinstance(line, str))

            line = line.strip()

            if not line:
                continue

            topic_id, terms = line.split(delimiter, 1)

            if topic_id in topics and (topics[topic_id] != terms):
                    logging.error('Duplicate topic "%s" (%s vs. %s).',
                                  topic_id,
                                  topics[topic_id],
                                  terms)

            topics[topic_id] = terms

            if max_topics > 0 and len(topics) >= max_topics:
                break

    return topics

### Task 1: Implement and compare lexical IR methods [45 points] ### 

In this task you will implement a number of lexical methods for IR using the **Pyndri** framework. Then you will evaluate these methods on the dataset we have provided using **TREC Eval**.

Use the **Pyndri** framework to get statistics of the documents (term frequency, document frequency, collection frequency; **you are not allowed to use the query functionality of Pyndri**) and implement the following scoring methods in **Python**:

- [TF-IDF](http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html). **[5 points]**
- [BM25](http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html) with k1=1.2 and b=0.75. **[5 points]**
- Language models ([survey](https://drive.google.com/file/d/0B-zklbckv9CHc0c3b245UW90NE0/view))
    - Jelinek-Mercer (explore different values of 𝛌 in the range [0.1, 0.2, ..., 0.9]). **[5 points]**
    - Dirichlet Prior (explore different values of 𝛍 [500, 1000, ..., 2000]). **[5 points]**
    - Absolute discounting (explore different values of 𝛅 in the range [0.1, 0.2, ..., 0.9]). **[5 points]**
    - [Positional Language Models](http://sifaka.cs.uiuc.edu/~ylv2/pub/sigir09-plm.pdf) define a language model for each position of a document, and score a document based on the scores of its PLMs. The PLM is estimated based on propagated counts of words within a document through a proximity-based density function, which both captures proximity heuristics and achieves an effect of “soft” passage retrieval. Implement the PLM, all five kernels, but only the Best position strategy to score documents. Use 𝛔 equal to 50, and Dirichlet smoothing with 𝛍 optimized on the validation set (decide how to optimize this value yourself and motivate your decision in the report). **[10 points]**
    
Implement the above methods and report evaluation measures (on the test set) using the hyper parameter values you optimized on the validation set (also report the values of the hyper parameters). Use TREC Eval to obtain the results and report on `NDCG@10`, Mean Average Precision (`MAP@1000`), `Precision@5` and `Recall@1000`.

For the language models, create plots showing `NDCG@10` with varying values of the parameters. You can do this by chaining small scripts using shell scripting (preferred) or execute trec_eval using Python's `subprocess`.

Compute significance of the results using a [two-tailed paired Student t-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html) **[10 points]**. Be wary of false rejection of the null hypothesis caused by the [multiple comparisons problem](https://en.wikipedia.org/wiki/Multiple_comparisons_problem). There are multiple ways to mitigate this problem and it is up to you to choose one.

Analyse the results by identifying specific queries where different methods succeed or fail and discuss possible reasons that cause these differences.

**NOTE**: Don’t forget to use log computations in your calculations to avoid underflows. 

## Setups

In [9]:
import time
import pickle
import math
from itertools import islice
import random
from operator import itemgetter

In [10]:
n = index.maximum_document()-index.document_base()
def get_rid_of_zeros(n):
    collection = []
    for i in range(1,n+1):
        collection.append([word for word in index.document(i)[1] if word > 0])
    return collection

In [11]:
# collection = get_rid_of_zeros(n)
# pickle.dump(collection, open("./results/collection.p", "wb"))

In [12]:
collection = pickle.load(open( "./results/collection.p", "rb"))

In [13]:
def get_collection_length():
    length_collection = 0
    for i in range(n):
        length_collection += len(collection[i])
    return length_collection

In [14]:
col_len = get_collection_length()

In [15]:
def get_unique_collection(n):
    unique_words_docs = []
    for doc in collection:
        unique_words_docs.append(list(set(doc)))
    return unique_words_docs

In [16]:
# unique_words_docs = get_unique_collection(n)
# pickle.dump(unique_words_docs, open("./results/unique_words_docs.p", "wb"))

In [17]:
unique_words_docs = pickle.load(open( "./results/unique_words_docs.p", "rb"))

In [18]:
# the query ids of the validation list
with open('./ap_88_89/qrel_validation', 'r') as val_queries: 
    val_queries = list(set([line.split(' ')[0] for line in val_queries]))

In [19]:
# the query ids of the test list
with open('./ap_88_89/qrel_test', 'r') as test_queries_: 
    test_queries = list(set([line.split(' ')[0] for line in test_queries_]))

** Inverted Index List **

In [20]:
queries_dict = {} # {qid: qstring, qid: qstring...}
with open('./ap_88_89/topics_title', 'r') as f_topics: 
    for query in parse_topics([f_topics]).items():
        queries_dict[query[0]] = query[1]

In [21]:
def get_inverted_index():
    inverted_list = {}
    query_list = []

    with open('./ap_88_89/topics_title', 'r') as f_topics:   
        for query in parse_topics([f_topics]).items():
            query_list.append(query)

    nr = 0
    for query in query_list:
        nr += 1
        if nr % 10 == 0: #
            print('\r',str(nr)+'/'+str(len(query_list)))

        # getting the query term ids
        query_id = query[0]
        query_string = query[1]
        query_tokens = index.tokenize(query_string)
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]

        for qti in query_id_tokens: # for every query term
            if qti not in inverted_list: # only check unique query token once
                tot_count = 0
                inverted_list[qti] = [[],0]
                for i in range(n):                
                    word_counter = collection[i].count(qti)
                    if word_counter > 0:
                        tot_count += word_counter
                        docid = index.document_ids([index.document(i+1)[0]])[0][1]
                        inverted_list[qti][0].append(docid) # add document to query tok id
                inverted_list[qti][1]= (tot_count)
    return inverted_list

Get lists:

In [22]:
# inverted_list = get_inverted_index()
# pickle.dump(inverted_list, open("./results/inverted_index.p", "wb"))

Load lists:

In [23]:
query_list = pickle.load(open( "./results/query_list.p", "rb"))
inverted_list = pickle.load(open( "./results/inverted_index.p", "rb"))

In [24]:
def make_dict_format(dic):
    results = []
    for doc, score in dic.items():
        results.append((score, doc))
    return tuple(results)

In [25]:
def get_query_docs(queryset):
    query_docus = {}
    nr = 0
    for query_id in queryset:
        nr +=1
        if nr % 5 ==0:
            print('doc', nr)
        query_tokens = index.tokenize(queries_dict[str(query_id)])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]

        relev_documents = []
        for token in query_id_tokens:
            for doc in inverted_list[token][0]:
                if doc not in relev_documents:
                    relev_documents.append(doc)

        query_docus[query_id] = relev_documents
    return query_docus

In [26]:
# query_val_docs = get_query_docs(val_queries)
# pickle.dump(query_val_docs, open("./results/query_val_docs.p", "wb"))

In [27]:
# query_test_docs = get_query_docs(test_queries)
# pickle.dump(query_test_docs, open("./results/query_test_docs.p", "wb"))

In [28]:
query_val_docs = pickle.load(open( "./results/query_val_docs.p", "rb"))
query_test_docs = pickle.load(open( "./results/query_test_docs.p", "rb"))

In [29]:
def idf(t):
    return math.log(n)-math.log(len(inverted_list[t][0]))

In [30]:
def background_prob(w):
    if w in inverted_list:
        tf_w_C = inverted_list[w][-1]
    else: 
        tf_w_C = 0
    return tf_w_C/float(col_len)

# Retrieval Models

## TF-IDF  (Vector-space)

In [None]:
def tf_idf(t, d):
    return math.log(1+collection[d].count(t)) * idf(t)

In [None]:
def score_TFIDF(q,d):
    unique = unique_words_docs[d-1]
    score = 0
    for word in q:
        score += tf_idf(word,d-1)   
    return score

In [None]:
def get_TFIDF_scores():
    TFIDF_dict = {}
    
    nr = 0
    for query_id in test_queries:
        r = {}
        print('\r',str(nr)+'/'+str(len(test_queries)), end=" ")
        query_tokens = index.tokenize(queries_dict[str(query_id)])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]

        for d in query_test_docs[query_id]:
            d_id = str(index.document(d)[0])
            r[d_id] = score_TFIDF(query_id_tokens, d)
            
        while len(r) < 1000:
            i = random.randrange(1,n)
            i = str(index.document(i)[0])
            if len(collection[i-1]) == 0:
                r[j] = r[j]
            else: r[i] = 0 if i not in r else r[i]

        TFIDF_dict[str(query_id)] = make_dict_format(r)
        nr += 1
        
    return TFIDF_dict

Get scores:

In [None]:
# TFIDF_scores = get_TFIDF_scores()
# pickle.dump(TFIDF_scores, open("./results/tfidf_results.p", "wb"))

Load scores:

In [None]:
TFIDF_scores = pickle.load(open( "./results/tfidf_results.p", "rb"))

In [None]:
def get_top_tfidftfidf_top_1000_docs = {}
for key, values in TFIDF_scores.items():
    query_id = key
    tfidf_top_1000_docs[int(query_id)] = []
    sorted_list = sorted(values, key=itemgetter(0), reverse = True)[:1000]
    for value in sorted_list:
        tfidf_top_1000_docs[int(query_id)].append(index.document_ids([value[1]])[0][1])

In [None]:
pickle.dump(tfidf_top_1000_docs, open("./results/tfidf_top.p", "wb"))

## BM25 (Probabilistic)

In [None]:
def average_length():
    l = 0
    for i in range(n):
        l += len(collection[i])
    return l/float(n)

l_av = average_length()

In [None]:
def BM25(t,d):
    k1 = 1.2
    b = 0.75
    first = ((k1+1)* collection[d].count(t)) / (k1*((1-b)+b*(len(collection[d])/l_av))+collection[d].count(t))
    return  first*idf(t)

In [None]:
def score_BM25(q,d):
    unique = unique_words_docs[d-1]
    score = 0
    for word in set(q):
        score += BM25(word,d-1)
        
    return score

In [None]:
def get_BM25_scores():
    BM25_dict = {}
    
    nr = 0
    for query_id in test_queries:
        r = {}
        print('\r',str(nr)+'/'+str(len(test_queries)), end=" ")
        query_tokens = index.tokenize(queries_dict[str(query_id)])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]
        for d in query_test_docs[query_id]:
            d_id = str(index.document(d)[0])
            r[d_id] = score_BM25(query_id_tokens, d)
            
        while len(r) < 1000:
            i = random.randrange(1,n)
            i = str(index.document(i)[0])
            if len(collection[i-1]) == 0:
                r[j] = r[j]
            else: r[i] = 0 if i not in r else r[i]

        BM25_dict[str(query_id)] = make_dict_format(r)
        nr += 1
        
    return BM25_dict

Get scores:

In [None]:
BM25_scores = get_BM25_scores()
pickle.dump(BM25_scores, open("./results/bm25_results.p", "wb"))

Load scores:

In [None]:
BM25_scores = pickle.load(open( "./results/bm25_results.p", "rb"))

**Background probability**


## Smoothing

## Jelinek-Mercer


In [None]:
def jelinek_mercer(lamb, d, w):
    P = (lamb * (collection[d].count(w)/len(collection[d]))) + ((1-lamb) * background_prob(w))
    return P

In [None]:
def jelinek_score(lamb,query,d):
    score = 0
    for q in query:
        score += math.log(jelinek_mercer(lamb,d,q))
    return score

In [None]:
def get_jelinek_scores(lamb, type_set):
    print('\r','lamb:',lamb)
    jelinek_dict = {}
    nr = 0
    
    if type_set == "val":
        queries = val_queries
        query_docs = query_val_docs
    else:
        queries = test_queries
        query_docs = query_test_docs
        
    for query_id in queries:
        r = {}
        print('\r',str(nr)+'/'+str(len(queries)), end=" ")
        query_tokens = index.tokenize(queries_dict[str(query_id)])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]
        
    
        for d in query_docs[query_id]:
            d_id = str(index.document(d)[0])
            r[d_id] = jelinek_score(lamb, query_id_tokens, d-1)
            
        while len(r) < 1000:
            i = random.randrange(1,n)
            j = str(index.document(i)[0])
            if len(collection[i-1]) == 0:
                r[j] = r[j]
            else: r[j] = jelinek_score(lamb, query_id_tokens, i-1) if j not in r else r[j]

        jelinek_dict[str(query_id)] = make_dict_format(r)
        nr += 1
        
    return jelinek_dict

Getting validation scores:

In [None]:
jelinek_scores_0_1 = get_jelinek_scores(0.1, 'val')
jelinek_scores_0_2 = get_jelinek_scores(0.2, 'val')
jelinek_scores_0_3 = get_jelinek_scores(0.3, 'val')
jelinek_scores_0_4 = get_jelinek_scores(0.4, 'val')
jelinek_scores_0_5 = get_jelinek_scores(0.5, 'val')
jelinek_scores_0_6 = get_jelinek_scores(0.6, 'val')
jelinek_scores_0_7 = get_jelinek_scores(0.7, 'val')
jelinek_scores_0_8 = get_jelinek_scores(0.8, 'val')
jelinek_scores_0_9 = get_jelinek_scores(0.9, 'val')

In [None]:
pickle.dump(jelinek_scores_0_1, open("./results/jelinek_scores_0_1.p", "wb"))
pickle.dump(jelinek_scores_0_2, open("./results/jelinek_scores_0_2.p", "wb"))
pickle.dump(jelinek_scores_0_3, open("./results/jelinek_scores_0_3.p", "wb"))
pickle.dump(jelinek_scores_0_4, open("./results/jelinek_scores_0_4.p", "wb"))
pickle.dump(jelinek_scores_0_5, open("./results/jelinek_scores_0_5.p", "wb"))
pickle.dump(jelinek_scores_0_6, open("./results/jelinek_scores_0_6.p", "wb"))
pickle.dump(jelinek_scores_0_7, open("./results/jelinek_scores_0_7.p", "wb"))
pickle.dump(jelinek_scores_0_8, open("./results/jelinek_scores_0_8.p", "wb"))
pickle.dump(jelinek_scores_0_9, open("./results/jelinek_scores_0_9.p", "wb"))

Loading validation scores:

In [None]:
jelinek_scores_0_1 = pickle.load(open( "./results/jelinek_scores_0_1.p", "rb"))
jelinek_scores_0_2 = pickle.load(open( "./results/jelinek_scores_0_2.p", "rb"))
jelinek_scores_0_3 = pickle.load(open( "./results/jelinek_scores_0_3.p", "rb"))
jelinek_scores_0_4 = pickle.load(open( "./results/jelinek_scores_0_4.p", "rb"))
jelinek_scores_0_5 = pickle.load(open( "./results/jelinek_scores_0_5.p", "rb"))
jelinek_scores_0_6 = pickle.load(open( "./results/jelinek_scores_0_6.p", "rb"))
jelinek_scores_0_7 = pickle.load(open( "./results/jelinek_scores_0_7.p", "rb"))
jelinek_scores_0_8 = pickle.load(open( "./results/jelinek_scores_0_8.p", "rb"))
jelinek_scores_0_9 = pickle.load(open( "./results/jelinek_scores_0_9.p", "rb"))

Getting test scores:

In [None]:
jelinek_scores_test = get_jelinek_scores(0.2, 'test')
pickle.dump(jelinek_scores_test, open("./results/jelinek_scores_test.p", "wb"))

Loading test scores:

In [None]:
jelinek_scores_test = pickle.load(open( "./results/jelinek_scores_test.p", "rb"))

## Dirichlet Prior

In [None]:
def dirichlet_prior(mu,d,w):
    return ((len(collection[d])/(len(collection[d])+mu)) * (collection[d].count(w)/len(collection[d]))) + ((mu/(mu+len(collection[d]))) * background_prob(w))

In [None]:
def dirichlet_score(mu,query,d):
    score = 0
    for q in query:
        score += math.log(dirichlet_prior(mu,d,q))
    return score

In [None]:
def get_dirichlet_scores(mu, type_set):
    print('\r','mu:',mu)
    dirichlet_dict = {}
    nr = 0
    
    if type_set == "val":
        queries = val_queries
        query_docs = query_val_docs
    else:
        queries = test_queries
        query_docs = query_test_docs
        
    for query_id in queries:
        r = {}
        print('\r',str(nr)+'/'+str(len(queries)), end=" ")
        query_tokens = index.tokenize(queries_dict[str(query_id)])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]      
    
        for d in query_docs[query_id]:
            d_id = str(index.document(d)[0])
            r[d_id] = dirichlet_score(mu, query_id_tokens, d-1)
            
        while len(r) < 1000:
            i = random.randrange(1,n)
            j = str(index.document(i)[0])
            if len(collection[i-1]) == 0:
                r[j] = r[j]
            else: r[j] = dirichlet_score(mu, query_id_tokens, i-1) if j not in r else r[j]

        dirichlet_dict[str(query_id)] = make_dict_format(r)
        nr += 1
        
    return dirichlet_dict

Getting validation scores:

In [None]:
# dirichlet_scores_500 = get_dirichlet_scores(500, "val")
# dirichlet_scores_1000 = get_dirichlet_scores(1000, "val")
# dirichlet_scores_1500 = get_dirichlet_scores(1500, "val")
# dirichlet_scores_2000 = get_dirichlet_scores(2000, "val")

In [None]:
pickle.dump(dirichlet_scores_500, open("./results/dirichlet_scores_500.p", "wb"))
pickle.dump(dirichlet_scores_1000, open("./results/dirichlet_scores_1000.p", "wb"))
pickle.dump(dirichlet_scores_1500, open("./results/dirichlet_scores_1500.p", "wb"))
pickle.dump(dirichlet_scores_2000, open("./results/dirichlet_scores_2000.p", "wb"))

Loading validation scores:

In [None]:
dirichlet_scores_500 = pickle.load(open( "./results/dirichlet_scores_500.p", "rb"))
dirichlet_scores_1000 = pickle.load(open( "./results/dirichlet_scores_1000.p", "rb"))
dirichlet_scores_1500 = pickle.load(open( "./results/dirichlet_scores_1500.p", "rb"))
dirichlet_scores_2000 = pickle.load(open( "./results/dirichlet_scores_2000.p", "rb"))


Getting test scores:

In [None]:
dirichlet_scores_test = get_dirichlet_scores(2000, "test")
pickle.dump(dirichlet_scores_test, open("./results/dirichlet_scores_test.p", "wb"))

Loading test scores:

In [None]:
dirichlet_scores_test = pickle.load(open( "./results/dirichlet_scores_test.p", "rb"))

**Absolute Discounting**

In [None]:
def absolute_discounting(delta, d, w):
    return (max(collection[d].count(w)-delta, 0)/len(collection[d])) + (((delta * len(unique_words_docs[d]))/len(collection[d])) * (background_prob(w)))

In [None]:
def AD_score(delta,query,d):
    score = 0
    for q in query:
        score += math.log(absolute_discounting(delta,d,q))
    return score

In [None]:
def get_AD_scores(delta, type_set):
    print('\r','delta:',delta)
    AD_dict = {}
    nr = 1 
    
    if type_set == "val":
        queries = val_queries
        query_docs = query_val_docs
    else:
        queries = test_queries
        query_docs = query_test_docs
        
    for query_id in queries:
        r = {}
        print('\r',str(nr)+'/'+str(len(queries)), end=" ")
        query_tokens = index.tokenize(queries_dict[str(query_id)])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]      
    
        for d in query_docs[query_id]:
            d_id = str(index.document(d)[0])
            r[d_id] = AD_score(delta, query_id_tokens, d-1)
            
        while len(r) < 1000:
            i = random.randrange(1,n)
            j = str(index.document(i)[0])
            if len(collection[i-1]) == 0:
                r[j] = r[j]
            else: r[j] = AD_score(delta, query_id_tokens, i-1) if j not in r else r[j]

        AD_dict[str(query_id)] = make_dict_format(r)
        nr += 1
        
    return AD_dict

Getting validation scores:

In [None]:
AD_scores_0_1 = get_AD_scores(0.1,"val")
AD_scores_0_2 = get_AD_scores(0.2,"val")
AD_scores_0_3 = get_AD_scores(0.3,"val")
AD_scores_0_4 = get_AD_scores(0.4,"val")
AD_scores_0_5 = get_AD_scores(0.5,"val")
AD_scores_0_6 = get_AD_scores(0.6,"val")
AD_scores_0_7 = get_AD_scores(0.7,"val")
AD_scores_0_8 = get_AD_scores(0.8,"val")
AD_scores_0_9 = get_AD_scores(0.9,"val")

In [None]:
pickle.dump(AD_scores_0_1, open("./results/AD_scores_0_1.p", "wb"))
pickle.dump(AD_scores_0_2, open("./results/AD_scores_0_2.p", "wb"))
pickle.dump(AD_scores_0_3, open("./results/AD_scores_0_3.p", "wb"))
pickle.dump(AD_scores_0_4, open("./results/AD_scores_0_4.p", "wb"))
pickle.dump(AD_scores_0_5, open("./results/AD_scores_0_5.p", "wb"))
pickle.dump(AD_scores_0_6, open("./results/AD_scores_0_6.p", "wb"))
pickle.dump(AD_scores_0_7, open("./results/AD_scores_0_7.p", "wb"))
pickle.dump(AD_scores_0_8, open("./results/AD_scores_0_8.p", "wb"))
pickle.dump(AD_scores_0_9, open("./results/AD_scores_0_9.p", "wb"))

Loading validation scores:

In [None]:
AD_scores_0_1 = pickle.load(open( "./results/AD_scores_0_1.p", "rb"))
AD_scores_0_2 = pickle.load(open( "./results/AD_scores_0_2.p", "rb"))
AD_scores_0_3 = pickle.load(open( "./results/AD_scores_0_3.p", "rb"))
AD_scores_0_4 = pickle.load(open( "./results/AD_scores_0_4.p", "rb"))
AD_scores_0_5 = pickle.load(open( "./results/AD_scores_0_5.p", "rb"))
AD_scores_0_6 = pickle.load(open( "./results/AD_scores_0_6.p", "rb"))
AD_scores_0_7 = pickle.load(open( "./results/AD_scores_0_7.p", "rb"))
AD_scores_0_8 = pickle.load(open( "./results/AD_scores_0_8.p", "rb"))
AD_scores_0_9 = pickle.load(open( "./results/AD_scores_0_9.p", "rb"))

Getting test scores:

In [None]:
AD_scores_test = get_AD_scores(0.8,"test")
pickle.dump(AD_scores_test, open("./results/AD_scores_test.p", "wb"))

Loading test scores:

In [None]:
AD_scores_test = pickle.load(open( "./results/AD_scores_test.p", "rb"))

## Positional Language Models

In [31]:
query_terms =[]
for query in query_list:
    query_tokens = index.tokenize(query[1])
    query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
    query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]

    for query_token in query_id_tokens:
        if query_token not in query_terms:
            query_terms.append(query_token)

In [None]:
# q_docs = []
# nr = 0
# for doc in collection:
#     nr +=1
#     if nr % 1000 ==0:
#         print(nr)
#     query_doc = []
#     for i in range(len(doc)):
#         if doc[i] in query_terms:
#             for qterm in query_terms:
#                 if qterm == doc[i]:
#                     query_doc.append([qterm, i])
#     q_docs.append(query_doc)               

In [None]:
# pickle.dump(q_docs, open("./results/query_docs.p", "wb"))

In [32]:
q_docs = pickle.load(open( "./results/query_docs.p", "rb"))

In [33]:
def kernel_gaussian(sigma,i,j):
    return math.exp((-1*((i-j)**2))/(2*(sigma**2)))
    
def kernel_triangle(sigma,i,j):
    if i-j <= sigma:
        return 1-((i-j)/sigma)
    else:
        return 0.0

def kernel_cosine(sigma,i,j):
    if i-j <= sigma:
        return 0.5*(1+math.cos(((i-j)*math.pi)/sigma))
    else:
        return 0.0

def kernel_circle(sigma,i,j):
    if i-j <= sigma:
        return math.sqrt(1-(((i-j)/sigma)**2))
    else:
        return 0.0

def kernel_passage(sigma,i,j):
    if i-j <= sigma:
        return 1.0
    else:
        return 0.0  

In [34]:
def c(w,j,d): 
    if w == d[j]:
        return 1
    else: return 0

In [35]:
def c_prime(w,i,d): # 0.0003 seconds
    c_prime = 0
    
    for query in q_docs[d]:
        if query[0] == w:
            j = query[1]
            c_prime += kernel_gaussian(50,i,j)
        
#     for j in range(len(collection[d])): 
#         c_prime += c(w,j,collection[d])*kernel_gaussian(50,i,j)
    return c_prime

In [36]:
def Z(i,d): # 0.25 seconds
    Z = 0
    for word in unique_words_docs[d]:
        Z += c_prime(word,i,d)
    return Z
    

In [37]:
def get_all_zs():
    Z = []
    
    max_len = 0
    for doc in collection:
        if len(doc) > max_len:
            max_len = len(doc)
    
    for i in range(max_len):
        z = 0
        for j in range(max_len):
            z += kernel_gaussian(50,i,j)
        Z.append(z)
    return Z


In [38]:
Zs = get_all_zs()

In [39]:
def PLM(mu,w,i,d): # 0.25 seconds
    P = (c_prime(w,i,d) + (mu * background_prob(w))) / (Zs[i] + mu)#(Z(i,d) + mu)
    return P

In [40]:
def PLM_score(q,d):
#     unique = list(set(collection[d-1]))
    unique = unique_words_docs[d]
    max_score = -100000
    for i in range(len(collection[d])):   
        scores = []
        for word in q:
            if inverted_list[word][0]:
                scores.append(((q.count(word)/float(len(q))) * (math.log((q.count(word)/float(len(q)))/PLM(1000,word,i,d)))))
        if -sum(scores) > max_score:
            max_score = -sum(scores)
    return max_score

In [43]:
def get_PLM_scores(mu, type_set):
    PLM_dict = {}
    secdoc = 0
    nr = 0
    
    if type_set == "val":
        queries = val_queries
        query_docs = query_val_docs
    else:
        queries = test_queries
        query_docs = query_test_docs
        
    for query_id in queries:
        nr +=1
        start = time.time()
        r = {}
        if len(query_docs[query_id]) < 1000:
            nrdocs = 1000
        else: nrdocs = len(query_docs[query_id])
        print('\r','Q'+str(query_id), 'estimated time: '+ str(round((secdoc*nrdocs)/60.0,2)), 'min,', nrdocs, 'documents', '\t'+str(nr)+'/'+str(len(queries)), end=" ") 
        query_tokens = index.tokenize(queries_dict[str(query_id)]) #query[1])
        query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
        query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]

        for d in query_docs[query_id]:
            d_id = str(index.document(d)[0])
            r[d_id] = PLM_score(query_id_tokens, d-1)

            
        while len(r) < 1000:
            i = random.randrange(1,n)
            j = str(index.document(i)[0])
            if len(collection[i-1]) == 0:
                r[j] = r[j]
            else: r[j] = PLM_score(query_id_tokens, i-1) if j not in r else r[j]
                
        PLM_dict[str(query_id)] = make_dict_format(r)
        secdoc = (time.time()-start)/float(nrdocs)
    
    return PLM_dict

In [44]:
PLM_scores = get_PLM_scores(2000, 'test')
pickle.dump(PLM_scores, open("./results/plm_results.p", "wb"))


 Q83 estimated time: 2.08 min, 13560 documents 	120/120   

### Task 2: Latent Semantic Models (LSMs) [25 points + 10 bonus points] ###

In this task you will experiment with applying a distributional semantics methods ([word2vec](http://arxiv.org/abs/1411.2738)  **[5 points]**, [LSI](http://lsa3.colorado.edu/papers/JASIS.lsi.90.pdf) **[5 points]**, [LDA](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf) **[5 points]** and [doc2vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) **[5 points]**) for retrieval.

You do not need to implement word2vec, LSI, LDA and doc2vec on your own. Instead, you can use [gensim](http://radimrehurek.com/gensim/index.html) (pre-loaded on the VirtualBox). An example on how to integrate Pyndri with Gensim for word2vec can be found [here](https://github.com/cvangysel/pyndri/blob/master/examples/word2vec.py). For the remaining latent vector space models, you will need to implement connector classes (such as `IndriSentences`) by yourself.

In order to use a latent semantic model for retrieval, you need to:
   * build a representation of the query **q**,
   * build a representation of the document **d**,
   * calculate the similarity between **q** and **d** (e.g., cosine similarity, KL-divergence).
     
The exact implementation here depends on the latent semantic model you are using. For example, in the case of word2vec, you only have vectors for individual words and not for documents or phrases. Try one of the following methods for producing these representations:
   * Average or sum the word vectors.
   * Cluster words in the document using [k-means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and use the centroid of the most important cluster. Experiment with different values of K for k-means.
   * Using the [bag-of-word-embeddings representation](https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1248). **[10 bonus points]**
   
Each of these LSMs come with various hyperparameters to tune. Make a choice on the parameters, and explicitly mention the reasons that led you to these decisions. You can use the validation set to optimize hyper parameters you see fit; motivate your decisions. In addition, mention clearly how the query/document representations were constructed for each LSM and explain your choices.

In this experiment, you will first obtain an initial top-1000 ranking for each query using TF-IDF in **Task 1**, and then re-rank the documents using the LSMs. Use TREC Eval to obtain the results and report on `NDCG@10`, Mean Average Precision (`MAP@1000`), `Precision@5` and `Recall@1000`.

Perform significance testing **[5 points]** (similar as in Task 1) in the class of semantic matching methods.

### Task 3: Learning to rank (LTR) [10 points] ###

In this task you will get an introduction into learning to rank for information retrieval. You will experiment with a pointwise learning to rank method, logistic regression, implemented in [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

**NOTE**: you can only perform this task if you have completely finished Task 1 and Task 2.

In this experiment, you will use the retrieval methods you implemented in Task 1 and Task 2 as features for the learning to rank model. Train your LTR model using 10-fold cross validation on the test set. For every query, first create a document candidate set using the top-1000 documents using TF-IDF. Secondly, compute query-document values using the retrieval models above and use them as features. Note that the feature values of different retrieval methods are likely to be distributed differently.

Your approach will definitely not be as good as the state-of-the-art since you are taking a pointwise approach, but we do not ask you to try pair- or listwise methods because they will be the main topic of the next assignment.

In [None]:
# from sklearn.linear_model import LogisticRegression

### Task 4: Write a report [20 points; instant FAIL if not provided] ###

The report should be a PDF file created using the [sigconf ACM template](https://www.acm.org/publications/proceedings-template) and will determine a significant part of your grade.

   * It should explain what you have implemented, motivate your experiments and detail what you expect to learn from them. **[10 points]**
   * Lastly, provide a convincing analysis of your results and conclude the report accordingly. **[10 points]**
      * Do all methods perform similarly on all queries? Why?
      * Is there a single retrieval model that outperforms all other retrieval models (i.e., silver bullet)?
      * ...

**Hand in the report and your self-contained implementation source files.** Do not send us the VirtualBox, but only the files that matter, organized in a well-documented zip/tgz file with clear instructions on how to reproduce your results. That is, we want to be able to regenerate all your results with minimal effort. You can assume that the index and ground-truth information is present in the same file system structure as on the VirtualBox.
