### Group 30 
#### Alessandro Cortese
#### Gianmarco Lodi

# Assignment 1 - _Foundations of Information Retrieval 2022_

This assignment is divided in 4 parts, which have to be delivered all together no later than 03/10/2022 (strict - no extensions will be granted!), via Canvas. Delivery of the assignment solutions is mandatory (_see grading conditions on Canvas and in slides of Lecture01_).

We will use [ElasticSearch](https://www.elastic.co/) as search engine. It provides state-of-the-art tools to implement your own engine, index your documents, and let you focus on methodological aspects of search models and optimization. 

The assignment is about text-based Information Retrieval and it is structured in three parts:
1. IR performance evaluation (implementation of performance metrics)
2. Setting up a search engine, pre-processing and indexing using ElasticSearch (Indexing, Analyzers)
3. Implementation and optimization of models of search (Similarity)


This assignment contains exercises, marked with the section title __Exercise 01.(x)__, which are evaluated, and other sections that contain support code which you should study and use as it is. Write your answers between the comments `BEGIN ANSWER` and `END ANSWER`. 

_Note:_ the comment `#THIS IS GRADED!` in a section indicates that it will be graded.


### Initial preparation (self-study)
For the first part, it is good to acquire (or refresh) basic knowledge of Python. Please use the [Python tutorials](https://docs.python.org/3/tutorial/) if needed.

For the second and third part of the assignment, please study yourself the [Getting Started guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html) of ElasticSearch and get acquainted with the framework.


***
***
***

# PART 01 - Performance evaluation


### Background information and reading
Study the slides of Lecture 01 (available on Canvas) and the reference book chapter (Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, [Chapter 8, Evaluation in information retrieval](http://nlp.stanford.edu/IR-book/pdf/08eval.pdf), Cambridge University Press. 2008)

### Basic concepts
Suppose the set of relevant documents (the document identifiers - _doc-IDs_) is called `relevant`, then we  define it as follows (in Python):

In [1]:
relevant = set([2, 3, 5, 8, 13, 17, 21, 34, 38])

A perfect run would retrieve exactly these 9 documents in any order. Now, suppose the list of retrieved documents (the document identifiers - _doc-IDs_) is called `retrieved`, and contains the following _doc-IDs_:

In [2]:
retrieved = [14, 4, 2, 18, 16, 8, 46, 32, 17, 34, 33, 22, 47, 39, 11]

One of the simplest evaluation measures is the _Success at rank 1_, i.e. `Is the first document retrieved a relevant document?`

_Success at rank 1_ returns 1 if the first document is relevant, and 0 otherwise. A possible implementation is: 

In [3]:
def success_at_1 (relevant, retrieved):
    if len(retrieved) > 0 and retrieved[0] in relevant:
        return 1
    else:
        return 0

success_at_1(relevant, retrieved)

0

The first retrieved documentid is 14 which is not in the set of relevant documents, so the `success_at_1` is 0.

_________________

> Note how easy it is to check if an item occurs in a Python set or list by using the keyword: `in`. Similarly, you can loop over all items in a set of list with: 
`for doc in retrieved:`, 
where doc will refer to each item in the set or list. 

Be sure to use the internet to sharpen your knowledge about Python constructs, for instance on [Python list slicing](https://duckduckgo.com/?q=python+list+slicing). Also note that the code above checks if at least one document is retrieved to avoid an index out of bounds exception (i.e. we avoid to access an empty vector).

> ___Suggestion:___ _to be sure of the correctness of the implementation of the performance metrics, you can compute their values manually and compare them with those computed by your functions. This is important, as you will use these metrics for later exercises and to compare the results of differentmodels._

## Preparation exercise: _Success at k_
The measure _Success at k_ returns 1 if a relevant document is among the first _k_ documents retrieved and zero otherwise.

> Success at _k_ measures are well-suited in case there is typically only one relevant document (or retrieving one relevant document is enough).

 __Implement _Success at 5_ below.__ 
 > The correct result is 1.

In [4]:
def success_at_5(relevant, retrieved):
    # BEGIN ANSWER
    if len(retrieved) == 0:
        return 0
    for el in retrieved[:5]:
        if el in relevant:
            return 1
    return 0
    # END ANSWER
    
success_at_5(relevant, retrieved)

1

Similarly __implement success at rank 10__

> The correct result is 1.

In [5]:
def success_at_10(relevant, retrieved):
    # BEGIN ANSWER
    if len(retrieved) == 0:
        return 0
    for el in retrieved[:10]:
        if el in relevant:
            return 1
    return 0
    # END ANSWER
    
success_at_10(relevant, retrieved)

1

## Exercise 01.A: _Precision, Recall and F-measure_
__1. Implement _Precision_ using Formula 8.1 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

>_Hint:_ one can count the number of documents in a list using the built-in Python function [len()](https://docs.python.org/3/library/functions.html#len) \
> _example:_ `len(retrieved)` for the number of retrieved documents. 

In [6]:
#THIS IS GRADED!

def precision(relevant, retrieved):
    # BEGIN ANSWER
    return len(set(retrieved).intersection(relevant)) / len(retrieved)
    # END ANSWER
    
precision(relevant, retrieved)

0.26666666666666666

__2. Implement _Recall_ using Formula 8.2 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

In [7]:
#THIS IS GRADED!

def recall(relevant, retrieved):
    # BEGIN ANSWER
    return len(set(retrieved).intersection(relevant)) / len(relevant)
    # END ANSWER
    
recall(relevant, retrieved)

0.4444444444444444

__3. Implement the balanced F measure (_F_ with β=1) using Formula 8.6 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

> Tip: you may reuse your implementations of precision and recall

In [8]:
#THIS IS GRADED!

def f_measure(relevant, retrieved):
    # BEGIN ANSWER
    return 2*precision(relevant, retrieved)*recall(relevant, retrieved) / (precision(relevant,retrieved) + recall(relevant, retrieved))
    # END ANSWER
    
f_measure(relevant, retrieved)

0.33333333333333337

## Exercise 01.B: _Precision at rank k_ and  _R-Precision_

Precision, Recall and F are _set_-based measures and suited for unranked lists of documents. If our search system returns a ranked _list_ of results, we can measure precision for several cut-off levels _k_ in the ranked list, i.e. we evaluate the relevance of the TOP-_k_ retrieved documents _(see lecture 01 slides and the book chapter)_. 


**1. Implement the function `precision_at_k()` that measures the precision at rank _k_**

> Interesting fact: For _k_=1, the _Precision at rank 1_ would be the samen as _Success at rank 1_ (why?) 

In [9]:
#THIS IS GRADED!

def precision_at_k(relevant, retrieved, k):
    # BEGIN ANSWER
    if k == 0:
        return 0
    else:
        return len(set(retrieved[:k]).intersection(relevant)) / len(retrieved[:k])
    # END ANSWER

print('Pr@1: %1.2f' % precision_at_k(relevant, retrieved, k=1))
print('Pr@5: %1.2f' % precision_at_k(relevant, retrieved, k=5))
print('Pr@10: %1.2f' % precision_at_k(relevant, retrieved, k=10))


Pr@1: 0.00
Pr@5: 0.20
Pr@10: 0.40


__2. Implement R-Precision as defined in Chapter 8 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book)__.

In [10]:
#THIS IS GRADED!

def r_precision(relevant, retrieved):
    # BEGIN ANSWER
    return len(set(retrieved[:len(relevant)]).intersection(relevant)) / len(retrieved[:len(relevant)])
    # END ANSWER   
r_precision(relevant, retrieved)


0.3333333333333333

## Exercise 01.D:  Interpolated precision at _recall_ X

Another way to address ranked retrieval is to measure precision for several _recall_ levels _X_.

__Implement the function `interpolated_precision_at_recall_X()` that measures the interpolated precision at recall level _X_ as defined by formula 8.7 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).__

> Tip: calculate for each rank the recall. If the recall is greater than or equal to X, 
> calculate the precision. Keep the highest (maximum) precision of those to be returned at the end.

In [11]:

#THIS IS GRADED!

def interpolated_precision_at_recall_X (relevant, retrieved, X):
    # BEGIN ANSWER
    max_p = 0
    for rank in range(1,len(retrieved)):
        rec = recall(relevant, retrieved[:rank])
        if rec >= X:
            prec = precision(relevant, retrieved[:rank])
            if prec > max_p:
                max_p = prec
    return max_p
        
    # END ANSWER
    
 

print('Pr_i@Re01: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.1))
print('Pr_i@Re02: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.2))
print('Pr_i@Re03: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.3))
print('Pr_i@Re04: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.4))
print('Pr_i@Re05: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.5))
print('Pr_i@Re06: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.6))
print('Pr_i@Re07: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.7))
print('Pr_i@Re08: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.8))
print('Pr_i@Re09: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=0.9))
print('Pr_i@Re10: %1.2f' % interpolated_precision_at_recall_X(relevant, retrieved, X=1))

Pr_i@Re01: 0.40
Pr_i@Re02: 0.40
Pr_i@Re03: 0.40
Pr_i@Re04: 0.40
Pr_i@Re05: 0.00
Pr_i@Re06: 0.00
Pr_i@Re07: 0.00
Pr_i@Re08: 0.00
Pr_i@Re09: 0.00
Pr_i@Re10: 0.00


## Exercise 01.E:  _Average Precision_

For a single information need, _Average Precision_ is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved (see [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book), Pages 159 and 160). 

__Implement _Average Precision_ for a single information need.__

In [12]:
#THIS IS GRADED!

def average_precision(relevant, retrieved):
    # BEGIN ANSWER
    numerator = 0
    for i in range(len(retrieved)):
        if retrieved[i] in relevant:
            numerator += precision_at_k(relevant, retrieved, i+1)
    
    return numerator/len(relevant)
            
    # END ANSWER

average_precision(relevant, retrieved)

0.15555555555555556

***
## Performance measures in TREC benchmarks

The relevance judgments are provided by TREC in so-called _"qrels"_ files that look as follows:

    1000 Q0 1341 1
    1000 Q0 1231 0
    1001 Q0 12332 1
     ...

The columns of the _qrels_ file contain:
1. the query identifier
2. the query number within that topic (currently unused and should always be Q0)
3. the document identifier that was examined by the judges
4. the relevance of the document (_1_:relevant; _0_: not relevant).

Below we provide some Python code that reads the _qrels_ and the _run_. The qrels will be put in the Python dictionary `all_relevant`. A [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) provides quick lookup of a set of values given a key. We will use the `query_id` as a key, and a [Python set](https://docs.python.org/3/tutorial/datastructures.html#sets) of relevant document identifiers. For the partial qrels file above, `all_relevant` would look as follows:

    {
        "1000": set(["1341", "1231"]),
        "1001": set(["12332"])
    }
    
We will use a dictionary called `all_retrieved` with `query_id` as key, and as value a [Python list](https://docs.python.org/3/tutorial/introduction.html#lists) of document identifiers retrieved by the IR system:

    {
        "1000": ["1341", "12346, "2345"],
        "1001": [..., ..., ...],
        ...
    }

Note that, with this data structure, for each `query_id` we can easily access the list of retrieved and relevant documents, and compute the performance metrics. We can then average these measures over all the queries to compute the mean performance of the IR system on the given retrieval task.

Please examine the code below, and make sure you understand every line. Use the Python documentation where needed.

### DATA: the TREC genomics benchmark

For the following exercises, we will use a subset of the TREC genomics document collection and queries. 
It is stored in the folder `data01/` in the directory where you have been instructed to place the assignment notebooks (`/`).

The collections contains:

* `FIR-s05-medline.json` (the collection in Elasticsearch batch format - because of its size it cannot be indexed with a single curl command!)
* `FIR-s05-training-queries-simple.txt` (test queries)
* `FIR-s05-training-qrels.txt` (the "relevance judgements" for the test queries, i.e. the correct answers)

> ___Note___ that these files contain a subset of the documents and queries of the TREC genomics track benchmark, to facilitate experimentations with less computation time needed.
> The original files are also included in the `data01/` directory, withouth the `FIR-s05-` prefix (you may use them for the final project).

To make things easy, the data is already provided in Elasticsearch' batch processing format. 
Inspect the collection file in the terminal:

`head FIR-s05-medline.json`

This shows the first 5 documents in the collection (in JSON format prepared for ElasticSearch, as you have seen in the tutorial)

#### Baseline model and results
We also provide the list of retrieved documents by a _baseline_ model, in the file `data01/baseline.run`. For each query, it contains the list of document IDs of the retrieved documents (to be compared with those in the qrels file). We use this file in the examples and evaluation exercises below. 

In [13]:
def read_qrels_file(qrels_file):  # reads the content of the qrels file
    trec_relevant = dict()  # query_id -> set([docid1, docid2, ...])
    with open(qrels_file, 'r') as qrels:
        for line in qrels:
            (qid, q0, doc_id, rel) = line.strip().split()
            if qid not in trec_relevant:
                trec_relevant[qid] = set()
            if (rel == "1"):
                trec_relevant[qid].add(doc_id)
    return trec_relevant

def read_run_file(run_file):  
    # read the content of the run file produced by our IR system 
    # (in the following exercises you will create your own run_files)
    trec_retrieved = dict()  # query_id -> [docid1, docid2, ...]
    with open(run_file, 'r') as run:
        for line in run:
            (qid, q0, doc_id, rank, score, tag) = line.strip().split()
            if qid not in trec_retrieved:
                trec_retrieved[qid] = []
            trec_retrieved[qid].append(doc_id) 
    return trec_retrieved
    

def read_eval_files(qrels_file, run_file):
    return read_qrels_file(qrels_file), read_run_file(run_file)

(all_relevant, all_retrieved) = read_eval_files('data01/FIR-s05-training-qrels.txt', 'data01/baseline.run')

### _Number of queries_ and _number of retrieved documents per query_
 
The following code counts the number of queries evaluated in the file `baseline.run` (provided in the `data01/` folder, containing the list of doc-ids retrieved using a baseline model) and prints it (use the result from the cell above). For each query, it also prints the number of documents that were retrieved for that query.

In [14]:
print('Number of retrieved documents: %d' % len(all_retrieved))

for qid in all_retrieved:
    print ('Docs retrieved for query #{}: {}'.format(qid, str(len(all_retrieved[qid]))))

Number of retrieved documents: 38
Docs retrieved for query #1: 1000
Docs retrieved for query #3: 1000
Docs retrieved for query #4: 1000
Docs retrieved for query #5: 1000
Docs retrieved for query #6: 1000
Docs retrieved for query #7: 1000
Docs retrieved for query #8: 1000
Docs retrieved for query #9: 1000
Docs retrieved for query #10: 1000
Docs retrieved for query #11: 1000
Docs retrieved for query #12: 1000
Docs retrieved for query #13: 1000
Docs retrieved for query #14: 1000
Docs retrieved for query #15: 1000
Docs retrieved for query #16: 1000
Docs retrieved for query #18: 1000
Docs retrieved for query #20: 1000
Docs retrieved for query #22: 1000
Docs retrieved for query #23: 1000
Docs retrieved for query #24: 1000
Docs retrieved for query #25: 1000
Docs retrieved for query #27: 1000
Docs retrieved for query #28: 1000
Docs retrieved for query #29: 1000
Docs retrieved for query #31: 1000
Docs retrieved for query #32: 1000
Docs retrieved for query #34: 1000
Docs retrieved for query #36:

For your own understanding, __inspect the structure and content of the `all_retrieved` and `all_relevant` data structures__ to understand them better. Use the `print()` function to see the content of the data structures.

In [15]:
# write here the code to inspect the data structures
from pprint import pprint
print(all_retrieved.keys())
pprint(all_retrieved) # list


dict_keys(['1', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '18', '20', '22', '23', '24', '25', '27', '28', '29', '31', '32', '34', '36', '37', '38', '39', '40', '42', '44', '45', '46', '48', '50'])
{'1': ['11929828',
       '11751903',
       '12384701',
       '12065641',
       '11980715',
       '12126481',
       '12455049',
       '12444545',
       '12431783',
       '12204896',
       '12119358',
       '12242284',
       '11886527',
       '11779850',
       '12203364',
       '12110586',
       '11767002',
       '12115564',
       '11827966',
       '12112322',
       '11762751',
       '12368211',
       '12055678',
       '11940356',
       '11989975',
       '11862714',
       '11756412',
       '12203371',
       '12173048',
       '11809764',
       '12124333',
       '11879190',
       '12080324',
       '12079680',
       '12363184',
       '12214254',
       '11950701',
       '11882322',
       '12098019',
       '12495933',
       '

       '11880333',
       '12231382',
       '11943203',
       '12182878',
       '12130660',
       '11971762',
       '12393857',
       '12574114',
       '11836553',
       '12136008',
       '11960010',
       '12388661',
       '12093738',
       '12147251',
       '11893918',
       '12230554',
       '12052892',
       '11895924',
       '11909607',
       '11774038',
       '12084709',
       '12217326',
       '12176996',
       '12097295',
       '12362432',
       '12149231',
       '12153166',
       '11965495',
       '11956195',
       '12371905',
       '11962755',
       '12203814',
       '12007147',
       '12115727',
       '12198773',
       '11920194',
       '11854065',
       '11815601',
       '11909854',
       '12543805',
       '12533508',
       '12149264',
       '12147293',
       '12203396',
       '12181191',
       '11889563',
       '12417711',
       '12028440',
       '12167697',
       '11784717',
       '12047228',
       '12069500',
       '1202

        '12196474',
        '12366817',
        '12390803',
        '12427847',
        '12521647',
        '12538178',
        '11873965',
        '11906188',
        '11964174',
        '12127819',
        '11972332',
        '12077850',
        '11723484',
        '11786500',
        '12446688',
        '12202167',
        '11984962',
        '12149272',
        '12391024',
        '12139748',
        '12059821',
        '12271276',
        '11749377',
        '11950247',
        '12134540',
        '11991241',
        '12113569',
        '12438094',
        '12427468',
        '11718705',
        '12075802',
        '12485864',
        '12223426',
        '12460196',
        '11975893',
        '12518265',
        '11921019',
        '12153727',
        '11741892',
        '12463798',
        '11900841',
        '12244047',
        '12096914',
        '12013332',
        '11914646',
        '12167117',
        '11912207',
        '12545900',
        '12532736',
        '11920289',


        '12395995',
        '12215542',
        '12558974',
        '11751906',
        '11756443',
        '12424403',
        '11893235',
        '12422331',
        '12439663',
        '12171996',
        '11788600',
        '12149263',
        '12107278',
        '12083524',
        '11839793',
        '12176907',
        '12079632',
        '12221077',
        '11907521',
        '12210370',
        '11931639',
        '12431389',
        '12505155',
        '12215546',
        '12220675',
        '11964570',
        '12244319',
        '12471888',
        '12115629',
        '11727829',
        '12112020',
        '11853696',
        '11782604',
        '12427903',
        '12374399',
        '12554656',
        '12087728',
        '12006660',
        '12242934',
        '11842095',
        '11836232',
        '11954869',
        '12374986',
        '11809763',
        '12496151',
        '12370244',
        '12270917',
        '12102560',
        '12054821',
        '12387898',


        '12354107',
        '12390015',
        '12438247',
        '11839775',
        '12538822',
        '12105232',
        '11874466',
        '12145804',
        '11953433',
        '12091385',
        '12591241',
        '12545801',
        '12052828',
        '11876294',
        '11916662',
        '12149264',
        '12438245',
        '12424224',
        '12362432',
        '12560494',
        '12203814',
        '11894618',
        '12032348',
        '11997045',
        '12612907',
        '12359219',
        '12563018',
        '12442347',
        '12373606',
        '12571357',
        '12595558',
        '11955286',
        '12404241',
        '12176064',
        '11828408',
        '12094700',
        '12438124',
        '11867517',
        '12562790',
        '12112450',
        '11983700',
        '12028664',
        '12221282',
        '12055254',
        '12069807',
        '12012460',
        '12021258',
        '11741949',
        '11877430',
        '12116281',


        '12388149',
        '12151787',
        '12073149',
        '12115600',
        '12071753',
        '12017180',
        '12083325',
        '12525377',
        '12566316',
        '12223541',
        '12176891',
        '11969508',
        '12139015',
        '12417256',
        '12196117',
        '12235027',
        '11958955',
        '12393547',
        '11861015',
        '11959138',
        '11994966',
        '11850064',
        '12427030',
        '11986954',
        '12412761',
        '11725690',
        '12010214',
        '12400445',
        '12187771',
        '11962515',
        '11814315',
        '11744734',
        '12206767',
        '12020777',
        '12009976',
        '11880364',
        '12086911',
        '12023070',
        '12212197',
        '12352682',
        '11891239',
        '12351672',
        '12051684',
        '12413072',
        '12183655',
        '11886847',
        '12126553',
        '12524657',
        '12011702',
        '11818520',


        '11744366',
        '11861560',
        '11864605',
        '11959669',
        '11965438',
        '11973303',
        '11976688',
        '11976950',
        '12042820',
        '12054512',
        '12136022',
        '12168620',
        '12207040',
        '12367628',
        '12479377',
        '11914937',
        '11906210',
        '12530959',
        '12124901',
        '12039744',
        '11898863',
        '12242667',
        '12218096',
        '11969261',
        '12417715',
        '11932949',
        '12074588',
        '11937490',
        '11978860',
        '12034840',
        '12003135',
        '12006491',
        '12034023',
        '12535522',
        '12215391',
        '12508276',
        '11695180',
        '11931103',
        '12023361',
        '11929599',
        '12137734',
        '12177427',
        '12367510',
        '12124764',
        '11834372',
        '11988765',
        '12025404',
        '12397110',
        '12205397',
        '12203721',


        '12428456',
        '12070284',
        '12467982',
        '12054017',
        '12242056',
        '12445465',
        '11875759',
        '11857584',
        '12054775',
        '12077454',
        '12121414',
        '12393395',
        '12605687',
        '11775190',
        '11905744',
        '12110242',
        '12113999',
        '12143401',
        '12151774',
        '12236025',
        '12399326',
        '12437832',
        '12480054',
        '12164311',
        '12428781',
        '12390037',
        '11846675',
        '12009566',
        '12208977',
        '11859078',
        '12271459',
        '11960996',
        '12417310',
        '12297206',
        '12423249',
        '11752778',
        '12124445',
        '12426600',
        '11869584',
        '11894322',
        '12018513',
        '12096408',
        '12151144',
        '12271836',
        '12387686',
        '12409552',
        '12418038',
        '12456881',
        '12520584',
        '12528617',


        '12205865',
        '12365403',
        '12368033',
        '12569160',
        '12574025',
        '12601309',
        '12456074',
        '12163594',
        '11814165',
        '11820419',
        '11849703',
        '11884065',
        '11926079',
        '11926722',
        '11943123',
        '12010837',
        '12011161',
        '12019297',
        '12120420',
        '12153828',
        '12153930',
        '12163277',
        '12189257',
        '12207329',
        '12218654',
        '12270104',
        '12391569',
        '12394276',
        '12404675',
        '12411815',
        '12440093',
        '12485326',
        '11940212',
        '12417308',
        '12438085',
        '12054559',
        '11950979',
        '12068803',
        '11897801',
        '12191777',
        '12225865',
        '12525650',
        '12426468',
        '11887007',
        '12072526',
        '11793386',
        '11850407',
        '11880575',
        '11941898',
        '12376669',


        '12034046',
        '12161509',
        '11861600',
        '12479521',
        '12533311',
        '12444914',
        '12154029',
        '12023551',
        '12388159',
        '12367608',
        '11735249',
        '12493245',
        '11796712',
        '12112335',
        '12070674',
        '11909416',
        '12009920',
        '12100360',
        '12364492',
        '12107166',
        '12433051',
        '12162733',
        '12107675',
        '11882385',
        '12076660',
        '11896619',
        '12084453',
        '12455001',
        '11856764',
        '11978790',
        '12415111',
        '12137501',
        '11858750',
        '12398538',
        '12402260',
        '12355354',
        '11899296',
        '12050554',
        '11940301',
        '11888245',
        '12055072',
        '11771419',
        '12053026',
        '12054608',
        '11822540',
        '11937599',
        '11973208',
        '12114512',
        '11850023',
        '12123718',


        '12130825',
        '12189152',
        '12154014',
        '12196157',
        '11962246',
        '12228317',
        '12445738',
        '11983446',
        '11904962',
        '11776683',
        '12067977',
        '12525103',
        '11916181',
        '11975986',
        '12498792',
        '11788595',
        '12069182',
        '12394376',
        '12057750',
        '12077736',
        '11824562',
        '12237284',
        '11855294',
        '12363420',
        '12243751',
        '11777919',
        '12230117',
        '12384057',
        '12023442',
        '12143039',
        '12162435',
        '11911421',
        '12062635',
        '11819730',
        '11886496',
        '12586778',
        '12411578',
        '11861935',
        '12200125',
        '11956588',
        '12296295',
        '11895936',
        '11947922',
        '12135752',
        '12543804',
        '12154423',
        '11969180',
        '12239152',
        '12388624',
        '12606705',


        '12590320',
        '11989792',
        '12518848',
        '12460404',
        '12421807',
        '12391725',
        '12016124',
        '12208950',
        '12487806',
        '11874242',
        '12198614',
        '11796350',
        '12587528',
        '12426230',
        '12174791',
        '11948062',
        '12195386',
        '12375741',
        '12368665',
        '12027302',
        '12466384',
        '11849840',
        '12221085',
        '12324716',
        '12009762',
        '12370743',
        '12091878',
        '12417077',
        '11893849',
        '12552448',
        '11785337',
        '11830478',
        '12182233',
        '12575348',
        '12138198',
        '11986602',
        '11809531',
        '12477270',
        '12182494',
        '12575775',
        '12004607',
        '12601362',
        '12221586',
        '12414627',
        '12045357',
        '12120892',
        '11962515',
        '11773073',
        '11893437',
        '11761333',


        '12402613',
        '11884466',
        '12048173',
        '12177138',
        '12237339',
        '11741961',
        '12087165',
        '11992576',
        '12096136',
        '12374567',
        '12167627',
        '12234597',
        '12368275',
        '12021325',
        '12397075',
        '12147528',
        '12556328',
        '11807254',
        '11905033',
        '12409672',
        '12167698',
        '12102354',
        '12032122',
        '12079563',
        '12478065',
        '12443340',
        '12403296',
        '12006557',
        '11870090',
        '12234620',
        '11916956',
        '11972632',
        '12537437',
        '12355524',
        '12581652',
        '12016009',
        '12370291',
        '12441126',
        '12039909',
        '12368284',
        '11994446',
        '12504106',
        '12420796',
        '12496416',
        '12120667',
        '12601357',
        '12167302',
        '12446004',
        '11985591',
        '12568342',


        '12444016',
        '12488416',
        '12505980',
        '12536838',
        '11937878',
        '12130572',
        '11884546',
        '12379844',
        '11848461',
        '11854270',
        '12403624',
        '11820931',
        '11922141',
        '12547209',
        '12441114',
        '11881412',
        '11958535',
        '11966699',
        '11994971',
        '12161751',
        '12180062',
        '12183409',
        '12200409',
        '12244622',
        '12413662',
        '12458198',
        '12485889',
        '12511087',
        '12575217',
        '12594231',
        '12244063',
        '11878564',
        '12419257',
        '11778873',
        '11980726',
        '12176071',
        '12470887',
        '11809751',
        '11801731',
        '12038451',
        '12239337',
        '11896461',
        '11932921',
        '12015304',
        '12127069',
        '12379729',
        '12228058',
        '12407685',
        '12426378',
        '12237119',


        '12225896',
        '12145096',
        '12090785',
        '11940567',
        '12401212',
        '12504923',
        '12036354',
        '12421360',
        '12123693',
        '11744163',
        '12027461',
        '11779037',
        '12061874',
        '11829303',
        '12527474',
        '12459018',
        '12198333',
        '12396876',
        '12167621',
        '12110612',
        '11888549',
        '11998397',
        '12598643',
        '12063077',
        '11937087',
        '11914115',
        '12488534',
        '12209562',
        '11897633',
        '12163545',
        '12569562',
        '12122505',
        '11738567',
        '12207967',
        '11831548',
        '12230916',
        '12110570',
        '12204896',
        '12445963',
        '12030006',
        '11875658',
        '12166950',
        '11964087',
        '12124444',
        '12166935',
        '12184995',
        '12069829',
        '11771748',
        '12454989',
        '12181311',


        '11914023',
        '12195808',
        '12149254',
        '12543784',
        '11963465',
        '11952872',
        '12038622',
        '12370537',
        '12471325',
        '12025890',
        '12109035',
        '11914745',
        '12488361',
        '12171170',
        '12217855',
        '11694079',
        '12050157',
        '12375805',
        '12003331',
        '12575226',
        '12601028',
        '11999354',
        '11886167',
        '12234623',
        '11984981',
        '12176022',
        '12063292',
        '12208074',
        '12208766',
        '11877389',
        '11971931',
        '11942535',
        '11990525',
        '12130571',
        '12533263',
        '12514134',
        '12172555',
        '12086964',
        '12371959',
        '12080908',
        '12203668',
        '12588768',
        '12374299',
        '12384485',
        '11956602',
        '11912163',
        '12572885',
        '11716154',
        '11897806',
        '11936485',


        '11803594',
        '11803877',
        '11804036',
        '11804129',
        '11804233',
        '11808658',
        '11821924',
        '11825008',
        '11827184',
        '11831220',
        '11835488',
        '11836923',
        '11837398',
        '11858150',
        '11859793',
        '11862679',
        '11862915',
        '11863143',
        '11867930',
        '11873319',
        '11874775',
        '11874845',
        '11875359',
        '11879761',
        '11880719',
        '11880903',
        '11882357',
        '11885345',
        '11886925',
        '11888370',
        '11889870',
        '11891016',
        '11895584',
        '11897618',
        '11898385',
        '11899510',
        '11900038',
        '11900365',
        '11903103',
        '11903905',
        '11904690',
        '11905236',
        '11906069',
        '11906781',
        '11906986',
        '11912835',
        '11914118',
        '11915693',
        '11916010',
        '11919497',


       '12065772',
       '11804102',
       '11943339',
       '11914402',
       '12359741',
       '12377766',
       '12132876',
       '11939509',
       '12440775',
       '12092657',
       '11997509',
       '11879554',
       '11866532',
       '12377004',
       '12086877',
       '12405259',
       '12592383',
       '12386824',
       '12150958',
       '12213818',
       '11964133',
       '12062858',
       '11911850',
       '12407683',
       '12414521',
       '12216069',
       '12372840',
       '11896592',
       '12297544',
       '12421830',
       '12574201',
       '11872709',
       '12115994',
       '12437512',
       '12112017',
       '12476790',
       '12363395',
       '12240375',
       '11975969',
       '12359731',
       '12086860',
       '11943721',
       '11960311',
       '12513920',
       '12124355',
       '11857543',
       '12118377',
       '12517951',
       '12437293',
       '12180649',
       '12465931',
       '12431784',
       '1198

       '11846985',
       '11919388',
       '11792588',
       '12203833',
       '12517793',
       '12500631',
       '12154226',
       '12212378',
       '12183675',
       '12540572',
       '12553908',
       '11875597',
       '12565816',
       '12355494',
       '12399978',
       '12296933',
       '12162776',
       '12139420',
       '12131020',
       '12138960',
       '12244085',
       '12583973',
       '12620741',
       '11902091',
       '11987975',
       '12121369',
       '12186530',
       '12209527',
       '12456990',
       '11949846',
       '12073009',
       '12384552',
       '12401729',
       '12436481',
       '12202228',
       '11838608',
       '12078048',
       '12032836',
       '11870227',
       '11880201',
       '12466622',
       '12023898',
       '11962602',
       '12031504',
       '12128217',
       '12361502',
       '12224391',
       '12031468',
       '12097480',
       '12370174',
       '12134031',
       '12133587',
       '1205

        '12003115',
        '12117068',
        '12170649',
        '11952349',
        '12008139',
        '11775086',
        '12083530',
        '12075399',
        '11857048',
        '12029084',
        '12473914',
        '11843363',
        '11925174',
        '12431177',
        '11889584',
        '12045347',
        '11885811',
        '11708758',
        '12167590',
        '12052024',
        '12468802',
        '12513167',
        '12207328',
        '12202016',
        '11747129',
        '11870904',
        '12162334',
        '11770996',
        '11761080',
        '11934476',
        '11909041',
        '11829210',
        '12594053',
        '12031982',
        '11804535',
        '12081502',
        '11749169',
        '12402245',
        '11839053',
        '11890362',
        '12012784',
        '11985059',
        '12095532',
        '11891150',
        '12074224',
        '12096059',
        '12485085',
        '12165394',
        '12032847',
        '12038741',


        '12239457',
        '12073608',
        '12228246',
        '12379516',
        '12142389',
        '12213258',
        '11979926',
        '12082956',
        '12394039',
        '12576260',
        '12169383',
        '12058063',
        '12193914',
        '11788792',
        '11961504',
        '12415623',
        '12063562',
        '11706286',
        '11882494',
        '12353810',
        '11981030',
        '12203035',
        '11929766',
        '12359859',
        '12468456',
        '12147290',
        '12323121',
        '12078201',
        '12071468',
        '12453737',
        '11981757',
        '11815367',
        '12218182',
        '11877543',
        '12379955',
        '12359222',
        '12428995',
        '11988133',
        '12144815',
        '12043562',
        '11910705',
        '12209854',
        '12025675',
        '12000748',
        '12083416',
        '12097326',
        '12113879',
        '12104076',
        '12117075',
        '12064474',


        '12218537',
        '12235253',
        '12213517',
        '12603818',
        '12021825',
        '12213656',
        '11850144',
        '12125043',
        '11860464',
        '11891282',
        '12127200',
        '11916368',
        '12228895',
        '12462416',
        '12452975',
        '12064655',
        '12442209',
        '12505656',
        '12094914',
        '11967634',
        '12404079',
        '11936576',
        '11809514',
        '12105128',
        '12105102',
        '11994130',
        '12490616',
        '11920936',
        '12213518',
        '12070162',
        '12486179',
        '11954652',
        '12023530',
        '11838768',
        '12464459',
        '12182882',
        '12213776',
        '11920720',
        '12388642',
        '11830362',
        '12574402',
        '12521669',
        '11841897',
        '12420795',
        '11755138',
        '11922858',
        '12213712',
        '12394416',
        '12079598',
        '12098701',


        '11997106',
        '12196399',
        '12509240',
        '12486147',
        '11902836',
        '11862451',
        '11910118',
        '12215542',
        '12468090',
        '12453727',
        '12510193',
        '11944984',
        '12044495',
        '12073549',
        '12110602',
        '12198500',
        '12408640',
        '12456751',
        '12459199',
        '11966878',
        '12111533',
        '12510194',
        '11875367',
        '11884612',
        '12019234',
        '12399311',
        '12436253',
        '11891299',
        '12188050',
        '12213251',
        '12361959',
        '12242239',
        '11976956',
        '12556490',
        '11695180',
        '11960010',
        '12086463',
        '11960020',
        '12208233',
        '11809829',
        '11971982',
        '11991710',
        '12006374',
        '12105418',
        '12545177',
        '12124764',
        '12029485',
        '12483226',
        '11888271',
        '12058017',


        '11985591',
        '12074273',
        '12176330',
        '12217327',
        '12447383',
        '12110665',
        '12361719',
        '12470576',
        '12511959',
        '12135920',
        '12163413',
        '12006495',
        '12459267',
        '12364587',
        '11886777',
        '11889557',
        '12009783',
        '12456656',
        '12191612',
        '11990765',
        '12223408',
        '12242505',
        '12466196',
        '11744372',
        '12408863',
        '12591247',
        '11744389',
        '12087134',
        '12117817',
        '11783006',
        '11861759',
        '11923211',
        '12151639',
        '11929995',
        '11972332',
        '12061953',
        '11786535',
        '11964389',
        '12110842',
        '11781497',
        '11944935',
        '11973283',
        '12050152',
        '12050669',
        '12111997',
        '12173684',
        '12226665',
        '12466201',
        '11913812',
        '11943466',


        '12176328',
        '11933058',
        '11907158',
        '11972031',
        '12429738',
        '11878884',
        '12456666',
        '12234178',
        '11854264',
        '12379483',
        '11812827',
        '11988469',
        '12133946',
        '12220654',
        '12466492',
        '12242302',
        '12206784',
        '11867758',
        '11922755',
        '12193606',
        '12374802',
        '12076763',
        '12479403',
        '11866516',
        '11964403',
        '12458196',
        '12110665',
        '12101294',
        '11856373',
        '12460196',
        '11812777',
        '12173745',
        '11926060',
        '11959095',
        '11867528',
        '11858823',
        '12009882',
        '11744384',
        '12438588',
        '11834378',
        '11893338',
        '11895294',
        '12135916',
        '11901161',
        '12036274',
        '12511959',
        '11911886'],
 '38': ['12065414',
        '12084706',
        '11710527',

 '39': ['11710527',
        '11805056',
        '12572604',
        '12018164',
        '12198117',
        '12370302',
        '12364783',
        '12498730',
        '12610309',
        '12018165',
        '12213252',
        '12353025',
        '12225917',
        '12451171',
        '11749711',
        '12209126',
        '11965438',
        '12001949',
        '11809773',
        '11901172',
        '11948200',
        '11937514',
        '11809845',
        '11965658',
        '12196400',
        '11985896',
        '12386932',
        '12230547',
        '12036300',
        '12017506',
        '12490253',
        '12137756',
        '12362054',
        '12408796',
        '12379215',
        '11861556',
        '12419898',
        '12438685',
        '12186628',
        '11994755',
        '12471440',
        '12420787',
        '11972156',
        '12207040',
        '11805061',
        '11933981',
        '11974603',
        '12150949',
        '11854371',
        '11852789',


       '12615796',
       '11907576',
       '12467226',
       '11886498',
       '12406163',
       '12024954',
       '12149254',
       '12446727',
       '11751851',
       '12135910',
       '12200763',
       '11903885',
       '12047917',
       '12204896',
       '12410184',
       '11876536',
       '11760850',
       '11783969',
       '11834245',
       '11844039',
       '11908923',
       '11935281',
       '11938054',
       '12055392',
       '12076735',
       '12187079',
       '12223436',
       '12234288',
       '12234991',
       '12297114',
       '12298426',
       '12354288',
       '12387143',
       '12425822',
       '12429853',
       '12437643',
       '12476614',
       '12494452',
       '11956562',
       '12439723',
       '11805401',
       '11778913',
       '12438247',
       '11994280',
       '11890876',
       '12096842',
       '12023962',
       '11966774',
       '12141135',
       '12023835',
       '12045736',
       '12434309',
       '1195

        '12391249',
        '11870879',
        '11818516',
        '12081968',
        '12368275',
        '12135607',
        '12123460',
        '11950097',
        '12145330',
        '11958545',
        '11722774',
        '12196557',
        '12205098',
        '12194978',
        '12160064',
        '11992723',
        '12189141',
        '12270814',
        '11925432',
        '12068803',
        '12062059',
        '12527889',
        '11888667',
        '12107112',
        '11884525',
        '12370805',
        '11931321',
        '12095679',
        '11806635',
        '12397110',
        '12120007',
        '11847283',
        '12231630',
        '12090141',
        '12473098',
        '11967304',
        '11854265',
        '11976357',
        '12356733',
        '12191913',
        '12036298',
        '11847285',
        '12054864',
        '12391259',
        '12412658',
        '11854370',
        '12495751',
        '12174585',
        '12421316',
        '12074273',


        '12030011',
        '12128220',
        '12354653',
        '12414534',
        '12051715',
        '12479811',
        '12495622',
        '12368260',
        '12271490',
        '11893338',
        '11914941',
        '11973308',
        '12135916',
        '12147138',
        '12466192',
        '11861482',
        '11923209',
        '11934862',
        '12004617',
        '12141451',
        '12150502',
        '12151333',
        '12403709',
        '11956649',
        '11976200',
        '11990762',
        '12202036',
        '12222828',
        '12530131',
        '12036960',
        '11850406',
        '11923414',
        '12193652',
        '12234026',
        '12504588',
        '11983923',
        '12165468',
        '12231507',
        '12408801',
        '12504024',
        '12062107',
        '12081647',
        '12086617',
        '12408870',
        '12526814',
        '12196118',
        '11976952',
        '11874920',
        '12223397',
        '11812047',


        '11809831',
        '12493632',
        '11847218',
        '11879635',
        '12186954',
        '12378866',
        '12408366',
        '11846675',
        '12075078',
        '11997455',
        '12421668',
        '11744739',
        '11800270',
        '11879636',
        '12171153',
        '11943587',
        '12042876',
        '12526811',
        '12039850',
        '12093819',
        '11922232',
        '12135356',
        '11992003',
        '12054559',
        '11955960',
        '11859071',
        '12064770',
        '12007788',
        '12044884',
        '12163035',
        '11823445',
        '12124901',
        '12039744',
        '12490411',
        '12367747',
        '11908957',
        '12116250',
        '11877387',
        '12477848',
        '12473691',
        '12090816',
        '11997237',
        '12176897',
        '11964556',
        '12165133',
        '12223406',
        '12096140',
        '11920685',
        '12075937',
        '12372030',


        '12372334',
        '12390324',
        '12439489',
        '12464265',
        '12497009',
        '12518325',
        '12532041',
        '11937351',
        '12074554',
        '12205095',
        '12355433',
        '11988842',
        '12422021',
        '12453242',
        '12074798',
        '12427560',
        '12427877',
        '12470336',
        '11839812',
        '11897993',
        '11973354',
        '12063067',
        '11829600',
        '12498796',
        '11751860',
        '11788597',
        '11940311',
        '12547974',
        '12636205',
        '11853552',
        '12032079',
        '11825278',
        '11874082',
        '11898748',
        '11904293',
        '11937346',
        '11954847',
        '11958148',
        '12054639',
        '12088560',
        '12148995',
        '12197034',
        '12231534',
        '12372304',
        '12372448',
        '12379197',
        '12421975',
        '12428625',
        '12432820',
        '12437970',


        '11904379',
        '12059782',
        '11853552',
        '11955338',
        '12002945',
        '12063230',
        '12164930',
        '11853760',
        '12403787',
        '11837789',
        '11880361',
        '11924901',
        '11929927',
        '11930246',
        '11987469',
        '11988224',
        '11993328',
        '11999318',
        '12014921',
        '12051692',
        '12068953',
        '12094781',
        '12106668'],
 '48': ['11937514',
        '12527194',
        '12167152',
        '12468090',
        '12520032',
        '12547394',
        '12135477',
        '12525084',
        '12645611',
        '12018119',
        '12148460',
        '11950932',
        '12234520',
        '12566294',
        '12028792',
        '12088152',
        '12503676',
        '12231201',
        '11805079',
        '11934490',
        '12177301',
        '12039735',
        '12407179',
        '11960597',
        '12525161',
        '12581528',
        '12033939',

       '12210907',
       '12188756',
       '12490615',
       '11989502',
       '12059079',
       '12091494',
       '12464083',
       '12187029',
       '12151513',
       '12420862',
       '12590259',
       '12016425',
       '12069903',
       '12098208',
       '12150592',
       '12086636',
       '12020769',
       '12234906',
       '11857930',
       '12133820',
       '12455717',
       '11912272',
       '11844582',
       '11796712',
       '12395735',
       '11834878',
       '11942385',
       '12365365',
       '11999355',
       '12060771',
       '12145154',
       '12192473',
       '12152388',
       '12351283',
       '12394376',
       '12174154',
       '12549904',
       '12090816',
       '12021953',
       '12402257',
       '12081468',
       '12035114',
       '11889437',
       '11878971',
       '11856774',
       '11813853',
       '12052541',
       '12039024',
       '12046312',
       '12451252',
       '12510369',
       '11956150',
       '1209

        '12041681',
        '12358433',
        '11891214',
        '12183079',
        '11766993',
        '12055230',
        '12530006',
        '12437217',
        '12530960',
        '11956171',
        '12075577',
        '12020679',
        '12031907',
        '12089150',
        '11980874',
        '11755926',
        '12099715',
        '12478612',
        '12349967',
        '11864576',
        '11885938',
        '11903339',
        '11963600',
        '12019501',
        '12065477',
        '12069464',
        '12100502',
        '12134004',
        '12209008',
        '12218919',
        '12372348',
        '12394871',
        '12400460',
        '12110367',
        '12061783',
        '12508652',
        '11976344',
        '12507912',
        '11765285',
        '11895539',
        '11902745',
        '11904461',
        '11999387',
        '12004922',
        '12021264',
        '12028060',
        '12071342',
        '12072546',
        '12113822',
        '12169581',


       '12183532',
       '12019264',
       '12270384',
       '12525619',
       '12093747',
       '12037321',
       '12473486',
       '12297020',
       '12381639',
       '11792068',
       '11972706',
       '12207326',
       '12237799',
       '12393418',
       '12118707',
       '11777907',
       '12031892',
       '11909852',
       '11853874',
       '11868844',
       '12358854',
       '12389097',
       '12490871',
       '12490873',
       '12269828',
       '12079471',
       '11925173',
       '12462857',
       '12051861',
       '12067484',
       '12045773',
       '12074224',
       '12078484',
       '12556266',
       '12009284',
       '12000681',
       '12409672',
       '11804690',
       '12118085',
       '12619514',
       '12051859',
       '12456788',
       '11994385',
       '12139680',
       '11864961',
       '11977834',
       '12227938',
       '12135765',
       '12136334',
       '12097320',
       '11944527',
       '12073723',
       '1221

       '11967186',
       '12151779',
       '12377625',
       '12043456',
       '12032390',
       '11700862',
       '11912517',
       '11751919',
       '11777079',
       '12124837',
       '11998958',
       '12169773',
       '12370117',
       '12383444',
       '11937360',
       '12354620',
       '12097559',
       '11891007',
       '12468734',
       '11775002',
       '11735251',
       '12193281',
       '12479044',
       '12230569',
       '12107374',
       '11958598',
       '12215662',
       '12297113',
       '12240120',
       '12042530',
       '12089345',
       '11897155',
       '11935408',
       '12027878',
       '11829081',
       '12002784',
       '12457042',
       '12370278',
       '12356856',
       '12017546',
       '11883353',
       '12153143',
       '12234549',
       '12200233',
       '12563029',
       '11857915',
       '11912131',
       '12090977',
       '12151362',
       '11857914',
       '12088102',
       '11996925',
       '1232

       '12142086',
       '12404348',
       '12473486',
       '12076959',
       '11779033',
       '11912279',
       '11908676',
       '12369936',
       '11767002',
       '11933134',
       '12460985',
       '12171170',
       '12475893',
       '11820569',
       '12455885',
       '12475802',
       '12460908',
       '11756682',
       '12456637',
       '12130673',
       '11925073',
       '11809469',
       '12424260',
       '11950841',
       '12447108',
       '11860031',
       '12174943',
       '11950247',
       '11953315',
       '12384924',
       '11812214',
       '11958509',
       '12147302',
       '12103440',
       '12168844',
       '12086825',
       '12183421',
       '11948660',
       '12108515',
       '11894134',
       '11886444',
       '12203361',
       '11749167',
       '12004246',
       '12491782',
       '12068136',
       '11886862',
       '12039487',
       '12595689',
       '12610302',
       '11937323',
       '11727929',
       '1193

In [16]:
print(all_relevant.keys())
pprint(all_relevant)
type(all_relevant["1"])

dict_keys(['1', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '18', '20', '22', '23', '24', '25', '27', '28', '29', '31', '32', '34', '36', '37', '38', '39', '40', '42', '44', '45', '46', '48', '50'])
{'1': {'11751903',
       '11756412',
       '11762751',
       '11872638',
       '11877274',
       '11882322',
       '11911463',
       '11953864',
       '11956602',
       '11968052',
       '11989975',
       '12015083',
       '12025230',
       '12054572',
       '12054658',
       '12055678',
       '12111504',
       '12123335',
       '12151346',
       '12151347',
       '12151395',
       '12168821',
       '12359245',
       '12384701',
       '12388558',
       '12393707',
       '12396717',
       '12429910',
       '12474524',
       '12513833',
       '12517948'},
 '10': {'12029084',
        '12063259',
        '12067914',
        '12091344',
        '12091346',
        '12193972',
        '12407115'},
 '11': {'12038964',
        '12115629

set

## Exercise 01.F: _mean average precision_
__Using the `average_precision()` function you implemented above, write the code to compute the _Mean Average Precision_ for the `baseline.run` results.__

In [17]:
#THIS IS GRADED!

def mean_average_precision(all_relevant, all_retrieved):
    # BEGIN ANSWER
    total = 0
    count = len(all_relevant.keys())
    for q in all_relevant:
        total += average_precision(all_relevant[q], all_retrieved[q])
    # END ANSWER
    return total / count

mapr = mean_average_precision(all_relevant, all_retrieved)
print('Mean Average Precision (MAP): %1.3f\n' % mapr)

Mean Average Precision (MAP): 0.112



***
## TREC benchmark evaluation

Below you find a function that take `all_relevant` and `all_retrieved` to compute the mean value of the `measure` over all queries. 

The function `mean_metric()`'s first function argument, `measure`, is a special argument: it is a function too! The `mean_metric` function sums the total score for the particular measure and divides it by the number of queries. It computes the average measures over all the query results.

_This part will be reused later to compare the results of different models._

In [18]:
def mean_metric(measure, all_relevant, all_retrieved):
    total = 0
    count = 0
    for qid in all_relevant:
        relevant  = all_relevant[qid]
        retrieved = all_retrieved.get(qid, [])
        value = measure(relevant, retrieved)
        total += value
        count += 1
    return "mean " + measure.__name__, total / count

# Example of use of the mean_metric function: computing the average r_precision
mean_metric(r_precision, all_relevant, all_retrieved)

('mean r_precision', 0.09155954402134368)

### TREC overview of the results of a run
The following two functions use your implementation of the metrics to create an overview of the performance metrics on the TREC benchmark data. Give a look at the numbers and make your own interpretations of the results. 

In [19]:
def trec_eval(qrels_file, run_file):

    def precision_at_1(rel, ret): return precision_at_k(rel, ret, k=1)
    def precision_at_5(rel, ret): return precision_at_k(rel, ret, k=5)
    def precision_at_10(rel, ret): return precision_at_k(rel, ret, k=10)
    def precision_at_50(rel, ret): return precision_at_k(rel, ret, k=50)
    def precision_at_100(rel, ret): return precision_at_k(rel, ret, k=100)
    def precision_at_recall_00(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.0)
    def precision_at_recall_01(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.1)
    def precision_at_recall_02(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.2)
    def precision_at_recall_03(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.3)
    def precision_at_recall_04(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.4)
    def precision_at_recall_05(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.5)
    def precision_at_recall_06(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.6)
    def precision_at_recall_07(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.7)
    def precision_at_recall_08(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.8)
    def precision_at_recall_09(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.9)
    def precision_at_recall_10(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=1.0)

    (all_relevant, all_retrieved) = read_eval_files(qrels_file, run_file)
    
    unknown_qids = set(all_retrieved.keys()).difference(all_relevant.keys())
    if len(unknown_qids) > 0:
        raise ValueError("Unknown qids in run: {}".format(sorted(list(unknown_qids))))

    metrics = [success_at_1,
               success_at_5,
               success_at_10,
               r_precision,
               precision_at_1,
               precision_at_5,
               precision_at_10,
               precision_at_50,
               precision_at_100,
               precision_at_recall_00,
               precision_at_recall_01,
               precision_at_recall_02,
               precision_at_recall_03,
               precision_at_recall_04,
               precision_at_recall_05,
               precision_at_recall_06,
               precision_at_recall_07,
               precision_at_recall_08,
               precision_at_recall_09,
               precision_at_recall_10,
               average_precision]

    return [mean_metric(metric, all_relevant, all_retrieved) for metric in metrics]


def print_trec_eval(qrels_file, run_file):
    results = trec_eval(qrels_file, run_file)
    print("Results for {}".format(run_file))
    for (metric, score) in results:
        print("{:<30} {:.4}".format(metric, score))

print_trec_eval('data01/FIR-s05-training-qrels.txt', 'data01/baseline.run')

Results for data01/baseline.run
mean success_at_1              0.1053
mean success_at_5              0.2632
mean success_at_10             0.3158
mean r_precision               0.09156
mean precision_at_1            0.1053
mean precision_at_5            0.07895
mean precision_at_10           0.04737
mean precision_at_50           0.01947
mean precision_at_100          0.01395
mean precision_at_recall_00    0.2015
mean precision_at_recall_01    0.1898
mean precision_at_recall_02    0.1683
mean precision_at_recall_03    0.1333
mean precision_at_recall_04    0.1236
mean precision_at_recall_05    0.1227
mean precision_at_recall_06    0.08744
mean precision_at_recall_07    0.08435
mean precision_at_recall_08    0.05999
mean precision_at_recall_09    0.05803
mean precision_at_recall_10    0.05803
mean average_precision         0.1116


## Exercise 01.H: _Significance testing_

Testing the statistical significance of differences of the results of different IR systems is important (see slides of lecture 01 and course book, Section 8.8). One of the basic tests one can perform is the two-tailed [sign test](https://en.wikipedia.org/wiki/Sign_test).

Only for this exercise, we use the run files obtained by  [Hiemstra and Aly](https://djoerdhiemstra.com/wp-content/uploads/trec2014mirex-draft.pdf) for the TREC Web track 2014 benchmark (note these files are from a different benchmark from what we have been working with so far). The `utbase.run` file was generated using Language Modeling, while `utexact.run` was generated using an IR system based on mathing the exact query string, and ranking the documents by  the number of exact matches found. The exact run improves the _Precision at 5_ to 0.456 (compared to 0.440 for the baseline run).  

__Implement the code to perform the _sign test_ of statistical significance.__
> _Hint:_ for each sign, compute the number of queries that increase/descrease performance (called `better, worse` in the code below). How would you use these values to compute the _p_ value of the two-tailed sign test? Is the difference between _utbase_ and _utexact_ significant?

In [20]:
#THIS IS GRADED!

def sign_test_values(measure, qrels_file, run_file_1, run_file_2):
    all_relevant = read_qrels_file(qrels_file)
    all_retrieved_1 = read_run_file(run_file_1)
    all_retrieved_2 = read_run_file(run_file_2)
    better = 0
    worse  = 0
    # BEGIN ANSWER
    for query in all_relevant:
        if precision_at_rank_5(all_relevant[query], all_retrieved_1[query]) < precision_at_rank_5(all_relevant[query], all_retrieved_2[query]):
            better += 1
        elif precision_at_rank_5(all_relevant[query], all_retrieved_1[query]) > precision_at_rank_5(all_relevant[query], all_retrieved_2[query]):
            worse +=1
            
    # END ANSWER
    return(better, worse)
    
def precision_at_rank_5(rel, ret):
    return precision_at_k(rel, ret, k=5)

sign_test_values(precision_at_rank_5, 'data01/trec.qrels', 'data01/utbase.run', 'data01/utexact.run')

(9, 9)

## BEGIN ANSWER
The null hypothesis is that there is no difference between the performance of the two run files, whereas the alternative hypothesis is that there is a difference between them. 
To calculate the p-value, we should ask ourselves what is the probability that the observed result of 9 positive differences, or a more extreme result, would occur if there is no difference between the performances of the two runs? 
We also need to discard the ties, which are 32, so our n = 18.
Since the test is two-sided, we need to consider the probabilities calculated using the binomial test and the p-value will be their sum:

- probability that we observe 9 better out of 18
- probability that we observe 10 better out of 18
- probability that we observe 11 better out of 18
- ...
- probability that we observe 18 better out of 18.

However, since the test is two-sided, we should also consider:

- probability that we observe 8 better out of 18
- probability that we observe 7 better out of 18
- probability that we observe 6 better out of 18
- ...
- probability that we observe 0 better out of 18

It's clear to see that in our case, since we are basically considering all possible cases, the p-value will be 1. This means that, according to this test, the difference between the two performance measures is not significant.

Just to double check, we could perform the same statistical test using the binom_test function from scipy, as implemented below. This confirms our result.


In [21]:
from scipy import stats
stats.binom_test(9, n=18, p = 0.5, alternative = "two-sided")

1.0

## END ANSWER

***
***
***
***
***

# Part 02 - Indexing and querying with ElasticSearch

## Preparation: Getting started with Elasticsearch

The following parts of the assignment will be based on ElasticSearch. you are adviced to go through the "Elasticsearch, [reference guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)", and work on the tutorials. You can skip the section on [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html), as we provide it already installed in the Virtual Machine.

> If you want (disclaimer: we do __not__ give help with this!), you can 
> follow the [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) to run Elasticsearch on your laptop without VM. Beware your system will likely be different from the 
> one of your colleagues and they might not be able to help you if 
> you have problems that are specific to your system, your operating
> system, or your Elasticsearch version.

### Starting/Stopping ElasticSearch
To start ElasticSearch on the virtual machine, you can type `sudo service elasticsearch start` in a Terminal.
To stop the ElasticSearch server, instead, you can type `sudo service elasticsearch stop`. Refer at the [the official guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-running-init), for more information.

### The REST API

Elasticsearch runs its own server that can be accessed by a regular web browser by opening this link: http://localhost:9200. 

Elasticsearch will respond with something like:

    {
        "name" : "fir-machine",
        "cluster_name" : "elasticsearch",
        "cluster_uuid" : "w7SBVo1ESVivMApbLIqRvA",
        "version" : {
            "number" : "7.9.0",
            "build_flavor" : "default",
            "build_type" : "deb",
            "build_hash" : "a479a2a7fce0389512d6a9361301708b92dff667",
            "build_date" : "2020-08-11T21:36:48.204330Z",
            "build_snapshot" : false,
            "lucene_version" : "8.6.0",
            "minimum_wire_compatibility_version" : "6.8.0",
            "minimum_index_compatibility_version" : "6.0.0-beta1"
        },
        "tagline" : "You Know, for Search"
    }


If you see this, then your Elasticsearch node is up and running. The RESTful API uses simple text or JSON over HTTP. 

> REST, API, JSON, HTTP, that's a lot of abbreviations! It is good to
> be familiar with the terminology. Let us explain: The Elasticsearch
> response is not (only) intended for humans. It is supposed to be used 
> by applications that run on the client machines, and therefore the
> interface is called an Application Programming Interface (API). The 
> API uses a format called JSON (JavaScript Object Notation), which 
> can be easily read by machines (and humans). The API sends its JSON
> response using the same method as your web browser displays web
> pages. This method is called HTTP (Hyper Text Transfer Protocol), 
> and it is the reason you can inspect the response in a normal web
> browser. APIs that use HTTP are called RESTful interfaces. REST 
> stands for REpresentational State Transfer, arguably one of the
> simplest ways to define an API.


### Interacting with the ElasticSearch server

You can interact with your Elasticsearch service in different ways. In this first part we explore Kibana, a dashboard for inspection of your indices. Later during the practical work we will use the Python Elasticsearch client or the DSL library. You can also start yourself with Python.

#### Kibana
Kibana provides a web interface to interact with your Elasticsearch service. It's available from http://localhost:5601. You can use Kibana to create interactive dashboards visualizing data in your Elasticsearch indices. It also provides a console to execute Elasticsearch commands. It's available from http://localhost:5601/app/kibana#/dev_tools

To start Kibana on the virtual machine, you can type `sudo service kibana start` in a Terminal. \
To stop the Kibana server, instead, you can type `sudo service kibana stop`.

Many examples from the Elasticsearch user guide can be directly executed in Kibana by clicking on the `CONSOLE` button.



***
***
***

# Indexing and queries (Exercises - Part 02)

_You can work on this part after Lecture 01 already_


## Collection indexing: useful code

We provide some code to read the TREC collection documents and index them into the ElasticSearch engine.
As we need to re-index the document collection when we use a different indexing configurations (called Mappings in ElasticSearch), we developed some functions to support a quick re-indexing in the following exercises.

Below you find the Python code for bulk-indexing our (FIR)Medline collection. Execute the following cells to index the collection in an Elasticsearch index called `genomics'. Study the code carefully, as you will use the indexing functions later for the completion of the assignment.

> The code uses additional helper functions 
> (`elasticsearch.helpers`) and a library for processing JSON.
> The function `read_documents()` reads the bulk collection file: The 
> function is a [Python generator](https://wiki.python.org/moin/Generators) function. It generates an 'on-demand' list
> by using the statement `yield` for every item of the list. It
> is used in the helper function `elasticsearch.helpers.bulk()`.
> The statement `raise` is Python's approach to throw exceptions: it exits the program with an error.
> Note the (keyword) arguments to bulk:
> `chunk_size` indicates the number of documents to be processed by
> elasticsearch in one batch. 
> The request_timeout is set to 30 seconds because processing a single batch
> of documents can take some time.

> __Note:__ _when processing a bulk index, be sure to have few GigaBytes free on the hard drive of the VM. If you get a BulkIndexError with read-only/FORBIDDEN errors, you probably have too little hard drive space available for ElasticSearch to work properly._


**_Note:_ indexing the (FIR)TREC genomics collection can take some time, be patient.**

In [22]:
import elasticsearch
import elasticsearch.helpers
import json

def read_documents(file_name):
    """
    Returns a generator of documents to be indexed by elastic, read from file_name
    """
    with open(file_name, 'r') as documents:
        for line in documents:
            doc_line = json.loads(line)
            if ('index' in doc_line):
                id = doc_line['index']['_id']
            elif ('PMID' in doc_line):
                doc_line['_id'] = id
                yield doc_line
            else:
                raise ValueError('Woops, error in index file')

def create_index(es, index_name, body={}):
    # delete index when it already exists
    es.indices.delete(index=index_name, ignore=[400, 404])
    # create the index 
    es.indices.create(index=index_name, body=body)
                
def index_documents(es, collection_file_name, index_name, body={}):
    create_index(es, index_name, body)
    # bulk index the documents from file_name
    return elasticsearch.helpers.bulk(
        es, 
        read_documents(collection_file_name),
        index=index_name,
        chunk_size=2000,
        request_timeout=30
    )

In [23]:
# Connect to the ElasticSearch server
es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

# Index the collection into the index called 'genomics'
body = {} # no indexing options (leave default)
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-base', body)



(263080, [])

> You can change the name of the index, in case you want to have different indices of the same collection created with different indexing settings, and compare the performance on the test queries. 

> E.g. you create two indices 'genomics01' and 'genomics02': genomics01 uses the default options, while genomics02 uses custom tokenizers. You will then have two indices with different characteristics (and probably different performance). 

## Exercise 02.A: index properties and querying

__1. Query the index called 'genomics-base' and determine how many documents are indexed.__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code here
# BEGIN ANSWER

### Kibana query ###

GET /genomics-base/_count
{
  "query": {
    "match_all": {}
  }
}

### Result ###
{
  "count" : 263080,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

# The total number of retrieved documents is 263080.
# END ANSWER

__2. How many documents containing the term `molecule` are there in your index? (searching all fields of the documents).__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
### Kibana query ###

GET /genomics-base/_count
{
  "query": {
    "query_string": {
      "query": "molecule"
    }
  }
}    
### Result ###
{
  "count" : 3404,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
# The number of retrieved documents is 3404.

# END ANSWER

__3. How many documents containing the term `molecular` are there in your index? (searching all fields of the documents).__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
### Kibana query ###

GET /genomics-base/_count
{
  "query": {
    "query_string": {
      "query": "molecular"
    }
  }
}
    
### Result ###
{
  "count" : 31556,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
# The number of retrieved documents is 31556.
# END ANSWER

__4. How many documents containing the terms `cell` AND `blood` are there in your index? (searching all fields of the documents).__

You can use Kibana (suggested for the time being - you can use the command line in Kibana), the Python ElasticSearch library or DSL. Report the code you implemented and the resulting number of documents.

In [None]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
### Kibana query ###

GET /genomics-base/_count
{
  "query": {
    "query_string": {
      "query": "cell AND blood"
    }
  }
}

### Result ###
{
  "count" : 6865,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
# The number of retrieved documents is 6865.
# END ANSWER

In [25]:
import elasticsearch
es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

# this is another solution (if query_string is used, be sure that AND is in the query, otherwise it will not search properly)
term = 'blood AND cell'
body = {"track_total_hits": True, "query": {"query_string": 
                                            {"query": term, 
                                             "default_operator":"AND", 
                                             "auto_generate_synonyms_phrase_query": True }}}
result = es.search(index='genomics-base', body=body)
print("Number of results: {}".format(result['hits']['total']['value']))

Number of results: 6865


## Exercise 02.B: the Python ElasticSearch library

#### Preparation
The command line is fine for doing basic operations on your Elasticsearch indices, but as soon as things get more complex, you better use custom client programs.
We will use the [Elasticsearch client library for Python](https://elasticsearch-py.readthedocs.io). This library will execute the HTTP requests that you have used before (with CURL or Kibana). The library is pre-installed on the VM.

#### Exercise

__Write the code that searches the index for _"molecule"_ using the [search()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search) function.__ Your code will take at minimum the following steps:

1. import the python library `elasticsearch`.
2. open a connection with the Elasticsearch host `'elasticsearch'` with `Elasticsearch()`.
3. execute a search with `search()` using the index `genomics-base`, and a correct query body.
4. print the JSON output of Elasticsearch 

How many hits are there in your index? Is the result the same as in Exercise 02.A?

> Elasticsearch runs on localhost on your laptop, at port 9200 (so as http://localhost:9200)


In [26]:
#THIS IS GRADED!

import elasticsearch

# your code below
#BEGIN ANSWER
es = elasticsearch.Elasticsearch(host='localhost') 

from pprint import pprint
body = {
    "query": { 
        "simple_query_string" : {
          "query": "molecule"
        }    
    }
}
response = es.search(index='genomics-base', body=body)
n_results = response["hits"]["total"]["value"]
print("The number of retrieved results is",n_results)
print("\n")
# The number is 3404, the same as in exercise 02.A
# Since we need to print the JSON output, we do not specify the size parameter. 
# In this way, only the top 10 titles will be shown instead of all the 3404, just to avoid waiting to much time for the response to print.
pprint(response) # Printing the JSON output

#END ANSWER



The number of retrieved results is 3404


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '374140',
                    '_ignored': ['AB.keyword'],
                    '_index': 'genomics-base',
                    '_score': 31.376125,
                    '_source': {'AB': 'OBJECTIVE: Endothelial cell dysfunction '
                                      'has been implicated in the inflammatory '
                                      'response to cardiopulmonary bypass, and '
                                      'the upregulation of endothelial cell '
                                      'expression of adhesion molecules might '
                                      'promote leukocyte extravasation in '
                                      'vivo. Soluble endothelial cell adhesion '
                                      'molecules are increased after bypass. '
                                      'The aim of this study was to '
        

The Python client library returns Python objects, that use [dictionaries](https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries) and [lists](https://docs.python.org/3.6/tutorial/introduction.html#lists).
Use a [for loop](https://docs.python.org/3.6/tutorial/controlflow.html#for-statements) to inspect each hit, and print the retrieved document's titles one by one. 

In [27]:
#example
print("Number of results: {}".format(response['hits']['total']['value']))
# your code below
body = {
    "size": 10000, # now we specify the size so that we can print ALL of the retrieved document's titles
    "query": { 
        "simple_query_string" : {
          "query": "molecule"
        }    
    }
}
response = es.search(index='genomics-base', body=body)

for i in range(len(response["hits"]["hits"])):
    print(f"{i+1}) ",response["hits"]["hits"][i]["_source"]["TI"])
    print("\n")

Number of results: 3404
1)  Endothelial expression of intercellular adhesion molecule 1 and vascular cell adhesion molecule 1 is suppressed by postbypass plasma containing increased soluble intercellular adhesion molecule 1 and vascular cell adhesion molecule 1.


2)  Influence of single low-density lipoprotein apheresis on the adhesion molecules soluble vascular cellular adhesion molecule-1, soluble intercellular adhesion molecule-1, and P-selectin.


3)  Effects of severe, uncontrolled hypertension on endothelial activation: soluble vascular cell adhesion molecule-1, soluble intercellular adhesion molecule-1 and von Willebrand factor.


4)  Vascular cell adhesion molecule-1 is a key adhesion molecule in melanoma cell adhesion to the leptomeninges.


5)  Localization and quantification of adhesion molecule expression in the lower uterine segment during premature labor.


6)  Clinical significance of serum levels of E-selectin, intercellular adhesion molecule-1, and vascular cell adhes


762)  Polymeric aqua(nitrilotriacetato)erbium(III).


763)  Correlating structural dynamics and function in single ribozyme molecules.


764)  11-Methyl-12a-phenyl-9a,12a-dihydrophenanthro[9',10':5,6][1,4]dioxino[2,3


765)  [Genetic polymorphisms, function and clinical effect of HLA-G]


766)  Nucleic acid aptamers in cancer medicine.


767)  [Leptin: regulatory role in bone metabolism and in inflammation]


768)  Hexamethylenetetramine-4-nitrocatechol-water (1/2/1).


769)  Crystal packing in vicinal diols CnHm(OH)2.


770)  Diffusion-limited kinetics of the solution-solid phase transition of molecular substances.


771)  Preclinical pharmacology of albumin-free B-domain deleted recombinant factor VIII.


772)  Description of ordered solvent molecules in a platinated decanucleotide duplex refined at 1.6A resolution against experimental MAD phases.


773)  Increased expression of endothelial cell adhesion molecules due to mediator release from human foreskin mast cells stimulated by 



1212)  Selective detection of the proton NMR spectra of molecules containing rare spins at natural abundance in liquid crystalline samples.


1213)  The effects of deimination of myelin basic protein on structures formed by its interaction with phosphoinositide-containing lipid monolayers.


1214)  KIR: diverse, rapidly evolving receptors of innate and adaptive immunity.


1215)  MALDI TOF mass spectrometry: an emerging platform for genomics and diagnostics.


1216)  Interrelation between thermochemical and structural data of polymorphs exemplified by diflunisal.


1217)  Lymphocyte activation via NKG2D: towards a new paradigm in immune recognition?


1218)  Genetic engineering of Streptococcus gordonii for the simultaneous display of two heterologous proteins at the bacterial surface.


1219)  The saga of the discovery of IL-1 and TNF and their specific inhibitors in the pathogenesis and treatment of rheumatoid arthritis.


1220)  [In vitro study of chlorhexidine resistance in subgi


1545)  Overcoming the blockade at the upstream of caspase cascade in Fas-resistant HTLV-I-infected T cells by cycloheximide.


1546)  Rapid detection of polymorphisms of the nitric oxide cascade.


1547)  Expression, purification, crystallization and preliminary X-ray analysis of a DNA-binding protein from Methanococcus jannaschii.


1548)  Drug receptor identification from multiple tissues using cellular-derived mRNA display libraries.


1549)  Simultaneous determination of Aloe-emodin and Rhein by synchronous fluorescence spectroscopy.


1550)  Measurement of the electron electric dipole moment using YbF molecules.


1551)  Immunology of factor VIII inhibitors.


1552)  Combinatorial informatics in the post-genomics ERA.


1553)  New therapeutics that modulate chemokine networks.


1554)  Novel and alternate SNP and genetic technologies.


1555)  A stepwise optimization of crystals of rhamnogalacturonan lyase from Aspergillus aculeatus.


1556)  9-(Trichloroacetylimino)acridine mono


1982)  Structure of green pigment formed by the reaction of caffeic acid esters (or chlorogenic acid) with a primary amino compound.


1983)  Chromatographic performance on a C30-bonded stationary phase of monohydroxycarotenoids with variable chain length or degree of desaturation and of lycopene isomers synthesized by various carotene desaturases.


1984)  The complexation of mercury (II) and organomercurial compounds by 8-hydroxyquinoline-bovine serum albumin conjugates.


1985)  Inflammatory mechanisms in atherosclerosis: from laboratory evidence to clinical application.


1986)  Age-associated thymic atrophy is linked to a decline in IL-7 production.


1987)  Superimposition-based protocol as a tool for determining bioactive conformations. II. Application to the GABA(A) receptor.


1988)  Antiplasmin activity of a peptide that binds to the receptor-binding site of angiogenin.


1989)  Chromosome 13 dementia syndromes as models of neurodegeneration.


1990)  A choice of death--the 

2116)  Antigen-specific dose-dependent system for the study of an inheritable and reversible phenotype in mouse CD4+ T cells.


2117)  Osteopontin inhibits mineral deposition and promotes regression of ectopic calcification.


2118)  Advances in therapy for hepatitis C infection.


2119)  Cutting edge: a novel Toll/IL-1 receptor domain-containing adapter that preferentially activates the IFN-beta promoter in the Toll-like receptor signaling.


2120)  Regulation of the metastatic process by E-selectin and stress-activated protein kinase-2/p38.


2121)  Integrin alphav and NCAM mediate the effects of GDNF on DA neuron survival, outgrowth, DA turnover and motor activity in rats.


2122)  Visualization of molecular dynamics by simulation.


2123)  Antibodies present in normal human serum inhibit invasion of human brain microvascular endothelial cells by Listeria monocytogenes.


2124)  CD25+ immunoregulatory CD4 T cells mediate acquired central transplantation tolerance.


2125)  Identific

2493)  Targeted therapy in non-small-cell lung cancer.


2494)  Activation of channel catfish (Ictalurus punctatus) T cells involves NFAT-like transcription factors.


2495)  Preparation of recombinant MK-1/Ep-CAM and establishment of an ELISA system for determining soluble MK-1/Ep-CAM levels in sera of cancer patients.


2496)  T lymphocytes express B7 family molecules following interaction with dendritic cells and acquire bystander costimulatory properties.


2497)  Serum leptin and CD4+ T lymphocytes in HIV+ children during highly active antiretroviral therapy.


2498)  Molecular topography imaging by intermembrane fluorescence resonance energy transfer.


2499)  [N(CH3)4]2[Mn(H2O)]3[Mo(CN)7](2).2H2O: a new high Tc cyano-bridged ferrimagnet based on the [MoIII(CN)7]4


2500)  Synergistic induction of apoptosis by acyclic retinoid and interferon-beta in human hepatocellular carcinoma cells.


2501)  Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Paramet


2711)  Ferromagnetism in malonato-bridged copper(II) complexes. Synthesis, crystal structures, and magnetic properties of [[Cu(H2O)3][Cu(mal)2(H2O)]]n and [[Cu(H2O)4]2[Cu(mal)2(H2O)]][Cu(mal)2(H2O)2][[Cu(H2O)4][Cu(mal)2(H2O)2][(H 2mal = malonic acid).


2712)  Rapid and precise epitope mapping of monoclonal antibodies against Plasmodium falciparum AMA1 by combined phage display of fragments and random peptides.


2713)  In vitro assay for site-specific proteases using bead-attached GFP substrate.


2714)  Selective in vivo inhibition of mitogen-activated protein kinase activation using cell-permeable peptides.


2715)  Wet air oxidation of a direct dye solution catalyzed by CoAlPO4 -5. Performance assessment and kinetic study.


2716)  Association of the A561C E-selectin polymorphism with systemic lupus erythematosus in 2 independent populations.


2717)  Simplified method for the detection of apo(a) isoforms.


2718)  Functional modification of cytochrome c by peroxynitrite in an ele

3044)  Expression of the type 1 insulin-like growth factor receptor is up-regulated in primary prostate cancer and commonly persists in metastatic disease.


3045)  Transcriptional regulation of the human toll-like receptor 2 gene in monocytes and macrophages.


3046)  Downregulation of P2X3 receptor-dependent sensory functions in A/J inbred mouse strain.


3047)  Structure of a domain-opened mutant (R121D) of the human lactoferrin N-lobe refined from a merohedrally twinned crystal form.


3048)  Beta-lactamase protein fragment complementation assays as in vivo and in vitro sensors of protein protein interactions.


3049)  SLAM (CD150)-independent measles virus entry as revealed by recombinant virus expressing green fluorescent protein.


3050)  Mitochondrial calcium ion and oxidative phosphorylation in regenerating rat liver.


3051)  Reactivities of methylenetriangulanes and spirocyclopropanated bicyclopropylidenes toward bromine. Relative stabilities of spirocyclopropanated versus m


3404)  Molecular complexity of vertebrate tight junctions (Review).




## Exercise 02.C: _Search using the Elasticsearch DSL_

You will notice that the native query format of Elasticsearch can be quite verbose.
Elasticsearh provides the Python library `elasticsearch_dsl` to write more concise Elasticsearch queries. 
This is only to simplify the syntax: the library still issues Elasticsearch queries.

For example, a simple `multi_match` query looks as follows:
```python
query = {
   "query": {
       "multi_match": {}
   }
}
```

The same query can be created with the DSL as follows:
```python
query = Q("multi_match")
```

Especially for more complicated boolean queries, to use the native query format can become complicated.
Read more about the DSL [here](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html)

__1. Search for the query `molecule` and check whether you get the same number of results as for exercise 02.A(2).__

In [28]:
#THIS IS GRADED!

# your code here
# BEGIN ANSWER
from elasticsearch_dsl import Q
from elasticsearch_dsl import Search

q = Q("multi_match", query='molecule')
s = Search().using(es).query(q).index("genomics-base")
response = s.execute()


print("Number of retrieved documents:", (response.to_dict()['hits']['total']['value']))
# It is the same as in exercise 02.A(2)

Number of retrieved documents: 3404


__2. Search for the documents that contain the words `cell` AND `blood`, using the DSL library. Check whether you get the same number of results as for exercise 02.A(4).__

In [29]:
#THIS IS GRADED!

# your code here
# BEGIN ANSWER
from elasticsearch_dsl import Q
from elasticsearch_dsl import Search

q = Q("multi_match", query='cell') & Q("multi_match", query ='blood')
s = Search().using(es).query(q).index("genomics-base")
response = s.execute()

print("Number of retrieved documents:", (response.to_dict()['hits']['total']['value']))
# It is the same as in exercise 02.A(2)
# END ANSWER

Number of retrieved documents: 6865


***
##  Exercise 02.D: Making your own TREC run

We will adopt a scientific approach to building search engines. That is, we are not only going to build a search engine and see that it works, but we are also going to _measure_ how well it works, by measuring the search engine's quality. We will adopt the method from the [Text Retrieval Conference](http://trec.nist.gov) (TREC). TREC provides researchers with test collections, that consists of 3 parts:

1. the document collection (in our case a part of the MEDLINE database)
2. the topics (which are natural language descriptions of what the user is searching for: you can think of the as the _queries_)
3. the relevance judgments (for each topic, what documents are relevant)



__Exercise: Complete the code of the Python function `make_trec_run()` that reads the topics [FIR-s05-training-queries-simple.txt](data01/FIR-s05-training-queries-simple.txt), and for each topic does a search using Elasticsearch.__ The program should output a file in the [TREC submission format](https://trec-core.github.io/2017/#submission-guidelines). We already provided the first  lines for this exercise, which include:

1. Open the file `'run_file_name'`' for writing and call it `run_file`.
2. Open the file `'topics_file_name'` for reading, call it `test_queries`.
3. For each line in `test_queries`:
4. Remove the newline using `strip()`, then split the string on the tab character (`'\t'`). The first part of the line is now `qid` (the query identifier) and the last part is `query` (a textual description of the query).
5. complete the Python program such that the correct TREC run file is written to `'run_file_name'`.

> **Note**: Make sure you output the `PMID` (pubmed identifier) of the document `hit['_source']['PMID']`. Do **not** use the elasticsearch identifier `_id` because they do not match the document identifiers in the relevance judgements. They were randomly generated by Elasticsearch during indexing.


__Make sure to search in the fiels `TI` and `AB`, which correspond to the title and abstract, respectivelt, of the scientific papers of the MEDLINE collection.__

In [30]:
# THIS IS GRADED!
def make_trec_run(es, topics_file_name, run_file_name, index_name="genomics", run_name="test"):
    with open(run_file_name, 'w') as run_file:
        with open(topics_file_name, 'r') as test_queries:
            for line in test_queries:
                (qid, query) = line.strip().split('\t')
                # BEGIN ANSWER
                q = Q("multi_match", query = query, fields = ["TI", "AB"])
                s = Search().using(es).query(q).index(index_name)
                s = s[:1000]
                response = s.execute()
                hits = response.to_dict()['hits']['hits']
                for rank, hit in enumerate(hits):
                    out = str(qid) + ' Q0 ' + hit['_source']['PMID'] + ' ' + str(rank) + ' ' + str(hit['_score']) + ' ' + run_name
                    run_file.write(out+'\n')
                # END ANSWER

                
# connect to ES server             
es = elasticsearch.Elasticsearch('localhost')
# Write the results of the queries contained in the topic file 'data/training-queries-simple.txt' 
# to the run file 'baseline.run', and name this test as test01
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'baseline.run', "genomics-base", run_name='test01')

In [31]:
# this prints out (it is a shell command) the content of the file baseline.run 
!cat baseline.run

1 Q0 11929828 0 43.560047 test01
1 Q0 11751903 1 43.358788 test01
1 Q0 12384701 2 41.708458 test01
1 Q0 12065641 3 40.89842 test01
1 Q0 11980715 4 40.009636 test01
1 Q0 12126481 5 38.631733 test01
1 Q0 12455049 6 37.12175 test01
1 Q0 12444545 7 36.621918 test01
1 Q0 12431783 8 36.194813 test01
1 Q0 12204896 9 36.05962 test01
1 Q0 12119358 10 35.775574 test01
1 Q0 12242284 11 35.638016 test01
1 Q0 11886527 12 35.580967 test01
1 Q0 11779850 13 35.447456 test01
1 Q0 12203364 14 35.040398 test01
1 Q0 12110586 15 34.98481 test01
1 Q0 11767002 16 34.724464 test01
1 Q0 12115564 17 34.63533 test01
1 Q0 11827966 18 34.54958 test01
1 Q0 12112322 19 34.139168 test01
1 Q0 11762751 20 33.923866 test01
1 Q0 12368211 21 33.21544 test01
1 Q0 12055678 22 32.81765 test01
1 Q0 11940356 23 32.102512 test01
1 Q0 11989975 24 32.03188 test01
1 Q0 11862714 25 31.95563 test01
1 Q0 11756412 26 31.488062 test01
1 Q0 12203371 27 31.462303 test01
1 Q0 12173048 28 31.39246 test01
1 Q0 1

4 Q0 12417648 214 22.356375 test01
4 Q0 11947898 215 22.346432 test01
4 Q0 12099577 216 22.34269 test01
4 Q0 12091808 217 22.337345 test01
4 Q0 12400058 218 22.333227 test01
4 Q0 11973338 219 22.306095 test01
4 Q0 12028792 220 22.287289 test01
4 Q0 11811791 221 22.282856 test01
4 Q0 12125991 222 22.282856 test01
4 Q0 12393593 223 22.282856 test01
4 Q0 12524173 224 22.27095 test01
4 Q0 12213572 225 22.260454 test01
4 Q0 11956621 226 22.258043 test01
4 Q0 12464872 227 22.229937 test01
4 Q0 12485426 228 22.225348 test01
4 Q0 11791031 229 22.122597 test01
4 Q0 12572862 230 22.114517 test01
4 Q0 11966788 231 22.114193 test01
4 Q0 12052469 232 22.096596 test01
4 Q0 12171170 233 22.043465 test01
4 Q0 12475893 234 22.043465 test01
4 Q0 11986954 235 22.028574 test01
4 Q0 12135928 236 22.028574 test01
4 Q0 11969363 237 22.022148 test01
4 Q0 12144547 238 22.022148 test01
4 Q0 12027247 239 22.004084 test01
4 Q0 12607671 240 21.973223 test01
4 Q0 12558068 241 21.97321 tes

5 Q0 11911589 194 13.334873 test01
5 Q0 12450897 195 13.319139 test01
5 Q0 12193299 196 13.310384 test01
5 Q0 11849651 197 13.294206 test01
5 Q0 12443093 198 13.287865 test01
5 Q0 12167010 199 13.283954 test01
5 Q0 12479050 200 13.274925 test01
5 Q0 12487682 201 13.26769 test01
5 Q0 12383231 202 13.208395 test01
5 Q0 11821012 203 13.206187 test01
5 Q0 12093034 204 13.204163 test01
5 Q0 11858187 205 13.201026 test01
5 Q0 11998908 206 13.200373 test01
5 Q0 11700958 207 13.187641 test01
5 Q0 11958829 208 13.18102 test01
5 Q0 11986328 209 13.178549 test01
5 Q0 11980375 210 13.169234 test01
5 Q0 12083742 211 13.16786 test01
5 Q0 11854190 212 13.147647 test01
5 Q0 12584006 213 13.133366 test01
5 Q0 12036080 214 13.125463 test01
5 Q0 12051637 215 13.122931 test01
5 Q0 12236981 216 13.121088 test01
5 Q0 12006406 217 13.118154 test01
5 Q0 12199157 218 13.116788 test01
5 Q0 12480812 219 13.112248 test01
5 Q0 12056741 220 13.107974 test01
5 Q0 12193029 221 13.056038 tes

8 Q0 11821394 674 8.359931 test01
8 Q0 12016971 675 8.341108 test01
8 Q0 12139420 676 8.341108 test01
8 Q0 12040023 677 8.330957 test01
8 Q0 11886417 678 8.328806 test01
8 Q0 11894029 679 8.326137 test01
8 Q0 12204195 680 8.322738 test01
8 Q0 12239119 681 8.322738 test01
8 Q0 12631582 682 8.318262 test01
8 Q0 12223278 683 8.312102 test01
8 Q0 12102472 684 8.305587 test01
8 Q0 12413983 685 8.300991 test01
8 Q0 12386818 686 8.296532 test01
8 Q0 12646666 687 8.293935 test01
8 Q0 11931851 688 8.288762 test01
8 Q0 11770184 689 8.283754 test01
8 Q0 11891012 690 8.283754 test01
8 Q0 11967012 691 8.283754 test01
8 Q0 12042362 692 8.283754 test01
8 Q0 12169771 693 8.282518 test01
8 Q0 12492439 694 8.275783 test01
8 Q0 11926895 695 8.263675 test01
8 Q0 12418588 696 8.263675 test01
8 Q0 12065886 697 8.2617035 test01
8 Q0 11974387 698 8.258896 test01
8 Q0 12429050 699 8.258896 test01
8 Q0 12031468 700 8.256929 test01
8 Q0 12201218 701 8.248863 test01
8 Q0 12231246 702 8

11 Q0 12525161 59 26.658413 test01
11 Q0 11961044 60 26.51983 test01
11 Q0 12195266 61 26.513145 test01
11 Q0 12154059 62 26.499403 test01
11 Q0 11877431 63 26.38765 test01
11 Q0 11865026 64 26.275072 test01
11 Q0 11851731 65 26.155634 test01
11 Q0 12006650 66 26.124893 test01
11 Q0 11847104 67 25.838749 test01
11 Q0 11956649 68 25.779228 test01
11 Q0 11771757 69 25.771694 test01
11 Q0 11748241 70 25.691837 test01
11 Q0 12380690 71 25.649042 test01
11 Q0 12446789 72 25.616796 test01
11 Q0 11858937 73 25.519165 test01
11 Q0 12126963 74 25.39788 test01
11 Q0 12393632 75 25.387085 test01
11 Q0 11923206 76 25.369545 test01
11 Q0 11970732 77 25.09735 test01
11 Q0 11782516 78 25.07002 test01
11 Q0 11744688 79 25.045332 test01
11 Q0 11895987 80 25.032879 test01
11 Q0 12467978 81 24.958323 test01
11 Q0 11875067 82 24.8527 test01
11 Q0 12527194 83 24.793636 test01
11 Q0 12440379 84 24.757513 test01
11 Q0 11862216 85 24.57302 test01
11 Q0 12488263 86 24.513077 test01


12 Q0 12112320 459 8.068904 test01
12 Q0 11751331 460 8.061859 test01
12 Q0 12513929 461 8.05768 test01
12 Q0 12096207 462 8.056857 test01
12 Q0 12390967 463 8.051619 test01
12 Q0 12538816 464 8.042247 test01
12 Q0 11755907 465 8.0386095 test01
12 Q0 11861012 466 8.037445 test01
12 Q0 11968068 467 8.03474 test01
12 Q0 11938966 468 8.034271 test01
12 Q0 12573292 469 8.028855 test01
12 Q0 11817671 470 8.027953 test01
12 Q0 12508433 471 8.024426 test01
12 Q0 12054910 472 8.024174 test01
12 Q0 11988488 473 8.022316 test01
12 Q0 11970621 474 8.019001 test01
12 Q0 12402614 475 8.010598 test01
12 Q0 12496995 476 8.005387 test01
12 Q0 11969479 477 8.002192 test01
12 Q0 11980614 478 7.999724 test01
12 Q0 12006387 479 7.995849 test01
12 Q0 12578818 480 7.994151 test01
12 Q0 11918713 481 7.9911976 test01
12 Q0 12610519 482 7.9835086 test01
12 Q0 12055070 483 7.981481 test01
12 Q0 12139566 484 7.9813523 test01
12 Q0 12473262 485 7.9798536 test01
12 Q0 11983209 486 7.9691

13 Q0 12011068 43 15.821289 test01
13 Q0 12121999 44 15.729966 test01
13 Q0 12063257 45 15.682049 test01
13 Q0 11908956 46 15.567491 test01
13 Q0 12489118 47 15.50609 test01
13 Q0 12032315 48 15.446028 test01
13 Q0 12242240 49 15.308076 test01
13 Q0 12457850 50 15.24101 test01
13 Q0 11914720 51 15.21714 test01
13 Q0 12086602 52 15.215726 test01
13 Q0 11824896 53 15.144858 test01
13 Q0 11914059 54 14.887511 test01
13 Q0 12610302 55 14.845992 test01
13 Q0 12231630 56 14.825171 test01
13 Q0 12137233 57 14.746465 test01
13 Q0 11809845 58 14.719691 test01
13 Q0 11866425 59 14.7187 test01
13 Q0 12123607 60 14.713492 test01
13 Q0 12441051 61 14.705674 test01
13 Q0 11960013 62 14.598043 test01
13 Q0 12169631 63 14.551276 test01
13 Q0 12508317 64 14.431506 test01
13 Q0 11971762 65 14.368824 test01
13 Q0 12393857 66 14.368824 test01
13 Q0 12574114 67 14.368824 test01
13 Q0 12362432 68 14.268593 test01
13 Q0 11943764 69 14.254662 test01
13 Q0 12203814 70 14.252971 test0

15 Q0 11991857 264 7.7188563 test01
15 Q0 12387463 265 7.7158737 test01
15 Q0 12501674 266 7.702625 test01
15 Q0 11718323 267 7.7006235 test01
15 Q0 12208236 268 7.6844296 test01
15 Q0 12427784 269 7.6560073 test01
15 Q0 12054878 270 7.6349983 test01
15 Q0 11796706 271 7.617118 test01
15 Q0 12523224 272 7.613283 test01
15 Q0 12523867 273 7.613283 test01
15 Q0 12520011 274 7.6119146 test01
15 Q0 12553811 275 7.6031537 test01
15 Q0 11839488 276 7.600291 test01
15 Q0 12144522 277 7.5974617 test01
15 Q0 12297623 278 7.5974617 test01
15 Q0 11958777 279 7.5928535 test01
15 Q0 12153830 280 7.5928535 test01
15 Q0 12242882 281 7.5928535 test01
15 Q0 12410065 282 7.5928535 test01
15 Q0 12415368 283 7.5928535 test01
15 Q0 12432046 284 7.5928535 test01
15 Q0 12432620 285 7.5928535 test01
15 Q0 12453112 286 7.5928535 test01
15 Q0 12270723 287 7.5907397 test01
15 Q0 11831757 288 7.5866733 test01
15 Q0 12446779 289 7.5814214 test01
15 Q0 12064588 290 7.577498 test01
15 Q0 1

15 Q0 12270104 757 6.3587556 test01
15 Q0 12391569 758 6.3587556 test01
15 Q0 12394276 759 6.3587556 test01
15 Q0 12404675 760 6.3587556 test01
15 Q0 12411815 761 6.3587556 test01
15 Q0 12440093 762 6.3587556 test01
15 Q0 12485326 763 6.3587556 test01
15 Q0 11940212 764 6.358551 test01
15 Q0 12417308 765 6.3581076 test01
15 Q0 12438085 766 6.357353 test01
15 Q0 12054559 767 6.356431 test01
15 Q0 11950979 768 6.356296 test01
15 Q0 12068803 769 6.356296 test01
15 Q0 11897801 770 6.355942 test01
15 Q0 12191777 771 6.3548565 test01
15 Q0 12225865 772 6.3543324 test01
15 Q0 12525650 773 6.3542547 test01
15 Q0 12426468 774 6.353903 test01
15 Q0 11887007 775 6.3527484 test01
15 Q0 12072526 776 6.3518677 test01
15 Q0 11793386 777 6.3480196 test01
15 Q0 11850407 778 6.346394 test01
15 Q0 11880575 779 6.34295 test01
15 Q0 11941898 780 6.34256 test01
15 Q0 12376669 781 6.336636 test01
15 Q0 11790799 782 6.336147 test01
15 Q0 12594518 783 6.336147 test01
15 Q0 11792730 7

18 Q0 12475931 552 14.377907 test01
18 Q0 12065591 553 14.375934 test01
18 Q0 11879190 554 14.3739195 test01
18 Q0 11935268 555 14.3642025 test01
18 Q0 12534368 556 14.359932 test01
18 Q0 11851356 557 14.350265 test01
18 Q0 12452011 558 14.34863 test01
18 Q0 11829493 559 14.336918 test01
18 Q0 12153481 560 14.332467 test01
18 Q0 12191493 561 14.320225 test01
18 Q0 12082100 562 14.311274 test01
18 Q0 12573581 563 14.306646 test01
18 Q0 11854262 564 14.298927 test01
18 Q0 12410564 565 14.296779 test01
18 Q0 12388064 566 14.2727995 test01
18 Q0 11952423 567 14.2315645 test01
18 Q0 11997250 568 14.221507 test01
18 Q0 12050851 569 14.207872 test01
18 Q0 11851094 570 14.176877 test01
18 Q0 12101282 571 14.16497 test01
18 Q0 12054529 572 14.16356 test01
18 Q0 12208735 573 14.162868 test01
18 Q0 11906245 574 14.162443 test01
18 Q0 12010877 575 14.158769 test01
18 Q0 12367505 576 14.157579 test01
18 Q0 12197899 577 14.154884 test01
18 Q0 12419773 578 14.136784 test01


24 Q0 12126289 424 6.5661845 test01
24 Q0 12131767 425 6.5661845 test01
24 Q0 12162187 426 6.5661845 test01
24 Q0 12472107 427 6.5661845 test01
24 Q0 12478863 428 6.5661845 test01
24 Q0 12486319 429 6.5661845 test01
24 Q0 12518718 430 6.5661845 test01
24 Q0 11892685 431 6.5633087 test01
24 Q0 12136415 432 6.551003 test01
24 Q0 11938931 433 6.5473948 test01
24 Q0 11920254 434 6.5442166 test01
24 Q0 12222574 435 6.543084 test01
24 Q0 11909596 436 6.543011 test01
24 Q0 12377062 437 6.541444 test01
24 Q0 11967566 438 6.537211 test01
24 Q0 12360435 439 6.537211 test01
24 Q0 12110374 440 6.5370884 test01
24 Q0 12504079 441 6.5367565 test01
24 Q0 12209533 442 6.5284805 test01
24 Q0 12269281 443 6.5284805 test01
24 Q0 12434137 444 6.5284805 test01
24 Q0 11933148 445 6.526253 test01
24 Q0 12112909 446 6.5249763 test01
24 Q0 11791710 447 6.5202684 test01
24 Q0 12173084 448 6.5202684 test01
24 Q0 12054682 449 6.5089216 test01
24 Q0 12127601 450 6.5089216 test01
24 Q0 12

25 Q0 12419221 705 5.726493 test01
25 Q0 12437925 706 5.726493 test01
25 Q0 11995979 707 5.72517 test01
25 Q0 12215531 708 5.72517 test01
25 Q0 12135374 709 5.723961 test01
25 Q0 12194873 710 5.723961 test01
25 Q0 12445933 711 5.723961 test01
25 Q0 12452599 712 5.723961 test01
25 Q0 12044159 713 5.71881 test01
25 Q0 12079337 714 5.71881 test01
25 Q0 12082117 715 5.71881 test01
25 Q0 12228254 716 5.71881 test01
25 Q0 12242019 717 5.717457 test01
25 Q0 11825623 718 5.713867 test01
25 Q0 12387366 719 5.711578 test01
25 Q0 11851331 720 5.7095566 test01
25 Q0 11955070 721 5.7095566 test01
25 Q0 12034741 722 5.7095566 test01
25 Q0 12122010 723 5.7095566 test01
25 Q0 12444095 724 5.7095566 test01
25 Q0 12559580 725 5.7095566 test01
25 Q0 12083528 726 5.708825 test01
25 Q0 12502857 727 5.708825 test01
25 Q0 11861304 728 5.704922 test01
25 Q0 11879635 729 5.70222 test01
25 Q0 11793386 730 5.698609 test01
25 Q0 12000745 731 5.698609 test01
25 Q0 12082103 732 5.6957674 

28 Q0 11762814 805 8.146226 test01
28 Q0 11956320 806 8.145664 test01
28 Q0 11985984 807 8.143866 test01
28 Q0 12523814 808 8.143686 test01
28 Q0 12193410 809 8.141483 test01
28 Q0 11882460 810 8.140765 test01
28 Q0 11908893 811 8.139288 test01
28 Q0 12122507 812 8.138182 test01
28 Q0 12163404 813 8.136515 test01
28 Q0 12052959 814 8.134825 test01
28 Q0 12218417 815 8.1346035 test01
28 Q0 12124900 816 8.131075 test01
28 Q0 11914023 817 8.130682 test01
28 Q0 12195808 818 8.127237 test01
28 Q0 12149254 819 8.126531 test01
28 Q0 12543784 820 8.122115 test01
28 Q0 11963465 821 8.119091 test01
28 Q0 11952872 822 8.114317 test01
28 Q0 12038622 823 8.113121 test01
28 Q0 12370537 824 8.113121 test01
28 Q0 12471325 825 8.113121 test01
28 Q0 12025890 826 8.110581 test01
28 Q0 12109035 827 8.107518 test01
28 Q0 11914745 828 8.105444 test01
28 Q0 12488361 829 8.104962 test01
28 Q0 12171170 830 8.102388 test01
28 Q0 12217855 831 8.099933 test01
28 Q0 11694079 832 8.099853

29 Q0 11912168 381 1.2541339 test01
29 Q0 11927848 382 1.2541339 test01
29 Q0 11953982 383 1.2541339 test01
29 Q0 11969976 384 1.2541339 test01
29 Q0 11987857 385 1.2541339 test01
29 Q0 11994929 386 1.2541339 test01
29 Q0 12029470 387 1.2541339 test01
29 Q0 12032845 388 1.2541339 test01
29 Q0 12039455 389 1.2541339 test01
29 Q0 12075683 390 1.2541339 test01
29 Q0 12076637 391 1.2541339 test01
29 Q0 12081349 392 1.2541339 test01
29 Q0 12131000 393 1.2541339 test01
29 Q0 12160172 394 1.2541339 test01
29 Q0 12165657 395 1.2541339 test01
29 Q0 12168858 396 1.2541339 test01
29 Q0 12182022 397 1.2541339 test01
29 Q0 12186121 398 1.2541339 test01
29 Q0 12197338 399 1.2541339 test01
29 Q0 12207166 400 1.2541339 test01
29 Q0 12270926 401 1.2541339 test01
29 Q0 12297556 402 1.2541339 test01
29 Q0 12383654 403 1.2541339 test01
29 Q0 12413088 404 1.2541339 test01
29 Q0 12441322 405 1.2541339 test01
29 Q0 12466769 406 1.2541339 test01
29 Q0 12471912 407 1.2541339 test01
2

34 Q0 11804788 926 7.3077245 test01
34 Q0 11917103 927 7.3077245 test01
34 Q0 11980915 928 7.3077245 test01
34 Q0 12355718 929 7.3077245 test01
34 Q0 12385800 930 7.301093 test01
34 Q0 11987771 931 7.2921495 test01
34 Q0 11994746 932 7.2921495 test01
34 Q0 12075100 933 7.2921495 test01
34 Q0 12446669 934 7.2921495 test01
34 Q0 12447386 935 7.2921495 test01
34 Q0 12354610 936 7.2549834 test01
34 Q0 12168114 937 7.254772 test01
34 Q0 12441665 938 7.2438173 test01
34 Q0 12475376 939 7.2403255 test01
34 Q0 12372024 940 7.2385416 test01
34 Q0 12441326 941 7.23461 test01
34 Q0 12515780 942 7.231635 test01
34 Q0 12210514 943 7.223422 test01
34 Q0 12066991 944 7.2217093 test01
34 Q0 12297115 945 7.216717 test01
34 Q0 12479262 946 7.2133393 test01
34 Q0 11968015 947 7.212617 test01
34 Q0 12067341 948 7.2114024 test01
34 Q0 12395202 949 7.2114024 test01
34 Q0 11889553 950 7.2098346 test01
34 Q0 11721970 951 7.206373 test01
34 Q0 12177474 952 7.2050323 test01
34 Q0 1212

37 Q0 12545177 147 15.275098 test01
37 Q0 11961101 148 15.223101 test01
37 Q0 12035802 149 15.181075 test01
37 Q0 12072176 150 15.167594 test01
37 Q0 12529414 151 15.152562 test01
37 Q0 12509240 152 15.1396 test01
37 Q0 12006672 153 15.129322 test01
37 Q0 11914720 154 15.120672 test01
37 Q0 12007414 155 15.120672 test01
37 Q0 12019236 156 15.120672 test01
37 Q0 12350270 157 15.120672 test01
37 Q0 12398416 158 15.120672 test01
37 Q0 12559957 159 15.120672 test01
37 Q0 11932766 160 15.088268 test01
37 Q0 12144703 161 15.049147 test01
37 Q0 12082123 162 15.012109 test01
37 Q0 12023222 163 14.987886 test01
37 Q0 11862484 164 14.9255295 test01
37 Q0 12136418 165 14.9255295 test01
37 Q0 12111533 166 14.921979 test01
37 Q0 12430566 167 14.920669 test01
37 Q0 12032248 168 14.897807 test01
37 Q0 11782551 169 14.89615 test01
37 Q0 11900466 170 14.89615 test01
37 Q0 12058017 171 14.827494 test01
37 Q0 12202754 172 14.796759 test01
37 Q0 11969085 173 14.790157 test01
37 

42 Q0 11960702 398 8.952121 test01
42 Q0 11884525 399 8.9214115 test01
42 Q0 12408836 400 8.910667 test01
42 Q0 12144015 401 8.9101925 test01
42 Q0 12079765 402 8.885281 test01
42 Q0 12097340 403 8.831926 test01
42 Q0 12243261 404 8.831926 test01
42 Q0 12359842 405 8.831926 test01
42 Q0 12403716 406 8.811738 test01
42 Q0 12486708 407 8.805146 test01
42 Q0 11768313 408 8.790151 test01
42 Q0 12127141 409 8.781653 test01
42 Q0 11842441 410 8.769283 test01
42 Q0 12056839 411 8.711064 test01
42 Q0 12084706 412 8.7019005 test01
42 Q0 12091318 413 8.7019005 test01
42 Q0 12598614 414 8.673412 test01
42 Q0 11826982 415 8.654305 test01
42 Q0 11850196 416 8.650809 test01
42 Q0 12119421 417 8.649623 test01
42 Q0 12137947 418 8.644445 test01
42 Q0 11909553 419 8.633376 test01
42 Q0 12049766 420 8.62775 test01
42 Q0 12140317 421 8.62775 test01
42 Q0 12581521 422 8.624799 test01
42 Q0 12230549 423 8.620875 test01
42 Q0 12424301 424 8.607246 test01
42 Q0 12183374 425 8.55573

46 Q0 12215391 725 6.984685 test01
46 Q0 12351469 726 6.982232 test01
46 Q0 11931103 727 6.977073 test01
46 Q0 12062807 728 6.973894 test01
46 Q0 12023361 729 6.973002 test01
46 Q0 11814735 730 6.970882 test01
46 Q0 12435403 731 6.9697413 test01
46 Q0 11994138 732 6.962432 test01
46 Q0 11935237 733 6.956643 test01
46 Q0 12200627 734 6.956643 test01
46 Q0 11830325 735 6.95621 test01
46 Q0 12111404 736 6.951092 test01
46 Q0 12145279 737 6.9429865 test01
46 Q0 12223514 738 6.9417124 test01
46 Q0 11795485 739 6.9380035 test01
46 Q0 11924798 740 6.9278607 test01
46 Q0 12391956 741 6.927168 test01
46 Q0 11981395 742 6.922466 test01
46 Q0 12101981 743 6.922466 test01
46 Q0 12143373 744 6.922466 test01
46 Q0 12163794 745 6.922466 test01
46 Q0 12235870 746 6.922466 test01
46 Q0 12242134 747 6.922466 test01
46 Q0 12358359 748 6.922466 test01
46 Q0 12406665 749 6.922466 test01
46 Q0 12582407 750 6.922466 test01
46 Q0 12488360 751 6.916587 test01
46 Q0 12404279 752 6.913

48 Q0 12206678 681 11.231786 test01
48 Q0 11856375 682 11.228728 test01
48 Q0 11812855 683 11.227276 test01
48 Q0 12083872 684 11.223692 test01
48 Q0 11785435 685 11.214392 test01
48 Q0 11805041 686 11.209799 test01
48 Q0 11749167 687 11.20929 test01
48 Q0 12351486 688 11.200652 test01
48 Q0 11844606 689 11.195318 test01
48 Q0 12446115 690 11.194837 test01
48 Q0 12468308 691 11.194253 test01
48 Q0 12446112 692 11.192425 test01
48 Q0 12060667 693 11.188585 test01
48 Q0 12527915 694 11.188585 test01
48 Q0 12595689 695 11.182223 test01
48 Q0 12610302 696 11.182223 test01
48 Q0 12130537 697 11.179291 test01
48 Q0 11960912 698 11.176793 test01
48 Q0 12370117 699 11.175535 test01
48 Q0 12389236 700 11.170699 test01
48 Q0 11843160 701 11.165109 test01
48 Q0 12183434 702 11.162604 test01
48 Q0 12379459 703 11.156667 test01
48 Q0 11914130 704 11.154594 test01
48 Q0 11996672 705 11.153107 test01
48 Q0 12124328 706 11.152273 test01
48 Q0 12547191 707 11.149593 test01
48

> Tip: Write a line to `run_file` using `run_file.write(line)`. 
> The newline character is: `'\n'`. Before writing a number to
> the file, cast it to a string using `str()`.
>
> The TREC Submission guidelines allow you to submit up to 1000
> documents per topic. Keep this in mind!

# Part 03: Search models 


<span style="background:red; color: white;">__You are advised to work on this part after Lecture 02__</span>


### Background
The way documents are indexed influences the performance of the IR systems. 
Elasticsearch [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/mapping.html) define how a document, and its properties (fields) are stored and indexed, but also provides tools to implement and execute different document similarity measures (i.e. search models).  When using a different configuration of an ElasticSearch Mapping, the document collection needs to be re-indexed (or a new index need to be created - use the functions we provided above to do that).

> See again: [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/indices-create-index.html).
> _Note: the default model (similarity) in ElasticSearch is BM25. Different models need to be specified (see example)._

For instance, we can add a new field `"title-abstract"` that uses the  [similarity measure](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/similarity.html) _Boolean_, and let it serve as an index for the fields `"TI"` and `"AB"` (title and abstract):

> Plase note that if you want to use the `boolean` similarity for the single fields, you need to specify it for each field. Otherwise, the default BM25 will be used.

In [33]:
boolean = {
  "settings" : {
    # a single shard, so we do not suffer from approximate document frequencies
    "number_of_shards" : 1
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "title-abstract",
          "similarity": "boolean"
        },
        "TI": {
          "type": "text",
          "copy_to": "title-abstract",
          "similarity": "boolean"
        },
        "title-abstract": {  # compound field
          "type": "text",
          "similarity": "boolean"
        }
      }
  }
}

es = elasticsearch.Elasticsearch('localhost')
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-bool', body=boolean)

(263080, [])

> Most changes to the mappings cannot be done on an existing index. Some (for instance
> similarity measures) can be changed if the index is first closed. Nevertheless, we 
> will in this notebook _re-index_ the collection for every change to the mappings
> using the function `index_documents()` that we defined above. Mappings (and settings)
> can be passed to the function using the `body` parameter.

<span style="background:#444; color: white;">__We suggest you to create different indices using different models of search (according to the available disk space on your VM). This will avoid that changes are not correctly applied, and you won't see the expected results.__</span>

<span style="background:#444; color: white;">E.g. for the 'boolean' model, we created the 'genomics-bool' index.</span>

Let's have a look at the mappings and settings for our index as follows:

In [34]:
es.indices.get(index='genomics-bool')

{'genomics-bool': {'aliases': {},
  'mappings': {'properties': {'AB': {'type': 'text',
     'copy_to': ['title-abstract'],
     'similarity': 'boolean'},
    'AD': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'AID': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CI': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CIN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CON': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CY': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DA': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DCOM': {'type': 'text',
     'fields': {'keyword': {'t

Now let's search our new field `"title-abstract"` as follows:

In [41]:
query = "molecule"
search_type = "dfs_query_then_fetch" # this will use exact document frequencies even for multiple shards
body = {
  "query": {
    "match" : { "title-abstract" : query }
  },
  "size": 10
}
es.search(index="genomics-bool", search_type=search_type, body=body)

{'took': 3,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 3185, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'genomics-bool',
    '_type': '_doc',
    '_id': '125',
    '_score': 1.0,
    '_source': {'AB': 'Previous studies from our laboratories revealed the susceptibility of Leishmania sp. to glibenclamide (GLIB), a potassium channel blocker which selectively interacts with adenosine-binding-cassette transporters. In the present work, we analyzed whether the drug sensitivity of intracellular amastigotes correlates with changes in macrophage features that are related to their function as antigen-presenting cells. We provide evidence that in BALB/c murine macrophages, GLIB induced a decrease in the interferon-gamma-stimulated expression of major histocompatibility complex class II molecules and the co-stimulatory molecule CD86 (B7-2). Furthermore, it caused a decrease in the interleukin-1 secretion

## Exercise 03.A: _new run and evaluation_
Create a new run file (e.g. `boolean.run`), compute the retrieval performance with the function `print_trec_eval()` and compare the results with the baseline run file `baseline.run`.

In [79]:
#THIS IS GRADED!

# write your code here
# BEGIN ANSWER
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'boolean.run', "genomics-bool", run_name='test02')

print("--- BASELINE RUN ---")
print_trec_eval("data01/FIR-s05-training-qrels.txt", "baseline.run")
print("\n")

print("--- BOOLEAN RUN ---")
print_trec_eval("data01/FIR-s05-training-qrels.txt", "boolean.run")
# END ANSWER

--- BASELINE RUN ---
Results for baseline.run
mean success_at_1              0.1053
mean success_at_5              0.2632
mean success_at_10             0.3158
mean r_precision               0.09156
mean precision_at_1            0.1053
mean precision_at_5            0.07895
mean precision_at_10           0.04737
mean precision_at_50           0.01947
mean precision_at_100          0.01395
mean precision_at_recall_00    0.2015
mean precision_at_recall_01    0.1898
mean precision_at_recall_02    0.1683
mean precision_at_recall_03    0.1333
mean precision_at_recall_04    0.1236
mean precision_at_recall_05    0.1227
mean precision_at_recall_06    0.08744
mean precision_at_recall_07    0.08435
mean precision_at_recall_08    0.05999
mean precision_at_recall_09    0.05803
mean precision_at_recall_10    0.05803
mean average_precision         0.1116


--- BOOLEAN RUN ---
Results for boolean.run
mean success_at_1              0.1579
mean success_at_5              0.1842
mean success_at_10      

The boolean model tends to have higher mean success at 1 than the baseline model (respectively 0.1579 against 0.1053). However, the baseline model has higher mean success at 5 and mean success at 10. 

Similarly, the same happens for mean precision at 1 (which of course has the same values at mean success at 1), in which the boolean model scores higher, contrary to precision at 5,10,50 and 100, in which the baseline model scores higher.

As far as r-precision is concerned, the boolean model seems to perform slightly better, with a value of 0.1096 against 0.09156 for the baseline model.

For mean precision at recall X, the baseline model outperforms the boolean for the first two thresholds (mean precision_at_recall_00 and mean precision_at_recall_01), but then performs worse than the boolean for all the following thresholds. Overall, however, the mean average precision of the two models is comparable, as the boolean model scores 0.1197, whereas the baseline scores 0.1116.

All in all, we think that these results indicate that the boolean model is performing slightly better than the baseline one.

## Exercise 03.B: _Language models_

Custom similarities can be configured by tuning the parameters of the built-in similarities. Read more about these (expert) options in the [similarity module](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules-similarity.html).

> Tip: the example similarity settings have to be used in a `"settings"` object.
> Check your settings and mappings with: `es.indices.get(index='NAME-OF-INDEX')`.

__1. Make a run that uses Language Models with [Jelinek-Mercer smoothing](http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html) (linear interpolation smoothing) on the field `"all"` that indexes the fields `"TI"` and `"AB"`. Use the parameter `lambda=0.2`.__

In [80]:
es = elasticsearch.Elasticsearch([{"host": "localhost", "port" : 9200}], timeout = 30)
# increasing the timeout otherwise we get timeout error in our machine!

In [81]:
#THIS IS GRADED!

lmjelinekmercer = {
    # BEGIN ANSWER
  "settings" : {
      "number_of_shards" : 1,
      "index" : {
          "similarity" : {
              "Jelinek_Mercer" : {
                  "type" : "LMJelinekMercer",
                  "lambda" : 0.2
              }
          }
      }
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "all",
          "similarity": "Jelinek_Mercer"
        },
        "TI": {
          "type": "text",
          "copy_to": "all",
          "similarity": "Jelinek_Mercer"
        },
        "all": {  # compound field
          "type": "text",
          "similarity": "Jelinek_Mercer"
        }
      }
  }
    # END ANSWER
}


In [82]:
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-jm', body=lmjelinekmercer)
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'lmjelinekmercer.run', 'genomics-jm')

__2. Make a run that uses Language Models with [Dirichelet smoothing](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html) to index the fields `"TI"` and `"AB"`. Use the parameter `mu=2000`.__

In [83]:
#THIS IS GRADED!

dirichlet = {
    # BEGIN ANSWER
  "settings" : {
      "number_of_shards" : 1,
      "index" : {
          "similarity" : {
              "Dirichlet" : {
                  "type" : "LMDirichlet",
                  "mu" : 2000
              }
          }
      }
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "all",
          "similarity": "Dirichlet"
        },
        "TI": {
          "type": "text",
          "copy_to": "all",
          "similarity": "Dirichlet"
        },
        "all": {  # compound field
          "type": "text",
          "similarity": "Dirichlet"
        }
      }
  }
    # END ANSWER
}

In [84]:
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-dirichlet', body=dirichlet)
make_trec_run(es, 'data01/FIR-s05-training-queries-simple.txt', 'dirichlet.run', 'genomics-dirichlet')

## Exercise 03.C: _Model comparison_


__1. Compute the performance results of the `lmjelinekmercer.run` and `dirichelet.run`. Compare them with those of the `baseline.run` and `boolean.run`. Evaluate the runs using the `print_trec_eval` function. Performing statistical tests may help strengthen your claims.__

In [86]:
#THIS IS GRADED!

# your comments here
# BEGIN ANSWER
print("--- Jelinek-Mercer smoothing ---")
print_trec_eval("data01/FIR-s05-training-qrels.txt", "lmjelinekmercer.run")
print("\n")

print("--- Dirichlet smoothing ---")
print_trec_eval("data01/FIR-s05-training-qrels.txt", "dirichlet.run")
print("\n")

print("--- Baseline model ---")
print_trec_eval("data01/FIR-s05-training-qrels.txt", "baseline.run")
print("\n")

print("--- Boolean search ---")
print_trec_eval("data01/FIR-s05-training-qrels.txt", "boolean.run")
print("\n")

# TODO: sign tests
# END ANSWER

--- Jelinek-Mercer smoothing ---
Results for lmjelinekmercer.run
mean success_at_1              0.1316
mean success_at_5              0.2895
mean success_at_10             0.2895
mean r_precision               0.1525
mean precision_at_1            0.1316
mean precision_at_5            0.08947
mean precision_at_10           0.05263
mean precision_at_50           0.02053
mean precision_at_100          0.01316
mean precision_at_recall_00    0.2201
mean precision_at_recall_01    0.2196
mean precision_at_recall_02    0.2179
mean precision_at_recall_03    0.1992
mean precision_at_recall_04    0.1932
mean precision_at_recall_05    0.1905
mean precision_at_recall_06    0.1523
mean precision_at_recall_07    0.1493
mean precision_at_recall_08    0.1178
mean precision_at_recall_09    0.1172
mean precision_at_recall_10    0.1172
mean average_precision         0.1668


--- Dirichlet smoothing ---
Results for dirichlet.run
mean success_at_1              0.05263
mean success_at_5              0.2632


In [59]:
print('Top20 retrieved documents baseline.run')
! head -10 baseline.run

print('\nTop20 retrieved documents boolean.run')
! head -10 boolean.run

print('\nTop20 retrieved documents lmjelinekmercer.run')
! head -10 lmjelinekmercer.run

print('\nTop20 retrieved documents dirichlet.run')
! head -10 dirichlet.run

Top20 retrieved documents baseline.run
1 Q0 11929828 0 43.560047 test01
1 Q0 11751903 1 43.358788 test01
1 Q0 12384701 2 41.708458 test01
1 Q0 12065641 3 40.89842 test01
1 Q0 11980715 4 40.009636 test01
1 Q0 12126481 5 38.631733 test01
1 Q0 12455049 6 37.12175 test01
1 Q0 12444545 7 36.621918 test01
1 Q0 12431783 8 36.194813 test01
1 Q0 12204896 9 36.05962 test01

Top20 retrieved documents boolean.run
1 Q0 11929828 0 43.560047 test02
1 Q0 11751903 1 43.358788 test02
1 Q0 12384701 2 41.708458 test02
1 Q0 12065641 3 40.89842 test02
1 Q0 11980715 4 40.009636 test02
1 Q0 12126481 5 38.631733 test02
1 Q0 12455049 6 37.12175 test02
1 Q0 12444545 7 36.621918 test02
1 Q0 12431783 8 36.194813 test02
1 Q0 12204896 9 36.05962 test02

Top20 retrieved documents lmjelinekmercer.run
1 Q0 12368211 0 44.58327 test
1 Q0 11929828 1 43.560047 test
1 Q0 11751903 2 43.358788 test
1 Q0 11929828 3 42.470695 test
1 Q0 12384701 4 41.708458 test
1 Q0 12065641 5 40.89842 test
1 Q0 11980715 6 40.48863 test
1 Q0 11



__2. Provide below your comments and interpretations of the results. Why, in your opinion, one model of search is better than the others?__

### Answer
The boolean model is the best with respect to the mean success_at_1 metric, with a score of 0.1579. This means that roughly 16% of the times, this model is able to retrieve a relevant document in the first position. Dirichlet is the worse of the four models, having a mean success_at_1 of 0.05263.
Taking into consideration the success_at_5 and the success_at_10 metrics, the best model are respectively the Jelinek-Mercer one and the baseline, with a score of 0.2895 and 0.3158.

The Jelinek-Mercer model outperformed the other three models also with respect to mean r-precision (in this case by a margin, as it scores 0.1525 when the second-best model, i.e. the boolean, scores 0.1096) and the mean precision_at_k with k = 5, 10 and 50. The baseline model was again the best with respect to mean precision_at_100.

Also, the Jelinek-Mercer model is the best one with respect to all thresholds of the mean precision_at_recall_X, while the Dirichlet model is the worse for each treshold. Therefore, it is unsurprising that the Jelinek-Mercer model's mean average precision is also the best among the four models.

In our opinion, the model implementing Jelinek-Mercer smoothing is the best among the four, since it performs better than the others in both r-precision and mean average precision, which are among two of the most important metrics when evaluating the performance of IR systems.

However, it is worth noting that, with respect to the models with smoothing, we have not performed any kind of parameter tuning. It could be interesting to tune the value of lambda and mu: perhaps this could corroborate or overturn the result. 


# Part 04: Index improvements: Tokenization
<span style="background:red; color: white;">__You are advised to work on this part after Lecture 03__</span>



## Background
The following part of the assignment requires some self-study of the ElasticSearch tools to support the improvemnet of the indexing. Please read the:
* [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-create-index.html).
* Elasticsearch [Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis.html) contain many options for improving your search engine.

> You are suggested to use the [Python Elasticsearch Client](https://elasticsearch-py.readthedocs.io) library documentation.

## Example: _ElasticSearch Analyzers for tokenization_

The amount and quality of the tokens used to construct the inverted index are of great importance. In ElasticSearch, mappings and settings also allow specifying what [Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) is used to tokenize your documents and queries. In the mappings below, use the _Dutch_ analyzer for the field `"all"`):

> Usually, the same analyzer should be applied to documents and queries, but 
> Elasticsearch allows you to specify a `"search_analyzer"` that is used on 
> your queries (which we do not need to use in the assignment).

In [42]:
analyzer_test = {
  "mappings": {
      "properties": {
        "all": {
          "type": "text",
          "analyzer": "dutch"
        }
      }
  }
}

# create the index, but don't index any documents:
create_index(es, 'test-tokens', body=analyzer_test)

The analyzer defined for the `"all"` field can be tested [as follows](https://elasticsearch-py.readthedocs.io/en/master/api.html#indices). Translated to English the text says: _"This is a Dutch sentence"_. 

> The following script identifies the tokens (based on the use of the dutch tokenizer): try with different tokenizers and different sentences to see how the tokens are created.

In [43]:
from pprint import pprint # pretty print

body = { "field": "all", "text": "dit zijn nederlandse zinnen"}
tokens = es.indices.analyze(index='test-tokens', body=body)
pprint(tokens)

{'tokens': [{'end_offset': 20,
             'position': 2,
             'start_offset': 9,
             'token': 'nederland',
             'type': '<ALPHANUM>'},
            {'end_offset': 27,
             'position': 3,
             'start_offset': 21,
             'token': 'zinn',
             'type': '<ALPHANUM>'}]}


##  Exercise 04.A: _chat language analyzer_

Read the documentation for [Custom Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis-custom-analyzer.html). 
Make a custom analyzer for _English chat language_. The analyzer should do the following:
* change common abbreviations to the full forms: 
  * _b4_ to _before_, 
  * _abt_ to _about_, 
  * _chk_ to _check_, 
  * _dm_ to _direct message_,
  * _f2f_ to _face-to-face_
* use the _standard_ tokenizer;
* put everything to lower-case;
* filter English stopwords.

In [57]:
#THIS IS GRADED!

tweet_analyzer = {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "replace_abbreviations" 
          ],
          "tokenizer": "standard", 
          "filter": [
            "lowercase",
            "stopwords_english" 
          ]
        }
      },
      "char_filter": {
        "replace_abbreviations": { 
          "type": "mapping",
          "mappings": [
            "b4 => before",
            "abt => about",
            "chk => check",
            "dm => direct message",
            "f2f => face to face"
          ]
        }
      },
      "filter": {
        "stopwords_english": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  },
    "mappings": {
      "properties": {
        "all": {
          "type": "text",
          "analyzer": "my_custom_analyzer"
        }
      }
  }
}


# create the index, but don't index any documents:
create_index(es, 'genomics', body=tweet_analyzer)
body = { "field": "all", "text": "done it b4! what abt dm me?"}
tokens = es.indices.analyze(index='genomics', body=body)
pprint(tokens)

{'tokens': [{'end_offset': 4,
             'position': 0,
             'start_offset': 0,
             'token': 'done',
             'type': '<ALPHANUM>'},
            {'end_offset': 10,
             'position': 2,
             'start_offset': 8,
             'token': 'before',
             'type': '<ALPHANUM>'},
            {'end_offset': 16,
             'position': 3,
             'start_offset': 12,
             'token': 'what',
             'type': '<ALPHANUM>'},
            {'end_offset': 20,
             'position': 4,
             'start_offset': 17,
             'token': 'about',
             'type': '<ALPHANUM>'},
            {'end_offset': 22,
             'position': 5,
             'start_offset': 21,
             'token': 'direct',
             'type': '<ALPHANUM>'},
            {'end_offset': 23,
             'position': 6,
             'start_offset': 22,
             'token': 'message',
             'type': '<ALPHANUM>'},
            {'end_offset': 26,
             'po

## Exercise 04.B: Stemmers

Referring at Exercise 02.A, we have seen that queries like `molecule` and `molecular` retrieve different sets of documents. Lemmatizer and stemmers can help the indexing and search of 'similar' terms, and retrieve more consistent sets of documents.

__Use the ElasticSearch [Stemming](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html) to index the document collection. Then retrieve documents with the queries `molecule` and `molecular` and comment on the eventual differences with the previous query results.__


In [61]:
#THIS IS GRADED!

body = {
  "settings": {
    "analysis": {
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      },
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "my_stemmer"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "AB": {
        "analyzer": "my_custom_analyzer",
          "type": "text"
      },
      "TI": {
        "analyzer": "my_custom_analyzer",
          "type": "text"
    }
  }
}
}

In [62]:
# Connect to the ElasticSearch server
es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

# Index the collection into the index called 'genomics-stem'
index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-stem', body)

(263080, [])

__Retrieve documents with the queries `molecule` and `molecular` and comment on the eventual differences with the previous query results.__

In [63]:
#THIS IS GRADED!
# BEGIN ANSWER
print("--- MOLECULE ---")
q = Q("multi_match", query='molecule')
s = Search().using(es).query(q).index("genomics-stem")
response = s.execute()
print("STEMMING! Number of retrieved documents:", s.count(), "\n")

s = Search().using(es).query(q).index("genomics-base")
response = s.execute()
print("NON STEMMING! Number of retrieved documents:", s.count(), "\n")

print("--- MOLECULAR ---")
q = Q("multi_match", query='molecular')
s = Search().using(es).query(q).index("genomics-stem")
response = s.execute()
print("STEMMING! Number of retrieved documents:", s.count(), "\n")

s = Search().using(es).query(q).index("genomics-base")
response = s.execute()
print("NON STEMMING! Number of retrieved documents:", s.count(), "\n")

# END ANSWER

--- MOLECULE ---
STEMMING! Number of retrieved documents: 7345 

NON STEMMING! Number of retrieved documents: 3404 

--- MOLECULAR ---
STEMMING! Number of retrieved documents: 31569 

NON STEMMING! Number of retrieved documents: 31556 



In [None]:
# Comment here about the eventual different results you get 
# -> words that are stemmed, like molecule and molecules, should improve the retrieval results (in terms of amount of retrieved dos)
# -> words that are not stemmed, like 'molecular' should not see much differernt results

# the point of the exercise is not to have the same results with molecule and molecular, but 
# what the stemmer does, and reason after that.


**ANSWER** As expected, using stemming on molecule increases the number of results from 3404 to 7345. This is because the word "molecule" gets stemmed, thus greatly increasing the number of possible results (allowing, for example, "molecules" to be retrieved as well).

Using stemming on molecular increases the number of results by just 13 documents, meaning that the two models (with or without stemming) are not very much different. Again, this was expected because the word "molecular" actually does not get stemmed, so applying stemming or not has more or less the same result, in terms of the number of retrieved documents.  

# BONUS PART: _Implement your own similarity measure_ 

We have only seen the results of using the analyzer to queries. The analyzer results from the _documents_ are available using the `termvectors()` function, as follows for document `id=3`: (Additionally, we can get overall field statistics, such as the number of documents)

> First, index the collection again. While waiting, have a coffee or tea :) 

> `id=3` refers to the internal document identifiers, so not to the Pubmed identifier.

_The bonus exercise is not mandatory. It can compensate for lower grades of other exercises._

In [66]:
import elasticsearch
es = elasticsearch.Elasticsearch(host='localhost')

index_documents(es, 'data01/FIR-s05-medline.json', 'genomics-base')

es.termvectors(index="genomics-base", id="3", fields="TI", 
               term_statistics=True, field_statistics=True, offsets=False)

{'_index': 'genomics-base',
 '_type': '_doc',
 '_id': '3',
 '_version': 1,
 'found': True,
 'took': 119,
 'term_vectors': {'TI': {'field_statistics': {'sum_doc_freq': 2312794,
    'doc_count': 197176,
    'sum_ttf': 2436354},
   'terms': {'against': {'doc_freq': 1318,
     'ttf': 1324,
     'term_freq': 1,
     'tokens': [{'position': 8}]},
    'alignment': {'doc_freq': 64,
     'ttf': 67,
     'term_freq': 1,
     'tokens': [{'position': 18}]},
    'an': {'doc_freq': 9015,
     'ttf': 9209,
     'term_freq': 1,
     'tokens': [{'position': 19}]},
    'and': {'doc_freq': 74206,
     'ttf': 86810,
     'term_freq': 1,
     'tokens': [{'position': 12}]},
    'application': {'doc_freq': 1041,
     'ttf': 1042,
     'term_freq': 1,
     'tokens': [{'position': 20}]},
    'binding': {'doc_freq': 2060,
     'ttf': 2147,
     'term_freq': 1,
     'tokens': [{'position': 23}]},
    'carbonyl': {'doc_freq': 54,
     'ttf': 57,
     'term_freq': 1,
     'tokens': [{'position': 13}]},
    'change

### Implement the BM25 similarity

Complete the function `bm25_similarity()` below by implementing the BM25 similarity as described by in Section 11.4.3 of [Manning, Raghavan and Schuetze, Chapter 11](https://nlp.stanford.edu/IR-book/pdf/11prob.pdf). Are you able to replicate the score of ElasitcSearch (9.55)? If not, are you using a different variant of the BM25 model? Provide your comments in plain text.

In [77]:
#THIS IS GRADED!

import math

# math.log(x) computes the logarithm of x

def bm25_similarity (query, doc_id):

    # Get the query tokens (see above)
    query_tokens = es.indices.analyze(index='genomics-base', body={"field":"TI", "text": query})
    tokens = query_tokens['tokens']

    # Get the term vector for doc_id and the field statistics
    term_vector = es.termvectors(index="genomics-base", id=doc_id, fields="TI", 
                  term_statistics=True, field_statistics=True, offsets=False)
    vector = term_vector['term_vectors']['TI']['terms']
    f_stats = term_vector['term_vectors']['TI']['field_statistics']

    # The answer should sum over 'tokens', check if the tokens exists in the 'vector',
    # and if so, add the appropriate value to 'similarity'.
    # Tip: add print statements to your code to see what each variable contains.
    
    similarity = 0

    # BEGIN ANSWER    
    ### IMPLEMENTING AS FORMULA 11.30 ###
    
    for tkn in tokens:
        if tkn["token"] in vector.keys():
            similarity += math.log((f_stats["doc_count"] / vector[tkn["token"]]["doc_freq"]))
  
    # END ANSWER
    return similarity
    
    
print("The value of BM25 similarity, as per formula 11.30 is:\n")
bm25_similarity("structure refinement", 3)

The value of BM25 similarity, as per formula 11.30 is:



13.603278431432187

In [78]:
#THIS IS GRADED!

import math

# math.log(x) computes the logarithm of x

def bm25_similarity (query, doc_id):

    # Get the query tokens (see above)
    query_tokens = es.indices.analyze(index='genomics-base', body={"field":"TI", "text": query})
    tokens = query_tokens['tokens']

    # Get the term vector for doc_id and the field statistics
    term_vector = es.termvectors(index="genomics-base", id=doc_id, fields="TI", 
                  term_statistics=True, field_statistics=True, offsets=False)
    vector = term_vector['term_vectors']['TI']['terms']
    f_stats = term_vector['term_vectors']['TI']['field_statistics']

    # The answer should sum over 'tokens', check if the tokens exists in the 'vector',
    # and if so, add the appropriate value to 'similarity'.
    # Tip: add print statements to your code to see what each variable contains.
    
    similarity = 0

    # BEGIN ANSWER
    ### IMPLEMENTING AS FORMULA 11.31 ###
    
    for tkn in tokens:
        if tkn["token"] in vector.keys():
            similarity += math.log((f_stats["doc_count"] - vector[tkn["token"]]["doc_freq"] + 0.5) / (vector[tkn["token"]]["doc_freq"] + 0.5))
  
    # END ANSWER
    return similarity
    
    
print("The value of BM25 similarity, as per formula 11.31 is:\n")
bm25_similarity("structure refinement", 3)

The value of BM25 similarity, as per formula 11.31 is:



13.579422882031427

In [79]:
#THIS IS GRADED!

import math

# math.log(x) computes the logarithm of x

def bm25_similarity (query, doc_id):

    # Get the query tokens (see above)
    query_tokens = es.indices.analyze(index='genomics-base', body={"field":"TI", "text": query})
    tokens = query_tokens['tokens']

    # Get the term vector for doc_id and the field statistics
    term_vector = es.termvectors(index="genomics-base", id=doc_id, fields="TI", 
                  term_statistics=True, field_statistics=True, offsets=False)
    vector = term_vector['term_vectors']['TI']['terms']
    f_stats = term_vector['term_vectors']['TI']['field_statistics']

    # The answer should sum over 'tokens', check if the tokens exists in the 'vector',
    # and if so, add the appropriate value to 'similarity'.
    # Tip: add print statements to your code to see what each variable contains.
    
    similarity = 0

    # BEGIN ANSWER
    ### IMPLEMENTING AS FORMULA 11.32 ###
    
    # Default elasticsearch values for b and k_1, as written on 
    # https://www.elastic.co/blog/practical-bm25-part-3-considerations-for-picking-b-and-k1-in-elasticsearch
    k_1 = 1.2
    b = 0.75
    L_ave = f_stats["sum_ttf"] / f_stats["doc_count"]
    L_d = len(vector)
    for tkn in tokens:
        if tkn["token"] in vector.keys():
            
            numerator = vector[tkn["token"]]["term_freq"] * (k_1 + 1)
            denominator = k_1 * ((1 -b) + b*(L_d / L_ave) + vector[tkn["token"]]["term_freq"])
            similarity += math.log((f_stats["doc_count"] / vector[tkn["token"]]["doc_freq"])) * (numerator / denominator)
  
    # END ANSWER
    return similarity
    
print("The value of BM25 similarity, as per formula 11.32 is:\n")
bm25_similarity("structure refinement", 3)

The value of BM25 similarity, as per formula 11.32 is:



8.988042300492586

We implemented three variants of the BM25 similarity, which refer to respectively formulas 11.30, 11.31, 11.32 of the book.
The first two formulas are quite simple and return a value of 13.60 and 13.58 respectively. The third formula is more complex and uses two hyperparameters, namely $b$ and $k_1$, which we choose as Elasticsearch's default values. 
The result is a BM25 similarity of 8.99, which is much closer to the reference score of 9.55 computed by Elasticsearch.

Probably, the reason why we are not able to replicate exactly this precise score is due to the fact that Elasticsearch implements a more sophisticated variant of the formula and therefore performs slightly different computations. One more reason could be that internally, the $b$ and $k1$ parameters are adjusted with respect to the dataset, thus not reflecting the default values anymore.

See below the 'reference score' computed by ElasticSearch:

In [80]:
body = {
  "query": {
    "match" : { "TI" : "structure refinement" }
  }
}
explain = es.explain(index="genomics-base", id="3", body=body)
print (explain['explanation']['value'])  # BM25 score computed by ElasticSearch

9.552309
