# Information Retrieval Practice

Elasticsearch is an open-source distributed search server built on top of Apache Lucene. It’s a great tool that allows to quickly build applications with full-text search capabilities. The core implementation is in Java, but it provides a nice REST interface which allows to interact with Elasticsearch from any programming language.


## Install Elastic Search

To install elastic search download your the package for your platform from Get Elasticsearch
 in https://www.elastic.co/es/start


![](https://github.com/acastellanos-ie/NLP-MBD-EN-PT-2021-J-1/blob/main/ir_practice/download.png?raw=1)

Once downloaded, unzip the tar.gz file and run `bin/elasticsearch` (or `bin\elasticsearch.bat` on Windows). This will launch the ElasticSearch Server. Once the server is running, by default it's accessible at [localhost:9200](http://localhost:9200).

## Querying Elastic Search via Python

To make queries to ElasticSearch you can directly query the server endpoint via REST. However, we can make it easier via the the `elasticsearch-py` Python library. This library provides a wrapper for the REST endpoint that will allow us to query the server form Python.

In case you have not yet installed the libraries, you can execute the following code

In [None]:
! pip install elasticsearch-dsl
! pip install elasticsearch

In [None]:
import warnings
warnings.filterwarnings('ignore')

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q, Index
from pprint import pprint

# Exercise 0: Indexing and Searching Demo for ElasticSearch

Now it's time to run some demo program. In this practice, we will create inverted index of sample documents (indexing) and then use Elasticsearch query grammar to search documents (searching).

### Useful functions

Functions to facilitate the reading of the dataset

In [None]:
import os, io
from collections import namedtuple

Doc = namedtuple('Doc', 'filename path text')

def read_doc(doc_path, encoding):
    '''
        reads a document from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - doc: instance of Doc namedtuple
    '''
    filename = doc_path.split('/')[-1]
    fp = io.open(doc_path, 'r', encoding = encoding)
    text = fp.read().strip()
    fp.close()
    return Doc(filename = filename, text = text, path = doc_path)

def read_dataset(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: instances of Doc namedtuple returned as generator
    '''
    for root, dirs, files in os.walk(path):
        for doc_path in files:
            yield read_doc(root + '/' + doc_path, encoding)

Setting up the connector

 To index the documents, we first need to make a connection to **Elasticsearch**. 

In [None]:
es_conn = Elasticsearch(
    'localhost',
)

es_conn

<Elasticsearch([{'host': 'localhost'}])>

## Indexing

We will try to index the sample documents in `./sample_documents`.

Before we index the documents, we first need to define the **configuration of elasticsearch**. During this process, you can define basic configuration of indexer such as tokenizer, stemmer, lemmatizer, and also define which search algorithm elasticsearch will use for search.

Below code shows a simple configuration settings for this demo.
The configuration tells elasticsearch that our document `doc` will have three fields `filename`, `path`, and `text`, and we will use `text` field for search. `my_analyzer` will be used to parse the `text` field, and `my_analyzer` will also be used as a search analyzer, which will parse search queries later on. `index:False` in `filename` and `path` fields tell elasticsearch that we will not index these two fields, therefore, we cannot search these two fields with queries. 

The detailed documentation of analyzer can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).

`"similarity": "boolean"` in `text` field will let elasticsearch know that we will use a boolean search algorithm to search `text` field. The detailed documentation of search algorithms can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html)  and [here](https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html). 


In [None]:
# configuration for indexing
settings = {
  "mappings": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase","stop"
          ],
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

Now we will retrieve `sample documents` and indexing them into `INDEX_NAME` index. To that end, the following 2 functions will help you in the creation of the index and the indexing of the documents.


In [None]:
ES_HOSTS = ['http://localhost:9200']
INDEX_NAME = 'sample_index'
DOCS_PATH = 'practice_data/sample_documents'

def create_index(es_conn, index_name, settings):
    '''
        create index structure in elasticsearch server. 
        If index_name exists in the server, it will be removed, and new index will be created.
        input:
            - es_conn: elasticsearch connection object
            - index_name: name of index to create
            - settings: settings and mappings for index to create
        output: =>
            - None
    '''
    if es_conn.indices.exists(index_name):
        es_conn.indices.delete(index = index_name)
        print('index `{}` deleted'.format(index_name))
    es_conn.indices.create(index = index_name, body = settings)
    print('index `{}` created'.format(index_name))            
            
def build_index(es_conn, dataset, index_name, settings, DOC_TYPE='doc'):
    '''
        build index from a collection of documents
        input:
            - es_conn: elasticsearch connection object
            - dataset: iterable, collection of namedtuple Doc objects
            - index_name: name of the index where the documents will be stored
            - DOC_TYPE: type signature of documents
    '''
    # create the index if it doesn't exist
    create_index(es_conn = es_conn, index_name = index_name, settings=settings)
    counter_read, counter_idx_failed = 0, 0 # counters

    # retrive & index documents
    for doc in dataset:
        res = es_conn.index(
            index = index_name,
            id = doc.filename,
            body = doc._asdict())
        counter_read += 1

        if res['result'] != 'created':
            conter_idx_failed += 1
        elif counter_read % 500 == 0:
            print('indexed {} documents'.format(counter_read))

    print('indexed {} docs to index `{}`, failed to index {} docs'.format(
        counter_read,
        index_name,
        counter_idx_failed
    ))
    
    # refresh after indexing
    es_conn.indices.refresh(index=index_name)  



In [None]:
dataset = read_dataset(DOCS_PATH)
build_index(es_conn, dataset, INDEX_NAME, settings)

index `sample_index` deleted
index `sample_index` created
indexed 5 docs to index `sample_index`, failed to index 0 docs


We successfully created an inverted index for the sample documents in `./sample/documents`. It's time to search the documents with some queries.

## Searching

### Full-Text Search

The two most important aspects of full-text search are as follows:

##### Relevance

>The ability to rank results by how relevant they are to the given query, whether relevance is calculated using TF/IDF (see [What Is Relevance?](https://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-intro.html)), proximity to a geolocation, fuzzy similarity, or some other algorithm.

##### Analysis

>The process of converting a block of text into distinct, normalized tokens (see [Analysis and Analyzers](https://www.elastic.co/guide/en/elasticsearch/guide/master/analysis-intro.html) in order to (a) create an inverted index and (b) query the inverted index.

#### Term-Based Versus Full-Text

Two types of text query:

##### Term-based

Queries like the term or fuzzy queries are low-level queries that have no analysis phase. They operate on a single term. A term query for the term Foo looks for that exact term in the inverted index and calculates the TF/IDF relevance _score for each document that contains the term.

##### Full-text queries

Queries like the match or query_string queries are high-level queries that understand the mapping of a field:

* If you use them to query a date or integer field, they will treat the query string as a date or integer, respectively.

* If you query an exact value (not_analyzed) string field, they will treat the whole query string as a single term.

* But if you query a full-text (analyzed) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

Once the query has assembled a list of terms, it executes the appropriate low-level query for each of these terms, and then combines their results to produce the final relevance score for each document.

#### The match Query

We will perform now different types of queries.

First, a query with a single term

In [None]:
s = Search(using=es_conn, index="sample_index")
s = s.query("match", text={"query": "obama"})
res = s.execute()

for hit in res:
    print(hit.filename, hit.text[:100], '... - Score:', hit.meta.score)
    print()

doc1.txt Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States ... - Score: 1.0

doc2.txt Michelle LaVaughn Robinson Obama (born January 17, 1964) is the wife of the forty-fourth President o ... - Score: 1.0

doc3.txt Joseph Robinette "Joe" Biden, Jr. (born November 20, 1942) is the 47th and current Vice President of ... - Score: 1.0

doc5.txt John Sidney McCain III (born August 29, 1936) is the senior United States Senator from Arizona. He w ... - Score: 1.0



#### Multiword Queries

Obviously, we can search on more than one word at a time:

In [None]:
s = Search(using=es_conn, index="sample_index")
s = s.query("match", text={"query":    "Obama Hillary"})
res = s.execute()

for hit in res:
    print(hit.filename, hit.text[:200], '... - Score:', hit.meta.score)
    print()

doc1.txt Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States, the first African American to hold the office. He served as the junior United States Senator from  ... - Score: 2.0

doc2.txt Michelle LaVaughn Robinson Obama (born January 17, 1964) is the wife of the forty-fourth President of the United States, Barack Obama, and is the first African-American First Lady of the United States ... - Score: 1.0

doc3.txt Joseph Robinette "Joe" Biden, Jr. (born November 20, 1942) is the 47th and current Vice President of the United States. He was a United States Senator from Delaware from January 3, 1973 until his resi ... - Score: 1.0

doc4.txt Hillary Diane Rodham Clinton (born October 26, 1947) is the 67th United States Secretary of State, serving within the administration of President Barack Obama. She was a United States Senate from New  ... - Score: 1.0

doc5.txt John Sidney McCain III (born August 29, 1936) is the senior United States Senat

The important thing is: any document whose title field contains at least one of the specified terms will match the query. The more terms that match, the more relevant the document.

But what happens if I want both terms appearing in the document.

In [None]:
s = Search(using=es_conn, index="sample_index")
s = s.query("match", text={
    "query":    "Obama Hillary",
    "operator": "and"})
res = s.execute()

for hit in res:
    print(hit.filename, hit.text, '... - Score:', hit.meta.score)
    print()


doc1.txt Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States, the first African American to hold the office. He served as the junior United States Senator from Illinois from January 2005 until he resigned after his election to the presidency in November 2008.

Obama is a graduate of Columbia University and Harvard Law School, where he was the president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney in Chicago and also taught constitutional law at the University of Chicago Law School from 1992 to 2004.

Obama served three terms in the Illinois Senate from 1997 to 2004. Following an unsuccessful bid for a seat in the U.S. House of Representatives in 2000, Obama ran for United States Senate in 2004. His victory, from a crowded field, in the March 2004 Democratic primary raised his visibility. His prime-time televised keynote address at the Democratic Nat

And now containing a term but NOT the other.

In [None]:
# Boolean Query "Obama BUT Hillary"
s = Search(using=es_conn, index="sample_index")
s = s.query("bool", 
            must = [Q('match', text="hillary")],
            must_not = [Q('match', text="obama")]
           )

res = s.execute()

for hit in res:
    print(hit.filename, hit.text[:100], '... - Score:', hit.meta.score)
    print()

doc4.txt Hillary Diane Rodham Clinton (born October 26, 1947) is the 67th United States Secretary of State, s ... - Score: 1.0



# Exercise 1: Evaluating Results

We will show how the retrieved result can be evaluated by **trec_eval** evaluation program.

**trec_eval** is the standard software for evaluating search engines with test collections.

First, we need to check the `government` folder which contains three things:

- A set of documents needed to be indexed, in the *documents* directory.
    
- A set of queries, also called 'topics', in *topics/gov.topics* file. The format of **.topic* file is "query_id query_terms". For example, the first line of 'air.topics' file is
    
    `1 mining gold silver coal`
    
    which means that the ID of query is *01* and the corresponding query is *mining gold silver coal*.

- A set of judgements, saying which documents are relevant for each query, in the *qrels/gov.qrels* file. The format of **.qrels* file is "query_id 0 document_name binary_relevance". For example, the first line of 'air.qrels' is
    
    `1 0 G00-00-0681214 0`
    
    which means that the document `G00-00-0681214` is not relevant to the given query id *01*. The binary relevance is *1* if the file is relevant to the query, otherwise *0*. Please ignore the second argument *0* as it is always *0*.

## Create new index

In the previous exercise, we have created the index (inverted-index) of five sample documents. In this one, you will create a new index with the documents in `government/documents` folder .

To build a new index, you first need to create a new index. Note that `EVAL_INDEX_NAME` should be changed in order to build separate index for the documents in `government/documents`.

After creating the new configuration file, now your job is to create the new index reusing the code in the previous exercise.

In [None]:
settings = {
  "mappings": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "char_filter": [
            "html_strip"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

### Exercise 1.1: Create the new index

You can reuse the previous code

In [None]:
EVAL_INDEX_NAME = 'government'
EVAL_DOCS_PATH = 'practice_data/government/documents'

dataset = read_dataset(EVAL_DOCS_PATH)
build_index(es_conn, dataset, EVAL_INDEX_NAME, settings)

index `government` deleted
index `government` created
indexed 500 documents
indexed 1000 documents
indexed 1500 documents
indexed 2000 documents
indexed 2500 documents
indexed 3000 documents
indexed 3500 documents
indexed 4000 documents
indexed 4079 docs to index `government`, failed to index 0 docs


### Exercise 1.2. Read topics and produce result file

Read topics (queries) from a file (`government/topics/gov.topics`) and then search documents indexed by **Elasticsearch**. You may choose one of search algorithms.

Produce result file (e.g., *retrieved.txt*) according to **trec_eval** standard output format: 

`01 Q0 document1 0 1.23 my_IR_system1`

`01 Q0 document2 1 1.08 my_IR_system1`

where '01' is the query ID; ignore 'Q0'; 'documentX' is the name of the file; '0' (or '1' or some other integer number) is the rank of this result; '1.23' (or '1.08' or some other number) is the score of this result; and 'my_IR_system1' is the name for your retrieval system. In particular, note that the rank field will be ignored in **trec_eval**; internally ranks are assigned by sorting by the score field with ties broken deterministicly (using file name).

**Now here's your first job**

1. read `gov.topics` file line by line, 
2. send query to the elastic search
3. write output according the the output format described above

In [None]:
def read_topic_file(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: instances of Doc namedtuple returned as generator
    '''
    for root, dirs, files in os.walk(path):
        for doc_path in files:
            filename = (root + '/' + doc_path)
            fp = io.open(filename, 'r', encoding = encoding)
            text = fp.readlines()
            fp.close()
    return [(t.split(" ")[0].strip(), " ".join(t.split(" ")[1:]).strip()) for t in text]

In [None]:
queries = read_topic_file("./practice_data/government/topics")

In [None]:
queries

[('1', 'mining gold silver coal'),
 ('2', 'juvenile delinquency'),
 ('4', 'wireless communications'),
 ('6', 'physical therapists'),
 ('7', 'cotton industry'),
 ('9', 'genealogy searches'),
 ('10', 'Physical Fitness'),
 ('14', 'Agricultural biotechnology'),
 ('16', 'Emergency and disaster preparedness assistance'),
 ('18', 'Shipwrecks'),
 ('19', 'Cybercrime, internet fraud, and cyber fraud'),
 ('22', "Veteran's Benefits"),
 ('24', 'Air Bag Safety'),
 ('26', 'Nuclear power plants'),
 ('28', 'Early Childhood Education')]

In [None]:
def search(query_string, es_conn, index_name, operator = "or"):
    '''
        searches for query_string with default search algorithm
        input:
            - query_string: a query
            - es_conn: elasticsearch connection
            - index_name: name of index
        output:
            - a generator of tuple (filename, score)

    '''
    res = es_conn.search(index = index_name, size = 100,
        body = {
            "_source": ["filename"],
            "query": {
                "match": {
                    "text":{
                        "query": query_string,
                        "operator" : operator    
                    }
                }
            }
        }
    )
    for hit in res['hits']['hits']:
        filename = hit["_source"]["filename"]
        score = hit["_score"]
        yield (filename, score)

   

In [None]:
def write_trec_file(query, res, output_file):
    # formatter of searched result
    for ranking, match in enumerate(sorted(res, key = lambda x: -x[1])):
        output_file.write('{} Q0 {} {} {} {}\n'.format(
            query,
            match[0], # filename
            ranking,
            match[1], # score
            "IR_system"
        ))

In [None]:
output_file = open("retrieved.txt","w+")

es_conn = Elasticsearch(ES_HOSTS)
for query_id, query in queries:
    res = search(query, es_conn, EVAL_INDEX_NAME)
    write_trec_file(query_id, res, output_file)
    
output_file.close()

### Exercise 1.3.  Evaluation

It's time to run the evaluation which compares the qrels file provided in *gov.qrels* with your result file.

TREC_EVAL is an initiative to evaluate the performance of your search engine. To evaluate your search result, you first need two sets of files: the retrieved result file and the ground truth file.
Let's say your retrieval result is saved at `retrieved.txt`, and the ground truth file is saved at `gov.qrels`. 

The TREC_EVAL evaluation tool is rather outdated and difficult to execute. For this reason, I have taken the following piece of code from this repository https://github.com/prachibhansali/TrecIREvaluation to facilitate its execution.

In [None]:
from collections import defaultdict
import sys
import getopt

def computePrecisionsAndRecall(rankedDocs, queryRelevantDocs, kranks):
	precisions = defaultdict(list)
	recall = defaultdict(list)
	kprecisions = defaultdict(list)
	krecall = defaultdict(list)
	fValues = defaultdict(list)
	rPrecisions = {}
	for key, value in rankedDocs.items():
		rel=0
		qlen = len(queryRelevantDocs[key])
		for index, docid in enumerate(value):
			if(docid in queryRelevantDocs[key]):
				rel=rel+1
			precision_index=float(rel)/(index+1)
			recall_index=float(rel)/qlen
			if(docid in queryRelevantDocs[key]):
				precisions[key].append(precision_index)
				recall[key].append(recall_index)
			if((index+1) in kranks):
				kprecisions[key].append((index+1,precision_index))
				krecall[key].append((index+1,recall_index))
				fval = computeFValue(precision_index , recall_index)
				fValues[key].append((index+1,fval))
			if((index+1)==len(queryRelevantDocs[key])):
				rPrecisions[key]=precision_index
	return precisions, recall, rPrecisions, kprecisions, krecall, fValues

def computeAveragePrecision(precisions, queryRelevantDocs):
	avgPrecisions = {}
	for key,value in precisions.items():
		sum=0
		for f in value:
			sum = sum + float(f)
		sum = sum/(len(queryRelevantDocs[key]))
		avgPrecisions[key]=sum
	return avgPrecisions

def computeFValue(p,r):
	if(p==0 and r==0):
		return 0
	return (2*float(p)*float(r))/(float(p)+float(r))

import math
def computeNDCG(rankedDocs, queryRelevantDocs, grades):
	ndcg = {}
	for key,value in rankedDocs.items():
		sum=0
		for index,docid in enumerate(value):
			if(docid in queryRelevantDocs[key]):
				rank = index+1;
				ids = [id for id in grades[key] if id[0] == docid]
				tup = ids[0]
				grade = tup[1]
				sum = sum + (((2**grade)-1) * (1/(math.log((1+rank),2))))
		ndcg[key] = sum
	return ndcg

def computeKAverageAllQueries(lst, kranks):
	kp = []
	for rank in kranks:
		sum=0
		for key,values in lst.items():
			scores = [score for score in values if score[0]==rank]
			for score in (item[1] for item in scores):
				sum = sum + score
		sum = sum/(len(lst))
		kp.append(sum)
	return kp

def computeAverageAllQueries(lst, avgPrecisions):
	avg=0
	for _,v in lst.items():
		avg = avg+v	
	return float(avg)/len(avgPrecisions)

def evaluate(hasQ,qrel_loc,rankedlist_loc):
    
	grades = defaultdict(list)
	kranks = [5,10,20,50,100]
	queryRelevantDocs = defaultdict(list)
	rankedDocs = defaultdict(list)
    
	with open(qrel_loc) as f:
		for line in f:
			(qid,_,docid,rel) = line.split(' ')
			if(int(rel)==1 or int(rel)==2):
				queryRelevantDocs[int(qid)].append(docid)
				grades[int(qid)].append((docid,int(rel)))

	with open(rankedlist_loc) as r:
		for line in r:
			(qid,_,docid,_,score,_) = line.split(' ')
			t=(docid,float(score))
			rankedDocs[int(qid)].append(t)

	import operator
	for qid, value in rankedDocs.items():
		value.sort(key=operator.itemgetter(1),reverse=True)
		rankedDocs[qid]=list(x[0] for x in value)

	precisions, recall, rPrecisions, kprecisions, fValues, krecall = computePrecisionsAndRecall(rankedDocs, queryRelevantDocs,kranks)
	avgPrecisions = computeAveragePrecision(precisions, queryRelevantDocs)
	ndcg = computeNDCG(rankedDocs, queryRelevantDocs, grades)

	if(hasQ==True):
		avgPrecisionAllQueries = computeAverageAllQueries(avgPrecisions, avgPrecisions)
		avgRPrecisionAllQueries = computeAverageAllQueries(rPrecisions, avgPrecisions)
		avgndcgAllQueries = computeAverageAllQueries(ndcg, avgPrecisions)
		rt,rl,retrel,pvd,rvd,fvd = calculate_avg_metrics(avgPrecisions,rankedDocs,queryRelevantDocs,kprecisions,krecall,fValues)
	writeAverageOverQueries(rt,rl,retrel,pvd,rvd,fvd,avgPrecisionAllQueries,avgRPrecisionAllQueries,avgndcgAllQueries,kranks,kprecisions)
	return

def calculate_avg_metrics(avgPrecisions,rankedDocs,queryRelevantDocs,kprecisions,krecall,fValues):
	retrievedDocs=0
	relevantDocs=0
	retrel = 0
	pvd = {}
	rvd = {}
	fvd = {}
	
	for q in avgPrecisions:
		retrievedDocs = retrievedDocs + len(rankedDocs[q])
		relevantDocs = relevantDocs + len(queryRelevantDocs[q])
		relev = set(queryRelevantDocs[q])
		retr = set(rankedDocs[q])
		retrel = retrel+len(relev.intersection(retr))
		k=0
		for pv,rv,fv in zip(kprecisions[q],krecall[q],fValues[q]):
			val = pvd[k] if k in pvd else 0 
			pvd[k]=val+pv[1]
			val = rvd[k] if k in rvd else 0 
			rvd[k]=val+rv[1]
			val = fvd[k] if k in fvd else 0 
			fvd[k]=val+fv[1]
			k=k+1
	return (retrievedDocs,relevantDocs,retrel,pvd,rvd,fvd)

def writeAverageOverQueries(rt,rl,retrel,pvd,rvd,fvd,avgPrecisionAllQueries,avgRPrecisionAllQueries,avgndcgAllQueries,kranks,kprecisions):
	print('Total number of documents')
	print('Retrieved : '+str(rt))
	print('Relevant : '+str(rl) )
	print('ret_rel : ' + str(retrel))
	print('Average precision (non-interpolated) for all rel docs(averaged over queries): ' + "%.2f" %(avgPrecisionAllQueries))
	print('K'+ '\t' +'Precision'+'\t'+'Recall'+'\t\t'+'F1')
	writeKValues(pvd,rvd,fvd,kranks,kprecisions)
	print('R-Precision (precision after R (= num_rel for a query) docs retrieved):' + "%.2f" %(avgRPrecisionAllQueries) )
	print('ndcg over all queries : ' + "%.2f" %(avgndcgAllQueries) )
	return

def writeKValues(pvd,rvd,fvd,kranks,kprecisions):
	k=0
	length = len(kprecisions)
	for r in kranks:
		print(str(r)+"\t"+"%.2f" % (float(pvd[k])/length)+"\t\t"+"%.2f" %(float(rvd[k])/length)+"\t\t"+"%.2f" %(float(fvd[k])/length))
		k=k+1
	return




In [None]:
evaluate(True,"./practice_data/government/qrels/gov.qrels","retrieved.txt")

Total number of documents
Retrieved : 1045
Relevant : 31
ret_rel : 20
Average precision (non-interpolated) for all rel docs(averaged over queries): 0.17
K	Precision	Recall		F1
5	0.08		0.12		0.31
10	0.05		0.08		0.32
20	0.02		0.04		0.25
50	0.01		0.03		0.42
100	0.01		0.02		0.51
R-Precision (precision after R (= num_rel for a query) docs retrieved):0.06
ndcg over all queries : 0.49


# Improving the index

The baseline retrieval that we have proposed before did offer a rather low performance. In order to improve it, we can tune the index setting to include some of the NLP processing that we have learned (e.g., stemming, stopwords, ...)-

To that end, review the documentation of analyzer [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).



Although we could generate our own analyzers (as we did in the previous exercises with `my_analyzer`), Elasticsearch provides a set of predefined analyzers for the different languages. More information [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html).

In particular, we are going to use the `English Analyzer`

In addition, we can modify the index to use a more sophisticated similarity measure (e.g., `BM25`) than the binary similarity.

## Exercise 2.1 English Analyzer + BM25

Modify the settings to apply the `English Analyzer` and use the `BM25` similarity

In [None]:
new_settings = {  
# Your code here
}

With this new settings we will create a new index, generate a new result file and evaluate it by means of the `trec_eval`

In [None]:
ES_HOSTS = ['http://localhost:9200']
EVAL_INDEX_NAME = 'government'
EVAL_DOCS_PATH = 'practice_data/government/documents'

es_conn = Elasticsearch(ES_HOSTS)
dataset = read_dataset(EVAL_DOCS_PATH)
build_index(es_conn, dataset, EVAL_INDEX_NAME, new_settings)

In [None]:
output_file = open("improved_retrieved.txt","w+")

es_conn = Elasticsearch(ES_HOSTS)
for query_id, query in queries:
    res = search(query, es_conn, EVAL_INDEX_NAME)
    write_trec_file(query_id, res, output_file)

output_file.close()

In [None]:
evaluate(True,"./practice_data/government/qrels/gov.qrels","improved_retrieved.txt")

Did the performance of the IR system improved? How?