# Assignment 1: Demo

The aim of this demo is to help you become familiar with some of the Python packages and methods that you'll be using in Assignment 1. This demo will also help reinforce some of the theory that you've learned in the Information Retrieval module.

## Setting up the work environment

To build a working environment for this demo, please refer to the **Assignment 1 - Setting up your work environment** instructions available on Wattle. 

1. unzip elasticsearch-6.3.0.zip, then run the code below in a Terminal/Console window: 

    `./elasticsearch-6.3.0/bin/elasticsearch -d`
    
    which will run **Elasticsearch** in the background.
    
    To check that Elasticsearch is running correctly, open your browser and go to *http://localhost:9200*. If 
    ElasticSearch is working correctly, you will see something like the below in your browser:
    
    ![ElasticSearch](images/es_working.png)
    
1. To run the demo code, install the python Elasticsearch client via *pip*, with the code below: 

    `pip install elasticsearch`
    
1. To check that the client has installed correctly, run python and import the library: 

    `python -c "import elasticsearch"`
    
    If you don't receive any error messages, then your install has been successful.
    
1. You are now ready to proceed with the demo!

## Tutorial 1: Indexing and Searching using ElasticSearch

Now it's time to run some demo programs. 

In this demo, we will create an inverted index of six sample documents (indexing), and then use the Elasticsearch query grammar to search the documents (searching).

To begin with, we'll first define some useful functions used for indexing:

In [5]:
import os
from collections import namedtuple

#  A document class with following attributes:
#   filename: document filename
#   text: body of documment
#   path: path of document

Doc = namedtuple('Doc', 'filename path text')

def read_doc(doc_path, encoding):
    '''
        reads a document from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - doc: instance of Doc namedtuple
    '''
    filename = doc_path.split('/')[-1]
    fp = open(doc_path, 'r', encoding = encoding)
    text = fp.read().strip()
    fp.close()
    return Doc(filename = filename, text = text, path = doc_path)

def read_dataset(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: instances of Doc namedtuple returned as generator
    '''
    for root, dirs, files in os.walk(path):
        for doc_path in files:
            yield read_doc(root + '/' + doc_path, encoding)

### Indexing

We will begin by indexing the sample documents stored in `./lab_demo/documents`. 

In order to do this, we first need to make a connection to the **Elasticsearch** server. The following code will:

1. Import python **Elasticsearch** library,
1. Establish a connection to the server,
1. Read sample documents, and
1. Create an inverted-index structure named `INDEX_NAME`

Before we index the documents, we first need to define the **configuration of Elasticsearch**. During this process, we will define basic configuration of the indexer, such as tokenizer, stemmer, lemmatizer, and also define which search algorithm Elasticsearch will use.

The code below shows a simple configuration settings for this demo.

The configuration tells Elasticsearch that our document 'doc' will have three fields: `filename`, `path`, and `text`. We will use the `text` field for our search queries. The field `my_analyzer` will be used to parse the `text` field, and `my_analyzer` will also be used as a search analyzer, which will parse search queries later on. The field `index:False` in the `filename` and `path` fields tells Elasticsearch that we will not index these two fields, therefore, we cannot search these two fields with queries. The detailed documentation of analyzer can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).

The `"similarity": "boolean"` in `text` field will let Elasticsearch know that we will use a boolean search algorithm to search the `text` field. The detailed documentation of the search algorithms can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html)  and [here](https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html). Read these documents carefully ahead of attempting Assignment 1.
 
We further define `my_analyzer` in `settings`. The analyzer consists of stopword filter (`stop`) and uses `whitespace` as the tokenizer, which separates sentences into tokens based on any whitespace.

In [6]:
# configuration for indexing
demo_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

We next retrieve the sample documents and index them into INDEX_NAME index:

In [19]:
from elasticsearch import Elasticsearch

ES_HOSTS = ['http://localhost:9200']
INDEX_NAME = 'lab-demo'
DOCS_PATH = 'lab-demo/documents'
DOC_TYPE = 'doc'

def create_index(es_conn, index_name, settings):
    '''
        create index structure in elasticsearch server. 
        If index_name exists in the server, it will be removed, and new index will be created.
        input:
            - es_conn: elasticsearch connection object
            - index_name: name of index to create
            - settings: settings and mappings for index to create
        output: =>
            - None
    '''
    if es_conn.indices.exists(index_name):
        es_conn.indices.delete(index = index_name)
        print('index `{}` deleted'.format(index_name))
    es_conn.indices.create(index = index_name, ignore = 400, body = settings)
    print('index `{}` created'.format(index_name))            
            
def build_index(es_conn, dataset, index_name, settings, DOC_TYPE='doc'):
    '''
        build index from a collection of documents
        input:
            - es_conn: elasticsearch connection object
            - dataset: iterable, collection of namedtuple Doc objects
            - index_name: name of the index where the documents will be stored
            - DOC_TYPE: type signature of documents
    '''
    # create the index if it doesn't exist
    create_index(es_conn = es_conn, index_name = index_name, settings=settings)
    counter_read, counter_idx_failed = 0, 0 # counters

    # retrive & index documents
    for doc in dataset:
        res = es_conn.index(
            index = index_name,
            id = doc.filename,
            doc_type = DOC_TYPE,
            body = doc._asdict())
        counter_read += 1

        if res['result'] != 'created':
            conter_idx_failed += 1
        else:
            print('indexed {} documents'.format(counter_read))

    print('indexed {} docs to index `{}`, failed to index {} docs'.format(
        counter_read,
        index_name,
        counter_idx_failed
    ))
    
    # refresh after indexing
    es_conn.indices.refresh(index=index_name)  

es_conn = Elasticsearch(ES_HOSTS)
dataset = read_dataset(DOCS_PATH)
build_index(es_conn, dataset, INDEX_NAME, demo_settings)

index `lab-demo` deleted
index `lab-demo` created
indexed 1 documents
indexed 2 documents
indexed 3 documents
indexed 4 documents
indexed 5 documents
indexed 5 docs to index `lab-demo`, failed to index 0 docs


Now that we have successfully created an inverted index for the sample documents in ./lab_demo/documents, it's time to search the documents with some queries.

### Searching

Elasticsearch supports a specific query grammar: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

To understand the **score** of each result, see: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html#explain

**Example 1**

In [8]:
def extract_response(res):
    if res is not None:
        for hit in res['hits']['hits']:
            filename = hit["_source"]["filename"]
            score = hit["_score"]
            yield (filename, score)

def print_result(query, res):
    # formatter of searched result
    matches = extract_response(res)
    if matches is not None:
        for match in sorted(matches, key = lambda x: -x[1]):
            print('{}, {}, {}'.format(
                query,
                match[0], # filename
                match[1], # score
            ))

# Elasticsearch Query grammar can be found here:
# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

# Boolean Query "Obama"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "bool": {
              "must": [
                {
                  "match": {"text": "Obama"}
                }
              ]
            }
          }
        }
    )
print_result("Obama", res)

Obama, doc2.txt, 1.0
Obama, doc5.txt, 1.0
Obama, doc1.txt, 1.0
Obama, doc3.txt, 1.0


**Example 2**

In [9]:
# Boolean Query "Obama AND Hillary"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "match" : {
              "text" : {
                "query" : "Obama Hillary",
                "operator" : "and"
              }
            }
          }
        }
    )
print_result("Obama AND Hillary", res)

Obama AND Hillary, doc1.txt, 2.0


**Example 3**

In [10]:
# Boolean Query "Obama BUT Hillary"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "bool": {
              "must": [
                {
                    "match": {"text": "Obama"}
                }
              ],
              "must_not":[
                {
                    "match": {"text": "Hillary"}
                }
              ]
            }
          }
        }
    )
print_result("Obama BUT Hillary", res)

Obama BUT Hillary, doc2.txt, 1.0
Obama BUT Hillary, doc5.txt, 1.0
Obama BUT Hillary, doc3.txt, 1.0


**Example 4**

In [11]:
# Boolean Query "Obama OR Hillary"
# default is OR
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "match" : {
              "text" : {
                "query" : "Obama Hillary",
              }
            }
          }
        }
    )
print_result("Obama OR Hillary", res)

Obama OR Hillary, doc1.txt, 2.0
Obama OR Hillary, doc4.txt, 1.0
Obama OR Hillary, doc2.txt, 1.0
Obama OR Hillary, doc5.txt, 1.0
Obama OR Hillary, doc3.txt, 1.0


## Tutorial 2: Evaluation an IR system using trec_eval

In this next tutorial, now we will show how the retrieved results can be evaluated using the **trec_eval** evaluation program, which is standard software for evaluating search engines against test collections.

### Initial setup

First, we need to install `trec_eval`:

- Unzip `trec_eval.zip`
- Go to the `trec_eval` folder
- Run the shell command `make` to create the `trec_eval` binary file. If you are working on your own machine you may need to install this command first using 'sudo apt-get install build-essential'.

Next, take a look at the contents of the 'eval-demo' folder. It contains a small data set consisnting of three things:

- A set of documents (11 email messages) needed to be indexed, in the *documents* directory.
    
- A set of queries, also called 'topics', in *topics/air.topics* file. The format of **.topic* file is "query_id query_terms". For example, the first line of 'air.topics' file is
    
    `01 ducks`
    
    which means that the ID of query is *01* and the corresponding query is *ducks*.

- A set of judgements, saying which documents are relevant for each query, in the *qrels/air.qrels* file. The format of **.qrels* file is "query_id 0 document_name binary_relevance". For example, the first line of 'air.qrels' is
    
    `01 0 email01 0`
    
    which means that the document 'email01' is not relevant to the given query id *01*. The binary relevance is *1* 
    if the file is relevant to the query, otherwise *0*. Please ignore the second argument *0* as it is always *0*.
    
### Create an index

In the first tutorial, we created an index (inverted-index) of five sample documents. 

In this tutorial, we will create a new index with the documents stored in the eval-demo/documents folder.

We first need to create a new index. Note that EVAL_INDEX_NAME should be changed in order to build a separate index for the documents in eval-demo/documents.

After creating the new configuration file, *your job will be to import the file, and create index using the code in the demo*.

In [12]:
# configuration for indexing
new_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "char_filter": [
            "html_strip"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

In [13]:
ES_HOSTS = ['http://localhost:9200']
EVAL_INDEX_NAME = 'eval-demo'
EVAL_DOCS_PATH = 'eval-demo/documents'
DOC_TYPE = 'doc'

es_conn = Elasticsearch(ES_HOSTS)
dataset = read_dataset(EVAL_DOCS_PATH)
build_index(es_conn, dataset, EVAL_INDEX_NAME, new_settings)

index `eval-demo` deleted
index `eval-demo` created
indexed 1 documents
indexed 2 documents
indexed 3 documents
indexed 4 documents
indexed 5 documents
indexed 6 documents
indexed 7 documents
indexed 8 documents
indexed 9 documents
indexed 10 documents
indexed 11 documents
indexed 11 docs to index `eval-demo`, failed to index 0 docs


### Read topics and produce a results file

We next need to read topics (queries) from a file (*eval-demo/topics/air.topics*) instead of having them in the program directly, as we had in Tutorial 1, where they were hard-coded. 

We will search the documents indexed by **Elasticsearch**. You may choose one of search algorithms used in the demo.

Then, we produce a result file (e.g., *retrieved.txt*), according to the **trec_eval** standard output format: 

`01 Q0 email09 0 1.23 my_IR_system1`

`01 Q0 email06 1 1.08 my_IR_system1`

where '01' is the query ID (01 to 06); ignore 'Q0'; 'emailxx' is the name of the file; '0' (or '1' or some other integer number) is the rank of this result; '1.23' (or '1.08' or some other number) is the score of this result; and 'my_IR_system1' is the name for your retrieval system. 

In particular, note that the rank field will be ignored in **trec_eval**; internally ranks are assigned by sorting by the score field with ties broken deterministicly (using file name).

**Now, here's your first job**
1. Read `air.topics` file line by line, 
2. Send query to the elastic search
3. Write output according the the output format described above

In [33]:
def search(query_string, es_conn, index_name):
    '''
        searches for query_string with default search algorithm
        input:
            - query_string: a query
            - es_conn: elasticsearch connection
            - index_name: name of index
        output:
            - a generator of tuple (filename, score)

    '''
    res = es_conn.search(index = index_name,
        body = {
            "_source": [ "filename"],
            "query": {
                "match": {
                    "text": {
                        "query": query_string,
                    }
                }
            }
        }
    )
    for hit in res['hits']['hits']:
        filename = hit["_source"]["filename"]
        score = hit["_score"]
        yield (filename, score)
        

#TODO: 1) read `air.topics` file line by line, 
#      2) send query to the elastic search
#      3) write output according the the output format described above

with open('eval-demo/topics/air.topics', 'r') as f: # Note that the 'r' means we want to read a string from this path.
    query_strings = f.readlines()

with open('retrieved.txt', 'w') as f:
    for query_string in query_strings:
        print(query_string)
        matches = search(query_string[2:], es_conn, EVAL_INDEX_NAME)
        if matches is not None:
            for match in sorted(matches, key = lambda x: -x[1]):
                #print(match)
                f.write('{} Q0 {} 0 {} my_IR_system1\n'.format(
                    query_string.split(' ')[0],
                    match[0], # filename
                    match[1], # score
        ))
        #print('query done')
    

01 ducks

02 ig nobel prizes

03 mathematics

04 flowing hair

05 music

06 AIR TV



### 3. Evaluation

Once you have done this, index and rank the documents for any or all of the six test queries, and run **trec_eval** which compares the qrels file provided in *air.qrels* with your results file. (hint: adding a **!** and shell commands allow you to execute shell commands in jupyter-notebook, e.g. `!ls`)

TREC_EVAL will evaluate the performance of your search engine. 

To evaluate your search result, you first need two sets of files: the retrieved result file and the ground truth file. Let's say your retrieval result is saved as `retrieved.txt`, and the ground truth file is saved as `air.qrels`, then the performance of your retrieval can be measured via:

`./trec_eval/trec_eval  air.qrels retrieved.txt`

In [15]:
# your shell command for comparing *air.qrels* against *retrieved.txt*
!./trec_eval/trec_eval ./eval-demo/qrels/air.qrels retrieved.txt

runid                 	all	my_IR_system1
num_q                 	all	4
num_ret               	all	13
num_rel               	all	12
num_rel_ret           	all	4
map                   	all	0.2792
gm_map                	all	0.0250
Rprec                 	all	0.2583
bpref                 	all	0.3833
recip_rank            	all	0.6250
iprec_at_recall_0.00  	all	0.6667
iprec_at_recall_0.10  	all	0.6667
iprec_at_recall_0.20  	all	0.6667
iprec_at_recall_0.30  	all	0.4167
iprec_at_recall_0.40  	all	0.1667
iprec_at_recall_0.50  	all	0.1667
iprec_at_recall_0.60  	all	0.1667
iprec_at_recall_0.70  	all	0.1667
iprec_at_recall_0.80  	all	0.1667
iprec_at_recall_0.90  	all	0.1667
iprec_at_recall_1.00  	all	0.1667
P_5                   	all	0.2000
P_10                  	all	0.1000
P_15                  	all	0.0667
P_20                  	all	0.0500
P_30                  	all	0.0333
P_100                 	all	0.0100
P_200                 	all	0.0050
P_500                 	all	0.00

If **trec_eval** runs correctly and produces numbers which you think are sensible, then you have completed this tutorial!

You may want to look at the output, though, and get some understanding of what it means. **In Assignment 1, you will be asked to interpret this and to choose evaluation measures you prefer.**

Running `./trec_eval/trec_eval -h` will list all the options available.

**TODO**: 
- Try to change the configuration of elastic search (`new_settings`), 
- Modify `search` using different DSL query structure, and 
- Compare result with the above example via `trec_eval`.

### Changing analyzers
We used dictionary structured settings when we created the indexes. 

However, there are multiple ways to change the configuration of settings. Please refer to the official elastic search pages [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) for more details. Try different filters and tokenizers by changing the settings.

Additionally, you can test your analyzer with arbitrary text using the `analyze_query` function below:

In [16]:
def analyze_query(text, es_conn, index_name):
    '''
        analyzes any text with my_analyzer defined in es_settings.json
        input:
            - text: a query text
            - es_conn: elasticsearch connection
            - index_name: name of index
        output:
            - a list of tokens
    '''

    tokens = es_conn.indices.analyze(
        index = index_name,
        body = {"text": text, "analyzer": "my_analyzer"})['tokens']

    return [token_row["token"].encode('utf-8') for token_row in tokens]

print(analyze_query("Let's see. how the analyzer analyse. this sentence.!", es_conn, INDEX_NAME))

[b"Let's", b'see.', b'how', b'analyzer', b'analyse.', b'sentence.!']


## You are ready to solve the first assignment! Open Assignment1.ipynb for further instructions.

### Have fun!