# Information Retrieval Practice

Elasticsearch is an open-source distributed search server built on top of Apache Lucene. It’s a great tool that allows to quickly build applications with full-text search capabilities. The core implementation is in Java, but it provides a nice REST interface which allows to interact with Elasticsearch from any programming language.


**Note: I do not recommend you to use Google Colab for this practice, but to execute everything locally in your computer. You will need to download, install and execute ElasticSearch, which is rather tricky to do in Colab**

## Install Elastic Search

To install elastic search download your the package for your platform from Get Elasticsearch
 in https://www.elastic.co/es/start


![](https://github.com/acastellanos-ie/MBD-EN-BL-ENE-2020-J-1/blob/master/ir_practice/download.png?raw=1)

Once downloaded, unzip the tar.gz file and run `bin/elasticsearch` (or `bin\elasticsearch.bat` on Windows). This will launch the ElasticSearch Server. Once the server is running, by default it's accessible at [localhost:9200](http://localhost:9200).

## Querying Elastic Search via Python

To make queries to ElasticSearch you can directly query the server endpoint via REST. However, we can make it easier via the the `elasticsearch-py` Python library. This library provides a wrapper for the REST endpoint that will allow us to query the server form Python

In [None]:
from elasticsearch import Elasticsearch

# Exercise 0: Indexing and Searching Demo for ElasticSearch

Now it's time to run some demo program. In this practice, we will create inverted index of sample documents (indexing) and then use Elasticsearch query grammar to search documents (searching).

### Useful functions

Functions to facilitate the reading of the dataset

In [None]:
import os, io
from collections import namedtuple

# A document class with following attributes
# filename: document filename
# text: body of documment
# path: path of document
Doc = namedtuple('Doc', 'filename path text')

def read_doc(doc_path, encoding):
    '''
        reads a document from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - doc: instance of Doc namedtuple
    '''
    filename = doc_path.split('/')[-1]
    fp = io.open(doc_path, 'r', encoding = encoding)
    text = fp.read().strip()
    fp.close()
    return Doc(filename = filename, text = text, path = doc_path)

def read_dataset(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: instances of Doc namedtuple returned as generator
    '''
    for root, dirs, files in os.walk(path):
        for doc_path in files:
            yield read_doc(root + '/' + doc_path, encoding)

##  Indexing

We will try to index the sample documents in `./sample_documents`. To index the documents, we first need to make a connection to **Elasticsearch**. 

Before we index the documents, we first need to define the **configuration of elasticsearch**. During this process, you can define basic configuration of indexer such as tokenizer, stemmer, lemmatizer, and also define which search algorithm elasticsearch will use for search.

Below code shows a simple configuration settings for this demo.
The configuration tells elasticsearch that our document `doc` will have three fields `filename`, `path`, and `text`, and we will use `text` field for search. `my_analyzer` will be used to parse the `text` field, and `my_analyzer` will also be used as a search analyzer, which will parse search queries later on. `index:False` in `filename` and `path` fields tell elasticsearch that we will not index these two fields, therefore, we cannot search these two fields with queries. 

The detailed documentation of analyzer can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).

`"similarity": "boolean"` in `text` field will let elasticsearch know that we will use a boolean search algorithm to search `text` field. The detailed documentation of search algorithms can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html)  and [here](https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html). 


In [None]:
# configuration for indexing
settings = {
  "mappings": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase","stop"
          ],
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

Now we will retrieve `sample documents` and indexing them into `INDEX_NAME` index. To that end, the following 2 functions will help you in the creation of the index and the indexing of the documents.


In [None]:
ES_HOSTS = ['http://localhost:9200']
INDEX_NAME = 'sample_index'
DOCS_PATH = 'practice_data/sample_documents'

def create_index(es_conn, index_name, settings):
    '''
        create index structure in elasticsearch server. 
        If index_name exists in the server, it will be removed, and new index will be created.
        input:
            - es_conn: elasticsearch connection object
            - index_name: name of index to create
            - settings: settings and mappings for index to create
        output: =>
            - None
    '''
    if es_conn.indices.exists(index_name):
        es_conn.indices.delete(index = index_name)
        print('index `{}` deleted'.format(index_name))
    es_conn.indices.create(index = index_name, body = settings)
    print('index `{}` created'.format(index_name))            
            
def build_index(es_conn, dataset, index_name, settings, DOC_TYPE='doc'):
    '''
        build index from a collection of documents
        input:
            - es_conn: elasticsearch connection object
            - dataset: iterable, collection of namedtuple Doc objects
            - index_name: name of the index where the documents will be stored
            - DOC_TYPE: type signature of documents
    '''
    # create the index if it doesn't exist
    create_index(es_conn = es_conn, index_name = index_name, settings=settings)
    counter_read, counter_idx_failed = 0, 0 # counters

    # retrive & index documents
    for doc in dataset:
        res = es_conn.index(
            index = index_name,
            id = doc.filename,
            body = doc._asdict())
        counter_read += 1

        if res['result'] != 'created':
            counter_idx_failed += 1
        elif counter_read % 500 == 0:
            print('indexed {} documents'.format(counter_read))

    print('indexed {} docs to index `{}`, failed to index {} docs'.format(
        counter_read,
        index_name,
        counter_idx_failed
    ))
    
    # refresh after indexing
    es_conn.indices.refresh(index=index_name)  

es_conn = Elasticsearch(ES_HOSTS)
dataset = read_dataset(DOCS_PATH)
build_index(es_conn, dataset, INDEX_NAME, settings)

index `sample_index` deleted
index `sample_index` created
indexed 1 documents
indexed 2 documents
indexed 3 documents
indexed 4 documents
indexed 5 documents
indexed 5 docs to index `sample_index`, failed to index 0 docs


We successfully created an inverted index for the sample documents in `./sample/documents`. It's time to search the documents with some queries.

## Searching

**Elasticsearch** supports a specific query grammar which intends to replicate the grammar of traditional search engines (Google Search supports a similar grammar).
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

To understand score of the result, check: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html#explain

### Useful Functions

These functions will help you with the ElasticSearch output format in order to visualize the search results

In [None]:
def extract_response(res):
    if res is not None:
        for hit in res['hits']['hits']:
            filename = hit["_source"]["filename"]
            score = hit["_score"]
            
            yield (filename, score)

def print_result(query, res):
    # formatter of searched result
    matches = extract_response(res)
    if matches is not None:
        for match in sorted(matches, key = lambda x: -x[1]):
            print('{}, {}, {},\n'.format(
                query,
                match[0], # filename
                match[1], # score
            ))

We will perform now different types of queries.

First, a query with a single term

In [None]:
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "bool": {
              "must": [
                {
                  "match": {"text": "Obama"}
                }
              ]
            }
          }
        }
    )
print_result("Obama", res)

Obama, doc1.txt, 1.0,

Obama, doc3.txt, 1.0,

Obama, doc5.txt, 1.0,

Obama, doc2.txt, 1.0,



Now a query for the documents containing both terms

In [None]:
# Boolean Query "Obama AND Hillary"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "match" : {
              "text" : {
                "query" : "Obama Hillary",
                "operator" : "and"
              }
            }
          }
        }
    )
print_result("Obama AND Hillary", res)

Obama AND Hillary, doc1.txt, 2.0,



And now containing a term but NOT the other.

In [None]:
# Boolean Query "Obama BUT Hillary"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "bool": {
              "must": [
                {
                    "match": {"text": "Obama"}
                }
              ],
              "must_not":[
                {
                    "match": {"text": "Hillary"}
                }
              ]
            }
          }
        }
    )
print_result("Obama BUT Hillary", res)

Obama BUT Hillary, doc3.txt, 1.0,

Obama BUT Hillary, doc5.txt, 1.0,

Obama BUT Hillary, doc2.txt, 1.0,



Finally, the default behaviour for queries with more than one term: OR.

In [None]:
# Boolean Query "Obama OR Hillary"
# default is OR
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "match" : {
              "text" : {
                "query" : "Obama Hillary",
              }
            }
          }
        }
    )
print_result("Obama OR Hillary", res)

Obama OR Hillary, doc1.txt, 2.0,

Obama OR Hillary, doc3.txt, 1.0,

Obama OR Hillary, doc5.txt, 1.0,

Obama OR Hillary, doc4.txt, 1.0,

Obama OR Hillary, doc2.txt, 1.0,

