# Assignment 1: Information Retrieval (10 marks)

This assignment is based on the Assignment 1 Demo tutorials.

In this assignment, your task is to index a new document collection into *Elasticsearch*, and then measure search performance based on predefined queries.

A new document collection containing more than 10,000 goverment sites description, and a set of predefined queries, is provided for this assignment.

Throughout this assginment: 
1. You will develop a better understanding of indexing, including the tokeniser, parser, and normaliser components, and how to improve the search performance given a predefined evaluation metric, 
2. You will develop a better understanding of search algorithms, and how to obtain better search results, and 
3. You will find the best way to combine an indexer and search algorithm to maximise your performance.

Below, you will solve five programming questions, and three written questions. 

We will check the correctness of your code and the overall performance score.

- Write your code after `### Your code here`, and remove `raise NotImplementedError` after implementation.
- Write answers in this notebook file in the designated cells, and upload the file to the Wattle submission site. **Please rename and submit jupyter notebook file (`Assignment1.ipynb`) to `your_uid.ipynb` (e.g. `u1234567.ipynb`) with your written answers therein - FAILURE TO DO SO WILL RESULT IN -0.5 POINTS**. 

*Hint*: After finishing coding your notebook, select from the Jupyter Notebook interface the menu option Kernel -> Restart & Run All. After the execution of each block is finished, inspect the output, save the notebook and shutdown the kernel. Only now you can safely manipulate the .ipynb file, which contains code, explanations and output.

## Coding component (Q1 - Q5), 4 marks

### Q1: Index Gov dataset (0.5 marks)

For this assignment we will be working with a corpus of government documents, located in the gov folder. 

The gov folder contains three sub-folders; documents, qrels and topics. The documents folder consists of sub-folders, each of which contain multiple documents. Topics and qrels contain search queries and corresponding ground truth relevant documents, respectively.

Your first job is to index the documents as we have done in the tutorial exercises (Assignment 1 Demo).

Note that depending on your machine, indexing may take several minutes to a few hours. You may implement multi-threaded version of indexing to mitigate this problem.

Below is provided the basic code configuration for indexing:

In [None]:
# basic configuration for indexing
basic_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "char_filter": [
            "html_strip"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }   
}

You need to implement the below function `build_gov_index`. Don't forget to remove `raise NotImplementedError` after implementation.

In [None]:
from elasticsearch import Elasticsearch

ES_HOSTS = ['http://localhost:9200']
DOCS_PATH = 'gov/documents'
INDEX_NAME = 'gov'
DOC_TYPE = 'doc'

def build_gov_index(es_conn, index_name, doc_path, settings):
    # TODO implement function that:
    #  1. Create an index with `index_name`. If `index_name` already exists, remove the index first.
    #  2. Index the documents under doc_path, including subfolders, into elasticsearch (hint: read demo carefully)
    # Note that this function will be used throughout this assignment    
    # YOUR CODE HERE
    raise NotImplementedError()

es_conn = Elasticsearch(ES_HOSTS)
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, basic_settings)

### Q2: Search and performance measure (0.5 marks)

For this second task, you will first need to read the topics/gov.topics file. 

As we have done in the demo tutorial, each file is formatted as `query_id query_terms`, where
query_id is a numerical number, and query_terms consists of multiple keywords as search terms. 

Your job is to read the query file and search using the provided search function. You will need to write the output of the search results to an output.txt file in the trec-eval standard format used in the demo tutorial. 

As a reminder, this means that each result of each query should be put in a line in the output.txt like this:

`01 Q0 email09 0 1.23 my_IR_system1`

`01 Q0 email09 1 1.11 my_IR_system1`

`02 Q0 email07 0 1.08 my_IR_system1`

where '01' is the query ID; ignore 'Q0'; 'emailxx' is the name of the file; '0' (or '1' or some other integer number) is the rank of this result; '1.23' (or '1.08' or some other number) is the score of this result; and 'my_IR_system1' is the name for your retrieval system. 

Note that you are only allowed to write 10-documents at most for each query. If your output file contains more than 10 documents per query, you will get 0 score for this question.

**Please rename your output_q2.txt file to `YourUID_output_q2.txt` eg `u1234567_output_q2.txt`, before submitting to Wattle - FAILURE TO DO SO WILL RESULT IN -0.5 POINTS**.

Below is some code to get you started and for you to complete:

In [None]:
def search(query_string, es_conn, index_name):
    '''
        searches for query_string with default search algorithm
        input:
            - query_string: a query
            - es_conn: elasticsearch connection
            - index_name: name of index
        output:
            - a generator of tuple (filename, score)

    '''
    res = es_conn.search(index = index_name,
        body = {
            "_source": [ "filename"],
            "query": {
                "query_string": {
                    "query": query_string,
                }
            }
        }
    )
    return res['hits']['hits']

# TODO: 
#       Read query file from `query_path`, search using `search_fn`, and 
#       Write top 10 outputs per query to `output_file`
#       Note that the function takes a search function as an argument. You can directly call the search function
#       as `result = search_fn(query_string, es_conn, index_name)` within the function.
#       This function will be used throughout this assignment
def read_search_write_output(search_fn, query_path, output_file):
    with open(output_file, 'w') as output:
        # YOUR CODE HERE
        raise NotImplementedError()

query_path = 'gov/topics/gov.topics'
output_file = 'output.txt'
read_search_write_output(search, query_path, output_file)

Once you have written the results of your query to an output file, you can run trec-eval on your output file and the provided gov.qrel file to evaluate your system. Trec-eval provides many different measures of quality, but for the purposes of this assignment you will use precision@10 (p_10 in trec-eval output) to measure the performance of your systems.

In [None]:
!./trec_eval/trec_eval ./gov/qrels/gov.qrels output.txt

### Q3: Improving the search algorithm: compare similarity algorithms (1 mark)

*Elasticsearch* also provides multiple configurable scoring algorithms. 

For this task, you will be asked to find a better similarity module to improve the search performance. Please refer [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/similarity.html) for a better understanding of the configurable elasticsearch similarity modules.

Here's some code to get you started and for you to complete:

In [None]:
# TODO: define your own analyzer for indexing and searching
q3_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
            # YOUR CODE HERE
            raise NotImplementedError()
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "char_filter": [
            "html_strip"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

In [None]:
# TODO: run this block to generate an output based on q3_settings defined above.
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, q3_settings)
read_search_write_output(search, query_path, output_file)

In [None]:
!./trec_eval/trec_eval ./gov/qrels/gov.qrels output.txt

Upload the final output to Wattle, but **please first rename output_file to YourUID_output_q3.txt eg u1234567_output_q3.txt - FAILURE TO DO SO WILL RESULT IN -0.5 POINTS**.

### Q4: Improving the indexer: compare different ways of indexing (1 mark)

For this part, you will be asked to change the configuration of indexer (`basic_settings`) to improve the search performance.

Please look at the elastic search official document [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) for better understanding of configuration and other options.

Note that you can check how your tokeniser tokenises your input string via the `analyze_query` function provided in the demo code.

In [None]:
# TODO: configure settings to define your own analyzer for indexing
q4_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
            # YOUR CODE HERE
            raise NotImplementedError()
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
          "my_analyzer": {
            # YOUR CODE HERE
            raise NotImplementedError()
        }
      }
    }
  }
}

In [None]:
# TODO: run this block to generate an output based on q4_settings and my_search defined above.
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, q4_settings)
read_search_write_output(search, query_path, output_file)

In [None]:
!./trec_eval/trec_eval ./gov/qrels/gov.qrels output.txt

Upload the final output to Wattle, but **please first rename output_file to YourUID_output_q4.txt eg u1234567_output_q3.txt - FAILURE TO DO SO WILL RESULT IN -0.5 POINTS**.

### Q5: Tolerant retrieval: wildcard queries (1 mark)

*Elasticsearch* provides wildcard query search. You can use wildcard expressions cosisting of '*' and '?' to search.  

For this task, you can reuse the previous index, i.e., *q4_settings*. Refer to the [link](https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl-wildcard-query.html) to see how to search with wildcard queries. 

For each query term from 'gov.topics', replace last two characters with any wildcard expression. For example, the first topic from 'gov.topics' is 'mining gold silver coal'. Instead, you search 'mini&ast; go?? silv&ast; co??'. 

In [None]:
def my_search(query_string, es_conn, index_name):
    res = es_conn.search(index = index_name,
        body = {
            "_source": [ "filename"],
            "query": {
                "query_string": {
                    "query": query_string,
                }
            }
        }
    )
    return res['hits']['hits']

In [None]:
# TODO: run this block to generate the output
read_search_write_output(my_search, query_path, output_file)

In [None]:
!./trec_eval/trec_eval ./gov/qrels/gov.qrels output.txt

## Written component (Q6 - Q9), 6 marks

Answer the following questions based on your implementation of Questions 1-5:

### Q6 (1.5 marks): What changs did you make to the search similarity to improve the performance of the system? Why do you think it improved the performance?

(provide answers below using bullet points with 2~3 items)

INSERT ANSWERS TO Q6 HERE

### Q7 (1.5 marks): What changes did you make to the indexer to improve the performance of the system? Why do you think it improved the performance?

(provide answer below using bullet points with 2~3 items (Check [this](https://sourceforge.net/p/jupiter/wiki/markdown_syntax/#md_ex_lists) if you are not familiar with markdown syntax))

INSERT ANSWERS TO Q7 HERE

### Q8 (1.5 marks): Apart from Precision@10, what other metrics can be used to measure the performance of the developed IR system for the government document collection? Provide two metrics and explain why they would be suited for this particular government IR system.

(provide answers below using bullet points with 2~3 items)

INSERT ANSWERS TO Q8 HERE

### Q9 (1.5 marks): How do wildcard queries affect performance of the retrieval in terms of measures you answered for Q8? Also provide some situations when wildcard queries are useful. 

INSERT ANSWERS TO Q9 HERE

**Academic Misconduct Policy**: All submitted written work and code must be your own (except for any provided starter code, of course) – submitting work other than your own will lead to both a failure on the assignment and a referral of the case to the ANU academic misconduct review procedures: ANU Academic Misconduct Procedures