# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [1]:
!pip install whoosh
!pip install pytrec_eval
!pip install wget

Collecting whoosh
[?25l  Downloading https://files.pythonhosted.org/packages/ba/19/24d0f1f454a2c1eb689ca28d2f178db81e5024f42d82729a4ff6771155cf/Whoosh-2.7.4-py2.py3-none-any.whl (468kB)
[K     |████████████████████████████████| 471kB 2.8MB/s 
[?25hInstalling collected packages: whoosh
Successfully installed whoosh-2.7.4
Collecting pytrec_eval
  Downloading https://files.pythonhosted.org/packages/36/0a/5809ba805e62c98f81e19d6007132712945c78e7612c11f61bac76a25ba3/pytrec_eval-0.4.tar.gz
Building wheels for collected packages: pytrec-eval
  Building wheel for pytrec-eval (setup.py) ... [?25l[?25hdone
  Created wheel for pytrec-eval: filename=pytrec_eval-0.4-cp36-cp36m-linux_x86_64.whl size=277001 sha256=91aab45616cd5f068ceab8b65f0efede9081c13b4a1fa372fcde336941c2431d
  Stored in directory: /root/.cache/pip/wheels/58/30/73/8858a1b6e5e2674e2ea85c9904949c06addcf6fd34d59b5ea6
Successfully built pytrec-eval
Installing collected packages: pytrec-eval
Successfully installed pytrec-eval-0.4
C

In [2]:
import wget
wget.download("https://github.com/MIE451-1513-2019/course-datasets/raw/master/government.zip", "government.zip")

'government.zip'

In [3]:
!unzip government.zip

Archive:  government.zip
   creating: government/
  inflating: government/topics-with-full-descriptions.txt  
  inflating: government/gov.topics   
  inflating: government/gov.qrels    
   creating: government/documents/
   creating: government/documents/61/
  inflating: government/documents/61/G00-61-2800209  
  inflating: government/documents/61/G00-61-1192048  
  inflating: government/documents/61/G00-61-1118212  
  inflating: government/documents/61/G00-61-0749882  
  inflating: government/documents/61/G00-61-2230501  
  inflating: government/documents/61/G00-61-0680698  
  inflating: government/documents/61/G00-61-0551387  
  inflating: government/documents/61/G00-61-2575433  
  inflating: government/documents/61/G00-61-0469713  
  inflating: government/documents/61/G00-61-0280746  
  inflating: government/documents/61/G00-61-2574316  
  inflating: government/documents/61/G00-61-3933997  
  inflating: government/documents/61/G00-61-3290635  
  inflating: government/documents/61/G0

In [4]:
# imports
# Put all your imports here

from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os.path
from pathlib import Path
import tempfile
import subprocess
import pytrec_eval
import wget
import numpy as np
import nltk
from nltk.stem import *
# download required resources
nltk.download("wordnet")
from whoosh import scoring, qparser
from whoosh.analysis import Filter

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [0]:
DATA_DIR = "government"
#
# Put other path constants here
#
DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")



## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): Provide answer to Q1 (a) here [markdown cell]

MAP and gMAP

### Q1 (b): Provide answer to Q1 (b) here [markdown cell]

Human evaluation performance does correlate well with Average Precision. 
 
https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940-9_493
https://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf

The traditional Mean Average Precision metric, which uses the arithmetic mean of average precision scores, gives equal weight to absolute changes in per-topic scores, regardless of the relative size of the change.The GMAP measure is designed for situations where you want to highlight improvements for low-performing
topics. 
  If the evaluator is interested in a measure of consistency and collective performance across all topics, GMAP is a good choice for the evaluation metric.

## Question 2

### Q2 (a): Write your code below

**Creating the index**

In [0]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2

def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)

mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))


# now, create the index at the path INDEX_DIR based on the new schema
myIndex = createIndex(mySchema)

**Indexing the documents**

In [0]:

def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1 % 1000 == 0):
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the index
        writer.close()
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]

In [8]:
addFilesToIndex(myIndex, filesToIndex)

done indexing.


**Querying**

In [0]:
# define a query parser for the field "file_content" in the index
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

**Evaluation using TREC_EVAL**

In [0]:
def pyTrecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    # topic file -> file you want to search, qres file -> true result
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            #print(topic_id, topic_phrase)
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                #print("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    with open(qrelsFile, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(tempOutputFile, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)


    evaluator = pytrec_eval.RelevanceEvaluator(
        qrel, pytrec_eval.supported_measures)
      
    results = evaluator.evaluate(run)
    def print_line(measure, scope, value):
        print('{:25s}{:8s}{:.4f}'.format(measure, scope, value))

    # print the result
    for query_id, query_measures in results.items():
        for measure, value in query_measures.items():
            #if measure == "runid" or measure != 'map': 
            if measure == "runid" or measure != 'map':
              continue
            print_line(measure, query_id, value)
    for measure in query_measures.keys():
        #if measure == "runid" or measure != 'map': 
        if measure == "runid"  or measure != 'map':
              continue
        print_line(
            measure,
            'all',
            pytrec_eval.compute_aggregated_measure(
                measure,
                [query_measures[measure]
                 for query_measures in results.values()]))

In [11]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, myQueryParser, mySearcher) 

map                      1       0.0000
map                      2       0.0000
map                      4       0.0312
map                      6       0.0000
map                      7       0.0000
map                      9       0.0000
map                      10      0.1667
map                      14      0.2500
map                      16      0.0000
map                      18      1.0000
map                      22      0.2000
map                      24      1.0000
map                      26      0.1111
map                      28      0.0000
map                      all     0.1971


In [0]:
INDEX_Q2 = myIndex # Replace None with your index for Q2
QP_Q2 = myQueryParser # Replace None with your query parser for Q2
SEARCHER_Q2 = mySearcher # Replace None with your searcher for Q2

In [13]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2) 

map                      1       0.0000
map                      2       0.0000
map                      4       0.0312
map                      6       0.0000
map                      7       0.0000
map                      9       0.0000
map                      10      0.1667
map                      14      0.2500
map                      16      0.0000
map                      18      1.0000
map                      22      0.2000
map                      24      1.0000
map                      26      0.1111
map                      28      0.0000
map                      all     0.1971


### Q2 (b): Provide answer to Q2 (b) here [markdown cell]

The baseline Whoosh system performance is:

for all: 

map : **0.1971**

gm_map : **0.0015**

### Q2 (c): Provide answer to Q2(c) here [markdown cell]

Topics which are did **very well**(looking at map = 1):

topic: **18, 24**

Topics which are did **very badly**(looking at map = 0):

topic: **1, 2, 6, 7, 9, 16, 28**

## Question 3

### Q3 (a): Provide answer to Q3 (a) here [markdown cell]

In [0]:
def printRelName(topicFile, qrelsFile, queryParser, searcher, id):
  with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()
  for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        if topic_id == id:
          print("---------------------------Topic_id and Topic_phrase----------------------------------")
          print(topic_id, topic_phrase)
          topicQuery = queryParser.parse(topic_phrase)
          topicResults = searcher.search(topicQuery, limit=None)
          print("---------------------------Return documents----------------------------------")
          for (docnum, result) in enumerate(topicResults):
              score = topicResults.score(docnum)
              print("%s Q0 %s %d %lf test" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
          print("---------------------------Relevant documents----------------------------------")
          with open(qrelsFile, 'r') as f_qrel:
            qrels = f_qrel.readlines()
            for i in qrels:
              qid, _, doc, rel = i.rstrip().split(" ")
              if qid == id and rel == "1":
                print(i.rstrip())

In [15]:
printRelName(TOPIC_FILE, QRELS_FILE, myQueryParser, mySearcher, "16")

---------------------------Topic_id and Topic_phrase----------------------------------
16 Emergency and disaster preparedness assistance
---------------------------Return documents----------------------------------
16 Q0 G00-34-3591274 0 34.092076 test
16 Q0 G00-05-0719078 1 32.195486 test
16 Q0 G00-92-2053892 2 27.131764 test
16 Q0 G00-70-2681284 3 26.574622 test
16 Q0 G00-33-2857182 4 21.813916 test
16 Q0 G00-51-3264753 5 10.948533 test
16 Q0 G00-32-1907807 6 10.008862 test
---------------------------Relevant documents----------------------------------
16 0 G00-03-0589290 1
16 0 G00-21-0494028 1
16 0 G00-21-2114990 1
16 0 G00-32-0551737 1
16 0 G00-86-3719816 1
16 0 G00-92-2974327 1
16 0 G00-99-0140748 1


The query I choose is number 16: **Emergency and disaster preparedness assistance**

---------------------------**Return documents**----------------------------------

16 Q0 G00-34-3591274 0 34.092076 test<br>
16 Q0 G00-05-0719078 1 32.195486 test<br>
16 Q0 G00-92-2053892 2 27.131764 test<br>
16 Q0 G00-70-2681284 3 26.574622 test<br>
16 Q0 G00-33-2857182 4 21.813916 test<br>
16 Q0 G00-51-3264753 5 10.948533 test<br>
16 Q0 G00-32-1907807 6 10.008862 test

---------------------------**Relevant documents**----------------------------------

16 0 G00-03-0589290 1<br>
16 0 G00-21-0494028 1<br>
16 0 G00-21-2114990 1<br>
16 0 G00-32-0551737 1<br>
16 0 G00-86-3719816 1<br>
16 0 G00-92-2974327 1<br>
16 0 G00-99-0140748 1


**Results from TREC_EVAL:**

TREC_EVAL | Topic # | Score
 ---|---|---
num_ret                 |16 |      7.0000
num_rel                 |16 |      7.0000
num_rel_ret             |16 |     0.0000



---
As the table above, we can see that we retrieved 7 docs, but non of them is relevant. And there are 7 relevant docs in total. So:<br>
False Positive(irrelevant documents ranked highly) = 7<br>
False Negative(relevant documents not ranked highly) = 7 <br>


---


**False Positive:**

G00-34-3591274<br>
G00-05-0719078<br>
G00-92-2053892<br>
G00-70-2681284<br>
G00-33-2857182<br>
G00-51-3264753<br>
G00-32-1907807<br><br>
**False Negative:**

G00-03-0589290<br>
G00-21-0494028<br>
G00-21-2114990<br>
G00-32-0551737<br>
G00-86-3719816<br>
G00-92-2974327<br>
G00-99-0140748


---

Query 16:Emergency and disaster preparedness assistance


False Positive Example:G00-92-2053892


*   this document contains all the words in the topic 16
*   too many stop words 'and'
*   the length of this document is very long, so that it contains more keywords than the relevant documents

False Negative Example:G00-21-2114990


*   doc contains the variation of the words: 
 *  e.g. 'diaster' -> 'Diaster' 'diasters'
 *  e.g. 'preparedness' -> 'Preparedness' 'prepare'

*   too many stop words 'and'
*   no word 'assistance'



---

**Improvements**

*   Use 'or' instead of 'and' to do the querying
*   Sremming
*   Lemmatization
*   Stop words filter
*   Lower case filter
*   lower the weights of the high frequent words(in the corpus) 



### Q3 (b): Write your code below

In [0]:
# Example1: Whoosh filter for NLTK's LancasterStemmer
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()

In [17]:
mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = stmLwrStpIntraAnalyzer))

# create the index based on the new schema
myIndex2 = createIndex(mySchema2)
addFilesToIndex(myIndex2, filesToIndex)
# define a query parser for the field "file_content" in the index
myQueryParser2 = QueryParser("file_content", schema=myIndex2.schema)
mySearcher2 = myIndex2.searcher()

done indexing.


In [0]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q3, your query parser in QP_Q3, and your searcher in SEARCHER_Q3

In [0]:
INDEX_Q3 = myIndex2 # Replace None with your index for Q3
QP_Q3 = myQueryParser2 # Replace None with your query parser for Q3
SEARCHER_Q3 = mySearcher2 # Replace None with your searcher for Q3

In [20]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3) 

map                      1       0.0000
map                      2       0.5000
map                      4       0.5357
map                      6       0.0000
map                      7       0.0000
map                      9       0.0769
map                      10      0.2500
map                      14      1.0000
map                      16      0.0000
map                      18      1.0000
map                      19      0.5000
map                      22      0.0385
map                      24      1.0000
map                      26      0.0771
map                      28      0.0711
map                      all     0.3366


In [21]:
printRelName(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3, "16")

---------------------------Topic_id and Topic_phrase----------------------------------
16 Emergency and disaster preparedness assistance
---------------------------Return documents----------------------------------
16 Q0 G00-68-3661801 0 26.913555 test
16 Q0 G00-34-3591274 1 26.876672 test
16 Q0 G00-88-2853984 2 25.422361 test
16 Q0 G00-45-0006211 3 25.422361 test
16 Q0 G00-05-0719078 4 24.287607 test
16 Q0 G00-92-2053892 5 24.132200 test
16 Q0 G00-20-3839216 6 24.011339 test
16 Q0 G00-03-2245885 7 23.690870 test
16 Q0 G00-70-2681284 8 23.246485 test
16 Q0 G00-93-0870338 9 23.084161 test
16 Q0 G00-46-3010333 10 22.197888 test
16 Q0 G00-14-0931254 11 22.121642 test
16 Q0 G00-33-2857182 12 21.183803 test
16 Q0 G00-49-2630728 13 19.483217 test
16 Q0 G00-23-1010771 14 18.660356 test
16 Q0 G00-30-1702552 15 16.122116 test
16 Q0 G00-51-3264753 16 13.517693 test
16 Q0 G00-15-3335359 17 13.013648 test
16 Q0 G00-23-3822276 18 12.816667 test
16 Q0 G00-79-0620805 19 12.281928 test
16 Q0 G00-32-19

### Q3 (c): Provide answer to Q3 (c) here [markdown cell]

**Modifications made:**

I added:
*    LowercaseFilter() : lower-case it
*    IntraWordFilter() : break phrases like "whoosh.analysis" into "whoosh" and "analysis"
*    StopFilter() : filter stop words
*    StemFilter() : stemming

**Improvements:**
There are overall improvements, given the evidence:


*   overall map increased from 0.1971 to 0.3366
*   overall gm_map increased from 0.0015 to 0.0177
*   result for query 19 occurs, which did not occur in the previous inefficient search system


---

**Results from TREC_EVAL:**

TREC_EVAL | Topic # | Score
 ---|---|---
num_ret                 |16 |     24.0000
num_rel                 |16 |     7.0000
num_rel_ret             |16 |     0.0000

According to the table for topic 16: <br>
False positives increased from 7 to 24.<br>
False negatives stay the same : 0

Hense, the map for topic 16 is still 0, so there is no improvement for topic 16.



### Q3 (d): Provide answer to Q3 (d) here [markdown cell]

**Yes.**<br>
Improvements: There are overall improvements, given the evidence:

* overall map increased from 0.1971 to 0.3366
* overall gm_map increased from 0.0015 to 0.0177

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]

TREC_EVAL | Topic # | Score before | Score after
 ---|---|--- |---
map           |           1    |   0.0000 |0.0000
map           |           2    |   0.0000 |0.5000
map           |           4    |   0.0312 |0.5357
map           |           6    |   0.0000 |0.0000
map           |           7    |   0.0000 |0.0000
map           |           9    |   0.0000 |0.0769
map           |           10   |   0.1667 |0.2500
map           |           14   |   0.2500 |1.0000
map           |           16   |   0.0000 |0.0000
map           |           18   |   1.0000 |1.0000
map           |           19   |          |0.5000   
map           |           22   |   0.2000 |0.0385
map           |           24   |   1.0000 |1.0000
map           |           26   |   0.1111 |0.0771
map           |           28   |   0.0000 |0.0771
map           |           all  |   0.1971 |0.3366

Yes, according to the table above, many queries got better(e.g. 2,4,9,10,..etc). And there is one query got worse which is query 26.

### Q3 (f): Provide answer to Q3 (f) here [markdown cell]

Since the overall map improved from 0.1971 to 0.3366, the gm_map increased from 0.0015 to 0.0017, to be more specific, the map of 8 topics increased. So I consider it is a good improvement. However, the map of topic 26 drops and the false negative of topic 16 is still 0, I conclude that there can be further improvement for this search system.

## Question 4

In [0]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4

### Please answer the following questions here
(a) A clear list of all final modifications made.  
(b)  Why each modification was made – how did it help?  
(c)  The  final  MAP  performance  that  these  modifications  attained.

**Iteration 1: Implementing NLTK**

In [0]:
# import nltk
# from nltk.stem import *
# # download required resources
# nltk.download("wordnet")

# from whoosh import scoring, qparser

In [0]:
# Dont change this! Use it as-is in your code
# This filter will run for both the index and the query

class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [0]:
# Analyzer_with_nltk = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()|  CustomFilter(WordNetLemmatizer().lemmatize)
# Analyzer_with_nltk = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()|  CustomFilter(WordNetLemmatizer().lemmatize, 'v')
Analyzer_with_nltk = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()|  CustomFilter(LancasterStemmer().stem)


In [26]:
mySchema4_iter = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = Analyzer_with_nltk))
# create the index based on the new schema
myIndex4_iter = createIndex(mySchema4_iter)
addFilesToIndex(myIndex4_iter, filesToIndex) 

done indexing.


In [0]:
myQueryParser4_iter1 = QueryParser("file_content", schema=myIndex4_iter.schema)
mySearcher4_iter1 = myIndex4_iter.searcher()

In [28]:
pyTrecEval(TOPIC_FILE,  QRELS_FILE, myQueryParser4_iter1, mySearcher4_iter1) 

map                      1       0.0000
map                      2       0.5000
map                      4       0.5357
map                      6       0.1667
map                      7       0.0000
map                      9       0.0588
map                      10      0.2500
map                      14      1.0000
map                      16      0.0000
map                      18      1.0000
map                      19      0.5000
map                      22      0.0357
map                      24      1.0000
map                      26      0.0771
map                      28      0.2262
map                      all     0.3567


**Iteration 2: Change 'and' to 'or'**

In [0]:
myQueryParser4_iter2 = QueryParser("file_content", schema=myIndex4_iter.schema, group=qparser.OrGroup)
mySearcher4_iter2 = myIndex4_iter.searcher()

In [30]:
pyTrecEval(TOPIC_FILE,  QRELS_FILE, myQueryParser4_iter2, mySearcher4_iter2) 

map                      1       0.0625
map                      2       0.5357
map                      4       0.5583
map                      6       0.1667
map                      7       0.1778
map                      9       0.0588
map                      10      0.2500
map                      14      1.0000
map                      16      0.1570
map                      18      1.0000
map                      19      0.5000
map                      22      0.0357
map                      24      1.0000
map                      26      0.1074
map                      28      0.2262
map                      all     0.3891


**Iteration 3: Try TF and TF-IDF**

In [0]:
myQueryParser4_iter3 = QueryParser("file_content", schema=myIndex4_iter.schema, group=qparser.OrGroup)
mySearcher4_iter3 = myIndex4_iter.searcher(weighting=scoring.TF_IDF())

In [32]:
pyTrecEval(TOPIC_FILE,  QRELS_FILE, myQueryParser4_iter3, mySearcher4_iter3) 

map                      1       0.0498
map                      2       0.1884
map                      4       0.1018
map                      6       0.0476
map                      7       0.0791
map                      9       0.2000
map                      10      0.0625
map                      14      0.1667
map                      16      0.0902
map                      18      0.5000
map                      19      0.0294
map                      22      0.0270
map                      24      0.3333
map                      26      0.0428
map                      28      0.0684
map                      all     0.1325


**Iteration 4: Tuning the k1 and b in BM25F**

In [0]:
def pyTrecEval_all_map(topicFile, qrelsFile, queryParser, searcher):
    res = 0
    # Load topic file - a list of topics(search phrases) used for evalutation
    # topic file -> file you want to search, qres file -> true result
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            #print(topic_id, topic_phrase)
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                #print("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    with open(qrelsFile, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(tempOutputFile, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)


    evaluator = pytrec_eval.RelevanceEvaluator(
        qrel, {"map"})
      
    results = evaluator.evaluate(run)
    def print_line(measure, scope, value):
        print('{:25s}{:8s}{:.4f}'.format(measure, scope, value))

    # print the result
    for query_id, query_measures in results.items():
        for measure, value in query_measures.items():
            if measure == "runid":
              continue
    for measure in query_measures.keys():
        if measure == "runid":
              continue
        res = pytrec_eval.compute_aggregated_measure(
                measure,
                [query_measures[measure]
                 for query_measures in results.values()])
    return res

In [0]:

# num = 0
# op_b = 0
# for b_iter in np.arange(0.4,0.9,0.05):
#   for k1_iter in np.arange(1, 4, 0.1):
#     w = scoring.BM25F(B=b_iter, K1=k1_iter)
#        # define a query parser for the field "file_content" in the index
#     myQueryParser4_iter = QueryParser("file_content", schema=myIndex4_iter.schema, group=qparser.OrGroup)
#     mySearcher4_iter = myIndex4_iter.searcher(weighting=w)
#     a = pyTrecEval_all_map(TOPIC_FILE,  QRELS_FILE, myQueryParser4_iter, mySearcher4_iter) 
#     if a>num:
#       num=a
#       op_b = b_iter
#       op_k1 = k1_iter

In [0]:
# print(op_b)
# print(op_k1)

In [0]:
# for or_score in np.arange(0,1,0.1):
#     num = 0
#     op_b = 0
#     op_k1 = 0
#     for b_iter in np.arange(0.4,0.9,0.05):
#       for k1_iter in np.arange(1, 4, 0.1):
#         w = scoring.BM25F(B=b_iter, K1=k1_iter)
#        # define a query parser for the field "file_content" in the index
#         myQueryParser4_iter = QueryParser("file_content", schema=myIndex4_iter.schema, group=qparser.OrGroup.factory(or_score))
#         mySearcher4_iter = myIndex4_iter.searcher(weighting=w)
#         a = pyTrecEval_all_map(TOPIC_FILE,  QRELS_FILE, myQueryParser4_iter, mySearcher4_iter) 
#         if a>num:
#           num=a
#           op_b = b_iter
#           op_k1 = k1_iter
#           op_or = or_score

In [0]:
myQueryParser4_iter4 = QueryParser("file_content", schema=myIndex4_iter.schema, group=qparser.OrGroup)
mySearcher4_iter4 = myIndex4_iter.searcher(weighting=scoring.BM25F(B=0.55, K1=2.7))



In [38]:
pyTrecEval(TOPIC_FILE,  QRELS_FILE, myQueryParser4_iter4, mySearcher4_iter4) 

map                      1       0.0660
map                      2       0.5294
map                      4       0.5493
map                      6       0.1429
map                      7       0.2167
map                      9       0.2500
map                      10      0.3333
map                      14      1.0000
map                      16      0.2055
map                      18      1.0000
map                      19      0.5000
map                      22      0.0435
map                      24      1.0000
map                      26      0.0948
map                      28      0.2429
map                      all     0.4116


In [0]:
INDEX_Q4 = myIndex4_iter # Replace None with your index for Q4
QP_Q4 = myQueryParser4_iter4 # Replace None with your query parser for Q4
SEARCHER_Q4 = mySearcher4_iter4 # Replace None with your searcher for Q4

In [40]:
pyTrecEval(TOPIC_FILE,  QRELS_FILE, QP_Q4, SEARCHER_Q4) 

map                      1       0.0660
map                      2       0.5294
map                      4       0.5493
map                      6       0.1429
map                      7       0.2167
map                      9       0.2500
map                      10      0.3333
map                      14      1.0000
map                      16      0.2055
map                      18      1.0000
map                      19      0.5000
map                      22      0.0435
map                      24      1.0000
map                      26      0.0948
map                      28      0.2429
map                      all     0.4116


### Q4 (a): Provide answer to Q4 (a) here [markdown cell]

1. Implementing NLTK stemmer
2. Add the parameter 'group=qparser.OrGroup' in the QueryParser
3. Tried using TF-IDF
4. Tune the parameter(b,k1) for BM25F

### Q4 (b): Provide answer to Q4 (b) here [markdown cell]

1. Implementing NLTK stemmer


    *   The MAP improved by adding LancasterStemmer
    

2. Add the parameter 'group=qparser.OrGroup' in the QueryParser

    *   Use 'or' instead of 'and' to parse the query in the qparser.
    *   Will retrieve much more documents and the recall will largely increase
    *   The MAP improved


3. Tried using TF-IDF

    *   It does not help at all

4. Tune the parameter(b,k1) for BM25F
    *   b: consider the length of the document since longer documents tend to contain more words
    *   k1: tf satuation. Larger term frequency does not count that much


### Q4 (c): Provide answer to Q4 (c) here [markdown cell]

The final MAP is **0.4116**

## Validation

In [0]:
# Run the following cells to make sure your code returns the correct value types

In [0]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [43]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [44]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [45]:
assert(isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert(isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
