# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [8]:
# imports
from whoosh import index, writing, qparser, scoring
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser, MultifieldParser
import os.path
import time
from pathlib import Path
import tempfile
import subprocess

In [7]:
DATA_DIR = "government"

DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")

TREC_EVAL = os.path.join("trec_eval", "trec_eval.exe")


## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): Provide answer to Q1 (a) here [markdown cell]

Chosen measure: 
- **gm_map(Geometric Mean of Average Precision)  - overall queries**
- **map(Mean of Average Precision) - each query** 

***Note: gm_map only available for overall measurement, map will be used for each query***

### Q1 (b): Provide answer to Q1 (b) here [markdown cell]

Under the assumptions that government websites' users are interested in finding many relevant documents for each query, and average previsions in real world is normally skewed, therefore, (geometric)mean of average precision are used since both consider # of revelent documents and # of retrieved revelent documents. 

Source: http://www.cs.sfu.ca/CourseCentral/456/jpei/web%20slides/L19%20-%20Evaluation.pdf

## Question 2

### Q2 (a): Write your code below
Save your index (after indexing all the documents) in the provided variable INDEX_Q2, your query parser in the provided variable QP_Q2, and your searcher in the provided variable SEARCHER_Q2.

   #### Creating the index

In [9]:
def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)

# first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# now, create the index at the path INDEX_DIR based on the new schema
INDEX_Q2 = createIndex(mySchema)

#### Indexing the documents

In [10]:
def addFilesToIndex(indexObj, fileList):
    start = time.time()
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (int(docNum+1) % 1000 == 0):
                    print("already indexed:", docNum+1)
        end = time.time()
        print("done indexing, time spent: " + str("%.2f" % (end - start)) + "s.")

    finally:
        # close the index
        writer.close()
    
    
        
# Build a list of files to index check if file not a directory
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]

In [5]:
# count files to index
print("number of files:", len(filesToIndex))

number of files: 4078


In [6]:
addFilesToIndex(INDEX_Q2, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing, time spent: 53.19s.


In [8]:
INDEX_Q2

FileIndex(FileStorage('C:\\Users\\austi\\AppData\\Local\\Temp\\tmpliwe6hft'), 'MAIN')

#### Querying

In [7]:
# define a query parser for the field "file_content" in the 
# two objects
QP_Q2 = QueryParser("file_content", schema=INDEX_Q2.schema)
SEARCHER_Q2 = INDEX_Q2.searcher()

#### Evaluation using TREC_EVAL

In [8]:
# print the topic file
with open(TOPIC_FILE, "r") as f:
    print(f.read())

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education



In [9]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    print(result.stdout.decode())

In [10]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2) 

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


### Q2 (b): Provide answer to Q2 (b) here [markdown cell]
How well did the baseline Whoosh system do on your chosen measure? [Provide the number.]

Topic | Measure | Value
--- | --- | ---
1 | map | 0
10 | map | 0.1667
14 | map | 0.25
16 | map | 0
18 | map | 1
2 | map | 0
22 | map | 0.2
24 | map | 1
26 | map | 0.1111
28 | map | 0
4 | map | 0.0312
6 | map | 0
7 | map | 0
9 | map | 0
**all** | **map** | **0.1971**
**all** | **gm_map** | **0.0015**

<br><br>
**Performance:** The baseline Whoosh system did not performe well on the gm_map with score of 0.0015, overall map with 0.1971
<br><br>
***Note: The result excluded topic #19 since, by default, TREC_EVAL does not consider un-retrieable queries (with comma inside).***

### Q2 (c): Provide answer to Q2(c) here [markdown cell]
Are there any particular topics where it did very well, or very badly? [If so, list a few topic IDs for each]

**Yes there are some particular topics performing in extreme cases.**
<br><br>
**VERY WELL(MAP score 1.0000)**
- #18 Shipwrecks  
- #24 Air Bag Sfety


**VERY BADLY(MAP score 0.0000)**
- #1 mining gold silver coal  
- #2 juvenile delinquency
- #6 physical therapists
- #7 cotton industry
- #9 genealogy searches
- #16 Emergency and disaster preparedness assistance
- #28 Early Childhood Education

## Question 3

### Q3 (a): Provide answer to Q3 (a) here [markdown cell]
What do you think would improve Whoosh’s performance on this test collection, and why?

**Chosen query:** topic #16: Emergency and disaster preparedness assistance
<br><br>
In Q2, we got the total number of files = len(filesToIndex) = 4078
<br><br>
**Other measures in TREC_EVAL:**
- num_rel_ret(# revelent retrieved) =  0 (True Positive)
- num_ret(# retrieved)&emsp;&emsp;&emsp;&emsp;&emsp;&nbsp;&nbsp; = 7 (False Positive)
- num_rel(# revelent)&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; = 7 (False Negative)


**Generated Confusion Matrix:**

Confusion Matrix | Relevant | Non-Relevant
--- | --- | ---
*Retrieved* | 0 | 7
*Not Retrieved* | 7 | 4064

**False Positive**
- G00-34-3591274 
- G00-05-0719078 
- G00-92-2053892 
- G00-70-2681284 
- G00-33-2857182 
- G00-51-3264753 
- G00-32-1907807 

**False Negative**
- G00-03-0589290
- G00-21-0494028
- G00-21-2114990
- G00-32-0551737
- G00-86-3719816 
- G00-92-2974327
- G00-99-0140748


Taking a detailed look at the document:
- False Positive document example ***G00-32-1907807***, an irrelevant document that has been ranked, does contain all terms in the query.  
- False Negative document example ***G00-99-0140748***, a relevant document that has not been ranked, contains:
  *  variations of "Emergency" - eg: "emergency", "Emergencies"
  *  upper case in "disaster" & "preparedness"
  *  too many stop word "and" (39 times)
  *  no "assistance"

Therefore, using customized analyzer with default vanilla tokenizer, lowercase filter, stopword filter, stemming and lemmatizing the index; searching for any term(s) instead of all terms, should provide better search results


In [11]:
sampleQuery = QP_Q2.parse("Emergency and disaster preparedness assistance") # parse topic 16
sampleQueryResults = SEARCHER_Q2.search(sampleQuery, limit=None)
# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-34-3591274 0 34.092075576251
G00-05-0719078 1 32.195485675325315
G00-92-2053892 2 27.13176387951665
G00-70-2681284 3 26.574622331662233
G00-33-2857182 4 21.813916494265353
G00-51-3264753 5 10.948532555349608
G00-32-1907807 6 10.008862397603384


### Q3 (b): Write your code below

#### Customize filter

In [11]:
import nltk
from nltk.stem import *

# download required resources
nltk.download("wordnet")

# Inititlizing filter, will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\austi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
# we'll compare two stemmers and a lemmatizer
lrStem = LancasterStemmer()
sbStem = SnowballStemmer("english")
wnLemm = WordNetLemmatizer()

In [13]:
#make a customerized analyzer with regextokenizer; lowercase; common words; stemmer
# tested lrStem out performed other NLTK measures
# we added IntraWordFilter to break phrase in #19 to improve overall performace (it is not originally needed for #16)
myAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter() | CustomFilter(lrStem.stem) 

In [15]:
# define a Schema with the new analyzer
mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myAnalyzer))

# create the index based on the new schema
INDEX_Q3 = createIndex(mySchema2)

In [16]:
addFilesToIndex(INDEX_Q3, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing, time spent: 62.18s.


In [17]:
# define a query parser for the field "file_content" in the index
# reconfigured QP to OrGroup with scailling factor on the bonus 0.65
QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema, group=qparser.OrGroup.factory(0.65))
SEARCHER_Q3= INDEX_Q3.searcher()
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3) 

num_ret               	1	469
num_rel               	1	5
num_rel_ret           	1	5
map                   	1	0.0673
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0556
iprec_at_recall_0.00  	1	0.1034
iprec_at_recall_0.10  	1	0.1034
iprec_at_recall_0.20  	1	0.1034
iprec_at_recall_0.30  	1	0.1034
iprec_at_recall_0.40  	1	0.1034
iprec_at_recall_0.50  	1	0.1034
iprec_at_recall_0.60  	1	0.1034
iprec_at_recall_0.70  	1	0.0435
iprec_at_recall_0.80  	1	0.0435
iprec_at_recall_0.90  	1	0.0431
iprec_at_recall_1.00  	1	0.0431
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.1000
P_100                 	1	0.0400
P_200                 	1	0.0250
P_500                 	1	0.0100
P_1000                	1	0.0050
num_ret               	10	490
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.00

In [18]:
sampleQuery = QP_Q3.parse("Emergency and disaster preparedness assistance") # parse topic 16
sampleQueryResults = SEARCHER_Q3.search(sampleQuery, limit=None)
# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-68-3661801 0 18.599747640697878
G00-34-3591274 1 18.479768692710223
G00-03-2245885 2 17.751934884382294
G00-70-2681284 3 17.683326346242325
G00-45-0006211 4 17.469466366514794
G00-88-2853984 5 17.469466366514794
G00-05-0719078 6 17.43829971560651
G00-84-2647789 7 17.360890515680687
G00-92-2053892 8 16.66956198606738
G00-20-3839216 9 16.602779963567237
G00-93-0870338 10 16.3046950063232
G00-21-0494028 11 16.184500620880957
G00-21-2114990 12 16.042227201700257
G00-46-3010333 13 15.652530245869027
G00-77-1693859 14 15.627349583551965
G00-86-3719816 15 15.626463396210132
G00-14-0931254 16 15.49111020561254
G00-53-0263242 17 15.140227721793492
G00-60-2564326 18 14.56206832027543
G00-33-2857182 19 14.355026017168779
G00-32-0551737 20 14.248538083393573
G00-02-0866559 21 14.082025972000068
G00-09-1393028 22 13.96537578435461
G00-75-3633903 23 13.93301974889707
G00-97-3418169 24 13.875352442130099
G00-56-1706812 25 13.722377670457522
G00-56-2140972 26 13.667839536280049
G00-45-0028267 27 1

G00-03-2338574 446 3.3974589568774665
G00-42-1834554 447 3.396823577651823
G00-62-0234683 448 3.3963684961914877
G00-03-2104296 449 3.3921143683169483
G00-11-2794349 450 3.385137565037657
G00-98-1706780 451 3.385137565037657
G00-07-2941106 452 3.3827007224643055
G00-86-0205886 453 3.380980414136774
G00-10-0704121 454 3.369163942334394
G00-78-3080721 455 3.369163942334394
G00-07-1711598 456 3.364406877921359
G00-61-0280746 457 3.3632076036709737
G00-28-0286112 458 3.338429590188017
G00-95-3892368 459 3.3375810876184477
G00-36-1433110 460 3.337486159830353
G00-70-2216707 461 3.337486159830353
G00-98-3853828 462 3.337458333259546
G00-18-0270371 463 3.329229174848943
G00-48-2907514 464 3.3260479975366506
G00-04-3319765 465 3.3192591379582863
G00-31-2469242 466 3.3192591379582863
G00-56-1880693 467 3.3192591379582863
G00-66-3921879 468 3.3192591379582863
G00-02-3807812 469 3.316109158775551
G00-62-0503229 470 3.316109158775551
G00-73-3056531 471 3.316109158775551
G00-84-2576377 472 3.316109

G00-85-1525415 931 2.1095669737965914
G00-95-4067546 932 2.1077095362189078
G00-98-1785584 933 2.1077095362189078
G00-01-2292288 934 2.082487909533082
G00-74-1546702 935 2.082487909533082
G00-96-1446108 936 2.082487909533082
G00-30-1592409 937 2.072517927834433
G00-55-1156700 938 2.060073726778834
G00-01-1184872 939 2.053926000343255
G00-06-1950576 940 2.053926000343255
G00-10-2871392 941 2.053926000343255
G00-36-2319087 942 2.053926000343255
G00-59-0641167 943 2.053926000343255
G00-60-2428516 944 2.053926000343255
G00-95-2914435 945 2.053926000343255
G00-33-2469103 946 2.051572873300147
G00-60-0739588 947 2.040439891005219
G00-19-0629904 948 2.0351399907121617
G00-03-0316753 949 2.026136960319596
G00-05-0013004 950 2.026136960319596
G00-10-0460101 951 2.026136960319596
G00-11-2857124 952 2.026136960319596
G00-46-1439567 953 2.026136960319596
G00-55-2853576 954 2.026136960319596
G00-68-2213331 955 2.026136960319596
G00-98-3138271 956 2.026136960319596
G00-99-2416431 957 2.0261369603195

### Q3 (c): Provide answer to Q3 (c) here [markdown cell]
**Modification**
- Added customized filter for the analyzer when generating schema for indexing, including stemming, lowercase, stopwards, NLTK stemmer 
  * added IntraWordFilter to break phrase in #19 to improve overall performace (not originally needed for #16).
- Reconfigured the QueryParser to use OR instread of AND using "group" keyward with scaling factor for repeated word bonus of 0.65 after many trials.

Topic | Measure | Value_in_Q2 | Value_in_Q3 | Improvement
--- | --- | --- | --- | ---
1 | map | 0 | 0.0673 | +
10 | map | 0.1667 | 0.2500 | +
14 | map | 0.2500 | 1 | +
***16*** | ***map*** | ***0*** | ***0.1552*** | ***+***
18 | map | 1 | 1 |
19 | map | N/A | 0.5000 | +
2 | map | 0 | 0.5385| +
22 | map | 0.2000 | 0.0357| -
24 | map | 1 | 1| 
26 | map | 0.1111 | 0.1074| -
28 | map | 0 | 0.2262 | +
4 | map | 0.0312 | 0.5614| +
6 | map | 0 | 0.1667 | +
7 | map | 0 | 0.1778 | +
9 | map | 0 | 0.0588 | +
**all** | **map** | **0.1971** | **0.3897** | ***104.35%***
**all** | **gm_map** | **0.0015** | **0.2427** | ***16080%***

**Confusion Matrix (topic #16): topic#16's map (0->0.1552)**

Confusion Matrix | Relevant | Non-Relevant
--- | --- | ---
*Retrieved* | 7 | 1104
*Not Retrieved* | 0 | 2967

**False Positive:**
<br>There are **1111** retrieved documents in total but **1104** of them are not relevant, precision dropped largely.
- first True Positive document ***G00-21-0494028*** only ranked **12th(recip_rank = 0.0833)** in the total retrieved document.
- first False Positive document ***G00-68-3661801*** that ranked **1st** in the total retrieved documents contained dominantly more # of occurrences of "emergency" other than other words in the query​.

**False Negative:**
<br>All False Negative document eliminated.<br>
***Noted: OR parser returned considerably more retrieved document, making sure a high recall while largely decreasing precision.***

### Q3 (d): Provide answer to Q3 (d) here [markdown cell]
YES, refer to table in 3c)

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]
YES, refer to table in 3c)

### Q3 (f): Provide answer to Q3 (f) here [markdown cell]

Although the customized analyzer with reconfigured QueryParser yielded considerably more retrieved documents which led to a high recall with low precision, the overall performance for measurement map, gm_map and topic #16's map increased largely, implying a good modification from Q2. Highly improved map(104.35%) and gm_map(16080%) value also reflected a generally higher ranked relevant documents in the retrieved documents. Noteworthly, few query results performed worse than Q2 while rest improved dramatically.

## Question 4 (Graduate Students)

In [19]:
GRAD_STUDENT = True # change to True if you are a grad student

### Q4 (a): Provide answer to Q4 (a) here [markdown cell]

**Chosen query:** topic #16: Emergency and disaster preparedness assistance
<br><br>
**Confusion Matrix (topic #16):**

Confusion Matrix | Relevant | Non-Relevant
--- | --- | ---
*Retrieved* | 0 | 7
*Not Retrieved* | 7 | 4064

**False Positive**
- G00-34-3591274 
- G00-05-0719078 
- G00-92-2053892 
- G00-70-2681284 
- G00-33-2857182 
- G00-51-3264753 
- G00-32-1907807 

**False Negative**
- G00-03-0589290
- G00-21-0494028
- G00-21-2114990
- G00-32-0551737
- G00-86-3719816 
- G00-92-2974327
- G00-99-0140748


Taking a detailed look at the document:
- False Positive document example ***G00-32-1907807***, an irrelevant document that has been ranked, does contain all terms in the query.  
- False Negative document example ***G00-99-0140748***, a relevant document that has not been ranked, contains:
  *  variations of "Emergency" - eg: "emergency", "Emergencies"
  *  upper case in "disaster" & "preparedness"
  *  too many stop word "and" (39 times)
  *  no "assistance"

Therefore, using customized analyzer with default vanilla tokenizer, lowercase filter, stopword filter, stemming and lemmatizing the index; searching for any term(s) instead of all terms, should provide better search results.



**In Q3 we are focusing on improving Parser by customizing analyzer in the schema and reconfiguring QueryParser. Now we can take a look at enhancing measurement by modifying Searcher.**

**Normally the searching results are sorted by the default scoring function BM25F. Parameters of tunning fall into B and K1.**
- B: the degree of field-length normalization. 
- K1: how quickly an increase in term frequency results in term frequency saturation.

**By tunning the parameters, the searcher should be optimized with a  most appropriate degree of field-length normalization and term frequency saturation, outputing better measurements compared to Q3.**

Source: https://www.elastic.co/guide/en/elasticsearch/guide/current/pluggable-similarites.html

### Q4 (b): Write your code below

In [14]:
# define a Schema with the new analyzer
mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myAnalyzer))

# create the index based on the new schema
INDEX_Q4 = createIndex(mySchema2)

In [15]:
addFilesToIndex(INDEX_Q4, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing, time spent: 69.88s.


In [16]:
def trecEval_map_all(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    
    result_map = []
    result_temp = []
    result_text = result.stdout.decode()
    result_line = result_text.splitlines()
    for line in result_line:
        cleanWords = [word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""") for word in line.split()]
        if any( "map" in s for s in cleanWords):
            result_map.append(cleanWords)
   


    map_temp = float(result_map[15][2])
    gm_temp = float(result_map[16][2])
    

    result_temp.append(map_temp)
    result_temp.append(gm_temp)

    return (result_temp)

In [17]:
# define a query parser for the field "file_content" in the index
QP_Q4 = QueryParser("file_content", schema=INDEX_Q4.schema,group=qparser.OrGroup.factory(0.65))
b = 0
k = 0

temp_best_map=[0,0]
temp_best_gm_map=[0,0]
temp_best_map_d={}
temp_best_gm_map_d={}

#running based on the score of map
while b < 101:
    while k < 101:
        w = scoring.BM25F(B=b*0.01, K1=k*0.1)
        searcher_temp = INDEX_Q4.searcher(weighting=w)
        result_temp = trecEval_map_all(TOPIC_FILE, QRELS_FILE, QP_Q4, searcher_temp)
        if result_temp[0] > temp_best_map[0]:
            temp_best_map = result_temp
            temp_best_map_d[(b,k)]=temp_best_map
            print("best map: " + str(temp_best_map_d[(b,k)][0])+", corresponding gm_map: " + str(temp_best_map_d[(b,k)][1]) +", b: "+str(b)+", k: "+str(k))
            
        k+=1
    b+=1
    k=0
    
print("finished searching, best map captured.")


best map: 0.0929, corresponding gm_map: 0.0606, b: 0, k: 0
best map: 0.2678, corresponding gm_map: 0.1703, b: 0, k: 1
best map: 0.2681, corresponding gm_map: 0.1749, b: 0, k: 10
best map: 0.2682, corresponding gm_map: 0.1763, b: 0, k: 12
best map: 0.2687, corresponding gm_map: 0.1774, b: 0, k: 13
best map: 0.2743, corresponding gm_map: 0.181, b: 0, k: 14
best map: 0.2749, corresponding gm_map: 0.1822, b: 0, k: 15
best map: 0.2757, corresponding gm_map: 0.1836, b: 0, k: 16
best map: 0.2759, corresponding gm_map: 0.1844, b: 0, k: 18
best map: 0.2763, corresponding gm_map: 0.185, b: 0, k: 24
best map: 0.277, corresponding gm_map: 0.1855, b: 0, k: 25
best map: 0.2881, corresponding gm_map: 0.1908, b: 0, k: 26
best map: 0.3067, corresponding gm_map: 0.183, b: 1, k: 1
best map: 0.3103, corresponding gm_map: 0.1876, b: 1, k: 5
best map: 0.3104, corresponding gm_map: 0.188, b: 1, k: 6
best map: 0.3107, corresponding gm_map: 0.1889, b: 1, k: 7
best map: 0.3112, corresponding gm_map: 0.1907, b: 

**best map: 0.4121, corresponding gm_map: 0.2811, b: 56, k: 27**

***B = 0.56, K1 = 2.7***

In [None]:
#runnign based on the score of gm_map
b = 0
k = 0
while b < 101:
    while k < 101:
        w = scoring.BM25F(B=b*0.02, K1=k*0.1)
        searcher_temp = INDEX_Q4.searcher(weighting=w)
        result_temp = trecEval_map_all(TOPIC_FILE, QRELS_FILE, QP_Q4, searcher_temp)

        if result_temp[1] > temp_best_gm_map[1]:
            temp_best_gm_map = result_temp
            temp_best_gm_map_d[(b,k)]=temp_best_map
            print("best gm_map: " + str(temp_best_gm_map_d[(b,k)][1])+", corresponding map: " + str(temp_best_gm_map_d[(b,k)][0]) +", b: "+str(b)+", k: "+str(k))    

        k+=1
    b+=1
    k=0
    
print("finished searching, best gm_map captured.")

In [29]:
# B and K1 value got from running loop above
w_final = scoring.BM25F(B=0.56, K1=2.7)
SEARCHER_Q4 = INDEX_Q4.searcher(weighting=w_final)
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q4, SEARCHER_Q4) 

num_ret               	1	469
num_rel               	1	5
num_rel_ret           	1	5
map                   	1	0.0689
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0556
iprec_at_recall_0.00  	1	0.1071
iprec_at_recall_0.10  	1	0.1071
iprec_at_recall_0.20  	1	0.1071
iprec_at_recall_0.30  	1	0.1071
iprec_at_recall_0.40  	1	0.1071
iprec_at_recall_0.50  	1	0.1071
iprec_at_recall_0.60  	1	0.1071
iprec_at_recall_0.70  	1	0.0500
iprec_at_recall_0.80  	1	0.0500
iprec_at_recall_0.90  	1	0.0485
iprec_at_recall_1.00  	1	0.0485
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.1000
P_100                 	1	0.0400
P_200                 	1	0.0250
P_500                 	1	0.0100
P_1000                	1	0.0050
num_ret               	10	490
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.00

In [30]:
sampleQuery = QP_Q4.parse("Emergency and disaster preparedness assistance") # parse topic 16
sampleQueryResults = SEARCHER_Q4.search(sampleQuery, limit=None)
# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-34-3591274 0 26.030995129301573
G00-68-3661801 1 25.066902612605578
G00-70-2681284 2 24.282161587672004
G00-21-2114990 3 23.457763430628543
G00-84-2647789 4 23.292300749759384
G00-77-1693859 5 23.24168281117828
G00-03-2245885 6 23.195509312631383
G00-21-0494028 7 22.64257000655455
G00-05-0719078 8 22.492704440707005
G00-53-0263242 9 21.88061758429768
G00-86-3719816 10 21.731413202170955
G00-60-2564326 11 21.624171648367508
G00-45-0006211 12 21.612991633253735
G00-88-2853984 13 21.612991633253735
G00-92-2053892 14 21.502549204423534
G00-33-2857182 15 20.90019449067951
G00-93-0870338 16 20.174302404872865
G00-09-1393028 17 19.035018093668768
G00-46-3010333 18 18.87528205390649
G00-99-0140748 19 18.43479367671693
G00-14-0931254 20 18.30171029337989
G00-56-2140972 21 18.27760849294458
G00-02-0866559 22 18.270938854470856
G00-97-3418169 23 18.2576266979494
G00-32-0551737 24 18.206983515333306
G00-75-3633903 25 18.20484167314297
G00-45-0028267 26 17.53452829046222
G00-56-1706812 27 17.08

G00-09-0524984 554 3.2834653393006574
G00-02-1833746 555 3.27912087563572
G00-28-3558822 556 3.27912087563572
G00-00-2737496 557 3.2753471357382917
G00-80-1112147 558 3.2534869505655197
G00-33-2327900 559 3.2506062972749032
G00-58-3488759 560 3.2506062972749032
G00-95-3043676 561 3.2506062972749032
G00-32-2166728 562 3.2505822028119784
G00-15-3874791 563 3.2164758782500416
G00-47-4117952 564 3.2157432031607045
G00-97-0293538 565 3.2039235489063205
G00-54-4103368 566 3.2038321360083826
G00-95-2566183 567 3.181619994394031
G00-07-0926644 568 3.158273971920461
G00-17-0424066 569 3.158273971920461
G00-53-1684082 570 3.158273971920461
G00-76-1350144 571 3.158273971920461
G00-80-0235938 572 3.158273971920461
G00-95-2305328 573 3.158273971920461
G00-34-0427966 574 3.158023206001361
G00-54-3996340 575 3.158023206001361
G00-81-1989367 576 3.1564775613236886
G00-10-1437219 577 3.1540931763181526
G00-16-2895420 578 3.151733680646212
G00-61-1131563 579 3.151733680646212
G00-00-1794090 580 3.148213

G00-02-1178594 888 2.1997661850970154
G00-10-1806241 889 2.1997661850970154
G00-31-3447261 890 2.199543944466273
G00-01-1676158 891 2.192501727111305
G00-03-1970588 892 2.192501727111305
G00-04-3662045 893 2.192501727111305
G00-09-3881335 894 2.192501727111305
G00-19-2800132 895 2.192501727111305
G00-54-0905235 896 2.192501727111305
G00-65-1737276 897 2.192501727111305
G00-69-0204239 898 2.192501727111305
G00-72-3133704 899 2.192501727111305
G00-75-2944787 900 2.192501727111305
G00-98-3928796 901 2.192501727111305
G00-77-0359759 902 2.1759402830192744
G00-26-3218156 903 2.1740636324693363
G00-63-3078014 904 2.1740636324693363
G00-15-1460278 905 2.166552709786399
G00-00-3536797 906 2.1657015249188234
G00-03-2042174 907 2.1657015249188234
G00-28-0596013 908 2.1657015249188234
G00-33-2762518 909 2.1657015249188234
G00-35-2527252 910 2.1657015249188234
G00-39-3855269 911 2.1657015249188234
G00-46-2131394 912 2.1657015249188234
G00-50-2576270 913 2.1657015249188234
G00-53-2952068 914 2.1657

### Q4 (c): Provide answer to Q4 (a) here [markdown cell]
**Modification**

- When creating the searcher, tunning the BM25F scoring algorithm by changing the value of B and K1 to yield the best map & gm_map score. Optimal B and K1 has been decided to be 0.56 and 2.7 respectively (using loop for adjusting and tunning).
  * set range for B (0-1) increment 0.01/iteration - **0.56 B reflected a less normalizd field-length than default (0.75)**
  * set range for K1 (0-10) increment 0.1/iteration - **2.7 K1 reflected a slower term-frequency saturation than default (1.2)**

Topic | Measure | Value_in_Q2 | Value_in_Q3 | Value_in_Q4 | Improvement(Q3->Q4)
--- | --- | --- | --- | --- | ---
1 | map | 0 | 0.0673 | 0.0689 | 2.37%
10 | map | 0.1667 | 0.2500 | 0.3333 | 33.32% 
14 | map | 0.2500 | 1 | 1 | 0%
***16*** | ***map*** | ***0*** | ***0.1552*** | ***0.2066*** | ***33.12%***
18 | map | 1 | 1 | 1 | 0%
19 | map | N/A | 0.5000 | 0.5000 | 0%
2 | map | 0 | 0.5385| 0.5323 | -1.15%
22 | map | 0.2000 | 0.0357| 0.0435 | 21.85%
24 | map | 1 | 1 | 1 | 0%
26 | map | 0.1111 | 0.1074| 0.0955 | -11.08%
28 | map | 0 | 0.2262 | 0.2429 | 7.38%
4 | map | 0.0312 | 0.5614| 0.5485 | -2.30%
6 | map | 0 | 0.1667 | 0.1429 | -14.28%
7 | map | 0 | 0.1778 | 0.2167 | 21.88%
9 | map | 0 | 0.0588 | 0.2500 | 325.17%
**all** | **map** | **0.1971** | **0.3897** | **0.4121** | **5.75%**
**all** | **gm_map** | **0.0015** | **0.2427** | **0.2811** | **15.82%**

<br><br>
**Confusion Matrix (topic #16): topic#16's map increased by 33.12% from Q3**

Confusion Matrix | Relevant | Non-Relevant
--- | --- | ---
*Retrieved* | 7 | 1104
*Not Retrieved* | 0 | 2967

**False Positive:**
<br>There are still **1111** retrieved documents in total but **1104** of them are not relevant, precision dropped largely.
- **IMPROVEMENT** - first True Positive document ***G00-21-2114990*** now ranked **4th (recip_rank = 0.25)** in the total retrieved document (formally 12th in Q3).

**False Negative:**
<br>All False Negative document eliminated.<br>

### Q4 (d): Provide answer to Q4 (a) here [markdown cell]
YES, refer to table in 4c)

### Q4 (e): Provide answer to Q4 (a) here [markdown cell]
YES, refer to table in 4c)

### Q4 (f): Provide answer to Q4 (a) here [markdown cell]
It is a good idea.
- map score for overall and topic #16 increased from Q3 (of course also from Q2)
- gm_map score also increased from Q3 and Q2
- Although # of retrieved documents and # of retrieved relevant documents did not increase from Q3, the ranks of relevant documents are now higher than Q3 and Q2 (since higher map and gm_map, and recip_rank is higher as well).

## Validation

In [None]:
# Run the following cells to make sure your code returns the correct value types

In [52]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [55]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [134]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [142]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
