# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure ***MATERIALS_DIR*** points to the directory where you extracted the Zip file.
* Make sure all your paths are **relative to ** ***MATERIALS_DIR*** and **NOT hard-coded** in your code.

In [1]:
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os
import shutil

In [2]:
MATERIALS_DIR = r"C:\DSS_Fall2017_Assign2"

DOCUMENTS_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents")
INDEX_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index1")
QUER_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\topics\gov.topics")
QRELS_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\qrels\gov.qrels")
OUTPUT_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres")
TREC_EVAL = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\trec_eval\trec_eval.exe")
INDEX_DIR2 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index2")
OUTPUT_FILE2 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres2")

## Question 1

### Q1 (a):
Mean Average Precision (MAP)

### Q1 (b):
The MAP evaluates the results incorporating the information of document order, and a higher MAP score suggests a better relevance results located near the top of ranking, which makes user easier to gather the results in terms of user experience.

## Question 2

### Q2 (a): Write your code below

In [3]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2

In [4]:
# first we build a list of all the full paths of the files in DOCUMENTS_DIR
filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)

In [5]:
# print the first 5 paths to make sure it worked
print("\n".join(filesToIndex[:5]))

C:\DSS_Fall2017_Assign2\DSS_Fall2017_Assign2\government\documents\00\G00-00-0088569
C:\DSS_Fall2017_Assign2\DSS_Fall2017_Assign2\government\documents\00\G00-00-0114013
C:\DSS_Fall2017_Assign2\DSS_Fall2017_Assign2\government\documents\00\G00-00-0124389
C:\DSS_Fall2017_Assign2\DSS_Fall2017_Assign2\government\documents\00\G00-00-0158061
C:\DSS_Fall2017_Assign2\DSS_Fall2017_Assign2\government\documents\00\G00-00-0165832


In [6]:
# count files to index
print("number of files:", len(filesToIndex))

number of files: 4078


In [7]:
# first, define a Schema for the index
mySchema = Schema(file_path=ID(stored=True),
                  file_content=TEXT(analyzer=RegexTokenizer()))

In [8]:
# if index exists - remove it
if os.path.isdir(INDEX_DIR):
    shutil.rmtree(INDEX_DIR)

# create the directory for the index
os.makedirs(INDEX_DIR)

# create index
INDEX_Q2 = index.create_in(INDEX_DIR, mySchema)

In [9]:
# open writer
myWriter = writing.BufferedWriter(INDEX_Q2, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding="utf-8") as f:
            fileContent = f.read()
            myWriter.add_document(file_path=filePath,
                                  file_content=fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [10]:
# define a query parser for the field "file_content" in the index
QP_Q2 = QueryParser("file_content", schema=INDEX_Q2.schema)
SEARCHER_Q2 = INDEX_Q2.searcher()

In [11]:
# print the topic file
!cat $QUER_FILE

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education


In [12]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE, "r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile = open(OUTPUT_FILE, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q2.parse(topic_phrase)
    topicResults = SEARCHER_Q2.search(topicQuery, limit=None)
    
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile.close()
topicsFile.close()

In [13]:
# INDEX_Q2 = None # Replace None with your index for Q2
# QP_Q2 = None # Replace None with your query parser for Q2
# SEARCHER_Q2 = None # Replace None with your searcher for Q2

In [14]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


### Q2 (b):
For all queries, the baseline Whoosh system has an overall MAP score of 0.1971.

### Q2 (c):
For topics like 10 (Physical Fitness), 14 (Agricultural biotechnology), 18 (Shipwrecks), 22 (Veteran's Benefits) and 24 (Air Bag Safety), MAP scores the highest among other metrics (topic 10: 0.1667, topic 14: 0.25, topic 22: 0.2, topic 18 and 24: 1.0. For topic 4 (wireless communication), MAP (0.0312) was out-performed by some of the other measures such as P_10 (0.10). For topics like 1, 2, 6, 7, 9, 16 and 28, MAP performs rather badly with scores of 0. Topic 19 (Cybercrime, internet fraud, and cyber fraud) does not return any result in this schema.

## Question 3

### Q3 (a):
The documents were highly ranked suggesting they are scored higher using the default BM25F scorer model in the Whoosh system. In the BM25F, according to Wikipedia (https://en.wikipedia.org/wiki/Okapi_BM25), given a query containg keyword q_i, the BM25F score of the document is: ![alt-text](https://wikimedia.org/api/rest_v1/media/math/render/svg/43e5c609557364f7836b6b2f4cd8ea41deb86a96)

where IDF is the inverse document frequency of a keyword computed as: ![alt-text](https://wikimedia.org/api/rest_v1/media/math/render/svg/c652b6871ce4872c8e924ff0f806bc8b06dc94ed)

In this BM25F model, we can see it is a combined measure of IDF of each individual keyword and the term frequency of each keyword. The main idea of IDF is to penalize the weight of terms that appears more often in the document set (high document frequency) and incentivize the weight of terms that occur rarely. In this way, a keyword that occurs frequently in a given document but not too often among all document collections will have a higher weight.

### False Positive/Negative Study Using Topic 14 "Agricultural biotechnology" and Its Variation:

In [15]:
# run a sample query for topic 14: "Agricultural biotechnology"
sampleQuery = QP_Q2.parse("Agricultural biotechnology")
sampleQueryResults = SEARCHER_Q2.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
if not sampleQueryResults:
    print("Nothing found.")
else:
    for (docnum, result) in enumerate(sampleQueryResults):
        score = sampleQueryResults.score(docnum)
        fileName = os.path.basename(result["file_path"])
        print(fileName, docnum, score)

G00-79-4144643 0 16.029011125416673
G00-09-1193469 1 15.95192473760627
G00-45-0809730 2 15.654311255102378
G00-89-0000000 3 15.38736843598047
G00-88-1894712 4 8.201903512787501
G00-34-0444524 5 3.247974551340488
G00-84-0274223 6 1.5484756466896998


In [16]:
!grep -rn $QRELS_FILE -e "14 0 G00-79-4144643"
!grep -rn $QRELS_FILE -e "14 0 G00-09-1193469"
!grep -rn $QRELS_FILE -e "14 0 G00-45-0809730"
!grep -rn $QRELS_FILE -e "14 0 G00-89-0000000"

1348:14 0 G00-79-4144643 0
1272:14 0 G00-09-1193469 0
1322:14 0 G00-45-0809730 0
1356:14 0 G00-89-0000000 1


The documents that were highly ranked contain the exact tokens matching with the query (e.g. "Agricultural biotechnology"). If a document contains the word "biotechnology" with a high term frequency and a low document frequency, it may have a higher ranked but not necessary related to "Agricultural biotechnology". On the contrary, for example, a document related to this topic even with frequent terms such as "agriculture" and "technology" that do not match with the query exactly would be dropped from the high ranking seats, which should have been highly ranked though.

In the above, we try the topic 14 (Agricultural biotechnology) as a sample query and Whoosh retrieves 7 results. If we cross-reference with the reference query file, we can see that none of the top 3 ranked documents "G00-79-4144643", "G00-09-1193469" and "G00-45-0809730" is actually relevant to this query, i.e. a case of false positive, whereas the 4th ranked docment "G00-89-0000000" with labelled as topic 14 should have been ranked higher.

When we zoom into the actual content of the documents that were highly ranked such as "G00-79-4144643", the word "Agriculutural" has a very high term frequency individually but rarely seen together with the word "biotechnology". This is why this document "G00-79-4144643" is scored higher than "G00-89-0000000" even though "G00-89-0000000" is the true relevant document.

However, the scores among the top 4 documents are very close, so that there might be opportunity if a more advanced schema comes in play that allows the query keywords to weigh more in the scorer model. Here we aim to improve the system to make the more relevant documents (e.g. "G00-89-0000000") to higher rankings (a higher MAP score).

In [17]:
# run a sample topic 14 query variation: "agricultural biotechnology"
sampleQuery = QP_Q2.parse("agricultural biotechnology")
sampleQueryResults = SEARCHER_Q2.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
if not sampleQueryResults:
    print("Nothing found.")
else:
    for (docnum, result) in enumerate(sampleQueryResults):
        score = sampleQueryResults.score(docnum)
        fileName = os.path.basename(result["file_path"])
        print(fileName, docnum, score)

G00-51-1924264 0 19.030803336624324
G00-09-1193469 1 16.430263786266377
G00-45-0809730 2 16.088674406075803
G00-79-4144643 3 13.725541733913213
G00-88-1894712 4 10.807640941786207
G00-21-4119651 5 6.916103279158381
G00-34-0444524 6 5.772794096660242
G00-84-0274223 7 4.265523383740064


In the same query we use a minor variation in a lowercase as "agricultural biotechnology" as opposed to the original query "Agricultural biotechnology", but the relevant document "G00-89-0000000" from the original query is not retrieved in the list of results. This directly suggests a false negative case, because the context betweent the two queries is essentially the same.

To summarize from the above, we can try to improve the system performance by applying different layers of filter, such as lower-case unification, stop-word removal, intra-word breaking and stemming in a new textual analyzer to index the files. This would merge the similar keywords (e.g. "Agricultural" and "agricultural" from the above) and potentially increase the weight of terms that are more relevant.

### Q3 (b): Write your code below

In [18]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q3, your query parser in QP_Q3, and your searcher in SEARCHER_Q3

In [19]:
# import nltk
from nltk.stem import *
# download required resources
#nltk.download("wordnet")

# Dont change this! Use it as-is in your code
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [20]:
# Create filters with a lemmatizer
wnFilter = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(WordNetLemmatizer().lemmatize)

In [21]:
wnSchema = Schema(file_path=ID(stored=True),
                  file_content=TEXT(analyzer=wnFilter))

In [22]:
# if index exists - remove it
if os.path.isdir(INDEX_DIR2):
    shutil.rmtree(INDEX_DIR2)

# create the directory for the index
os.makedirs(INDEX_DIR2)

# create index or open it if already exists
INDEX_Q3 = index.create_in(INDEX_DIR2, wnSchema)

In [23]:
# open writer
myWriter2 = writing.BufferedWriter(INDEX_Q3, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r", encoding="utf-8") as f:
            fileContent = f.read()
            myWriter2.add_document(file_path=filePath,
                                   file_content=fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter2.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [24]:
# define a query parser for the field "file_content" in the index
QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema)
SEARCHER_Q3 = INDEX_Q3.searcher()

In [25]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile2 = open(OUTPUT_FILE2, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q3.parse(topic_phrase)
    topicResults = SEARCHER_Q3.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile2.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile2.close()
topicsFile.close()

In [26]:
# INDEX_Q3 = None # Replace None with your index for Q3
# QP_Q3 = None # Replace None with your query parser for Q3
# SEARCHER_Q3 = None # Replace None with your searcher for Q3

In [27]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE2

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.0000


### Q3 (c):
In this part, we utilize the LowerCaseFilter, IntraWordFilter, StopFilter and a lemmatizer is employed using the NLTK WordNetLemmatizer.

Overall, we have seen significant improvements among most of the queries. For all topics, the MAP score increases from 0.1971 to 0.3402. Topics 2 (0 to 0.5), 4 (0.0312 to 0.5375), 9 (0 to 0.0435), 10 (0.1667 tot 0.333), 14 (0.25 to 1.0), 19 (None to 0.5), 22 (0 to 0.0385) and 28 (0 to 0.0729) all have shown improvement. There is also a minor setback that the MAP gets worst like topic 26, the score drops from 0.1111 to 0.0771.

In [28]:
# run a sample topic 14 query variation: "agricultural biotechnology"
sampleQuery = QP_Q3.parse("agricultural biotechnology")
sampleQueryResults = SEARCHER_Q3.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
if not sampleQueryResults:
    print("Nothing found.")
else:
    for (docnum, result) in enumerate(sampleQueryResults):
        score = sampleQueryResults.score(docnum)
        fileName = os.path.basename(result["file_path"])
        print(fileName, docnum, score)

G00-89-0000000 0 18.214011639257414
G00-79-4144643 1 16.785311512357875
G00-01-3251318 2 16.422827720764104
G00-97-1475424 3 16.422827720764104
G00-51-1924264 4 16.122169342434024
G00-09-1193469 5 14.744013233628984
G00-45-0809730 6 14.606627550144285
G00-36-2788975 7 14.212273794173173
G00-72-0385489 8 14.134450247884832
G00-97-2215955 9 13.864211664168874
G00-91-1609512 10 13.66634773244634
G00-35-2527252 11 13.237218534622421
G00-86-0847220 12 12.021346487775538
G00-06-2853218 13 11.80498173034213
G00-06-0690672 14 11.759498291565809
G00-70-3424520 15 11.667867670919065
G00-10-2024294 16 11.451648390647527
G00-88-1894712 17 11.252756736452708
G00-07-2371962 18 9.687714705407522
G00-21-4119651 19 5.8381231007890175
G00-34-0444524 20 5.598449430181252
G00-84-0274223 21 4.424027712336162
G00-46-0840102 22 2.001014379880117
G00-82-3144058 23 1.7428418759257123


In [29]:
!grep -rn $QRELS_FILE -e "14 0 G00-89-0000000"
!grep -rn $QRELS_FILE -e "14 0 G00-79-4144643"
!grep -rn $QRELS_FILE -e "14 0 G00-01-3251318"
!grep -rn $QRELS_FILE -e "14 0 G00-97-1475424"

1356:14 0 G00-89-0000000 1
1348:14 0 G00-79-4144643 0
1204:14 0 G00-01-3251318 0
1359:14 0 G00-97-1475424 0


As seen above, using the topic 14 variation "agricultural biotechnology" again, this time the modified system can not only retrieve the relevant document "G00-89-0000000" but also ranks it the highest in the results (MAP improves from 0.25 to 1.0). Hence we have resolved the false negative in this case, whereas there are still a lot of false positive cases in the current setup.

### Q3 (d):
Yes

### Q3 (e):
Yes

### Q3 (f):
With the major improvements we have seen among most of the queries, the idea I believed is implemented correctly and the modified system has worked reasonaly well. The preprocessing stage using layers of filters is necessary to ensure the relevant information can be properly extracted during indexing and parsed during query. Further investigation toward resolving false positive cases should be looked into.

## Question 4 (Graduate Students)

In [30]:
GRAD_STUDENT = True # change to True if you are a grad student

### Q4 (a):
Here we use the topic 2 "juvenile delinquency" as a sample query to investigate the false positive/negative cases in the improved system.

### False Positive/Negative Study Using Topic 2 "juvenile delinquency":

In [27]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE2

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.0000


At present, the system scores 0.5 in MAP using topic 2 and retrieves only 1 (num_rel_ret: 1) of the 2 relevant documents (num_rel: 2), which gives us a false negative case (1 unretrieved relevant document). For the false positive case, the list of the all retrieved still contains the non-relevant documents.

### Q4 (b): Write your code below

In [33]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4

### Weighting Scheme Using TF-IDF

In [34]:
from whoosh import scoring

# Weighting scheme using tf-idf
OUTPUT_FILE_TFIDF = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres_tfidf")
SEARCHER_TFIDF = INDEX_Q3.searcher(weighting=scoring.TF_IDF())

In [35]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile_tfidf = open(OUTPUT_FILE_TFIDF, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q3.parse(topic_phrase)
    topicResults = SEARCHER_TFIDF.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile_tfidf.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile_tfidf.close()
topicsFile.close()

In [36]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE_TFIDF

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2000
Rprec                 	10	0.0000


As seen above, the overall performance has been decreased using the tf-idf scheme. For all queries, the MAP score drops from 0.3402 to 0.1559. Hence, the following questions will be continuously using the BM25F scoring model.

### Parsing Query Using "OR" Group

In [69]:
from whoosh import qparser

# Weighting scheme using tf-idf
OUTPUT_FILE_OR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres_or")
QP_OR = QueryParser("file_content", schema=INDEX_Q3.schema, group=qparser.OrGroup)

In [70]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile_OR = open(OUTPUT_FILE_OR, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_OR.parse(topic_phrase)
    topicResults = SEARCHER_Q3.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile_OR.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile_OR.close()
topicsFile.close()

In [71]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE_OR

num_ret               	1	205
num_rel               	1	5
num_rel_ret           	1	3
map                   	1	0.0307
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0588
iprec_at_recall_0.00  	1	0.0690
iprec_at_recall_0.10  	1	0.0690
iprec_at_recall_0.20  	1	0.0690
iprec_at_recall_0.30  	1	0.0690
iprec_at_recall_0.40  	1	0.0690
iprec_at_recall_0.50  	1	0.0259
iprec_at_recall_0.60  	1	0.0259
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.0667
P_100                 	1	0.0200
P_200                 	1	0.0150
P_500                 	1	0.0060
P_1000                	1	0.0030
num_ret               	10	275
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.00

As seen above, the overall performance in MAP measure has been significantly improved from 0.3403 to 0.3744 using query parser of "Or" group.

### BM25F Hyperparameter Optimizaton with Query Parser Using "OR" Group

In [76]:
# Weighting scheme using various BM25F hyperparameter setting
import numpy as np
import itertools

# B: 0 to 1 in step size of 0.05
# K1: 1.2 to 2 in step of 0.05
B = np.arange(0, 1, 0.05)
K1 = np.arange(1.2, 2.01, 0.05)

# Generate a grid search list for the two parameters
grid_search = list(itertools.product(B, K1))
for b, k1 in grid_search:
    print(b, k1)

0.0 1.2
0.0 1.25
0.0 1.3
0.0 1.35
0.0 1.4
0.0 1.45
0.0 1.5
0.0 1.55
0.0 1.6
0.0 1.65
0.0 1.7
0.0 1.75
0.0 1.8
0.0 1.85
0.0 1.9
0.0 1.95
0.0 2.0
0.05 1.2
0.05 1.25
0.05 1.3
0.05 1.35
0.05 1.4
0.05 1.45
0.05 1.5
0.05 1.55
0.05 1.6
0.05 1.65
0.05 1.7
0.05 1.75
0.05 1.8
0.05 1.85
0.05 1.9
0.05 1.95
0.05 2.0
0.1 1.2
0.1 1.25
0.1 1.3
0.1 1.35
0.1 1.4
0.1 1.45
0.1 1.5
0.1 1.55
0.1 1.6
0.1 1.65
0.1 1.7
0.1 1.75
0.1 1.8
0.1 1.85
0.1 1.9
0.1 1.95
0.1 2.0
0.15 1.2
0.15 1.25
0.15 1.3
0.15 1.35
0.15 1.4
0.15 1.45
0.15 1.5
0.15 1.55
0.15 1.6
0.15 1.65
0.15 1.7
0.15 1.75
0.15 1.8
0.15 1.85
0.15 1.9
0.15 1.95
0.15 2.0
0.2 1.2
0.2 1.25
0.2 1.3
0.2 1.35
0.2 1.4
0.2 1.45
0.2 1.5
0.2 1.55
0.2 1.6
0.2 1.65
0.2 1.7
0.2 1.75
0.2 1.8
0.2 1.85
0.2 1.9
0.2 1.95
0.2 2.0
0.25 1.2
0.25 1.25
0.25 1.3
0.25 1.35
0.25 1.4
0.25 1.45
0.25 1.5
0.25 1.55
0.25 1.6
0.25 1.65
0.25 1.7
0.25 1.75
0.25 1.8
0.25 1.85
0.25 1.9
0.25 1.95
0.25 2.0
0.3 1.2
0.3 1.25
0.3 1.3
0.3 1.35
0.3 1.4
0.3 1.45
0.3 1.5
0.3 1.55
0.3 1.6
0.3 1.65


In [77]:
# Here we try to capture IPython output using its I/O utilities
from IPython.utils import io

# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

map_score = []
with io.capture_output() as captured:
    for b, k1 in grid_search:
        # create an output file to which we'll write our results
        output_file_tmp = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres_tmp")
        outputTRECFile_tmp = open(output_file_tmp, "w")
        searcher_tmp = INDEX_Q3.searcher(weighting=scoring.BM25F(B=b, content_B=1.0, K1=k1))

        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = QP_OR.parse(topic_phrase)
            topicResults = searcher_tmp.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile_tmp.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

        outputTRECFile_tmp.close()
        
        # Capture TREC_EVAL printout ouput using MAP measure to all queries
        !$TREC_EVAL -m map $QRELS_FILE $output_file_tmp
        trec_output = captured.stdout
        map_score.append(trec_output.split()[-1])
        
    topicsFile.close()
map_max = max(map_score)
print("The max MAP score in the grid-search is: {}".format(map_max))
print("The optimal B, K1 are: {}".format(grid_search[map_score.index(map_max)]))

The max MAP score in the grid-search is: 0.3788
The optimal B, K1 are: (0.55000000000000004, 1.9000000000000006)


In [78]:
INDEX_Q4 = INDEX_Q3
QP_Q4 = QP_OR
SEARCHER_Q4 = INDEX_Q3.searcher(weighting=scoring.BM25F(B=0.55, content_B=1.0, K1=1.9))

### Q4 (c):
An "OR" group query parser is incorporated and has shown great improvement in the MAP score overall (0.3402 to 0.3744). Further increase has been found from 0.3744 to 0.3788 using exhaustive grid search for the hyperparameters B, K1 (optimal: B: 0.55, K1: 1.9).

As seen below, the MAP of the topic 2 "juvenile delinquency" has increased from 0.5 to 0.5345. All the two relevant documents (num_rel: 2) have been retrieved (num_rel_ret: 2), suggesting we have resolved the false negative case, but the false positive case is more severe as the number of non-relevant documents increases.

In [80]:
OUTPUT_FILE3 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres3")

# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile3 = open(OUTPUT_FILE3, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = QP_Q4.parse(topic_phrase)
    topicResults = SEARCHER_Q4.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile3.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile3.close()
topicsFile.close()

In [81]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE3

num_ret               	1	205
num_rel               	1	5
num_rel_ret           	1	3
map                   	1	0.0282
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0588
iprec_at_recall_0.00  	1	0.0588
iprec_at_recall_0.10  	1	0.0588
iprec_at_recall_0.20  	1	0.0588
iprec_at_recall_0.30  	1	0.0571
iprec_at_recall_0.40  	1	0.0571
iprec_at_recall_0.50  	1	0.0248
iprec_at_recall_0.60  	1	0.0248
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0500
P_30                  	1	0.0333
P_100                 	1	0.0200
P_200                 	1	0.0150
P_500                 	1	0.0060
P_1000                	1	0.0030
num_ret               	10	275
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.00

### Q4 (d):
Yes

### Q4 (e):
Yes

### Q4 (f):
Overall, the idea is well implemented as using OR group and hyperparameter tuning has given us better performance in the system in terms of MAP measures. The major improvement is found using the OR group parser, because more relevant documents would be potentially retrieved thus improve our MAP score and eliminate a lot of false negative cases. Meanwhile, false positive issue in this case would be difficult to resolve of trade-off.

## Validation

In [82]:
# Run the following cells to make sure your code returns the correct value types

In [83]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Path Validation

In [84]:
assert "MATERIALS_DIR" in globals(), "variable MATERIALS_DIR does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR))), "MATERIALS_DIR folder does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2"))), "invalid folder structure"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents"))), "invalid folder structure"
print("Paths validated")

Paths validated


### Q2 Validation

In [85]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [86]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [87]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
