## Assignment 1 - Information Retrieval 

### Q1 : a) Which of trec_eval's measures may be appropriate for measuring search system performance for government websites?

Trec_eval is a tool used to evaluate information retrieval resutls. Government search requests may have many different requirements : from staff in a department looking up all instances of a particular regulation, to individuals searching for more information on a given topic. 

In many cases the "map" - or mean average precision is the most important measure.


### Q1 : b) Why do you think this measure is appropriate?

It returns the 'mean average precision' or the area under the precision recall curve is the average precision score for each query.  Overall it seems to indicate whether a) the returned results are relevant to what the judges deemed important to the information need and b) whether the result occured within the top of the ranked return list. 

In [1]:
# Install whoosh
!pip --quiet install whoosh
!pip --quiet install nltk

In [1]:
# Extract dataset
#/resources/data/DSS_Fall2016_Assign1/government

In [2]:
#compile trec_eval
!make -s -w -C /resources/data/DSS_Fall2016_Assign1/trec_eval.8.1 > /dev/null 2>&1

In [3]:
#import necessary libraries
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os, os.path
import shutil

In [4]:
#define constants for pathfiles
DOCUMENTS_DIR = "/resources/data/DSS_Fall2016_Assign1/government/documents"
INDEX_DIR = "/resources/data/DSS_Fall2016_Assign1/government/index1"
QUER_FILE = "/resources/data/DSS_Fall2016_Assign1/government/topics/gov.topics"
QRELS_FILE = "/resources/data/DSS_Fall2016_Assign1/government/qrels/gov.qrels"
OUTPUT_FILE = "/resources/data/DSS_Fall2016_Assign1/government/myres"
TREC_EVAL = "/resources/data/DSS_Fall2016_Assign1/trec_eval.8.1/trec_eval"
#constants for pathfiles for question 3
INDEX_DIR2 = "/resources/data/DSS_Fall2016_Assign1/government/index2"
OUTPUT_FILE2 = "/resources/data/DSS_Fall2016_Assign1/government/myres2"


### Building the index

In [5]:
# define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

In [6]:
# if index exists - remove it
if os.path.isdir(INDEX_DIR):
    shutil.rmtree(INDEX_DIR)

# create the directory for the index
os.makedirs(INDEX_DIR)

# create index
myIndex = index.create_in(INDEX_DIR, mySchema)

### Indexing the files

In [7]:
#a review of the documents in our dataset
!ls $DOCUMENTS_DIR

00  05	10  15	20  25	30  35	40  45	50  55	60  65	70  75	80  85	90  95
01  06	11  16	21  26	31  36	41  46	51  56	61  66	71  76	81  86	91  96
02  07	12  17	22  27	32  37	42  47	52  57	62  67	72  77	82  87	92  97
03  08	13  18	23  28	33  38	43  48	53  58	63  68	73  78	83  88	93  98
04  09	14  19	24  29	34  39	44  49	54  59	64  69	74  79	84  89	94  99


In [8]:
# build a list of all the full paths of the files in DOCUMENTS_DIR
filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)

In [9]:
# count files to index
print("number of files:", len(filesToIndex))

number of files: 4078


In [11]:
# open writer
myWriter = writing.BufferedWriter(myIndex, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r") as f:
            fileContent = f.read()
            myWriter.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


### Querying

In [12]:
# define a query parser for the field "file_content" in the index
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

In [14]:
# run a sample query for the phrase "mining"
sampleQuery = myQueryParser.parse("mining")
sampleQueryResults = mySearcher.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-27-2048511 0 11.170158826961423
G00-23-3149835 1 10.652009691707796
G00-21-2004003 2 10.503256176663015
G00-02-0351712 3 10.400390721635022
G00-42-1455285 4 9.983462665074871
G00-80-2792256 5 9.968921376585962
G00-09-3243231 6 9.785659651660213
G00-07-3659195 7 9.785659651660213
G00-32-2907392 8 9.6203188529462
G00-34-1044519 9 9.450078792832446
G00-94-0326199 10 9.242492296562132
G00-74-1802348 11 9.129242769451768
G00-15-3106058 12 8.988621065292778
G00-90-0342721 13 8.763825043794746
G00-78-2441218 14 8.469978686870341
G00-12-1831266 15 8.230056959150348
G00-62-3289850 16 8.131299320987106
G00-67-2481553 17 8.079090699241481
G00-86-3214229 18 7.909972777923419
G00-00-0681214 19 7.605797631067926
G00-01-2689026 20 6.9793862722689
G00-31-1216640 21 6.9793862722689
G00-44-0995015 22 6.918337097042838
G00-07-1172041 23 6.728688991101429
G00-40-0635599 24 6.659445406770013
G00-05-2231767 25 6.591612449542602
G00-09-1443772 26 6.316697752806301
G00-50-1708114 27 5.529164039346947
G00-

### Evaluate Results Using Trec_Eval

In [13]:
# print the topic file
!cat $QUER_FILE

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education


In [14]:
# print the first 10 lines in the qrels file
!head -n 10 $QRELS_FILE

1 0 G00-00-0681214 0
1 0 G00-00-0945765 0
1 0 G00-00-1006224 1
1 0 G00-00-1591495 0
1 0 G00-00-2764912 0
1 0 G00-00-3253540 0
1 0 G00-00-3717374 0
1 0 G00-01-0270065 0
1 0 G00-01-0400712 0
1 0 G00-01-0682299 0


In [15]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile = open(OUTPUT_FILE, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser.parse(topic_phrase)
    topicResults = mySearcher.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile.write("%s Q0 %s %d %lf amoogle\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile.close()
topicsFile.close()

In [16]:
#ensure output file contains contents 
!head -n 10 $OUTPUT_FILE

1 Q0 G00-90-0342721 0 26.645398 amoogle
2 Q0 G00-22-3396139 0 17.262139 amoogle
2 Q0 G00-76-0415824 1 10.597055 amoogle
2 Q0 G00-78-1531079 2 8.778648 amoogle
2 Q0 G00-15-1718631 3 8.076860 amoogle
2 Q0 G00-70-2787853 4 6.788751 amoogle
2 Q0 G00-74-1394517 5 3.368380 amoogle
4 Q0 G00-99-2247765 0 16.449155 amoogle
4 Q0 G00-85-1525415 1 13.364613 amoogle
4 Q0 G00-05-1218739 2 12.956314 amoogle


In [17]:
#compare output to trec_eval
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE

num_ret        	1	1
num_rel        	1	5
num_rel_ret    	1	0
map            	1	0.0000
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0000
ircl_prn.0.00  	1	0.0000
ircl_prn.0.10  	1	0.0000
ircl_prn.0.20  	1	0.0000
ircl_prn.0.30  	1	0.0000
ircl_prn.0.40  	1	0.0000
ircl_prn.0.50  	1	0.0000
ircl_prn.0.60  	1	0.0000
ircl_prn.0.70  	1	0.0000
ircl_prn.0.80  	1	0.0000
ircl_prn.0.90  	1	0.0000
ircl_prn.1.00  	1	0.0000
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0000
P100           	1	0.0000
P200           	1	0.0000
P500           	1	0.0000
P1000          	1	0.0000
num_ret        	2	6
num_rel        	2	2
num_rel_ret    	2	0
map            	2	0.0000
R-prec         	2	0.0000
bpref          	2	0.0000
recip_rank     	2	0.0000
ircl_prn.0.00  	2	0.0000
ircl_prn.0.10  	2	0.0000
ircl_prn.0.20  	2	0.0000
ircl_prn.0.30  	2	0.0000
ircl_prn.0.40  	2	0.0000
ircl_prn.0.50  	

### Q2 a) How well did the baseline Whoosh system do on your chosen measures? 

Based on the MAP measure, performed poorly : or 0.1971. 

### Q2 b) Are there any particular topics where it did very well, or very badly? 

It performed well on topics : 24, 18
It performed poorly on topics : 26, 22, 16, 14, 10, 9, 7, 6, 4, 2, 1

### Q3 a) What do you think would improve Whoosh's performance on this test collection, and why? 

The performance could be potentially be improved by adding additional tokenizers, such as : lowercase filter, intraword filter, stop filter and seach filter

### Evaluating Additional Configurations

In [14]:
#create a function to perform the normalization
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()

In [15]:
mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = stmLwrStpIntraAnalyzer))

In [16]:
# if index exists - remove it
if os.path.isdir(INDEX_DIR2):
    shutil.rmtree(INDEX_DIR2)

# create the directory for the index
os.makedirs(INDEX_DIR2)

# create index or open it if already exists
myIndex2 = index.create_in(INDEX_DIR2, mySchema2)

In [13]:
# open writer
myWriter2 = writing.BufferedWriter(myIndex2, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r") as f:
            fileContent = f.read()
            myWriter2.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter2.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [17]:
# define a query parser for the field "file_content" in the index
myQueryParser2 = QueryParser("file_content", schema=myIndex2.schema)
mySearcher2 = myIndex2.searcher()

In [18]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile2 = open(OUTPUT_FILE2, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser2.parse(topic_phrase)
    topicResults = mySearcher2.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile2.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile2.close()
topicsFile.close()

In [22]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE2

### Q3 b) What modifications did you make and what were the improvements?

write write

### Q3 c) Did your changes improve things overall?

write write

### Q3 d) Did some queries get better while others got worse?

write write

### Q3 e) What do you think this means for your idea?

write write

### Q4 b) Additional iteration graduate student. Try alternative techniques of improving performance and answer questions 3 again. 

write write