# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [1]:
# imports
# Put all your imports here
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
from whoosh import scoring

import os.path
from pathlib import Path
import tempfile
import subprocess

import nltk
from nltk.stem import *

In [2]:
DATA_DIR = "government"

# Put other path constants here

DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")

# For windows:
TREC_EVAL = os.path.join("trec_eval", "trec_eval.exe")

In [3]:
DOCUMENTS_DIR

'government\\documents'

## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): Provide answer to Q1 (a) here [markdown cell]


     Q1 (a):RPrec - R-Precision

### Q1 (b): Provide answer to Q1 (b) here [markdown cell]

    Q1 (b): RPrec - R precision allows for the assessment of the entire system. It is the ratio between all the relevant documents retrieved until the rank that equals the number of relevant documents you have in your collection in total (r) to the total number of relevant documents in your collection (R) 
    RPrec= r/R

## Question 2

### Q2 (a): Write your code below

In [4]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2

def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)

In [5]:
# first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# now, create the index at the path INDEX_DIR based on the new schema
INDEX_02 = createIndex(mySchema)

In [6]:
def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1) % 1000 == 0:
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the index
        writer.close()

In [7]:
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]

In [8]:
filesToIndex[:5]

['government\\documents\\00\\G00-00-0088569',
 'government\\documents\\00\\G00-00-0114013',
 'government\\documents\\00\\G00-00-0124389',
 'government\\documents\\00\\G00-00-0158061',
 'government\\documents\\00\\G00-00-0165832']

In [9]:
#count files to index
print("number of files:", len(filesToIndex))

number of files: 4078


In [10]:
addFilesToIndex(INDEX_02 , filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [11]:
# define a query parser for the field "file_content" in the index
QP_02 = QueryParser("file_content", schema=INDEX_02.schema)

# defining a searcher using the above index created
SEARCHER_02 = INDEX_02.searcher()

In [12]:
# run a sample query for the phrase "item"
sampleQuery = QP_02.parse("physical therapists")
sampleQueryResults = SEARCHER_02.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)#tf-idf score is given for ranking

G00-26-3134051 0 13.996502415475085
G00-59-0786269 1 13.8539343648808
G00-60-3914816 2 11.345260421579761
G00-21-0649032 3 5.95590290335526
G00-45-4032177 4 5.937136717292031


### Evaluating using TREC-EVAL

In [13]:
# print the topic file
with open(TOPIC_FILE, "r") as f:
    print(f.read())

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education



In [14]:
#defining a reader object on the index
READER_02 = INDEX_02.reader()

In [46]:
#printing all the indexed documents
[(docnum, doc_dict) for (docnum, doc_dict) in READER_02.iter_docs()][:5]

[(0, {'file_path': 'government\\documents\\00\\G00-00-0088569'}),
 (1, {'file_path': 'government\\documents\\00\\G00-00-0114013'}),
 (2, {'file_path': 'government\\documents\\00\\G00-00-0124389'}),
 (3, {'file_path': 'government\\documents\\00\\G00-00-0158061'}),
 (4, {'file_path': 'government\\documents\\00\\G00-00-0165832'})]

In [16]:
#how many terms do we have?
print('Total number of terms: ',READER_02.field_length("file_content"))

Total number of terms:  2165181


In [17]:
#printing a sample of terms read by the reader
[term for term in READER_02.field_terms("file_content")][90010:90020]

['quantum',
 'quarantine',
 'quark',
 'quark.phy.bnl.gov',
 'quarknet.fnal.gov',
 'quarks',
 'quarries',
 'quart',
 'quarter',
 'quarter1.gif']

In [18]:
# printing the list of all the relevant documents from the Qrels file
with open(QRELS_FILE, "r") as f:
    qrels = f.readlines()
#defining an empty list of all the relevant documents in the Qrels file.
rel=[]
for i in qrels:
    if i[-2] == '1':
        rel.append(i.strip('\n'))

rel

['1 0 G00-00-1006224 1',
 '1 0 G00-02-0901987 1',
 '1 0 G00-03-1898526 1',
 '1 0 G00-10-3730888 1',
 '1 0 G00-10-3849661 1',
 '2 0 G00-08-1145623 1',
 '2 0 G00-37-1427392 1',
 '4 0 G00-03-2855342 1',
 '4 0 G00-36-1275993 1',
 '4 0 G00-47-2117970 1',
 '4 0 G00-65-0162935 1',
 '6 0 G00-10-0106475 1',
 '7 0 G00-07-4009621 1',
 '7 0 G00-10-3302265 1',
 '7 0 G00-76-1350144 1',
 '9 0 G00-91-3181951 1',
 '10 0 G00-04-0412407 1',
 '14 0 G00-89-0000000 1',
 '16 0 G00-03-0589290 1',
 '16 0 G00-21-0494028 1',
 '16 0 G00-21-2114990 1',
 '16 0 G00-32-0551737 1',
 '16 0 G00-86-3719816 1',
 '16 0 G00-92-2974327 1',
 '16 0 G00-99-0140748 1',
 '18 0 G00-07-0978415 1',
 '19 0 G00-02-3479535 1',
 '19 0 G00-10-2344253 1',
 '22 0 G00-08-2045138 1',
 '24 0 G00-35-3406418 1',
 '26 0 G00-01-1806077 1',
 '26 0 G00-01-3645577 1',
 '26 0 G00-92-1620651 1',
 '28 0 G00-02-0541868 1',
 '28 0 G00-54-2576117 1']

In [19]:
#total number of relevant documents in qrels.
len(rel)

35

In [20]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    print(result.stdout.decode())

In [21]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_02, SEARCHER_02)

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


only 33 relevant documents retrieved whereas there are 35 relevant documents in the qrels file showing the search system is not efficient. Query id 19 has no relevant documents retrieved.

In [22]:
INDEX_Q2 = INDEX_02 # Replace None with your index for Q2
QP_Q2 = QP_02 # Replace None with your query parser for Q2
SEARCHER_Q2 = SEARCHER_02 # Replace None with your searcher for Q2

### Q2 (b): Provide answer to Q2 (b) here [markdown cell]

    Overall, Whoosh baseline system didn't perfom well for the whole system and needs some data preprocessing to filter out un-necessary strings.  R-Prec overall = 0.1667

### Q2 (c): Provide answer to Q2(c) here [markdown cell]

    The baseline whoosh system didn't do that well for our chosen measure:
    R-Prec was 0 for the topic ids -> 1,10,14,16,2,22,28,4,6,7,9 which implies that it performed very badly for these queries. 
    R-Prec was 1 for topic ids -> 18,24 which implies it performed extrermely well.
    R-Prec was 0.33 for topic id -> 26 and performed moderately well.
    Also, it didn't retrieve any information for the topic id 19.

## Question 3

### Q3 (a): Provide answer to Q3 (a) here [markdown cell]

In [23]:
#defining an empty list
res=[]

#opening the topic file
with open(TOPIC_FILE, "r") as tf:
    topics = tf.read().splitlines()
    
    #appending the topic names after the searcher searches the query string.
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = QP_02.parse(topic_phrase)
        topicResults = SEARCHER_02.search(topicQuery, limit=None)
        res.append(topicResults)
        
    #printing the rank and the score of documents for each queries
    for i in range(len(res)):
        ret_doc = res[i]
        print("topic: ", topics[i])
        print("")
        print("      Doc", "     Rank       ", "Score")
        for (docnum, result) in enumerate(ret_doc):
            score = ret_doc.score(docnum)
            fileName = os.path.basename(result["file_path"])
            print(fileName,"",docnum+1,"", score)
        print("")

topic:  1 mining gold silver coal

      Doc      Rank        Score
G00-90-0342721  1  26.64539777117502

topic:  2 juvenile delinquency

      Doc      Rank        Score
G00-22-3396139  1  17.26213868755707
G00-76-0415824  2  10.597054574626043
G00-78-1531079  3  8.77864826829756
G00-15-1718631  4  8.076859679154353
G00-70-2787853  5  6.788751086401821
G00-74-1394517  6  3.368379609499579

topic:  4 wireless communications

      Doc      Rank        Score
G00-99-2247765  1  16.44915453547273
G00-85-1525415  2  13.364613279303013
G00-05-1218739  3  12.956313628711154
G00-09-0774298  4  11.781349226871903
G00-56-4151981  5  11.367247611926537
G00-21-2229498  6  10.743957712158082
G00-98-4068688  7  10.46486548752591
G00-47-2117970  8  10.213356414484583
G00-67-0152545  9  8.392871246133646
G00-06-1757034  10  6.4315561377014046
G00-78-2551063  11  3.955775319427501
G00-84-0274223  12  2.0684375138105002

topic:  6 physical therapists

      Doc      Rank        Score
G00-26-3134051  1 

Data Preprocessing can help improve the search results.

#### ID:- 4
#### Topic:- wireless communications

Total number of documents retrieved:-

          Doc      Rank        Score
    G00-99-2247765  1  16.44915453547273
    G00-85-1525415  2  13.364613279303013
    G00-05-1218739  3  12.956313628711154
    G00-09-0774298  4  11.781349226871903
    G00-56-4151981  5  11.367247611926537
    G00-21-2229498  6  10.743957712158082
    G00-98-4068688  7  10.46486548752591
    G00-47-2117970  8  10.213356414484583
    G00-67-0152545  9  8.392871246133646
    G00-06-1757034  10  6.4315561377014046
    G00-78-2551063  11  3.955775319427501
    G00-84-0274223  12  2.0684375138105002

Total number of relevant documents in Qrels file:-

              G00-03-2855342
              G00-36-1275993
              G00-47-2117970
              G00-65-0162935

Total number of relevant document retrieved:-
            
              G00-47-2117970 


Hence,

FP(Irrelevant documents retrieved):- 12-1=11

##### Taking Example of One False Positive:-
The document "G00-85-1525415" is highest ranked but is irrelevant because it contains the words 'wireless' AND 'communications' in the exact same manner and hence was retrieved. However, the domain experts didn't consider it relevant because it doesn't meet the information need. The conditions of query search is met but relevance is based on information need not the query search. 

FN(Relevant Documents Not Retrieved):- 4-1=3


##### Taking Example of One False Negative:-
The document "G00-03-2855342" contains the required query but has 'wireless' written as 'Cellular',which is a wireless device and 'communication' as 'transmission of signals' or 'Telecommunication'. But our search system cannot take these words into account as the words are not in the form as our search system desires. Therefore, it contains relevant information and satisfies information need but it's not retrieved because it has words in different forms. 
Therefore, in order to retrieve these documents we need to pre process the data to reduce the noise. We can consider to lemmatize the words or reduce the words to it's root form via stemming

##### The only True Positive:- 'G00-47-2117970'
This document is all about 'wireless communication' and hence provides the relevant information and fulfils the information need but still is ranked at 8th because various words are capitalized or written in some other form. Removing the noise may improve the ranking of this document.


###### The desired information need is "Information on existing and planned uses, research/technology, regulations and legislative interest"


In [24]:
#prining the contents of the selected False Positive documents
FP_id4 = list(Path(DOCUMENTS_DIR).glob("**/G00-85-1525415"))[0]
with open(FP_id4, "r" , encoding = 'utf-8') as f:
    content = f.read()
    print(content)

http://www.its.dot.gov/tcomm/tcomm.htm


     ITS U.S. Department of Transportation - Home link

      Commercial Vehicles | Intelligent Vehicles | Intermodal Freight |
                        Travel Management | 511 Info

      Architecture | Standards | Architecture Conformity | Evaluation |
                         Public Safety | Training |






     [white_arrow.gif] ITS Information
    About Us
    ITS in Your State
    ITS Newsletter &
    Forum
    Document Library
    Press Room
    Speeches
    FAQs
     [white_arrow.gif] Key Sources
    Deployment Tracking
    Benefit/Cost Dbase
    ITS Peer-To-Peer
    Technical Assistance
    Rural
    Telecommunications

     [white_arrow.gif] Related Items
    Search
    Related Links
    US DOT
    Operations Web Site
    Federal Transit
    Administration
    Contact Us


                         HOME
   ITS U.S. Department of Transportation - Home link


   Telecommunications
   RIGHT-OF-WAY
     * "Telecommunications: Getting More F

In [25]:
#prining the contents of the selected False Negatives documents
FN_id4 = list(Path(DOCUMENTS_DIR).glob("**/G00-03-2855342"))[0]
with open(FP_id4, "r" , encoding = 'utf-8') as f:
    content = f.read()
    print(content)

http://www.its.dot.gov/tcomm/tcomm.htm


     ITS U.S. Department of Transportation - Home link

      Commercial Vehicles | Intelligent Vehicles | Intermodal Freight |
                        Travel Management | 511 Info

      Architecture | Standards | Architecture Conformity | Evaluation |
                         Public Safety | Training |






     [white_arrow.gif] ITS Information
    About Us
    ITS in Your State
    ITS Newsletter &
    Forum
    Document Library
    Press Room
    Speeches
    FAQs
     [white_arrow.gif] Key Sources
    Deployment Tracking
    Benefit/Cost Dbase
    ITS Peer-To-Peer
    Technical Assistance
    Rural
    Telecommunications

     [white_arrow.gif] Related Items
    Search
    Related Links
    US DOT
    Operations Web Site
    Federal Transit
    Administration
    Contact Us


                         HOME
   ITS U.S. Department of Transportation - Home link


   Telecommunications
   RIGHT-OF-WAY
     * "Telecommunications: Getting More F

### Q3 (b): Write your code below

In [26]:
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [27]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q3, your query parser in QP_Q3, and your searcher in SEARCHER_Q3

#defining our own filter
myfilter = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter() | CustomFilter(WordNetLemmatizer().lemmatize , 'v')

In [28]:
# define a Schema with the new analyzer
mySchema3 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myfilter))

# create the index based on the new schema
INDEX_03 = createIndex(mySchema3)

In [29]:
addFilesToIndex(INDEX_03, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [30]:
# define a query parser for the field "file_content" in the index
QP_03 = QueryParser("file_content", schema=INDEX_03.schema)
SEARCHER_03 = INDEX_03.searcher()

In [31]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_03, SEARCHER_03)

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	42
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.0000


In [32]:
#defining an empty list
res=[]

#opening the topic file
with open(TOPIC_FILE, "r") as tf:
    topics = tf.read().splitlines()
    
    #appending the topic names after the searcher searches the query string.
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = QP_03.parse(topic_phrase)
        topicResults = SEARCHER_03.search(topicQuery, limit=None)
        res.append(topicResults)
        
    #printing the rank and the score of documents for each queries
    for i in range(len(res)):
        ret_doc = res[i]
        print("topic: ", topics[i])
        print("")
        print("      Doc", "     Rank       ", "Score")
        for (docnum, result) in enumerate(ret_doc):
            score = ret_doc.score(docnum)
            fileName = os.path.basename(result["file_path"])
            print(fileName,"",docnum+1,"", score)
        print("")

topic:  1 mining gold silver coal

      Doc      Rank        Score
G00-90-0342721  1  23.96711110734902
G00-55-3817584  2  13.436290649235756
G00-69-2353421  3  7.288463562861348

topic:  2 juvenile delinquency

      Doc      Rank        Score
G00-37-1427392  1  19.449213860332442
G00-22-3396139  2  18.434193291655042
G00-78-1531079  3  16.335289209741667
G00-92-0578141  4  15.493780459238984
G00-67-0637954  5  14.85795689433918
G00-91-1567424  6  14.721796656533094
G00-94-1117794  7  13.437583261488802
G00-76-0415824  8  12.54308590417346
G00-15-1718631  9  11.522913461281409
G00-90-3871013  10  11.372035203185476
G00-70-2787853  11  10.268645726545305
G00-27-2159399  12  8.788646752754644
G00-74-1394517  13  4.022868439339689

topic:  4 wireless communications

      Doc      Rank        Score
G00-36-1275993  1  15.16426827480711
G00-47-2117970  2  14.653075734498902
G00-99-2247765  3  13.87236761038983
G00-85-1525415  4  13.224219175900872
G00-00-1958915  5  13.150224612762251
G00

G00-08-2869885  23  13.826300497566422
G00-69-1630483  24  13.747454714216891
G00-93-0207532  25  13.683912720461787
G00-08-2045138  26  13.66334443025179
G00-97-1855812  27  13.662969261192329
G00-40-1089727  28  13.624250666257609
G00-83-3561112  29  13.614213964697754
G00-83-0087723  30  13.572514289779232
G00-05-3024540  31  13.423088034128272
G00-67-1920188  32  13.422131262485886
G00-07-1233793  33  13.382579198942642
G00-16-2479016  34  13.352986771015319
G00-80-4041999  35  13.334741840097971
G00-09-0472163  36  13.255398838217829
G00-01-0241150  37  13.163602706857741
G00-81-2370175  38  13.116984535675064
G00-81-2715832  39  13.069695018533771
G00-21-0649032  40  13.056274671061459
G00-60-0543216  41  12.964561914056574
G00-09-3548030  42  12.871801805274405
G00-29-2589930  43  12.762899294976048
G00-10-3017657  44  12.665857923161234
G00-06-3205886  45  12.592326714454867
G00-21-3411003  46  12.49934951917097
G00-09-1022942  47  12.448074362437854
G00-10-2085310  48  12.3928

In [33]:
INDEX_Q3 = INDEX_03 # Replace None with your index for Q3
QP_Q3 = QP_03 # Replace None with your query parser for Q3
SEARCHER_Q3 = SEARCHER_03 # Replace None with your searcher for Q3

### Q3 (c): Provide answer to Q3 (c) here [markdown cell]

Modifications made to the query terms as well as to the document terms:-
1. Stemming (reducing all the verb forms to their root)
2. Lemmatization (grouping the words who convey the same meaning together)
3. Stop Word Filter (removing unnecessary words that do not convey any information)
4. Intrawordfilter
5. Tokenization

With the improved filter, the performance over some of the queries also improved. 
The RPrec increased from 0 to 1 for the topic ids :- 14
The RPrec increased from 0 to 0.5 for the topic ids :- 2,4
topic id 19 was not retrieved but is now retrieved with RPrec = 0.5
and for the rest of the ids, it remained the same.

Also, the overall RPrec improved from 0.1667 to 0.3.

The document 'G00-85-1525415' is still a False Positive but the rank dropped from 2nd to 4th which is a good thing. And hence, the improved filter improved our search engine performance.

The document 'G00-03-2855342' is still not retrieved and is still a False Negative and is still not retrieved. But, there is another document 'G00-36-1275993' which is relevant but was not retrieved. But, with the current filter, it was retrieved with the rank 1. It changed from False Negative to True Positive.



### Q3 (d): Provide answer to Q3 (d) here [markdown cell]

YES

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]

YES

### Q3 (f): Provide answer to Q3 (f) here [markdown cell]

The overall performance of the system increased and provided better results than the baseline whoosh system. RPrec which is the decision factor in our case got better for the respective queries as well as for the overall system. Also, topic id 19 was not retrieved at all was now retrieved after providing improvements in the search system. Therefore, the changes made will now retrieve the higher ranked documents.

## Question 4 (Graduate Students)

In [34]:
GRAD_STUDENT = True # change to True if you are a grad student

### Q4 (a): Provide answer to Q4 (a) here [markdown cell]

The default scoring method BM25F has two parameter B(document length normalization) and k1(a smoothing paramemter for adjusting term frequency saturation). By varying these parameters, we can vary the score and change the weighting.

##### Considering the same topic id:- 4. wireless communication

Total number of documents retrieved:-40

         Doc       Rank       Score
    G00-36-1275993  1  15.082136853124123
    G00-47-2117970  2  14.859084332141869
    G00-74-4030396  3  13.56707831073486
    G00-21-2229498  4  13.402077083419002
    G00-99-2247765  5  13.275229484579372
    G00-67-0152545  6  13.164082308558882
    G00-69-0005329  7  13.066413811539078
    G00-85-1525415  8  13.01223391576302
    G00-84-3349019  9  12.20501686969157
    G00-05-1218739  10  12.190895132762657
    G00-28-2286602  11  11.774213421994189
    G00-46-1439567  12  11.55047844834981
    G00-02-1720397  13  11.278400395889964
    G00-00-1958915  14  11.10549125146912
    G00-59-3586444  15  10.601790642470505
    G00-07-3064254  16  10.391653162713634
    G00-65-4078383  17  10.172789724381795
    G00-16-0059045  18  10.141842710958606
    G00-44-1482914  19  9.748095799853669
    G00-65-0162935  20  9.656693194713643
    G00-09-0774298  21  9.574744896512644
    G00-71-3454228  22  9.55080291993928
    G00-21-2773039  23  8.890601230785663
    G00-20-1390240  24  8.83244257094294
    G00-98-4068688  25  8.725965207486437
    G00-04-3812745  26  8.42501242882606
    G00-87-3746944  27  8.422553665781484
    G00-06-1757034  28  8.409994238279356
    G00-02-0735704  29  8.277472854059127
    G00-02-2698369  30  8.162340417375768
    G00-61-3907226  31  8.040488620538229
    G00-05-1550998  32  7.960844428852013
    G00-56-4151981  33  7.85102734489849
    G00-78-2551063  34  7.813420431814233
    G00-36-3945230  35  7.697174902781903
    G00-80-2865414  36  7.3278289665422776
    G00-79-0620805  37  7.196661201129199
    G00-15-1460278  38  5.995307053483447
    G00-84-0274223  39  4.800252890187574
    G00-23-3986715  40  2.8560802058266566
    
Total number of relevant documents in the Qrels file:- 4

              G00-03-2855342
              G00-36-1275993
              G00-47-2117970
              G00-65-0162935

Total number of relevant documents retrieved:- 3

              G00-36-1275993
              G00-47-2117970
              G00-65-0162935


The scoring method didn't change our False Positive and False Negative but it did improve the ranking of the documents which implies the parameter tuning was beneficial over our query.

FP(Irrelevant documents retrieved):- 40-3=37

##### Taking Example of One False Positive:-
The document "G00-85-1525415" is highest ranked but is irrelevant because it contains the words 'wireless' AND 'communications' in the exact same manner and hence was retrieved. However, the domain experts didn't consider it relevant because it doesn't meet the information need. 

But, the document rank decreased with filter and by varying the scoring parameters. Initially, its rank was 2nd; after custom filter the ranked dropped to 4th and now after tuning the parameter, the rank dropped to 8th. This is good becuause it's an irrelevant document retrieved.

FN(Relevant documents not retrieved):- 4-3=1

##### Taking Example of One False Negative:-
The document "G00-03-2855342" contains the required query but has 'wireless' written as 'Cellular',which is a wireless device and 'communication' as 'transmission of signals' or 'Telecommunication'. But our search system cannot take these words into account as the words are not in the form as our search system desires. Therefore, it contains relevant information and satisfies information need but it's not retrieved because it has words in different forms. 

##### True Positives:-
Total number of true positives have increased from 1 to 3, and hence it suggests an improved performance over the query.

##### The overall RPrec is increased to a marginal level (from 0.167 to 0.2) but there is an improvement, hence, our scoring method gave better performance results.

### Q4 (b): Write your code below

In [35]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4

# define a Schema with the new analyzer
mySchema4 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myfilter))

# create the index based on the new schema
INDEX_04 = createIndex(mySchema4)

In [36]:
addFilesToIndex(INDEX_04, filesToIndex)

already indexed: 1000
already indexed: 2000
already indexed: 3000
already indexed: 4000
done indexing.


In [37]:
# define a query parser for the field "file_content" in the index
QP_04 = QueryParser("file_content", schema=INDEX_04.schema)
# defining a new searhcer with different scoring 
SEARCHER_04 = INDEX_04.searcher(weighting = scoring.BM25F(B=0.2 , k1=2))

In [38]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_04, SEARCHER_04)

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	42
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.0000


In [39]:
#defining an empty list
res=[]

#opening the topic file
with open(TOPIC_FILE, "r") as tf:
    topics = tf.read().splitlines()
    
    #appending the topic names after the searcher searches the query string.
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = QP_04.parse(topic_phrase)
        topicResults = SEARCHER_04.search(topicQuery, limit=None)
        res.append(topicResults)
        
    #printing the rank and the score of documents for each queries
    for i in range(len(res)):
        ret_doc = res[i]
        print("topic: ", topics[i])
        print("")
        print("      Doc", "     Rank       ", "Score")
        for (docnum, result) in enumerate(ret_doc):
            score = ret_doc.score(docnum)
            fileName = os.path.basename(result["file_path"])
            print(fileName,"",docnum+1,"", score)
        print("")

topic:  1 mining gold silver coal

      Doc      Rank        Score
G00-90-0342721  1  31.545119318842566
G00-55-3817584  2  21.177774593173304
G00-69-2353421  3  14.929473092855272

topic:  2 juvenile delinquency

      Doc      Rank        Score
G00-37-1427392  1  21.289285165270623
G00-78-1531079  2  19.989420261429263
G00-22-3396139  3  16.500465372630174
G00-15-1718631  4  15.863583833021245
G00-70-2787853  5  15.580282708497759
G00-76-0415824  6  15.494493729021688
G00-92-0578141  7  14.853212972376546
G00-27-2159399  8  13.988102942128435
G00-67-0637954  9  12.286801773820741
G00-91-1567424  10  12.26179302540987
G00-94-1117794  11  12.006928235578876
G00-90-3871013  12  11.508837959981449
G00-74-1394517  13  9.502598020927163

topic:  4 wireless communications

      Doc      Rank        Score
G00-36-1275993  1  15.082136853124123
G00-47-2117970  2  14.859084332141869
G00-74-4030396  3  13.56707831073486
G00-21-2229498  4  13.402077083419002
G00-99-2247765  5  13.27522948457937

G00-67-1920188  38  12.276855799536527
G00-08-2869885  39  12.186324736617674
G00-69-1630483  40  12.169919618111162
G00-10-2085310  41  12.168956633009746
G00-09-0441906  42  12.065907336875544
G00-01-0241150  43  12.04381636773509
G00-83-0087723  44  12.034588294048895
G00-30-3863788  45  11.864613994388481
G00-06-3205886  46  11.84030443494839
G00-71-0183650  47  11.807852652936706
G00-30-2725539  48  11.723740474398879
G00-69-0982341  49  11.687981225407388
G00-62-3414482  50  11.641805051305706
G00-02-0832578  51  11.554797252067711
G00-53-2539578  52  11.531240238141622
G00-03-2694354  53  11.529164995797139
G00-26-1904362  54  11.495710641191312
G00-10-3017657  55  11.382345299744973
G00-81-2715832  56  11.311520874793777
G00-00-1246456  57  11.28033266131526
G00-04-0177999  58  11.252756163650858
G00-21-3411003  59  11.01535635102174
G00-57-0508479  60  10.992033020444888
G00-93-0207532  61  10.961624554750742
G00-07-0452218  62  10.883979771632514
G00-66-1204421  63  10.818954

In [40]:
INDEX_Q4 = INDEX_04 # Replace None with your index for Q4
QP_Q4 = QP_04 # Replace None with your query parser for Q4
SEARCHER_Q4 = SEARCHER_04 # Replace None with your searcher for Q4

### Q4 (c): Provide answer to Q4 (c) here [markdown cell]

The BM25F parameters were tuned to B=0.2 and k1=2 so as to achieve better performance results.

For False Positive Case:-

The documents' rank decreased with the custom filter and by varying the scoring parameters. Initially, its rank was 2nd; after custom filter the ranked dropped to 4th and now after tuning the parameter, the rank dropped to 8th. This is good becuause it's an irrelevant document retrieved. So, lower rank is desirable.

For False Negative Case:-

3 documents instead of 1, were retrieved by changing the scoring parameters. So, overall performance over the query improved.


### Q4 (d): Provide answer to Q4 (d) here [markdown cell]

YES

### Q4 (e): Provide answer to Q4 (e) here [markdown cell]

YES

### Q4 (f): Provide answer to Q4 (f) here [markdown cell]

It was a good idea to tune the parameters to increase the system performance. The overall RPrec increased to 0.2 and Rprec over our query increase to 0.5. Therefore, more relevant documents were ranked higher and retrieved and irrelevent documents were ranked lower as possible.

## Validation

In [41]:
# Run the following cells to make sure your code returns the correct value types

In [42]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [43]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [44]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [45]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated
