# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [1]:
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os.path
from pathlib import Path
import tempfile
import subprocess
import nltk
from nltk.stem import *
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
DATA_DIR = "government"
DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")
TREC_EVAL = os.path.join("trec_eval", "trec_eval.exe")

## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): Provide answer to Q1 (a) here [markdown cell]
map

### Q1 (b): Provide answer to Q1 (b) here [markdown cell]
MAP is the mean of average precision over queries. Compared with P@K,average precision counts higher ranked documents more often which means ordering matters. 

## Question 2

### Q2 (a): Write your code below

In [3]:
def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)

mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# now, create the index at the path INDEX_DIR based on the new schema
INDEX_Q2 = createIndex(mySchema)

def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)
    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1 % 1000 == 0):
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the index
        writer.close()

In [4]:
# Build a list of files to index
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]
addFilesToIndex(INDEX_Q2, filesToIndex)

done indexing.


In [5]:
QP_Q2 = QueryParser("file_content", schema=INDEX_Q2.schema) 
SEARCHER_Q2 = INDEX_Q2.searcher()

In [6]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    
    print(result.stdout.decode())

In [7]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q2, SEARCHER_Q2)

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


### Q2 (b): Provide answer to Q2 (b) here [markdown cell]

map                 	1&nbsp;  0.0000  
    
map                 	10&nbsp;	0.1667

map                 	14&nbsp;	0.2500  

map                 	16&nbsp;	0.0000

map                 	18&nbsp;	1.0000  

map                 	2&nbsp; 	0.0000  

map                 	22&nbsp;	0.2000

map                 	24&nbsp;	1.0000  

map                 	26&nbsp;	0.1111  

map                 	28&nbsp;	0.0000

map                 	4&nbsp;	0.0312  

map                 	6&nbsp;	0.0000

map                 	7&nbsp;	0.0000

map                 	9&nbsp;	0.0000

map                 	all&nbsp;	0.197  


### Q2 (c): Provide answer to Q2(c) here [markdown cell]
It did well on topic 18 and topic 24 while the MAP of the topic 16, 2, 28, 6, 7, 9 were 0.

## Question 3

In [9]:
# run a  query for the topic 6 "physical therapists"
sampleQuery =QP_Q2.parse("physical therapists" )
sampleQueryResults = SEARCHER_Q2.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-26-3134051 0 13.996502415475085
G00-59-0786269 1 13.8539343648808
G00-60-3914816 2 11.345260421579761
G00-21-0649032 3 5.95590290335526
G00-45-4032177 4 5.937136717292031


### Q3 (a): Provide answer to Q3 (a) here [markdown cell]
For topic 6 'physical therapists', an example of false negatives is G00-10-0106475 which is relevant to the query according to the qrels file but not retrieved.


As is shown in the picture below, for the baseline Whoosh configuration, both "physical" and "therapists" must be present for a document to match. The file G00-10-0106475 contains "therapy" not "therapists" so it is not retrieved.


<img src="./fig1.JPG"> 


An example of false positives is G00-26-3134051 which is ranked highly but not relevant. It is because this document includes both "physical" and "therapist". Moreover, since the scoring module implements the BM25F algorithm, documents that have short lengths and where terms in the query have higher term frequencies have higher scores and G00-26-3134051 has such features. 
Accordingly, some methods are supposed to be used to make words to arrive at the base word, not to mention LowercaseFilter and StopFilter,etc.

### Q3 (b): Write your code below

In [10]:
with open(TOPIC_FILE, "r") as tf:
        topics = tf.read().splitlines()
        topiclist = []
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topiclist.append(topic_phrase)
            topicstr = ' '.join(topiclist)
            
# we start with basic tokenizer on words of topics
tokenizer = RegexTokenizer()
[token.text for token in tokenizer(topicstr)]

['mining',
 'gold',
 'silver',
 'coal',
 'juvenile',
 'delinquency',
 'wireless',
 'communications',
 'physical',
 'therapists',
 'cotton',
 'industry',
 'genealogy',
 'searches',
 'Physical',
 'Fitness',
 'Agricultural',
 'biotechnology',
 'Emergency',
 'and',
 'disaster',
 'preparedness',
 'assistance',
 'Shipwrecks',
 'Cybercrime',
 'internet',
 'fraud',
 'and',
 'cyber',
 'fraud',
 'Veteran',
 's',
 'Benefits',
 'Air',
 'Bag',
 'Safety',
 'Nuclear',
 'power',
 'plants',
 'Early',
 'Childhood',
 'Education']

In [28]:
# using some Tokenizers and Filters of whoosh
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()
[token.text for token in stmLwrStpIntraAnalyzer(topicstr)]
# we can find out the word "therapist" is still not changed

['mine',
 'gold',
 'silver',
 'coal',
 'juvenil',
 'delinqu',
 'wireless',
 'commun',
 'physical',
 'therapist',
 'cotton',
 'industri',
 'genealog',
 'search',
 'physical',
 'fit',
 'agricultur',
 'biotechnolog',
 'emerg',
 'disast',
 'prepared',
 'assist',
 'shipwreck',
 'cybercrime',
 'internet',
 'fraud',
 'cyber',
 'fraud',
 'veteran',
 'benefit',
 'air',
 'bag',
 'safeti',
 'nuclear',
 'power',
 'plant',
 'earli',
 'childhood',
 'educ']

In [12]:

# Using NLTK's stemmers
lrStem = LancasterStemmer()
sbStem = SnowballStemmer("english")
wnLemm = WordNetLemmatizer()

topiclist = topicstr.split()

# I'll compare two stemmers and a lemmatizer and choose LancasterStemmer() finally
for word in topiclist:
    print("%15s %15s %15s %15s" % (lrStem.stem(word),
                                   sbStem.stem(word),
                                   wnLemm.lemmatize(word),
                                   wnLemm.lemmatize(word, 'v')))

            min            mine          mining            mine
           gold            gold            gold            gold
           silv          silver          silver          silver
           coal            coal            coal            coal
        juvenil         juvenil        juvenile        juvenile
        delinqu         delinqu     delinquency     delinquency
       wireless        wireless        wireless        wireless
         commun        communic   communication  communications
           phys          physic        physical        physical
         therap       therapist       therapist      therapists
         cotton          cotton          cotton          cotton
       industry        industri        industry        industry
       genealog        genealog       genealogy       genealogy
         search          search          search          search
           phys          physic        Physical        Physical
            fit             fit         

In [13]:
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [15]:
# combining some Tokenizers and Filters of whoosh and LancasterStemmer() of NLTK
myFilter = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter() | CustomFilter(LancasterStemmer().stem

In [16]:

# define a Schema with the new analyzer
mySchema3 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter))

INDEX_Q3 = createIndex(mySchema3)
addFilesToIndex(INDEX_Q3, filesToIndex)
QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema)
SEARCHER_Q3 = INDEX_Q3.searcher()

done indexing.


In [17]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    
    print(result.stdout.decode())

In [18]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3)

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	52
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.0000


In [30]:
# run a  query for the topic 6 "physical therapists"
sampleQuery3 =QP_Q3.parse("physical therapists" )
sampleQueryResults3 = SEARCHER_Q3.search(sampleQuery3, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults3):
    score = sampleQueryResults3.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-66-3921879 0 17.46899044668693
G00-32-3515764 1 17.26262146618101
G00-17-0942793 2 17.203982825361432
G00-53-1998631 3 17.173316137966196
G00-77-4032514 4 16.685394988230875
G00-10-0106475 5 16.65128340677739
G00-71-1781567 6 16.266956052752562
G00-29-3538849 7 15.755316702073758
G00-59-0786269 8 15.71866404031583
G00-62-0503229 9 15.158572789697388
G00-79-0620805 10 13.75204488718396
G00-03-1993772 11 12.982378015754998
G00-03-2010455 12 12.96896377464737
G00-60-3914816 13 12.75222315709509
G00-59-2617080 14 12.591580436540987
G00-09-3069955 15 11.933489951001182
G00-91-3031381 16 11.481451278302718
G00-44-0827206 17 11.202738416515597
G00-26-3134051 18 11.197801300706354
G00-43-2127379 19 10.66364081706265
G00-21-0649032 20 10.392688022841604
G00-47-1267820 21 10.211837768307724
G00-45-4032177 22 9.798988235832775
G00-70-1984499 23 9.706129968148273
G00-41-0098543 24 9.24521648427766
G00-68-1600868 25 9.164480123843322
G00-82-3144058 26 8.749605197210283
G00-33-1942384 27 8.23669

### Q3 (c): Provide answer to Q3 (c) here [markdown cell]

I combined some Tokenizers and Filters of whoosh and LancasterStemmerone(), one of NLTK stemmers. There were improvements over most queries in performance but 5 topics were not improved. The false negative case improved because with these changes it was retrieved and ranked highly. The false positive cases also improved because their rankings dropped.

### Q3 (d): Provide answer to Q3 (d) here [markdown cell]
yes

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]
yes

### Q3 (f): Provide answer to Q3 (f) here [markdown cell]
It was good because it made the performance to develop in a good direction although there were a few exceptions. The next step is to deal with exceptions and propose solutions for exceptions.

## Question 4 (Graduate Students)

In [31]:
GRAD_STUDENT = True # change to True if you are a grad student

### Q4 (a): Provide answer to Q4 (a) here [markdown cell]
For topic 9 "genealogy searches", G00-91-318195 is a false negative because it is relevant but not ranked highly. An example of false positive is G00-59-0523165. Those two documents have almost the same term frequencies but the false positive, document G00-59-0523165, has a much shorter length. Since the parameter B controls how much effect length normalization should have, I choose a smaller B to weaken the effects of the length of the documents and also tune K1 to get a best result. 

In [32]:
from whoosh import scoring
mySchema4 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter))
INDEX_Q4 = createIndex(mySchema4)
addFilesToIndex(INDEX_Q4, filesToIndex)
QP_Q4 = QueryParser("file_content", schema=INDEX_Q4.schema)

done indexing.


In [33]:
SEARCHER_Q4 = INDEX_Q4.searcher(weighting=scoring.BM25F(B=0.5, K1=1.5))

In [34]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    
    print(result.stdout.decode())

In [35]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q4, SEARCHER_Q4) 

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	52
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.0000


In [36]:
# run a  query for the topic 6 "physical therapists"
Query3 =QP_Q3.parse("genealogy searches" )
QueryResults3 = SEARCHER_Q3.search(Query3, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(QueryResults3):
    score = QueryResults3.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-30-0221651 0 14.375037867758056
G00-79-2892445 1 13.424950588630212
G00-26-1048210 2 12.392029318651186
G00-55-0643570 3 11.497054143096872
G00-02-1372443 4 10.777541937361102
G00-08-0900666 5 10.777541937361102
G00-08-1314254 6 10.777541937361102
G00-06-1975174 7 10.405808918386139
G00-59-0523165 8 10.405808918386139
G00-88-2629440 9 10.405808918386139
G00-95-3755341 10 10.405808918386139
G00-24-0016657 11 10.353918860271968
G00-95-3337324 12 10.353918860271968
G00-01-2134408 13 10.1814202423583
G00-33-1729611 14 10.070068981217394
G00-01-2898660 15 9.879990942669014
G00-91-3181951 16 9.815246958879358
G00-43-3812747 17 9.743915969345494
G00-21-1529615 18 9.435812076690379
G00-67-1176122 19 9.147730089331123
G00-08-3780534 20 9.093147175005733
G00-49-2630728 21 8.601572598020624
G00-00-2016453 22 8.542644070469915
G00-59-3622783 23 7.592934911703263
G00-48-2464830 24 6.1386074695904185
G00-22-3171818 25 6.031816992720081
G00-21-2737022 26 4.565078287082096


In [37]:
# run a  query for the topic 6 "physical therapists"
sampleQuery4 =QP_Q4.parse("genealogy searches" )
sampleQueryResults4 = SEARCHER_Q4.search(sampleQuery4, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults4):
    score = sampleQueryResults4.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-30-0221651 0 15.509571914726536
G00-79-2892445 1 15.362037509183018
G00-26-1048210 2 13.16705002702678
G00-91-3181951 3 11.571274640605147
G00-55-0643570 4 11.41317399162092
G00-02-1372443 5 11.149755344737851
G00-08-0900666 6 11.149755344737851
G00-08-1314254 7 11.149755344737851
G00-43-3812747 8 10.445997706712355
G00-24-0016657 9 10.055135612781925
G00-95-3337324 10 10.055135612781925
G00-01-2134408 11 9.937999292775888
G00-01-2898660 12 9.539763238773023
G00-06-1975174 13 9.514204481886681
G00-59-0523165 14 9.514204481886681
G00-88-2629440 15 9.514204481886681
G00-95-3755341 16 9.514204481886681
G00-33-1729611 17 9.30616799651125
G00-59-3622783 18 9.053096687161352
G00-08-3780534 19 8.993113769641333
G00-21-1529615 20 8.900705918684896
G00-49-2630728 21 8.801252691386097
G00-67-1176122 22 8.710941640903279
G00-00-2016453 23 8.300382162307162
G00-48-2464830 24 7.10669262335312
G00-22-3171818 25 6.40163627955863
G00-21-2737022 26 5.121052453870837


### Q4 (c): Provide answer to Q4 (a) here [markdown cell]
Since the parameter B controls how much effect length normalization should have, I choose a smaller B to weaken the effects of the length of the documents and also tune K1 to get a best result.

The false negative improved because its ranking improved(from the 17th to the 4th) and the false positive also improved since its ranking dropped. There were improvements for some of the queries while results were getting worse for some other queries.

### Q4 (d): Provide answer to Q4 (a) here [markdown cell]
No.

### Q4 (e): Provide answer to Q4 (a) here [markdown cell]
Yes.

### Q4 (f): Provide answer to Q4 (a) here [markdown cell]

It’s been fairly well studied that there are no “best” b and k1 values for all queries what is also shown in the above results. Smaller B is suitable for query 9 but not for query 26. For some long documents have different topics - political articles may touch on economics, sports and so on, a larger b is preferrable while for some long documents highly specific about one topic, a smaller b is better. 

## Validation

In [38]:
# Run the following cells to make sure your code returns the correct value types

In [39]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [40]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [41]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [42]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

Q4 Types Validated


## Refference
https://whoosh.readthedocs.io/en/latest/api/scoring.html#whoosh.scoring.BM25F