#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 1:  Information Retrieval Basics

### 100 points [7% of your final grade]

### Due: January 31 (Friday) by 11:59pm

*Goals of this homework:* In this homework you will get first hand experience building a text-based mini search engine. In particular, there are three main learning objectives: (i) the basics of tokenization (e.g. stemming, case-folding, etc.) and its effect on information retrieval; (ii) basics of index building and Boolean retrieval; and (iii) basics of the Vector Space model and ranked retrieval.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw1.ipynb`. For example, my homework submission would be something like `555001234_hw1.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Dataset

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

# Part 1: Parsing (20 points)

First, you should tokenize documents (definitions) using **whitespaces and punctuations as delimiters**. Your parser needs to also provide the following three pre-processing options:
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

Please note that you should stick to the stemming package listed above. Otherwise, given the same query, the results generated by your code can be different from others.

In [1]:
# configuration options
remove_stopwords = True  # or false
use_stemming = True # or false
remove_otherNoise = True # or false

In [439]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.tokenize import RegexpTokenizer
import math
ps = PorterStemmer() 


In [359]:
def FetchData(filename):
    f = open(filename, encoding = "utf8")
            
    return f

In [379]:
f = FetchData('homework_1_data.txt')
ps = PorterStemmer() 
stop_words = set(stopwords.words('english')) 
count_stopwords=0
filtered_sentence = [] 
stemmed = []
unique_list = []

for line in f:
    
    s = line.split('\t')[2]
    out = s.translate( str.maketrans("", "", string.punctuation))
    tokens= word_tokenize(out)
    unique_list.append(tokens)
    unique_words = set(tokens)
    for w in unique_words: 
        if w not in stop_words: 
            filtered_sentence.append(w)
            stemmed.append(ps.stem(w))
    
#print(stemmed)   
unique_words = set(stemmed) 
unique_words_filtered = set(filtered_sentence) 

flat_list = []
for sublist in unique_list:
    for item in sublist:
        flat_list.append(item)
unique_words_wout_processing = set(flat_list) 

Processed_length = len(unique_words)
filtered_length = len(unique_words_filtered)
without_processing = len(unique_words_wout_processing)



In [497]:
# write for remove other noise
other_noise =[]
for wrd in unique_words:
        ch =0
        for c in wrd:
            if ord(c)>128:
                ch =1
                break
        if(ch ==0):
            other_noise.append(wrd)
other_noise_removed =len(other_noise)           
#print(other_noise_removed)            

In [391]:
print("None of pre-processing options = " , without_processing)
print("remove stop words =" ,filtered_length) 
print("remove stop words + stemming =" ,Processed_length) 
print("remove stop words + stemming + remove other noise = " ,other_noise_removed) 


None of pre-processing options =  19388
remove stop words = 19261
remove stop words + stemming = 13344
remove stop words + stemming + remove other noise =  12714


### Observations

Once you have your parser working, you should report here the size of your dictionary under the four cases. That is, how many unique tokens do you have with stemming on and casefolding on? And so on. You should fill in the following

* None of pre-processing options      = ??
* remove stop words       = ??
* remove stop words + stemming       = ??
* remove stop words + stemming  + remove other noise     = ??

# Part 2: Boolean Retrieval (30 points)

In this part you build an inverted index to support Boolean retrieval. We only require your index to support AND queries. In other words, your index does not have to support OR, NOT, or parentheses. Also, we do not explicitly expect to see AND in queries, e.g., when we query **relational model**, your search engine should treat it as **relational** AND **model**.

Search for the queries below using your index and print out matching documents (for each query, print out 5 matching documents):
* relational database
* garbage collection
* retrieval model

Please use the following format to present your results:
* query: relational database
* result 1:
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [8]:
import nltk
import string
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
import re
import numpy as np

In [392]:
f = FetchData('homework_1_data.txt')
document_file=[]
document_list =[]
document_entity = []
for file in f:
    definition = file.split('\t')[2]
    docidx =file.split('\t')[1]
    entity = file.split('\t')[0]
    document_file.append(definition)
    document_list.append(docidx)
    document_entity.append(entity)



In [393]:
def invert_dict(vocab, original):
    inverted_index = {}

    for word in vocab:
        inverted_index[word] = {}
        inverted_index[word] = set()

    for key, value in original.items():
        for word in value:
            inverted_index[word].add(key)
            
    return inverted_index 

In [394]:
f = FetchData('homework_1_data.txt')
token_file =[]
for files in f:
    tokens = word_tokenize(files.split('\t')[2])
    unique_words = set(tokens)
    token_file.append(unique_words)
    
flat_list = []
for sublist in token_file:
    for item in sublist:
        flat_list.append(item)
vocab = set(flat_list)
#print(vocab)

In [395]:
# Use this to be given into inverted index mapping
tf ={}

f = FetchData('homework_1_data.txt')
ps = PorterStemmer() 
stop_words = set(stopwords.words('english'))
token_file = []
for files in f:
    entity, docid, definition = files.split('\t')
    tf[docid] = {}
    tokens = wordpunct_tokenize(definition)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [ps.stem(word) for word in tokens]
    tokens = [word for word in tokens if len(word) > 0]
    
    for token in tokens:
        if token not in tf[docid]:
            tf[docid][token] = 1
        else:
            tf[docid][token] += 1
            
    token_file.append(tokens)
    
flat_list = []
for sublist in token_file:
    for item in sublist:
        flat_list.append(item)
vocab1 = set(flat_list)
#print(vocab1) 

In [396]:
f = FetchData('homework_1_data.txt')
ps = PorterStemmer() 
stop_words = set(stopwords.words('english'))

In [397]:
f = FetchData('homework_1_data.txt')
def original_dict(f):
    original_dict = {}

    for line in f:
        entity, docid, definition = line.split('\t')
        original_dict[docid] = wordpunct_tokenize(definition)
        original_dict[docid] = [word.lower() for word in original_dict[docid]]
        original_dict[docid] = [word for word in original_dict[docid] if word not in stop_words]
        original_dict[docid] = [ps.stem(word) for word in original_dict[docid]]
        original_dict[docid] = [word for word in original_dict[docid] if len(word) > 0]
        #original_dict[docid] = [ps.stem(word) if word not in stop_words for word in original_dict[docid]]
        #original_dict[docid] = re.sub(r"\s*{.*}\s*", " ", original_dict[docid])
            
    return original_dict
#print(original_dict(f))

In [398]:
f = FetchData('homework_1_data.txt')
def invert_dict(vocab, original):
    inverted_index = {}

    for word in vocab:
        inverted_index[word] = {}
        inverted_index[word] = set()

    #print(original)
    
    for key, value in original.items():
        #print(value)
        for word in value:
            inverted_index[word].add(key)
            
    return inverted_index 

#print(query2)
#print(invert_dict(vocab1,original_dict(f)).get(query))



In [405]:
def returnval(output,query,document_file,document_entity):
    print("query:" + query)
    for num in range(len(output)):
        print("result " + str(num+1) + ":")
        id = int(output[num]) 
        print("entity:" + document_entity[int(id)])
        print("definition id:" + str(id))
        print("definition:" + document_file[id])
            


In [406]:
# search for the input using your index and print out ids of matching documents.
f = FetchData('homework_1_data.txt')
inverted_index = invert_dict(vocab1,original_dict(f))


query1 = ps.stem("relational")
val1 = inverted_index[query1]
query2 = ps.stem("database")
val2 = inverted_index[query2]
outputBR = list(val1 & val2)
final_output = outputBR[:5]

query3 = ps.stem("garbage")
val3 = inverted_index[query3]
query4 = ps.stem("collection")
val4 = inverted_index[query4]
outputBR2 = list(val3 & val4)
final_output2 = outputBR2[:5]

query5 = ps.stem("retrieval")
val5 = inverted_index[query5]
query6 = ps.stem("model")
val6 = inverted_index[query6]
outputBR3 = list(val5 & val6)
final_output3 = outputBR3[:5]

In [407]:
print(returnval(final_output,"relational database",document_file,document_entity))
print(returnval(final_output2,"garbage collection",document_file,document_entity))
print(returnval(final_output3,"retrieval model",document_file,document_entity))

query:relational database
result 1:
entity:relational model
definition id:831
definition:a database model that describes data in which all data elements are placed in two-dimensional tables, called relations, which are the logical equivalent of files.

result 2:
entity:relational database
definition id:28160
definition:a group of related databases associated by a key, or a common identifying (qualitative) characteristic.

result 3:
entity:data management
definition id:16052
definition:data stored in relational databases -tables stored in secondary storage

result 4:
entity:relational algebra
definition id:7135
definition:a theoretical way of manipulating a relational database based on set theory

result 5:
entity:data model
definition id:7008
definition:model used for planning the org's database that identifies what kind of info is needed, what entities will be created and how they are related to one another

None
query:garbage collection
result 1:
entity:garbage collection
definition 

### Observations
Could your boolean search engine find relevant documents for these queries? What is the impact of the three pre-processing options? Do they improve your search quality?

    # Part 3: Ranking Documents (50 points) 

    In this part, your job is to rank the documents that have been retrieved by the Boolean Retrieval component in Part 2, according to their relevance with each query.

    ### A: Ranking with simple sums of TF-IDF scores (15 points) 
    For a multi-word query, we rank documents by a simple sum of the TF-IDF scores for the query terms in the document.
    TF is the log-weighted term frequency $1+log(tf)$; and IDF is the log-weighted inverse document frequency $log(\frac{N}{df})$

    **Output:**
    For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 results plus the TF-IDF sum score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [498]:
# your code here
# hint: you could first call boolean retrieval function in part 2 to find possible relevant documents, 
# and then rank these documents in this part. Hence, you don't need to rank all documents.
total_document = len(document_list)
#document_list= all docid list

import math
#document_file = all defination list
def tfidfcal(query, document_list, inverted_index,document_file):
    query = parseDefination(query)   #['relat', 'databas']
   # print(query)
    rank = []
    idf_val = {}
    #calculating idf value
    for word in query:
        dftv = len(inverted_index[word])  #1057
        #print(dftv)
        idfv = math.log10(total_document/dftv)
        #print(idfv)
        idf_val[word] = idfv
    for docid in document_list:
        tfidf_score =0
       
        for word in query:
            if word not in inverted_index:
                continue
            parsedDoc = document_file[int(docid)]
            #print(parsedDocument)
            word_freq = parsedDoc.count(word)
            if(word_freq >0):
                #print(word_freq)
                tfv = 1+ math.log10(word_freq)
                idfv = idf_val[word]
                tfidf_score = tfidf_score + tfv*idfv
        rank.append((tfidf_score, docid))
        rank = sorted(rank, reverse = True)
    tfidf_score = [(ranked[1],ranked[0]) for ranked in rank]
    return tfidf_score[:5]
result1 = tfidfcal('relational database',document_list,inverted_index,document_file)
#print(result1)

In [425]:
result1 = tfidfcal('relational database',document_list,inverted_index,document_file)
result2 = tfidfcal('garbage collection',document_list,inverted_index,document_file)
result3 = tfidfcal('retrieval model',document_list,inverted_index,document_file)

In [296]:
def returnvalPart3(output,query,document_file,document_entity):
    print("query:" + query)
    for num in range(len(output)):
        print("result " + str(num+1) + ":")
        if(isinstance(output[0],tuple)):
            id = output[num][0]
            print("score: " + str(output[num][1]))
            print("entity:" + document_entity[int(id)])
            print("definition id:" + str(id))
            print("definition:" + document_file[int(id)])
        else:
            id = int(output[num]) 
            print("entity:" + document_entity[int(id)])
            print("definition id:" + str(id))
            print("definition:" + document_file[id])

In [426]:
print(returnvalPart3(result1,"relational database",document_file,document_entity))
print(returnvalPart3(result2,"garbage collection",document_file,document_entity))
print(returnvalPart3(result3,"retrieval model",document_file,document_entity))

query:relational database
result 1:
score: 4.71733880527531
entity:relational algebra
definition id:7156
definition:- a theoretical language with operations that work on one or more relations to define another relation without changing the original relation(s)  - relation-at-a-time (or set) language in which all tuples, possibly from several relations, are manipulated in one statement without looping  relational algebra, first created by edgar f. codd while at ibm, is a family of algebras with a well-founded semantics used for modelling the data stored in relational databases, and defining queries on it.  the main application of relational algebra is providing a theoretical foundation for relational databases, particularly query languages for such databases, chief among which is sql.

result 2:
score: 4.357658330802902
entity:relational database
definition id:28378
definition:a type of database system where data is stored in  tables related by common fields. a relational database is th

In [441]:
def parseDefination(query):
    tokens = RegexpTokenizer(r'\w+')
    query = tokens.tokenize(query.lower())
    stm = PorterStemmer()
    if remove_stopwords:
        query = [w for w in query if not w in stop_words]
    if use_stemming:
        query = [stm.stem(word) for word in query]    
    return query
#print(parseDefination('relational database'))

### B: Ranking with vector space model with TF-IDF (15 points) 

**Cosine:** You should use cosine as your scoring function. 

**TFIDF:** For the document vectors, use the standard TF-IDF scores as introduced in A. For the query vector, use simple weights (the raw term frequency). For example:
* query: troll $\rightarrow$ (1)
* query: troll trace $\rightarrow$ (1, 1)

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the cosine score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

You can additionally assume that your queries will contain at most three words. Be sure to normalize your vectors as part of the cosine calculation!

In [303]:
def calidf(vocab1):
    idf_matrix_vocab = {}
    for v in vocab1:
        #print(v)
        dftval = len(inverted_index[v])  #1057
        #print(dftval)
        idfval = math.log10(total_document/dftval)
        #print(idf)
        idf_matrix_vocab[v] = idfval

    return  idf_matrix_vocab

#print(calidf(vocab1)['yellow'])


In [161]:
def numerator(query):
    query = parseDefination(query)  #['relat', 'databas']
#print(query)

    tfidf={}
#numerator = tfid(relat) + tfid(database)

    for docid in document_list:
        for wrd in query:
            if wrd not in inverted_index:
                continue
            word_frq = list(vocab1).count(wrd)
            if(word_frq >0):
                tfv = 1+ math.log10(word_frq)
                #print(tfv)
                idfv = idfmatrix[wrd]
                #print(idfv)
                tfidf[wrd] = tfv*idfv
    return tfidf     

print(numerator('relational database'))

{'relat': 1.4661083113184445, 'databas': 1.2538980211778423}


In [442]:
def parseQuery(query):
    query = parseDefination(query)
    out ={}
    for w in query:
        word_frq = list(vocab1).count(w)
        out[w] = word_frq
    return out

#print(parseQuery('relational database'))

In [418]:
def VSMCosine(query,outputBR):
    queryvect = parseQuery(query)  #['relat', 'databas']
    qdenm = 0
    tfid = numerator(query)  #0': {'estim': 1, 'durat': 1, 'cost': 1, 'made': 1, 'compon': 1, 'separ': 1, 'combin': 1, 'provid': 1, 'overal': 1, 'figur': 1}
    docvect=0
    docdenm=0

    ranking = []

    sum=0
    for i in tfid:     # according to query calculated
        sum=sum+ tfid[i]

#outputBR-- for relational database 
    for docid in outputBR:
        for w in queryvect:
            
        #print(w)
            qdenm = qdenm+(queryvect[w]*queryvect[w])
            docdenm = docdenm +(tfid[w]*tfid[w])
        qdenm = math.sqrt(qdenm) 
        docdenm = math.sqrt(docdenm)
    
    
    
        deno = qdenm*docdenm
        nume = sum
        cosine_score = nume/deno
    
        ranking.append((cosine_score,docid))
        ranking = sorted(ranking, reverse = True)

    cosine_score_list = [(ranked[1],ranked[0]) for ranked in ranking]
    result_VSM  = cosine_score_list[:5]
    return result_VSM

print(VSMCosine('relational database',outputBR))

[('831', 0.9969703954378698), ('28160', 0.6192489053674813), ('16052', 0.5614877243385036), ('7135', 0.5492173984278385), ('7008', 0.5464098806564088)]


In [421]:
resultX = VSMCosine('relational database',outputBR)
resultY = VSMCosine('garbage collection',outputBR2)
resultZ = VSMCosine('retrieval model',outputBR3)


In [422]:
print(returnvalPart3(resultX,"relational database",document_file,document_entity))
print(returnvalPart3(resultY,"garbage collection",document_file,document_entity))
print(returnvalPart3(resultZ,"retrieval model",document_file,document_entity))

query:relational database
result 1:
score: 0.9969703954378698
entity:relational model
definition id:831
definition:a database model that describes data in which all data elements are placed in two-dimensional tables, called relations, which are the logical equivalent of files.

result 2:
score: 0.6192489053674813
entity:relational database
definition id:28160
definition:a group of related databases associated by a key, or a common identifying (qualitative) characteristic.

result 3:
score: 0.5614877243385036
entity:data management
definition id:16052
definition:data stored in relational databases -tables stored in secondary storage

result 4:
score: 0.5492173984278385
entity:relational algebra
definition id:7135
definition:a theoretical way of manipulating a relational database based on set theory

result 5:
score: 0.5464098806564088
entity:data model
definition id:7008
definition:model used for planning the org's database that identifies what kind of info is needed, what entities will

### C: Ranking with BM25 (20 points) 
Finally, let's try the BM25 approach for ranking. Refer to https://en.wikipedia.org/wiki/Okapi_BM25 for the specific formula. You could choose k_1 = 1.2 and b = 0.75 but feel free to try other options.

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the BM25 score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [443]:
#outputBR-- for relational database 
sum=0
for docid in outputBR:
    sum =sum + word_count[docid]

avglen =float(sum) / float(len(outputBR))

#print(avglen)
    

In [321]:
#document_list ==== total document considered
sum=0
for docid in document_list:
    sum =sum + word_count[docid]

avglen_total =float(sum) / float(len(document_list))

print(avglen_total)
    

20.14141087427629


In [322]:
print(document_file[0])
wds = document_file[0].split(" ")
print(len(wds))

estimates of duration and cost are made for each component separately and combined to provide an overall figure

18


In [480]:
word_count = {}
for docid in document_list:
    word_count[docid]= len(document_file[int(docid)].split(" "))
#print (word_count)   
k2=4
print(word_count['0'])   

18


In [481]:
k1 = 1.5

b = 0.75

def scoreBM25(f,n,avgdl,queryTerm):
    numerator = f*(k1+1)*k2
    deno1 = (n/avgdl)*b
    denominator = f + k1*(1-b + deno1)
    sm = (numerator/denominator)*idfmax[queryTerm]
    return sm

score = scoreBM25(word_freq,n,avglen_total,'relat')
#print(score)

In [483]:
def BM25(query,outputBR):
    queryvect = parseQuery(query)  # {'relat': 1, 'databas': 1}

    ranks = []

#outputBR-- for relational database 
    for docid in outputBR:
        for w in queryvect:
            if w not in inverted_index:
                    continue
            parsedDocument = document_file[int(docid)]
            #print(parsedDocument)
            word_freq = parsedDocument.count(w)
            #print(word_freq)
            if(word_freq >0):
                wds = parsedDocument.split(" ")
                n = len(parsedDocument) 
                #print(n)#no of words in parsed document
                score = scoreBM25(word_freq,n,avglen_total,w)
    
        ranks.append((score,docid))
        ranks = sorted(ranks, reverse = True)

    BM25_list = [(ranked[1],ranked[0]) for ranked in ranks]
    return BM25_list[:5]

#print(BM25('relational database',outputBR))

In [477]:
resultA = BM25('relational database',outputBR)
resultB = BM25('garbage collection',outputBR2)
resultC = BM25('retrieval model',outputBR3)


In [478]:
print(returnvalPart3(resultA,"relational database",document_file,document_entity))
print(returnvalPart3(resultB,"garbage collection",document_file,document_entity))
print(returnvalPart3(resultC,"retrieval model",document_file,document_entity))

query:relational database
result 1:
score: 4.70036763058626
entity:relational database
definition id:28205
definition:a database built using the relational database model

result 2:
score: 3.701653166778901
entity:relational databases
definition id:5134
definition:• a database is intended to be shared by many users • there are three structures for storing database files: - relational database structures - hierarchical database structures - network database structures

result 3:
score: 3.6433178422136003
entity:relational database
definition id:28177
definition:relational database schema with data

result 4:
score: 3.4741687611343863
entity:relational database
definition id:28210
definition:a collection of related database tables

result 5:
score: 3.2716443379401814
entity:relational database
definition id:28227
definition:a database using the relational data model.

None
query:garbage collection
result 1:
score: 2.740246983633277
entity:garbage collection
definition id:21553
definition

### Discussio
Briefly discuss the differences you see between the three methods. Is there one you prefer?

## BONUS --EVALUATION(10 POINTS)
Rather than just compare methods by pure observation, there are several metrics to evaluate the performance of an IR engine: Precision, Recall, MAP, NDCG, HitRate and so on. These all require a ground truth set of queries and documents with a notion of relevance. These ground truth judgments can be expensive to obtain, so we are cutting corners here and treating a flashcard's front and back as a "relevant" query-document pair.

That is, if a document (definition) in your top-5 results is from the back of query's (entity's) flashcard, this document is regarded as relevant to the query (entity). This document is also called a hit in IR. Based on the ground-truth, you could calculate the metrics for the three ranking methods and provide the results like these:

metric: Precision@5
TF-IDF - score1
Vector Space Model with TF-IDF - score2
BM25 - score3
You could pick any of the reasonable metrics.

In [496]:
# your code here
#BM25
def Precision(output,query,document_file,document_entity):
    print("query:" + query)
    count =0
    for num in range(len(output)):
        
        if(isinstance(output[0],tuple)):
            id = output[num][0]
            score = str(output[num][1])
            entity= document_entity[int(id)]
            docid=  str(id)
            definition = document_file[int(id)]
            if(entity==query):
                count=count+1
    return count     
num1 =Precision(result1,"relational database",document_file,document_entity)
print("precision for TFID = " , num1/5)
num2 =Precision(resultX,"relational database",document_file,document_entity)
print("precision for VSM = " , num2/5)
num3 =Precision(resultA,"relational database",document_file,document_entity)
print("precision for BM25 = " , num3/15) #FOR ALL CALS


query:relational database
precision for TFID =  0.4
query:relational database
precision for VSM =  0.2
query:relational database
precision for BM25 =  0.26666666666666666


# Collaboration Declarations

** You should fill out your collaboration declarations here.**

**Reminder:** You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by filling out the Collaboration Declarations at the bottom of this notebook.

Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.