#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 1:  Information Retrieval Basics

### 100 points [7% of your final grade]

### Due: 

*Goals of this homework:* In this homework you will get first hand experience building a text-based mini search engine. In particular, there are three main learning objectives: (i) the basics of tokenization (e.g. stemming, case-folding, etc.) and its effect on information retrieval; (ii) basics of index building and Boolean retrieval; and (iii) basics of the Vector Space model and ranked retrieval.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw1.ipynb`. For example, my homework submission would be something like `555001234_hw1.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Dataset

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

# Part 1: Parsing (20 points)

First, you should tokenize documents (definitions) using **whitespaces and punctuations as delimiters**. Your parser needs to also provide the following three pre-processing options:
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

Please note that you should stick to the stemming package listed above. Otherwise, given the same query, the results generated by your code can be different from others.

In [4]:
# configuration options
remove_stopwords = True  # or false
use_stemming = True # or false
remove_otherNoise = True # or false

In [5]:
# Your parser function here. It will take the three option variables above as the parameters
# add cells as needed to organize your code
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

f = open("homework_1_data.txt", encoding='UTF-8')             
line = f.readline()             
strings = ''
while line:
    line_list = line.split("\t")
    string = re.sub("[^A-Z^a-z^0-9^ ]", " ", line_list[2])
    strings = strings + string
    line = f.readline()
f.close()

words = nltk.word_tokenize(strings)
words_freq_dist = nltk.FreqDist(words)
print(len(words_freq_dist))

if remove_stopwords and not use_stemming and not remove_otherNoise:
    filtered_words = [word for word in words if word not in stopwords.words('english')]
    filtered_words_freq_dist = nltk.FreqDist(filtered_words)
    print(len(filtered_words_freq_dist))

if remove_stopwords and use_stemming and not remove_otherNoise:
    filtered_words = [word for word in words if word not in stopwords.words('english')]
    filtered_words_freq_dist = nltk.FreqDist(filtered_words)
    print(len(filtered_words_freq_dist))
    stemmer = PorterStemmer()
    singles = [stemmer.stem(plural) for plural in filtered_words]
    singles_freq_dist = nltk.FreqDist(singles)
    print(len(singles_freq_dist))

# remove other noise: separate digits
if remove_stopwords and use_stemming and remove_otherNoise:
    filtered_words = [word for word in words if word not in stopwords.words('english')]
    filtered_words_freq_dist = nltk.FreqDist(filtered_words)
    print(len(filtered_words_freq_dist))
    stemmer = PorterStemmer()
    singles = [stemmer.stem(plural) for plural in filtered_words]
    singles_freq_dist = nltk.FreqDist(singles)
    print(len(singles_freq_dist))
    singles_no_digits = []
    for x in singles:
        if not x.isdigit():
            singles_no_digits.append(x)
    singles_no_digits_freq_dist = nltk.FreqDist(singles_no_digits)
    print(len(singles_no_digits_freq_dist))


15742
15602
9722
9458


### Observations

Once you have your parser working, you should report here the size of your dictionary under the four cases. That is, how many unique tokens do you have with stemming on and casefolding on? And so on. You should fill in the following

* None of pre-processing options      = 15742
* remove stop words       = 15602
* remove stop words + stemming       = 9722
* remove stop words + stemming  + remove other noise     = 9458

# Part 2: Boolean Retrieval (30 points)

In this part you build an inverted index to support Boolean retrieval. We only require your index to support AND queries. In other words, your index does not have to support OR, NOT, or parentheses. Also, we do not explicitly expect to see AND in queries, e.g., when we query **relational model**, your search engine should treat it as **relational** AND **model**.

Search for the queries below using your index and print out matching documents (for each query, print out 5 matching documents):
* relational database
* garbage collection
* retrieval model

Please use the following format to present your results:
* query: relational database
* result 1:
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [1]:
# build the index here
# add cells as needed to organize your code
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

f = open("homework_1_data.txt", encoding='UTF-8')             
f2 = open("definition.txt", 'w', encoding='UTF-8')
f3 = open("entity.txt", 'w', encoding='UTF-8')
f4 = open("original_definition.txt", 'w', encoding='UTF-8')
line = f.readline()             
strings = ''
while line:
    line_list = line.split("\t")
    string = re.sub("[^A-Z^a-z^0-9^ ]", " ", line_list[2])
    strings = strings + string

    line_words = nltk.word_tokenize(string)
    filtered_line_words = [line_word for line_word in line_words if line_word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    line_singles = [stemmer.stem(line_plural) for line_plural in filtered_line_words]
    line_singles_no_digits = []
    for x in line_singles:
        if not x.isdigit():
            line_singles_no_digits.append(x)
    # list -> string
    list_to_string = " ".join(line_singles_no_digits)
    f2.write(list_to_string+"\n")

    f3.write(line_list[0] + "\n")
    f4.write(line_list[2])

    line = f.readline()

f.close()
f2.close()
f3.close()
f4.close()

words = nltk.word_tokenize(strings)
filtered_words = [word for word in words if word not in stopwords.words('english')]
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in filtered_words]

singles_no_digits = []
for x in singles:
    if not x.isdigit():
        singles_no_digits.append(x)

# construct index
f = open("definition.txt", encoding='UTF-8')
lines = f.readlines()
f.close()

singles_no_digits_freq_dist = nltk.FreqDist(singles_no_digits)
sorted_list = sorted(singles_no_digits_freq_dist)
invert_index = dict()
for b in sorted_list:
    temp = []
    ID = 0
    for line in lines:
        split_line = line.split()
        if b in split_line:
            temp.append(ID)
        ID = ID + 1
    invert_index[b] = temp


In [7]:
# query
def boolean_query(argus):
    query = argus
    query_tokens = nltk.word_tokenize(query)
    filtered_query_tokens = [query_token for query_token in query_tokens if
                             query_token not in stopwords.words('english')]
    stemmer = PorterStemmer()
    query_singles = [stemmer.stem(query_plural) for query_plural in filtered_query_tokens]
    query_singles_no_digits = []
    for x in query_singles:
        if not x.isdigit():
            query_singles_no_digits.append(x)

    query_word1 = query_singles_no_digits[0]
    query_word2 = query_singles_no_digits[1]

    no_order_merge_list = list(set(invert_index[query_word1]).intersection(set(invert_index[query_word2])))
    merge_list = sorted(no_order_merge_list)

    # print
    f = open("entity.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    f = open("original_definition.txt", encoding='UTF-8')
    original_definition_lines = f.readlines()
    f.close()
    print("query: " + query)
    for i in range(5):        
        print("result {}:".format(i + 1))
        print("entity: " + lines[merge_list[i]], end="")
        print("definition id: {}".format(merge_list[i]))
        print("definition: " + original_definition_lines[merge_list[i]], end="")


In [8]:
# search for the input using your index and print out ids of matching documents.
boolean_query("relational database")
print("\n")
boolean_query("garbage collection")
print("\n")
boolean_query("retrieval model")

query: relational database
result 1:
entity: database management system
definition id: 654
definition: dbms allows users to create, read, update, and delete structured data in a relational database. managers send requests to dbms and the dbms performs manipulation of the data. can retrieve information from using sql or qbe (query by example).   relational database management system: allows users to create, read, update, and delete data in a relational database.  pros: increased flexibility, inc scalability and performance, reduced info redundancy, inc info integrity/quality, increased info security.
result 2:
entity: database management system
definition id: 657
definition: general hospital utilizes various related files that include clinical and financial data to generate reports such as ms drg case mix reports. what application would be most effective for this activity  desktop publishing  word processing database management system command interpreter
result 3:
entity: database manag

### Observations
Could your boolean search engine find relevant documents for these queries? What is the impact of the three pre-processing options? Do they improve your search quality?

Yes!

The three pre-processing options help me find relevant documents for these queries. Not just precisely match the words in the documents can i find the expected results. If the documents contain the meaning in the query, i can also find them.

Definitely, they improve my search quality.

# Part 3: Ranking Documents (50 points) 

In this part, your job is to rank the documents that have been retrieved by the Boolean Retrieval component in Part 2, according to their relevance with each query.

### A: Ranking with simple sums of TF-IDF scores (15 points) 
For a multi-word query, we rank documents by a simple sum of the TF-IDF scores for the query terms in the document.
TF is the log-weighted term frequency $1+log(tf)$; and IDF is the log-weighted inverse document frequency $log(\frac{N}{df})$

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 results plus the TF-IDF sum score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [9]:
# your code here
# hint: you could first call boolean retrieval function in part 2 to find possible relevant documents, 
# and then rank these documents in this part. Hence, you don't need to rank all documents.
import math

def boolean_query_sum_tf_idf(argus):
    query = argus
    query_tokens = nltk.word_tokenize(query)
    filtered_query_tokens = [query_token for query_token in query_tokens if
                             query_token not in stopwords.words('english')]
    stemmer = PorterStemmer()
    query_singles = [stemmer.stem(query_plural) for query_plural in filtered_query_tokens]
    query_singles_no_digits = []
    for x in query_singles:
        if not x.isdigit():
            query_singles_no_digits.append(x)

    query_word1 = query_singles_no_digits[0]
    query_word2 = query_singles_no_digits[1]

    list1 = invert_index[query_word1]
    list2 = invert_index[query_word2]
    merge_list = [new for new in list1 if new in list2]

    f = open("definition.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    sum_wtd_index = dict()

    for i in range(len(merge_list)):
        query_word1_wtd = (1 + math.log10(lines[merge_list[i]].count(query_word1))) * math.log10(
            30917 / len(invert_index[query_word1]))
        query_word2_wtd = (1 + math.log10(lines[merge_list[i]].count(query_word2))) * math.log10(
            30917 / len(invert_index[query_word2]))
        sum_wtd = query_word1_wtd + query_word2_wtd
        sum_wtd_index[merge_list[i]] = sum_wtd

    sorted_sum_wtd_index = dict(sorted(sum_wtd_index.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_sum_wtd_index.keys())

    # print
    f = open("entity.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    f = open("original_definition.txt", encoding='UTF-8')
    original_definition_lines = f.readlines()
    f.close()
    print("query: " + query)
    for i in range(5):
        print("result {}:".format(i + 1))
        print("score: {}".format(sorted_sum_wtd_index[sorted_keys_list[i]]))
        print("entity: " + lines[sorted_keys_list[i]], end="")
        print("definition id: {}".format(sorted_keys_list[i]))
        print("definition: " + original_definition_lines[sorted_keys_list[i]], end="")
        

In [10]:
boolean_query_sum_tf_idf("relational database")
print("\n")
boolean_query_sum_tf_idf("garbage collection")
print("\n")
boolean_query_sum_tf_idf("retrieval model")

query: relational database
result 1:
score: 4.71733880527531
entity: relational algebra
definition id: 7156
definition: - a theoretical language with operations that work on one or more relations to define another relation without changing the original relation(s)  - relation-at-a-time (or set) language in which all tuples, possibly from several relations, are manipulated in one statement without looping  relational algebra, first created by edgar f. codd while at ibm, is a family of algebras with a well-founded semantics used for modelling the data stored in relational databases, and defining queries on it.  the main application of relational algebra is providing a theoretical foundation for relational databases, particularly query languages for such databases, chief among which is sql.
result 2:
score: 4.357658330802902
entity: relational database
definition id: 28378
definition: a type of database system where data is stored in  tables related by common fields. a relational database

### B: Ranking with vector space model with TF-IDF (15 points) 

**Cosine:** You should use cosine as your scoring function. 

**TFIDF:** For the document vectors, use the standard TF-IDF scores as introduced in A. For the query vector, use simple weights (the raw term frequency). For example:
* query: troll $\rightarrow$ (1)
* query: troll trace $\rightarrow$ (1, 1)

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the cosine score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

You can additionally assume that your queries will contain at most three words. Be sure to normalize your vectors as part of the cosine calculation!

In [11]:
# your code here
import numpy as np

def boolean_query_vsm_tf_idf(argus):
    query = argus
    query_tokens = nltk.word_tokenize(query)
    filtered_query_tokens = [query_token for query_token in query_tokens if
                             query_token not in stopwords.words('english')]
    stemmer = PorterStemmer()
    query_singles = [stemmer.stem(query_plural) for query_plural in filtered_query_tokens]
    query_singles_no_digits = []
    for x in query_singles:
        if not x.isdigit():
            query_singles_no_digits.append(x)

    query_word1 = query_singles_no_digits[0]
    query_word2 = query_singles_no_digits[1]

    list1 = invert_index[query_word1]
    list2 = invert_index[query_word2]
    merge_list = [new for new in list1 if new in list2]

    f = open("definition.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()

    docID_score_dict = dict()

    for i in range(len(merge_list)):
        # each line: string -> list
        line_list = lines[merge_list[i]].split()
        # each line: -> dict
        line_list_dict = nltk.FreqDist(line_list)
        # dict values -> vector
        dict_values_list = list(line_list_dict.values())
        doc_vector = np.array(dict_values_list)
        doc_vector = doc_vector / np.linalg.norm(doc_vector)
        # query -> dict
        query_dict = dict([(k, 0) for k in line_list_dict.keys()])
        query_dict[query_word1] = 1
        query_dict[query_word2] = 1
        # dict values -> vector
        dict_values_list = list(query_dict.values())        
        query_vector = np.array(dict_values_list)
        query_vector = query_vector / np.linalg.norm(query_vector)
        # compute score
        docID_score_dict[merge_list[i]] = query_vector.dot(doc_vector)

    sorted_docID_score_dict = dict(sorted(docID_score_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_docID_score_dict.keys())

    # print
    f = open("entity.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    f = open("original_definition.txt", encoding='UTF-8')
    original_definition_lines = f.readlines()
    f.close()
    print("query: " + query)
    for i in range(5):
        print("result {}:".format(i + 1))
        print("score: {}".format(sorted_docID_score_dict[sorted_keys_list[i]]))
        print("entity: " + lines[sorted_keys_list[i]], end="")
        print("definition id: {}".format(sorted_keys_list[i]))
        print("definition: " + original_definition_lines[sorted_keys_list[i]], end="")
        

In [12]:
boolean_query_vsm_tf_idf("relational database")
print("\n")
boolean_query_vsm_tf_idf("garbage collection")
print("\n")
boolean_query_vsm_tf_idf("retrieval model")

query: relational database
result 1:
score: 0.7878385971583353
entity: relational database
definition id: 28234
definition: a database that is modeled using the relational database model a collection of related relations within which each relation has a unique name
result 2:
score: 0.7499999999999998
entity: relational database
definition id: 28205
definition: a database built using the relational database model
result 3:
score: 0.7302967433402214
entity: relational database
definition id: 28312
definition: a collection of related relations in which each relation has a unique name  operational/transactional databases
result 4:
score: 0.7071067811865475
entity: relational model
definition id: 771
definition: a database is a collection of relations or tables.
result 5:
score: 0.7071067811865475
entity: database schema
definition id: 19673
definition: set of schemas for the relations of a database


query: garbage collection
result 1:
score: 0.5773502691896258
entity: garbage collector
de

### C: Ranking with BM25 (20 points) 
Finally, let's try the BM25 approach for ranking. Refer to https://en.wikipedia.org/wiki/Okapi_BM25 for the specific formula. You could choose k_1 = 1.2 and b = 0.75 but feel free to try other options.

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the BM25 score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [13]:
# your code here

def boolean_query_BM25(argus):
    query = argus
    query_tokens = nltk.word_tokenize(query)
    filtered_query_tokens = [query_token for query_token in query_tokens if
                             query_token not in stopwords.words('english')]
    stemmer = PorterStemmer()
    query_singles = [stemmer.stem(query_plural) for query_plural in filtered_query_tokens]
    query_singles_no_digits = []
    for x in query_singles:
        if not x.isdigit():
            query_singles_no_digits.append(x)

    query_word1 = query_singles_no_digits[0]
    query_word2 = query_singles_no_digits[1]

    list1 = invert_index[query_word1]
    list2 = invert_index[query_word2]
    merge_list = [new for new in list1 if new in list2]

    # compute avgdl
    f = open("definition.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    num_words = 0
    for i in range(30917):
        num_words = num_words + len(lines[i].split())
    avgdl = num_words / 30917

    score_BM25_dict = dict()

    f = open("definition.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    for i in range(len(merge_list)):
        idf_query_word1 = math.log10((30917 - len(invert_index[query_word1]) + 0.5) / (
                (len(invert_index[query_word1])) + 0.5))
        idf_query_word2 = math.log10((30917 - len(invert_index[query_word2]) + 0.5) / (
                (len(invert_index[query_word2])) + 0.5))
        score_BM25_part1 = idf_query_word1 * (lines[merge_list[i]].count(query_word1) * (1.2 + 1)) / (
                    lines[merge_list[i]].count(query_word1) + 1.2 * (
                        1 - 0.75 + 0.75 * len(lines[merge_list[i]].split()) / avgdl))
        score_BM25_part2 = idf_query_word2 * (lines[merge_list[i]].count(query_word2) * (1.2 + 1)) / (
                    lines[merge_list[i]].count(query_word2) + 1.2 * (
                        1 - 0.75 + 0.75 * len(lines[merge_list[i]].split()) / avgdl))
        score_BM25 = score_BM25_part1 + score_BM25_part2
        score_BM25_dict[merge_list[i]] = score_BM25
    sorted_score_BM25_dict = dict(sorted(score_BM25_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_score_BM25_dict.keys())

    # print
    f = open("entity.txt", encoding='UTF-8')
    lines = f.readlines()
    f.close()
    f = open("original_definition.txt", encoding='UTF-8')
    original_definition_lines = f.readlines()
    f.close()
    print("query: " + query)
    for i in range(5):
        print("result {}:".format(i + 1))
        print("score: {}".format(sorted_score_BM25_dict[sorted_keys_list[i]]))
        print("entity: " + lines[sorted_keys_list[i]], end="")
        print("definition id: {}".format(sorted_keys_list[i]))
        print("definition: " + original_definition_lines[sorted_keys_list[i]], end="")


In [14]:
boolean_query_BM25("relational database")
print("\n")
boolean_query_BM25("garbage collection")
print("\n")
boolean_query_BM25("retrieval model")

query: relational database
result 1:
score: 4.068414906597672
entity: relational database
definition id: 28234
definition: a database that is modeled using the relational database model a collection of related relations within which each relation has a unique name
result 2:
score: 3.7876375911531026
entity: relational database
definition id: 28205
definition: a database built using the relational database model
result 3:
score: 3.7750544210039623
entity: relational database
definition id: 28312
definition: a collection of related relations in which each relation has a unique name  operational/transactional databases
result 4:
score: 3.7384378453329443
entity: relational model
definition id: 795
definition: - database is a collection of relations - each relation has attributes and a collection of tuples
result 5:
score: 3.682438274725153
entity: relational model
definition id: 771
definition: a database is a collection of relations or tables.


query: garbage collection
result 1:
score:

### Discussion
Briefly discuss the differences you see between the three methods. Is there one you prefer?

About the query "relational database", the number of top 5 results of three methods whose entity is "relational database" are respectively 2, 2 and 3.

About the query "garbage collection", the number of top 5 results of three methods whose entity is "garbage collection" are respectively 4, 3 and 3.

About the query "retrieval model", the number of top 5 results of three methods whose entity is "retrieval model" are all 0. But the top 5 results of VSM-TF-IDF are the same as BM25. While from the perspective of meaning contained in the definition, i think the top 5 results of simple sums of TF-IDF are not so good as the VSM-TF-IDF and BM25.

Hence, in this dataset, I prefer the BM25 method.

## Bonus: Evaluation (10 points)
Rather than just compare methods by pure observation, there are several metrics to evaluate the performance of an IR engine: Precision, Recall, MAP, NDCG, HitRate and so on. These all require a ground truth set of queries and documents with a notion of **relevance**. These ground truth judgments can be expensive to obtain, so we are cutting corners here and treating a flashcard's front and back as a "relevant" query-document pair.

That is, if a document (definition) in your top-5 results is from the back of query's (entity's) flashcard, this document is regarded as relevant to the query (entity). This document is also called a hit in IR. Based on the ground-truth, you could calculate the metrics for the three ranking methods and provide the results like these:

* metric: Precision@5
* TF-IDF - score1
* Vector Space Model with TF-IDF - score2
* BM25 - score3

You could pick any of the reasonable metrics.

In [3]:
# your code here
use Precision as the metric
calculate by hand

About the method TF-IDF:
the precision for the query of "relational database" is 2/5
the precision for the query of "garbage collection" is 4/5
the precision for the query of "retrieval model" is 0/5
So the average precision for the method TF-IDF is score1=(2/5+4/5+0/5)/3=0.4

About the method Vector Space Model with TF-IDF:
the precision for the query of "relational database" is 2/5
the precision for the query of "garbage collection" is 3/5
the precision for the query of "retrieval model" is 0/5
So the average precision for the method Vector Space Model with TF-IDF is score2=(2/5+3/5+0/5)/3=0.33

About the method BM25:
the precision for the query of "relational database" is 3/5
the precision for the query of "garbage collection" is 3/5
the precision for the query of "retrieval model" is 0/5
So the average precision for the method BM25 is score3=(3/5+3/5+0/5)/3=0.4

So, score1=score3>score2

Hence, just from the perspective of the precision score in the range of top-5 result, we may think the method TF-IDF and BM25 are better than
Vector Space Model with TF-IDF. And the method TF-IDF is as good as BM25.

But considering the meaning contained in the documents, we may think that BM25 is the best. 
As its result is more relevant than TF-IDF's result.


# Collaboration Declarations

** You should fill out your collaboration declarations here.**

**Reminder:** You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by filling out the Collaboration Declarations at the bottom of this notebook.

Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.