#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 4:  Word Embeddings for Information Retrieval and Query Expansion

### 100 points [5% of your final grade]

### Due: April 28, 2020 by 11:59pm

*Goals of this homework:* In this homework you will improve your information retrieval engine in homework 1 by word embeddings to: (i) directly match the query and the document in the latent semantic space of word embeddings; (ii) expand the original query via word embeddings.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw4.ipynb`. For example, my homework submission would be something like `555001234_hw4.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Part 0. Dataset and Parsing (The same as Homework 1)

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

For parsing this dataset, you could also just copy your code from homework 1 to complete the following tasks:
* Tokenize documents (definitions) using **whitespaces and punctuations as delimiters**.
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
f = open("homework_1_data.txt", encoding='UTF-8')
line = f.readline()
strings = ''
corpus = []
while line:
    line_list = line.split("\t")
    string = re.sub("[^A-Z^a-z^0-9^ ]", " ", line_list[2])

    line_words = nltk.word_tokenize(string)
    filtered_line_words = [line_word for line_word in line_words if line_word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    line_singles = [stemmer.stem(line_plural) for line_plural in filtered_line_words]
    line_singles_no_digits = []
    for x in line_singles:
        if not x.isdigit():
            line_singles_no_digits.append(x)

    corpus.append(line_singles_no_digits)
    line = f.readline()

f.close()

# Part 1: Word2Vec (30 points)

In this part you will use the Word2Vec algorithm to generate word embeddings for tokens in the dataset. You can just use a package like https://radimrehurek.com/gensim/models/word2vec.html. Let's set the size of word embeddings to be 20. Please print the word embeddings for the tokens: 
* relational
* database
* garbage
* collection
* retrieval 
* model

In [3]:
# code here.
# how do you generate the word embeddings
from gensim.models import Word2Vec
model = Word2Vec(corpus, size=20, min_count=1)

In [4]:
from gensim.models import KeyedVectors

wv = model.wv 
del model 
wv.save('word_vector') 
loaded_wv = KeyedVectors.load('word_vector', mmap='r') 

In [9]:
# print the word embeddings of the six tokens
def print_word_embeddings(word):
    stemmer = PorterStemmer()
    stem = stemmer.stem(word)
    print("The word embedding for:", word)
    print(wv[stem],'\n')

In [10]:
print_word_embeddings('relational')
print_word_embeddings('database')
print_word_embeddings('garbage')
print_word_embeddings('collection')
print_word_embeddings('retrieval')
print_word_embeddings('model')

The word embedding for: relational
[ 2.1185431  -1.7910612   2.3100224  -3.8898535   0.95128983 -1.6378244
  0.1950791   0.29736868 -1.1256076  -0.12825722  1.9100384   1.6876783
  0.15590088  1.1038716   1.5969405  -2.4179933  -0.19025594 -0.22537753
  0.6726891  -0.23532271] 

The word embedding for: database
[ 2.3996198  -0.5639794   0.79490227 -4.1667533   1.0665674  -1.4387202
 -1.287428    0.47218665 -1.8812461   0.61816096  2.1934264   0.5602395
  0.57601064 -0.06206043  1.5491109  -3.4975069   1.4149612  -2.2815917
  2.357336    0.4816189 ] 

The word embedding for: garbage
[ 0.23065002 -0.2146568   0.31055576 -0.17249243  0.24463701 -0.2938342
 -0.09150504  0.325075   -0.2880192   0.16644886  0.28005844 -0.00961642
  0.18499838 -0.00350569  0.04795637  0.00319307 -0.01125147 -0.05349322
  0.37475586  0.25783548] 

The word embedding for: collection
[ 2.3029718   0.05625704  2.0099719  -3.3979063   0.9960767  -1.6817203
 -1.2647372   1.7185093  -1.1941811   0.87113434  1.248264

# Part 2: Vector Space Model via Word Embeddings (40 points) 

In this part, your job is to match the query and the document via the cosine similarity between the embeddings of them.

Since there are not just one token in a query or a document, the first challenge is how to aggregate many word embeddings into one embedding of a query or a document. There are many ways to do so: 
* Max pooling: return the maximum value along each dimension of a bunch of word embeddings. For example, [1, 3, 4], [2, 1, 5] -> [2, 3, 5].
* Min pooling: return the minimum value along each dimension of a bunch of word embeddings
* Mean pooling: return the mean value along each dimension of a bunch of word embeddings
* Sum: element-wise add a bunch of word embeddings together
* Weighted sum: assign weights to word embeddings and then add them together. Weights could be TF, IDF or TF-IDF.

In [None]:
# your code here

In [11]:
# ground truth
f = open("homework_1_data.txt", encoding='UTF-8')             
f2 = open("definition.txt", 'w', encoding='UTF-8')

line = f.readline()             
strings = ''
while line:
    line_list = line.split("\t")
    string = re.sub("[^A-Z^a-z^0-9^ ]", " ", line_list[2])
    strings = strings + string

    line_words = nltk.word_tokenize(string)
    filtered_line_words = [line_word for line_word in line_words if line_word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    line_singles = [stemmer.stem(line_plural) for line_plural in filtered_line_words]
    line_singles_no_digits = []
    for x in line_singles:
        if not x.isdigit():
            line_singles_no_digits.append(x)
    # list -> string
    list_to_string = " ".join(line_singles_no_digits)
    f2.write(list_to_string+"\n")

    line = f.readline()

f.close()
f2.close()


In [12]:
words = nltk.word_tokenize(strings)
filtered_words = [word for word in words if word not in stopwords.words('english')]
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in filtered_words]

singles_no_digits = []
for x in singles:
    if not x.isdigit():
        singles_no_digits.append(x)

term_freq_dist = nltk.FreqDist(singles_no_digits)

In [13]:
stemmer = PorterStemmer()
print(stemmer.stem('relational'))
print(stemmer.stem('database'))
print(stemmer.stem('garbage'))
print(stemmer.stem('collection'))
print(stemmer.stem('retrieval'))
print(stemmer.stem('model'))

relat
databas
garbag
collect
retriev
model


In [21]:
# save the docID of 'relational database' in list_1
f = open("homework_1_data.txt", encoding='UTF-8')
line = f.readline()
ID = 0
list_1 = []
while line:
    line_list = line.split("\t")
    if "relational database" == line_list[0]:
        list_1.append(ID)
        ID += 1
    else:
        ID += 1
        
    line = f.readline()
   
f.close()
print("The docID of 'relational database':") 
print(list_1)

The docID of 'relational database':
[28128, 28129, 28130, 28131, 28132, 28133, 28134, 28135, 28136, 28137, 28138, 28139, 28140, 28141, 28142, 28143, 28144, 28145, 28146, 28147, 28148, 28149, 28150, 28151, 28152, 28153, 28154, 28155, 28156, 28157, 28158, 28159, 28160, 28161, 28162, 28163, 28164, 28165, 28166, 28167, 28168, 28169, 28170, 28171, 28172, 28173, 28174, 28175, 28176, 28177, 28178, 28179, 28180, 28181, 28182, 28183, 28184, 28185, 28186, 28187, 28188, 28189, 28190, 28191, 28192, 28193, 28194, 28195, 28196, 28197, 28198, 28199, 28200, 28201, 28202, 28203, 28204, 28205, 28206, 28207, 28208, 28209, 28210, 28211, 28212, 28213, 28214, 28215, 28216, 28217, 28218, 28219, 28220, 28221, 28222, 28223, 28224, 28225, 28226, 28227, 28228, 28229, 28230, 28231, 28232, 28233, 28234, 28235, 28236, 28237, 28238, 28239, 28240, 28241, 28242, 28243, 28244, 28245, 28246, 28247, 28248, 28249, 28250, 28251, 28252, 28253, 28254, 28255, 28256, 28257, 28258, 28259, 28260, 28261, 28262, 28263, 28264, 2826

In [22]:
# save the docID of 'garbage collection' in list_2
f = open("homework_1_data.txt", encoding='UTF-8')
line = f.readline()
ID = 0
list_2 = []
while line:
    line_list = line.split("\t")
    if "garbage collection" == line_list[0]:
        list_2.append(ID)
        ID += 1
    else:
        ID += 1
        
    line = f.readline()
   
f.close()
print("The docID of 'garbage collection':") 
print(list_2)

The docID of 'garbage collection':
[21543, 21544, 21545, 21546, 21547, 21548, 21549, 21550, 21551, 21552, 21553, 21554, 21555, 21556, 21557, 21558, 21559, 21560, 21561, 21562, 21563, 21564, 21565, 21566, 21567, 21568, 21569, 21570, 21571, 21572, 21573, 21574, 21575, 21576, 21577, 21578, 21579, 21580]


In [23]:
# save the docID of 'retrieval model' in list_3
f = open("homework_1_data.txt", encoding='UTF-8')
line = f.readline()
ID = 0
list_3 = []
while line:
    line_list = line.split("\t")
    if "retrieval model" == line_list[0]:
        list_3.append(ID)
        ID += 1
    else:
        ID += 1
        
    line = f.readline()
   
f.close()
print("The docID of 'retrieval model':") 
print(list_3)

The docID of 'retrieval model':
[13961, 13962]


In [25]:
# define the function of cosine similarity
import numpy as np
def cos_sim(vector_a, vector_b):
    if np.linalg.norm(vector_a)==0 or np.linalg.norm(vector_b)==0:
        return 0.0
    else:
        vector_a = vector_a / np.linalg.norm(vector_a)
        vector_b = vector_b / np.linalg.norm(vector_b)
        cosine = vector_a.dot(vector_b)
        return cosine

In [28]:
# construct the Max pooling word embeddings for documents
# save it in doc_Max_pooling
f = open("definition.txt", encoding='UTF-8')
line = f.readline()
doc_Max_pooling = []
while line:
    line = line.strip()    
    line_list = line.split()
    if line_list:
        temp = list(wv[line_list[0]])
        for i in range(len(line_list)-1):
            for j in range(20):
                temp[j] = max(temp[j], wv[line_list[i+1]][j])
        
        doc_Max_pooling.append(temp)
        line = f.readline()
    else:
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        doc_Max_pooling.append(temp)
        line = f.readline()    
   
f.close()


In [49]:
def print_precision_Max(word1, word2, truth):
    query_list = []
    for i in range(20):
        query_list.append(max(wv[word1][i], wv[word2][i]))
    
    cos_sim_list = []
    for i in range(len(doc_Max_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Max_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    precision = len(merge_list)/10

    print(precision)

In [50]:
print("Max_pooling method:")
print("Precision@10 for query: relational database:   ", end='')
print_precision_Max("relat", "databas", list_1)
print("Precision@10 for query: garbage collection:    ", end='')
print_precision_Max("garbag", "collect", list_2)
print("Precision@10 for query: retrieval model:       ", end='')
print_precision_Max("retriev", "model", list_3)

Max_pooling method:
Precision@10 for query: relational database:   0.2
Precision@10 for query: garbage collection:    0.0
Precision@10 for query: retrieval model:       0.0


In [45]:
# construct the Min pooling word embeddings for documents
# save it in doc_Min_pooling
f = open("definition.txt", encoding='UTF-8')
line = f.readline()
doc_Min_pooling = []
while line:
    line = line.strip()    
    line_list = line.split()
    if line_list:
        temp = list(wv[line_list[0]])
        for i in range(len(line_list)-1):
            for j in range(20):
                temp[j] = min(temp[j], wv[line_list[i+1]][j])
        
        doc_Min_pooling.append(temp)
        line = f.readline()
    else:
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        doc_Min_pooling.append(temp)
        line = f.readline()    
   
f.close()


In [51]:
def print_precision_Min(word1, word2, truth):
    query_list = []
    for i in range(20):
        query_list.append(min(wv[word1][i], wv[word2][i]))
    
    cos_sim_list = []
    for i in range(len(doc_Min_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Min_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    precision = len(merge_list)/10

    print(precision)

In [52]:
print("Min_pooling method:")
print("Precision@10 for query: relational database:   ", end='')
print_precision_Min("relat", "databas", list_1)
print("Precision@10 for query: garbage collection:    ", end='')
print_precision_Min("garbag", "collect", list_2)
print("Precision@10 for query: retrieval model:       ", end='')
print_precision_Min("retriev", "model", list_3)

Min_pooling method:
Precision@10 for query: relational database:   0.3
Precision@10 for query: garbage collection:    0.0
Precision@10 for query: retrieval model:       0.0


In [48]:
# construct the Mean pooling word embeddings for documents
# save it in doc_Mean_pooling
f = open("definition.txt", encoding='UTF-8')
line = f.readline()
doc_Mean_pooling = []
while line:
    line = line.strip()    
    line_list = line.split()
    if line_list:        
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        for k in range(20):
            temp[k] = 1/20 * wv[line_list[0]][k]
        for i in range(len(line_list)-1):
            for j in range(20):
                temp[j] = temp[j] + 1/20 * wv[line_list[i+1]][j]                
        
        doc_Mean_pooling.append(temp)
        line = f.readline()
    else:
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        doc_Mean_pooling.append(temp)
        line = f.readline()    
   
f.close()

In [53]:
def print_precision_Mean(word1, word2, truth):
    query_list = []
    for i in range(20):
        query_list.append((wv[word1][i] + wv[word2][i])/2)
    
    cos_sim_list = []
    for i in range(len(doc_Mean_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Mean_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    precision = len(merge_list)/10

    print(precision)

In [54]:
print("Mean_pooling method:")
print("Precision@10 for query: relational database:   ", end='')
print_precision_Mean("relat", "databas", list_1)
print("Precision@10 for query: garbage collection:    ", end='')
print_precision_Mean("garbag", "collect", list_2)
print("Precision@10 for query: retrieval model:       ", end='')
print_precision_Mean("retriev", "model", list_3)

Mean_pooling method:
Precision@10 for query: relational database:   0.6
Precision@10 for query: garbage collection:    0.0
Precision@10 for query: retrieval model:       0.0


In [55]:
# construct the Sum word embeddings for documents
# save it in doc_Sum
f = open("definition.txt", encoding='UTF-8')
line = f.readline()
doc_Sum = []
while line:
    line = line.strip()    
    line_list = line.split()
    if line_list:        
        temp = list(wv[line_list[0]])
        for i in range(len(line_list)-1):
            for j in range(20):
                temp[j] = temp[j] + wv[line_list[i+1]][j]
        
        doc_Sum.append(temp)
        line = f.readline()
    else:
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        doc_Sum.append(temp)
        line = f.readline()    
   
f.close()

In [56]:
def print_precision_Sum(word1, word2, truth):  
    query_list = []
    for i in range(20):
        query_list.append(wv[word1][i] + wv[word2][i])
    
    cos_sim_list = []
    for i in range(len(doc_Sum)):
        cos_sim_list.append(cos_sim(query_list, doc_Sum[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    precision = len(merge_list)/10

    print(precision)

In [57]:
print("Sum method:")
print("Precision@10 for query: relational database:   ", end='')
print_precision_Sum("relat", "databas", list_1)
print("Precision@10 for query: garbage collection:    ", end='')
print_precision_Sum("garbag", "collect", list_2)
print("Precision@10 for query: retrieval model:       ", end='')
print_precision_Sum("retriev", "model", list_3)

Sum method:
Precision@10 for query: relational database:   0.6
Precision@10 for query: garbage collection:    0.0
Precision@10 for query: retrieval model:       0.0


In [58]:
# construct the Weighted Sum word embeddings for documents
# save it in doc_Weighted_Sum
f = open("definition.txt", encoding='UTF-8')
line = f.readline()
doc_Weighted_Sum = []
while line:
    line = line.strip()    
    line_list = line.split()
    if line_list:        
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        for k in range(20):
            temp[k] = term_freq_dist[line_list[0]] * wv[line_list[0]][k]
        for i in range(len(line_list)-1):
            for j in range(20):
                temp[j] = temp[j] + term_freq_dist[line_list[i+1]] * wv[line_list[i+1]][j]                
        
        doc_Weighted_Sum.append(temp)
        line = f.readline()
    else:
        temp = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
        doc_Weighted_Sum.append(temp)
        line = f.readline()    
   
f.close()

In [59]:
def print_precision_Weighted_Sum(word1, word2, truth):  
    query_list = []
    for i in range(20):
        query_list.append(term_freq_dist[word1] * wv[word1][i] + term_freq_dist[word2] * wv[word2][i])
    
    cos_sim_list = []
    for i in range(len(doc_Weighted_Sum)):
        cos_sim_list.append(cos_sim(query_list, doc_Weighted_Sum[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    precision = len(merge_list)/10

    print(precision)

In [60]:
print("Weighted Sum method:")
print("Precision@10 for query: relational database:   ", end='')
print_precision_Weighted_Sum("relat", "databas", list_1)
print("Precision@10 for query: garbage collection:    ", end='')
print_precision_Weighted_Sum("garbag", "collect", list_2)
print("Precision@10 for query: retrieval model:       ", end='')
print_precision_Weighted_Sum("retriev", "model", list_3)

Weighted Sum method:
Precision@10 for query: relational database:   0.2
Precision@10 for query: garbage collection:    0.0
Precision@10 for query: retrieval model:       0.0


Try different aggregation methods and report the precision@10 for these queries:
* query: relational database
* query: garbage collection
* query: retrieval model

### Discussion
Among these aggregation methods, which one is the best and which one is the worst?

   Max:    0.2, 0, 0
   
   Min:     0.3, 0, 0
   
   Mean:  0.6, 0, 0
   
   Sum:    0.6, 0, 0
   
 Weighted Sum:  0.2, 0, 0
### So, Mean and Sum are the best, Max and Weighted Sum are the worst

# Part 3: Query Expansion via Word Embeddings (30 points) 
Remember the hardest query "retrieval model" in homework 1? Because there is no document containing "retrieval model" in the dataset, you cannot retrieve any documents by Boolean matching. Now, it is the time of your "revenge" via query expansion.

In this part, your job is to expand the original query like "retrieval model" by adding semantically similar words (e.g., "search"), which are selected from all tokens in the dataset.

There are many ways to do so. For this part, we want you to calculate the cosine similarity between each of the original query tokens and the other tokens based on their word embeddings.

First, please find the top 3 similar tokens for:
* relational
* database
* garbage
* collection
* retrieval 
* model

In [65]:
# your code here
print("Top 3 similar tokens and their similarities")
print("relational:")
print(wv.similar_by_word("relat", topn=3))
print("database:")
print(wv.similar_by_word("databas", topn=3))
print("garbage:")
print(wv.similar_by_word("garbag", topn=3))
print("collection:")
print(wv.similar_by_word("collect", topn=3))
print("retrieval:")
print(wv.similar_by_word("retriev", topn=3))
print("model:")
print(wv.similar_by_word("model", topn=3))

Top 3 similar tokens and their similarities
relational:
[('entiti', 0.9473505020141602), ('certifi', 0.9239091873168945), ('tabl', 0.9139374494552612)]
database:
[('dbm', 0.9263573884963989), ('od', 0.9236255288124084), ('cleans', 0.9000288248062134)]
garbage:
[('later', 0.9856059551239014), ('reorgan', 0.9844050407409668), ('fan', 0.9817262887954712)]
collection:
[('repositori', 0.9401402473449707), ('warehous', 0.9266716241836548), ('gather', 0.9155275821685791)]
retrieval:
[('store', 0.9658225774765015), ('updat', 0.9093206524848938), ('warehous', 0.904878556728363)]
model:
[('mathemat', 0.9043813943862915), ('formal', 0.8658748865127563), ('conceptu', 0.8592434525489807)]


Second, please add these similar tokens to the orignal query and redo the **vector space model** in part 2. 
* query: relational database
* query: garbage collection
* query: retrieval model

In [None]:
# your code here

In [71]:
def print_recall_Max(word1, word2, truth):
    query_list = []
    for i in range(20):
        query_list.append(max(wv[word1][i], wv[word2][i]))
    
    cos_sim_list = []
    for i in range(len(doc_Max_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Max_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)

    
print("Max_pooling method:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Max("relat", "databas", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Max("garbag", "collect", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Max("retriev", "model", list_3)

Max_pooling method:
Recall@10 for query: relational database:   0.007042253521126761
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [72]:
def print_recall_Min(word1, word2, truth):
    query_list = []
    for i in range(20):
        query_list.append(min(wv[word1][i], wv[word2][i]))
    
    cos_sim_list = []
    for i in range(len(doc_Min_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Min_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)


print("Min_pooling method:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Min("relat", "databas", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Min("garbag", "collect", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Min("retriev", "model", list_3)

Min_pooling method:
Recall@10 for query: relational database:   0.01056338028169014
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [73]:
def print_recall_Mean(word1, word2, truth):
    query_list = []
    for i in range(20):
        query_list.append((wv[word1][i] + wv[word2][i])/2)
    
    cos_sim_list = []
    for i in range(len(doc_Mean_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Mean_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)
    
    
print("Mean_pooling method:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Mean("relat", "databas", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Mean("garbag", "collect", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Mean("retriev", "model", list_3)

Mean_pooling method:
Recall@10 for query: relational database:   0.02112676056338028
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [74]:
def print_recall_Sum(word1, word2, truth):  
    query_list = []
    for i in range(20):
        query_list.append(wv[word1][i] + wv[word2][i])
    
    cos_sim_list = []
    for i in range(len(doc_Sum)):
        cos_sim_list.append(cos_sim(query_list, doc_Sum[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)
    
    
print("Sum method:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Sum("relat", "databas", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Sum("garbag", "collect", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Sum("retriev", "model", list_3)

Sum method:
Recall@10 for query: relational database:   0.02112676056338028
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [75]:
def print_recall_Weighted_Sum(word1, word2, truth):  
    query_list = []
    for i in range(20):
        query_list.append(term_freq_dist[word1] * wv[word1][i] + term_freq_dist[word2] * wv[word2][i])
    
    cos_sim_list = []
    for i in range(len(doc_Weighted_Sum)):
        cos_sim_list.append(cos_sim(query_list, doc_Weighted_Sum[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)
    
    
print("Weighted Sum method:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Weighted_Sum("relat", "databas", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Weighted_Sum("garbag", "collect", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Weighted_Sum("retriev", "model", list_3)

Weighted Sum method:
Recall@10 for query: relational database:   0.007042253521126761
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


#### -------------------------------------------------------Expansion--------------------------------------------------------------------------

In [76]:
def print_recall_Max_Expansion(word1, word2, word3, word4, word5, word6, word7, word8, truth):
    query_list = []
    for i in range(20):
        query_list.append(max(wv[word1][i], wv[word2][i], wv[word3][i], wv[word4][i], wv[word5][i], wv[word6][i], wv[word7][i], wv[word8][i]))
    
    cos_sim_list = []
    for i in range(len(doc_Max_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Max_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)

    
print("Max_pooling method after Expansion:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Max_Expansion("relat", "databas", "entiti", "certifi", "tabl", "dbm", "od", "cleans", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Max_Expansion("garbag", "collect", "later", "reorgan", "fan", "repositori", "warehous", "gather", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Max_Expansion("retriev", "model", "store", "updat", "warehous", "mathemat", "formal", "conceptu", list_3)

Max_pooling method after Expansion:
Recall@10 for query: relational database:   0.02112676056338028
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [78]:
def print_recall_Min_Expansion(word1, word2, word3, word4, word5, word6, word7, word8, truth):
    query_list = []
    for i in range(20):
        query_list.append(min(wv[word1][i], wv[word2][i], wv[word3][i], wv[word4][i], wv[word5][i], wv[word6][i], wv[word7][i], wv[word8][i]))
    
    cos_sim_list = []
    for i in range(len(doc_Min_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Min_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)


print("Min_pooling method after Expansion:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Min_Expansion("relat", "databas", "entiti", "certifi", "tabl", "dbm", "od", "cleans", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Min_Expansion("garbag", "collect", "later", "reorgan", "fan", "repositori", "warehous", "gather", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Min_Expansion("retriev", "model", "store", "updat", "warehous", "mathemat", "formal", "conceptu", list_3)

Min_pooling method after Expansion:
Recall@10 for query: relational database:   0.017605633802816902
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [79]:
def print_recall_Mean_Expansion(word1, word2, word3, word4, word5, word6, word7, word8, truth):
    query_list = []
    for i in range(20):
        query_list.append((wv[word1][i] + wv[word2][i] + wv[word3][i] + wv[word4][i] + wv[word5][i] + wv[word6][i] + wv[word7][i] + wv[word8][i])/8)
    
    cos_sim_list = []
    for i in range(len(doc_Mean_pooling)):
        cos_sim_list.append(cos_sim(query_list, doc_Mean_pooling[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)
    
    
print("Mean_pooling method after Expansion:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Mean_Expansion("relat", "databas", "entiti", "certifi", "tabl", "dbm", "od", "cleans", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Mean_Expansion("garbag", "collect", "later", "reorgan", "fan", "repositori", "warehous", "gather", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Mean_Expansion("retriev", "model", "store", "updat", "warehous", "mathemat", "formal", "conceptu", list_3)

Mean_pooling method after Expansion:
Recall@10 for query: relational database:   0.028169014084507043
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [80]:
def print_recall_Sum_Expansion(word1, word2, word3, word4, word5, word6, word7, word8, truth): 
    query_list = []
    for i in range(20):
        query_list.append(wv[word1][i] + wv[word2][i] + wv[word3][i] + wv[word4][i] + wv[word5][i] + wv[word6][i] + wv[word7][i] + wv[word8][i])
    
    cos_sim_list = []
    for i in range(len(doc_Sum)):
        cos_sim_list.append(cos_sim(query_list, doc_Sum[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)
    
    
print("Sum method after Expansion:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Sum_Expansion("relat", "databas", "entiti", "certifi", "tabl", "dbm", "od", "cleans", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Sum_Expansion("garbag", "collect", "later", "reorgan", "fan", "repositori", "warehous", "gather", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Sum_Expansion("retriev", "model", "store", "updat", "warehous", "mathemat", "formal", "conceptu", list_3)

Sum method after Expansion:
Recall@10 for query: relational database:   0.028169014084507043
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


In [81]:
def print_recall_Weighted_Sum_Expansion(word1, word2, word3, word4, word5, word6, word7, word8, truth): 
    query_list = []
    for i in range(20):
        query_list.append(term_freq_dist[word1] * wv[word1][i] + term_freq_dist[word2] * wv[word2][i] + term_freq_dist[word3] * wv[word3][i] + term_freq_dist[word4] * wv[word4][i] + term_freq_dist[word5] * wv[word5][i] + term_freq_dist[word6] * wv[word6][i] + term_freq_dist[word7] * wv[word7][i] + term_freq_dist[word8] * wv[word8][i])
    
    cos_sim_list = []
    for i in range(len(doc_Weighted_Sum)):
        cos_sim_list.append(cos_sim(query_list, doc_Weighted_Sum[i]))
    
    cos_sim_dict = {}
    for i in range(len(cos_sim_list)):
        cos_sim_dict[i] = cos_sim_list[i]
    sorted_cos_sim_dict = dict(sorted(cos_sim_dict.items(), key=lambda d: d[1], reverse=True))
    sorted_keys_list = list(sorted_cos_sim_dict.keys())
    top10_list = []
    for i in range(10):
        top10_list.append(sorted_keys_list[i])

    merge_list = list(set(truth).intersection(set(top10_list)))
    recall = len(merge_list)/len(truth)

    print(recall)
    
    
print("Weighted Sum method after Expansion:")
print("Recall@10 for query: relational database:   ", end='')
print_recall_Weighted_Sum_Expansion("relat", "databas", "entiti", "certifi", "tabl", "dbm", "od", "cleans", list_1)
print("Recall@10 for query: garbage collection:    ", end='')
print_recall_Weighted_Sum_Expansion("garbag", "collect", "later", "reorgan", "fan", "repositori", "warehous", "gather", list_2)
print("Recall@10 for query: retrieval model:       ", end='')
print_recall_Weighted_Sum_Expansion("retriev", "model", "store", "updat", "warehous", "mathemat", "formal", "conceptu", list_3)

Weighted Sum method after Expansion:
Recall@10 for query: relational database:   0.02464788732394366
Recall@10 for query: garbage collection:    0.0
Recall@10 for query: retrieval model:       0.0


Report recall@10 before the query expansion:

   Max:  0.007042253521126761, 0, 0
   
   Min:  0.01056338028169014, 0, 0
   
   Mean: 0.02112676056338028, 0, 0
   
   Sum:  0.02112676056338028, 0, 0
   
 Weighted Sum:  0.007042253521126761, 0, 0

Report recall@10 after the query expansion:

   Max:  0.02112676056338028, 0, 0
   
   Min:  0.017605633802816902, 0, 0
   
   Mean: 0.028169014084507043, 0, 0
   
   Sum:  0.028169014084507043, 0, 0
   
 Weighted Sum:  0.02464788732394366, 0, 0

### Discussion
Why we measure recall here instead of precision or NDCG?

Should the tokens added for expansion have the same importance as the original query tokens? If not, how to improve the query expansion in this part?

In the dataset, some queries such as "retrieval model" has only two lines(13961 and 13962). It is not reasonable to compute precision@10 or NDCG@10 for this query.

The tokens added for expansion do not have the same importance as the original query tokens. When aggregating the word embeddings into one embedding of a query, assign weights to the tokens added for expansion such as the similarity of the token with the original token.