#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 4:  Word Embeddings for Information Retrieval and Query Expansion

### 100 points [5% of your final grade]

### Due: April 28, 2020 by 11:59pm

*Goals of this homework:* In this homework you will improve your information retrieval engine in homework 1 by word embeddings to: (i) directly match the query and the document in the latent semantic space of word embeddings; (ii) expand the original query via word embeddings.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw4.ipynb`. For example, my homework submission would be something like `555001234_hw4.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Part 0. Dataset and Parsing (The same as Homework 1)

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

For parsing this dataset, you could also just copy your code from homework 1 to complete the following tasks:
* Tokenize documents (definitions) using **whitespaces and punctuations as delimiters**.
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

In [36]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer 
from nltk.tokenize import RegexpTokenizer
import string
import re 
ps = PorterStemmer() 
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# PART 1 : REMOVE STOP WORDS

def remove_stop_words(content_tokens):
    
    # stopwords is used to get a list of all the stopwords in english language   
    stop_words = set(stopwords.words('english')) 

    # now from all the tokens, stopwords are removed
    content_tokens_without_stopwords = []
    temp=[]
    for tokens in content_tokens:
        temp=[]
        for token in tokens:
            if token not in stop_words:
                temp.append(token)
        content_tokens_without_stopwords.append(temp)
    
    all_tokens = []
    for tokens in content_tokens_without_stopwords:
        for token in tokens:
            all_tokens.append(token)
    print('Total number of tokens after removing stopwords : ',len(all_tokens))
    print('Number of unique words after removing stopwords : ',len(set(all_tokens)))
    return content_tokens_without_stopwords


In [3]:
# PART 2 : STEMMING

def stemming(content_tokens):
    stemmed_tokens = []
    ps = PorterStemmer() 
    for tokens in content_tokens:
        temp=[]
        for token in tokens:
            temp.append(ps.stem(token))
        stemmed_tokens.append(temp)
#     print(stemmed_tokens)
    
    all_tokens = []
    for tokens in stemmed_tokens:
        for token in tokens:
            all_tokens.append(token)
    print('Total number of tokens after stemming : ',len((all_tokens)))
    print('Number of unique tokens after stemming : ',len(set(all_tokens)))
#     print(stemmed_tokens)
    return stemmed_tokens
    

In [4]:
# PART 3 : REMOVE LESS INFORMATIVE AND NOISY STRINGS

def remove_noise(contnet_tokens):  
    tokens_no_noise=[]
    unn = []
    for tokens in contnet_tokens:
        temp=[]
        for token in tokens:
            if(token!='' and len(token)>1):
                temp.append(token.lower())
        tokens_no_noise.append(temp)
        
    all_tokens = []
    for tokens in tokens_no_noise:
        for token in tokens:
            all_tokens.append(token)
          
    print('Total number of tokens after removing extra noise : ',len(all_tokens))
    print('Number of unique tokens after removing extra noise : ',len(set(all_tokens)))
  
    return tokens_no_noise


In [5]:
# configuration options
remove_stopwords = True  # or false
use_stemming = True # or false
remove_otherNoise = True # or false

In [6]:
# Your parser function here. It will take the three option variables above as the parameters
# add cells as needed to organize your code

# read the entire document into content variable
with open('homework_1_data.txt','r',encoding='utf8') as text:
    content = text.read()
    
# store each flashcard seperately by splitting the document using newline
list_sentences = content.split('\n')

# now for each flashcard, seperate the query, doc id and doc by spitting with \t
sub_sentences = []
for sentence in list_sentences:
    sub_sentences.append(sentence.split('\t'))

definitions=[]
doc_id = []
entity = []
for sub in sub_sentences:
    if len(sub)==3:
        entity.append(sub[0])
        doc_id.append(sub[1])
        definitions.append((sub[2]))
    else:
        entity.append('#')
        doc_id.append('#')
        definitions.append('#')


def parser_function(remove_stopwords,use_stemming,remove_otherNoise): 

    # remove punctuations 
#     for d in range(0,len(definitions)):
#         definitions[d] = re.sub(r'[^\w\s]','',definitions[d])
        
    def_tokens = []
    for d in definitions:
        def_tokens.append(re.split('\W+',d))

    all_tokens = []
    for tokens in def_tokens:
        for token in tokens:
            all_tokens.append(token)
    
    print('Total number of tokens : ',len(all_tokens))
    print('Number of unique words : ',len(set(all_tokens)))
    
    if remove_stopwords == True:
        token_without_stop_words = remove_stop_words(def_tokens)
    if use_stemming == True:
        stopped_stemmed_tokens = stemming(token_without_stop_words)
#         print(stopped_stemmed_tokens)
    if remove_otherNoise == True:
        final_tokens = remove_noise(stopped_stemmed_tokens)
        
    return final_tokens

In [7]:
final_tokens = parser_function(remove_stopwords,use_stemming,remove_otherNoise)

Total number of tokens :  631067
Number of unique words :  15939
Total number of tokens after removing stopwords :  388439
Number of unique words after removing stopwords :  15799
Total number of tokens after stemming :  388439
Number of unique tokens after stemming :  9921
Total number of tokens after removing extra noise :  368309
Number of unique tokens after removing extra noise :  9866


In [8]:
(final_tokens)

[['estim',
  'durat',
  'cost',
  'made',
  'compon',
  'separ',
  'combin',
  'provid',
  'overal',
  'figur'],
 ['pattern',
  'process',
  'deduct',
  'approach',
  'use',
  'understand',
  'process',
  'predict',
  'pattern',
  'emerg',
  'individu',
  'agent',
  'base',
  'model',
  'macroscop',
  'emerg',
  'pattern',
  'micro',
  'level',
  'behavior'],
 ['normal', '3nf'],
 ['client', 'factor', 'standpoint'],
 ['approach',
  'dynam',
  'program',
  'problem',
  'problem',
  'solut',
  'compos',
  'solut',
  'problem',
  'smaller',
  'input'],
 ['piec', 'togeth', 'system', 'give', 'rise', 'complex', 'system'],
 ['piec', 'togeth', 'system', 'give', 'rise', 'grander', 'system'],
 ['begin',
  'look',
  'process',
  'directli',
  'activ',
  'level',
  'aggreg',
  'identifi',
  'process',
  'across',
  'organ'],
 ['begin',
  'level',
  'attribut',
  'normal',
  'bottom',
  'use',
  'small',
  'databas',
  'attribut'],
 ['often',
  'intuit',
  'approach',
  'use',
  'recurs',
  'dynam',

# Part 1: Word2Vec (30 points)

In this part you will use the Word2Vec algorithm to generate word embeddings for tokens in the dataset. You can just use a package like https://radimrehurek.com/gensim/models/word2vec.html. Let's set the size of word embeddings to be 20. Please print the word embeddings for the tokens: 
* relational
* database
* garbage
* collection
* retrieval 
* model

In [126]:
# code here.
# how do you generate the word embeddings
model = Word2Vec(final_tokens, min_count=1,size= 20,workers=5, window =3, sg = 1)

In [127]:
words = list(model.wv.vocab)
# print the word embeddings of the six tokens
def print_embeddings(word):
    wrd= ps.stem(word)
    print("WORD EMBEDDINGS OF ", word, " is : ")
    print()
    print(model[wrd])
    print()

In [128]:
print_embeddings('relational')
print_embeddings('database')
print_embeddings('garbage')
print_embeddings('collection')
print_embeddings('retrieval')
print_embeddings('model')

WORD EMBEDDINGS OF  relational  is : 

[-0.19857235  0.12671016  0.58526367  1.1316813   0.23514447  0.4050185
  0.06603111 -0.78574866 -1.1994452  -0.6083639  -0.5986841   0.31020227
  0.12358997 -0.8132407   0.33675492 -0.4835127   1.1031913   0.70540714
 -1.1240478   0.9866053 ]

WORD EMBEDDINGS OF  database  is : 

[ 0.14583108  0.17329638  0.47260886  0.9799709   0.15488948  0.05742981
  0.60294205 -0.32326025 -1.6562929  -0.14949217 -0.92841756  1.171146
  0.1251197  -0.40019056  0.5721219   0.01220717  0.95245594  0.9872849
 -0.59492344  0.9843003 ]

WORD EMBEDDINGS OF  garbage  is : 

[-0.00233705 -0.24472564  0.3015086   0.50329936  0.01594679 -0.18599935
 -0.27282998 -0.05815934 -0.8369989  -0.09737109 -0.35560307  0.38770807
 -0.16374335 -0.07756652  0.21853782  0.09844609  0.03178845  0.18159175
 -0.65845513  0.16140185]

WORD EMBEDDINGS OF  collection  is : 

[-0.18443732 -0.26626533  0.5432484   0.870646   -0.2960243  -0.35501286
 -0.06668856 -0.9452887  -1.5189544   0.04

  import sys


# Part 2: Vector Space Model via Word Embeddings (40 points) 

In this part, your job is to match the query and the document via the cosine similarity between the embeddings of them.

Since there are not just one token in a query or a document, the first challenge is how to aggregate many word embeddings into one embedding of a query or a document. There are many ways to do so: 
* Max pooling: return the maximum value along each dimension of a bunch of word embeddings. For example, [1, 3, 4], [2, 1, 5] -> [2, 3, 5].
* Min pooling: return the minimum value along each dimension of a bunch of word embeddings
* Mean pooling: return the mean value along each dimension of a bunch of word embeddings
* Sum: element-wise add a bunch of word embeddings together
* Weighted sum: assign weights to word embeddings and then add them together. Weights could be TF, IDF or TF-IDF.

Try different aggregation methods and report the precision@10 for these queries:
* query: relational database
* query: garbage collection
* query: retrieval model

In [129]:
def query_embedding(queryname):
    q1 = model[ps.stem(queryname.split(' ')[0])]
    q2 = model[ps.stem(queryname.split(' ')[1])]
    
    q_emb_vect_max = np.maximum.reduce([q1, q2])
    q_emb_vect_min = np.minimum.reduce([q1, q2])
    q_emb_vect_sum = np.add.reduce([q1, q2])
    q_emb_vect_mean = np.mean(([q1, q2]), axis = 0)
    return q_emb_vect_max,q_emb_vect_min,q_emb_vect_sum,q_emb_vect_mean

In [130]:
relational_database_maxpooling,relational_database_minpooling,relational_database_sum,relational_database_mean = query_embedding('relational database')

garbage_collection_maxpooling,garbage_collection_minpooling,garbage_collection_sum,garbage_collection_mean = query_embedding('garbage collection')

retrieval_model_maxpooling,retrieval_model_minpooling,retrieval_model_sum,retrieval_model_mean = query_embedding('retrieval model')


  
  This is separate from the ipykernel package so we can avoid doing imports until


In [131]:
def FetchData(filename):
    f = open(filename, encoding = "utf8")         
    return f

In [132]:
f = FetchData('homework_1_data.txt')
document_file=[]
document_list =[]
document_entity = []
for file in f:
    definition = file.split('\t')[2]
    docidx =file.split('\t')[1]
    entity = file.split('\t')[0]
    document_file.append(definition)
    document_list.append(docidx)
    document_entity.append(entity)

In [133]:
tf_list = []
for d in document_list:
    defi = final_tokens[int(d)]
    frequency_list = []
    for every_word in defi:
        leng = len(defi)
        wf = defi.count(every_word)
        tf_val = wf/leng
        frequency_list.append((every_word,tf_val))
        
    tf_list.append(frequency_list)  

In [134]:
def process_tf(index, r = 0):
    if final_tokens[index]:
        tftf = tf_list[index] 
        values = []
        for i in range(len(final_tokens[index])):
           
            wrd_vector = model[final_tokens[index][i]]  # word embedding of estim
            
            wrd_vector = wrd_vector * tftf[i][1]
           
            values.append(wrd_vector)
            
        k = values[0]
        for j in range(1,len(values)):
            r = np.add.reduce([k,values[j]])
            k = r    
        
    return r
           

In [135]:
def process_mean(index, r= 0):
    
    
    if final_tokens[index]:
       
        values = []
        for i in range(len(final_tokens[index])):
            wrd_vector = model[final_tokens[index][i]]
            values.append(wrd_vector)
       
        k = values[0]
        for j in range(1,len(values)):
            r = np.add.reduce([k,values[j]])
            k = r
       
        r = r/len(final_tokens[index])   
        
    return r   

In [136]:
def process_max(index, r= 0):
    
    
    if final_tokens[index]:
        
        values = []
        for i in range(len(final_tokens[index])):
            wrd_vector = model[final_tokens[index][i]]
            values.append(wrd_vector)

        k = values[0]
        for j in range(1,len(values)):
            r = np.maximum.reduce([k,values[j]])
            k = r
        
    return r    

In [137]:
def process_min(index, r= 0):
    
    
    if final_tokens[index]:
        
        values = []
        for i in range(len(final_tokens[index])):
            wrd_vector = model[final_tokens[index][i]]
            values.append(wrd_vector)

        k = values[0]
        for j in range(1,len(values)):
            r = np.minimum.reduce([k,values[j]])
            k = r
        
    return r    

In [138]:
def process_sum(index, r= 0):
  
    
    if final_tokens[index]:
        
        values = []
        for i in range(len(final_tokens[index])):
            wrd_vector = model[final_tokens[index][i]]
            values.append(wrd_vector)

        k = values[0]
        for j in range(1,len(values)):
            r = np.add.reduce([k,values[j]])
            k = r
        
    return r    

In [139]:
doc_embeds_min = []
doc_embeds_max = []
doc_embeds_sum = []
doc_embeds_mean = []
doc_embeds_tf = []
for ids in range(len(final_tokens)):   #the whole list
    max_line = process_max(ids)            #one line of documents's embedding vector---MAX
    doc_embeds_max.append(max_line)
    min_line = process_min(ids)
    doc_embeds_min.append(min_line)
    sum_line = process_sum(ids)
    doc_embeds_sum.append(sum_line)
    tf_line = process_tf(ids)
    doc_embeds_tf.append(tf_line)
    mean_line = process_mean(ids)
    doc_embeds_mean.append(mean_line)


  
  import sys


In [140]:
def cal_csimilarity(query_vector,document_vector, csv = 0):
    
    if (not isinstance(document_vector, int)) and (not isinstance(document_vector, float)) :
        #print(document_vector)
        vector1 = query_vector
        vector2 = document_vector
    
        vector1 = vector1.reshape(1, -1)
        vector2 = vector2.reshape(1, -1)
        csv = cosine_similarity(vector1,vector2)
    
    return csv

In [141]:
def lst_cvals(queryv,doc_embeds):
    similarity_value_list =  []
    for docid in range(len(doc_embeds)):
        #print(docid)
        similarity_vals = cal_csimilarity(queryv,doc_embeds[docid])
        #print ("ggg",similarity_vals[0])
        #break
        similarity_value_list.append((similarity_vals,docid))
    similarity_value_list = sorted(similarity_value_list, reverse = True)
    tfidf_score = [(ranked[1],ranked[0]) for ranked in similarity_value_list]
        
    return tfidf_score

In [142]:
sorted_csmililarity_RD_max = lst_cvals(relational_database_maxpooling,doc_embeds_max)
sorted_csmililarity_GC_max = lst_cvals(garbage_collection_maxpooling,doc_embeds_max)
sorted_csmililarity_RM_max = lst_cvals(retrieval_model_maxpooling,doc_embeds_max)

sorted_csmililarity_RD_min = lst_cvals(relational_database_minpooling,doc_embeds_min)
sorted_csmililarity_GC_min = lst_cvals(garbage_collection_minpooling,doc_embeds_min)
sorted_csmililarity_RM_min = lst_cvals(retrieval_model_minpooling,doc_embeds_min)

sorted_csmililarity_RD_sum = lst_cvals(relational_database_sum,doc_embeds_sum)
sorted_csmililarity_GC_sum = lst_cvals(garbage_collection_sum,doc_embeds_sum)
sorted_csmililarity_RM_sum = lst_cvals(retrieval_model_sum,doc_embeds_sum)

sorted_csmililarity_RD_mean = lst_cvals(relational_database_mean,doc_embeds_mean)
sorted_csmililarity_GC_mean = lst_cvals(garbage_collection_mean,doc_embeds_mean)
sorted_csmililarity_RM_mean = lst_cvals(retrieval_model_mean,doc_embeds_mean)

sorted_csmililarity_RD_tf = lst_cvals(relational_database_sum,doc_embeds_tf)
sorted_csmililarity_GC_tf = lst_cvals(garbage_collection_sum,doc_embeds_tf)
sorted_csmililarity_RM_tf = lst_cvals(retrieval_model_sum,doc_embeds_tf)

In [143]:
result_RD_max = sorted_csmililarity_RD_max[:10]
result_GC_max = sorted_csmililarity_GC_max[:10]
result_RM_max = sorted_csmililarity_RM_max[:10]

result_RD_min = sorted_csmililarity_RD_min[:10]
result_GC_min = sorted_csmililarity_GC_min[:10]
result_RM_min = sorted_csmililarity_RM_min[:10]

result_RD_sum = sorted_csmililarity_RD_sum[:10]
result_GC_sum = sorted_csmililarity_GC_sum[:10]
result_RM_sum = sorted_csmililarity_RM_sum[:10]

result_RD_mean = sorted_csmililarity_RD_mean[:10]
result_GC_mean = sorted_csmililarity_GC_mean[:10]
result_RM_mean = sorted_csmililarity_RM_mean[:10]

result_RD_tf = sorted_csmililarity_RD_tf[:10]
result_GC_tf = sorted_csmililarity_GC_tf[:10]
result_RM_tf = sorted_csmililarity_RM_tf[:10]

In [144]:
# your code here
def Precision(output,query,document_file,document_entity):
    print("query:" + query)
    count =0
    
    for num in range(len(output)):
        
        if(isinstance(output[0],tuple)):
            id = output[num][0]   #28177
            score = str(output[num][1]) #0.9823
            entity= document_entity[int(id)]  #relational database
            docid=  str(id) 
            definition = document_file[int(id)]  #relational database schema with data
            if(entity==query):
                count=count+1
    return count     


In [170]:
print("********************************PRECISION@10**********************************************************")
print()
print("MAXPOOLING")
print("============")
num1 =Precision(result_RD_max,"relational database",document_file,document_entity)
print("precision for cosine similarity of relational database Maxpooling  = " , num1/10)
num2 =Precision(result_GC_max,"garbage collection",document_file,document_entity)
print("precision for cosine similarity of garbage collection Maxpooling  = " , num2/10)
num3 =Precision(result_RM_max,"retrieval model",document_file,document_entity)
print("precision for cosine similarity of retrieval model Maxpooling  = " , num3/10)
print()
print("MINPOOLING")
print("============")
RD_min =Precision(result_RD_min,"relational database",document_file,document_entity)
print("precision for cosine similarity of relational database Minpooling  = " , RD_min/10)
GC_min =Precision(result_GC_min,"garbage collection",document_file,document_entity)
print("precision for cosine similarity of garbage collection Minpooling  = " , GC_min/10)
RM_min =Precision(result_RM_min,"retrieval model",document_file,document_entity)
print("precision for cosine similarity of retrieval model Minpooling  = " , RM_min/10)
print()
print("SUM")
print("============")
RD_sum =Precision(result_RD_sum,"relational database",document_file,document_entity)
print("precision for cosine similarity of relational database sum  = " , RD_sum/10)
GC_sum =Precision(result_GC_sum,"garbage collection",document_file,document_entity)
print("precision for cosine similarity of garbage collection sum  = " , GC_sum/10)
RM_sum =Precision(result_RM_sum,"retrieval model",document_file,document_entity)
print("precision for cosine similarity of retrieval model sum  = " , RM_sum/10)
print()
print("MEAN POOLING")
print("============")
RD_mean =Precision(result_RD_mean,"relational database",document_file,document_entity)
print("precision for cosine similarity of relational database mean  = " , RD_mean/10)
GC_mean =Precision(result_GC_mean,"garbage collection",document_file,document_entity)
print("precision for cosine similarity of garbage collection mean  = " , GC_mean/10)
RM_mean =Precision(result_RM_mean,"retrieval model",document_file,document_entity)
print("precision for cosine similarity of retrieval model mean  = " , RM_mean/10)
print()
print("WEIGHTED SUM : TF WEIGHT")
print("===========================")
RD_tf=Precision(result_RD_tf,"relational database",document_file,document_entity)
print("precision for cosine similarity of relational database tf  = " , RD_tf/10)
GC_tf =Precision(result_GC_tf,"garbage collection",document_file,document_entity)
print("precision for cosine similarity of garbage collection tf  = " , GC_tf/10)
RM_tf =Precision(result_RM_tf,"retrieval model",document_file,document_entity)
print("precision for cosine similarity of retrieval model tf  = " , RM_tf/10)

********************************PRECISION@10**********************************************************

MAXPOOLING
query:relational database
precision for cosine similarity of relational database Maxpooling  =  0.3
query:garbage collection
precision for cosine similarity of garbage collection Maxpooling  =  0.0
query:retrieval model
precision for cosine similarity of retrieval model Maxpooling  =  0.0

MINPOOLING
query:relational database
precision for cosine similarity of relational database Minpooling  =  0.5
query:garbage collection
precision for cosine similarity of garbage collection Minpooling  =  0.0
query:retrieval model
precision for cosine similarity of retrieval model Minpooling  =  0.0

SUM
query:relational database
precision for cosine similarity of relational database sum  =  0.7
query:garbage collection
precision for cosine similarity of garbage collection sum  =  0.0
query:retrieval model
precision for cosine similarity of retrieval model sum  =  0.0

MEAN POOLING
query

### Discussion
Among these aggregation methods, which one is the best and which one is the worst?

BEST AGGREGATION RESULT - TF WEIGHT
WORST AGGREGATION RESULT - MAXPOOLING

# Part 3: Query Expansion via Word Embeddings (30 points) 
Remember the hardest query "retrieval model" in homework 1? Because there is no document containing "retrieval model" in the dataset, you cannot retrieve any documents by Boolean matching. Now, it is the time of your "revenge" via query expansion.

In this part, your job is to expand the original query like "retrieval model" by adding semantically similar words (e.g., "search"), which are selected from all tokens in the dataset.

There are many ways to do so. For this part, we want you to calculate the cosine similarity between each of the original query tokens and the other tokens based on their word embeddings.

First, please find the top 3 similar tokens for:
* relational
* database
* garbage
* collection
* retrieval 
* model

In [146]:
# your code here
relational_similar = model.most_similar(ps.stem('relational'))[:3]
print("RELATIONAL TOP 3 SIMILAR WORDS : ")
print("===================================")
print(relational_similar)
print("SIMILAR TOKENS : ",relational_similar[0][0],",",relational_similar[1][0],",",relational_similar[2][0] )
print()
database_similar = model.most_similar(ps.stem('database'))[:3]
print("DATABASE TOP 3 SIMILAR WORDS : ")
print("===================================")
print(database_similar)
print("SIMILAR TOKENS : ",database_similar[0][0],",",database_similar[1][0],",",database_similar[2][0] )
print()
garbage_similar = model.most_similar(ps.stem('garbage'))[:3]
print("GARBAGE TOP 3 SIMILAR WORDS : ")
print("===================================")
print(garbage_similar)
print("SIMILAR TOKENS : ",garbage_similar[0][0],",",garbage_similar[1][0],",",garbage_similar[2][0] )
print()
collection_similar = model.most_similar(ps.stem('collection'))[:3]
print("COLLECTION TOP 3 SIMILAR WORDS : ")
print("===================================")
print(collection_similar)
print("SIMILAR TOKENS : ",collection_similar[0][0],",",collection_similar[1][0],",",collection_similar[2][0] )
print()
retrieval_similar = model.most_similar(ps.stem('retrieval'))[:3]
print("RETRIEVAL TOP 3 SIMILAR WORDS : ")
print("===================================")
print(retrieval_similar)
print("SIMILAR TOKENS : ",retrieval_similar[0][0],",",retrieval_similar[1][0],",",retrieval_similar[2][0] )
print()
model_similar = model.most_similar(ps.stem('model'))[:3]
print("MODEL TOP 3 SIMILAR WORDS : ")
print("===================================")
print(model_similar)
print("SIMILAR TOKENS : ",model_similar[0][0],",",model_similar[1][0],",",model_similar[2][0] )
print()

RELATIONAL TOP 3 SIMILAR WORDS : 
[('tabl', 0.9308838844299316), ('rdbm', 0.911335825920105), ('compli', 0.9051710367202759)]
SIMILAR TOKENS :  tabl , rdbm , compli

DATABASE TOP 3 SIMILAR WORDS : 
[('dbm', 0.9440183639526367), ('cleans', 0.9279346466064453), ('catalog', 0.9182411432266235)]
SIMILAR TOKENS :  dbm , cleans , catalog

GARBAGE TOP 3 SIMILAR WORDS : 
[('oltp', 0.9758096933364868), ('collet', 0.9738173484802246), ('circut', 0.9729315042495728)]
SIMILAR TOKENS :  oltp , collet , circut

COLLECTION TOP 3 SIMILAR WORDS : 
[('dispar', 0.9542980194091797), ('consolid', 0.9531002044677734), ('catalog', 0.9460099339485168)]
SIMILAR TOKENS :  dispar , consolid , catalog

RETRIEVAL TOP 3 SIMILAR WORDS : 
[('store', 0.9334141612052917), ('etl', 0.9210958480834961), ('metadata', 0.91939377784729)]
SIMILAR TOKENS :  store , etl , metadata

MODEL TOP 3 SIMILAR WORDS : 
[('deduct', 0.8712215423583984), ('hierarch', 0.8681561946868896), ('structur', 0.8670740723609924)]
SIMILAR TOKENS :  

  
  
  


Second, please add these similar tokens to the orignal query and redo the **vector space model** in part 2. 
* query: relational database
* query: garbage collection
* query: retrieval model

In [147]:
# your code here
def retrieved_number(query):
    number = 0
    for h in range(len(document_entity)):
        if(document_entity[h] == query):
            number += 1
    return number    

In [148]:
RD_query = retrieved_number("relational database")
GC_query = retrieved_number("garbage collection")
RM_query = retrieved_number("retrieval model")


In [149]:
# QUERY EMBEDDING
def similar_vectors(qv):
    
    word_vecs = []
    for q in range(len(qv)):
        word_vecs.append(qv[q][0])
    return word_vecs    

In [150]:
relationalS = similar_vectors(relational_similar)
databaseS = similar_vectors(database_similar)
garbageS = similar_vectors(garbage_similar)
collectionS = similar_vectors(collection_similar)
retrievalS =similar_vectors(retrieval_similar)
modelS = similar_vectors(model_similar)
print(relationalS)
print(databaseS)
print(garbageS)
print(collectionS)
print(retrievalS)
print(modelS)
print(relationalS +databaseS)

['tabl', 'rdbm', 'compli']
['dbm', 'cleans', 'catalog']
['oltp', 'collet', 'circut']
['dispar', 'consolid', 'catalog']
['store', 'etl', 'metadata']
['deduct', 'hierarch', 'structur']
['tabl', 'rdbm', 'compli', 'dbm', 'cleans', 'catalog']


In [151]:
def recall_maxp(query_embeddings):
    values = []
    for g in range(len(query_embeddings)):
        wrd_vector = model[query_embeddings[g]]
        values.append(wrd_vector)
        
    k = values[0]
    for j in range(1,len(values)):
        r = np.maximum.reduce([k,values[j]])
        k = r   
    #print(values)
    return r

In [152]:
def recall_minp(query_embeddings):
    values = []
    for g in range(len(query_embeddings)):
        wrd_vector = model[query_embeddings[g]]
        values.append(wrd_vector)
        
    k = values[0]
    for j in range(1,len(values)):
        r = np.minimum.reduce([k,values[j]])
        k = r   
    #print(values)
    return r

In [153]:
def recall_sump(query_embeddings):
    values = []
    for g in range(len(query_embeddings)):
        wrd_vector = model[query_embeddings[g]]
        values.append(wrd_vector)
        
    k = values[0]
    for j in range(1,len(values)):
        r = np.add.reduce([k,values[j]])
        k = r   
   # print(values)
    return r

In [154]:
def recall_meanp(query_embeddings):
    values = []
    for g in range(len(query_embeddings)):
        wrd_vector = model[query_embeddings[g]]
        values.append(wrd_vector)
    #print(values)    
    k = values[0]
    for j in range(1,len(values)):
        r = np.add.reduce([k,values[j]])
        k = r   
   # print(values)
    r = r/len(query_embeddings)
    return r


In [155]:
def cal_q_tf(query_embeddings):
    tf_query_list = []
    for r in range(len(query_embeddings)):
        qeer = query_embeddings[r]
        leng = len(query_embeddings)
        wf = query_embeddings.count(qeer)
        tf_q_val = wf/leng
        tf_query_list.append((qeer,tf_q_val))
        #print(qeer)
        #print(leng)
    return tf_query_list

In [156]:
tf_rd_v = cal_q_tf(relationalS +databaseS)
tf_gc_v = cal_q_tf(garbageS + collectionS)
tf_rm_v = cal_q_tf(retrievalS + modelS)

In [157]:
def recall_tfp(query_embeddings,tftf):
    values = []
    for g in range(len(query_embeddings)):
        wrd_vector = model[query_embeddings[g]]
        wrd_vector = wrd_vector*tftf[g][1]
        values.append(wrd_vector)
   # print(values)  
    
    k = values[0]
    for j in range(1,len(values)):
        r = np.add.reduce([k,values[j]])
        k = r   
        #print(r)
        #break
    #print(r)
    return r

In [158]:
RD_recall_max = recall_maxp(relationalS + databaseS) #maxpooling of all 6 vectors
RD_recall_min = recall_minp(relationalS + databaseS)
RD_recall_sum = recall_sump(relationalS + databaseS)
RD_recall_mean = recall_meanp(relationalS + databaseS)
RD_recall_tf = recall_tfp(relationalS + databaseS,tf_rd_v)

GC_recall_max = recall_maxp(garbageS + collectionS)
GC_recall_min = recall_minp(garbageS + collectionS)
GC_recall_sum = recall_sump(garbageS + collectionS)
GC_recall_mean = recall_meanp(garbageS + collectionS)
GC_recall_tf = recall_tfp(garbageS + collectionS,tf_gc_v)

RM_recall_max = recall_maxp(retrievalS + modelS)
RM_recall_min = recall_minp(retrievalS + modelS)
RM_recall_sum = recall_sump(retrievalS + modelS)
RM_recall_mean = recall_meanp(retrievalS + modelS)
RM_recall_tf = recall_tfp(retrievalS + modelS,tf_rm_v)

  after removing the cwd from sys.path.


In [159]:
q_rd_max = np.maximum.reduce([relational_database_maxpooling, RD_recall_max]) # all 8 vectors
q_gc_max = np.maximum.reduce([garbage_collection_maxpooling, GC_recall_max])
q_rm_max = np.maximum.reduce([retrieval_model_maxpooling, RM_recall_max])

q_rd_min = np.minimum.reduce([relational_database_minpooling, RD_recall_min])
q_gc_min = np.minimum.reduce([garbage_collection_minpooling, GC_recall_min])
q_rm_min = np.minimum.reduce([retrieval_model_minpooling, RM_recall_min])

q_rd_sum = np.add.reduce([relational_database_sum, RD_recall_sum])
q_gc_sum = np.add.reduce([garbage_collection_sum, GC_recall_sum])
q_rm_sum = np.add.reduce([retrieval_model_sum, RM_recall_sum])

q_rd_mean = np.mean(([relational_database_mean, RD_recall_mean]), axis = 0)
q_gc_mean = np.mean(([garbage_collection_mean, GC_recall_mean]), axis = 0)
q_rm_mean = np.mean(([retrieval_model_mean, RM_recall_mean]), axis =0)

q_rd_tf = np.add.reduce([relational_database_sum, RD_recall_tf])
q_gc_tf = np.add.reduce([relational_database_sum, RD_recall_tf])
q_rm_tf = np.add.reduce([relational_database_sum, RD_recall_tf])


In [160]:
sorted_csmililarity_RD_max_r = lst_cvals(q_rd_max,doc_embeds_max)
sorted_csmililarity_GC_max_r = lst_cvals(q_gc_max,doc_embeds_max)
sorted_csmililarity_RM_max_r = lst_cvals(q_rm_max,doc_embeds_max)

sorted_csmililarity_RD_min_r = lst_cvals(q_rd_min,doc_embeds_min)
sorted_csmililarity_GC_min_r = lst_cvals(q_gc_min,doc_embeds_min)
sorted_csmililarity_RM_min_r = lst_cvals(q_rm_min,doc_embeds_min)

sorted_csmililarity_RD_sum_r = lst_cvals(q_rd_sum,doc_embeds_sum)
sorted_csmililarity_GC_sum_r = lst_cvals(q_gc_sum,doc_embeds_sum)
sorted_csmililarity_RM_sum_r = lst_cvals(q_rm_sum,doc_embeds_sum)

sorted_csmililarity_RD_mean_r = lst_cvals(q_rd_mean,doc_embeds_mean)
sorted_csmililarity_GC_mean_r = lst_cvals(q_gc_mean,doc_embeds_mean)
sorted_csmililarity_RM_mean_r = lst_cvals(q_rm_mean,doc_embeds_mean)

sorted_csmililarity_RD_tf_r = lst_cvals(q_rd_tf,doc_embeds_tf)
sorted_csmililarity_GC_tf_r = lst_cvals(q_gc_tf,doc_embeds_tf)
sorted_csmililarity_RM_tf_r = lst_cvals(q_rm_tf,doc_embeds_tf)

In [161]:
result_RD_max_r = sorted_csmililarity_RD_max_r[:10]
result_GC_max_r = sorted_csmililarity_GC_max_r[:10]
result_RM_max_r = sorted_csmililarity_RM_max_r[:10]

result_RD_min_r = sorted_csmililarity_RD_min_r[:10]
result_GC_min_r = sorted_csmililarity_GC_min_r[:10]
result_RM_min_r = sorted_csmililarity_RM_min_r[:10]

result_RD_sum_r = sorted_csmililarity_RD_sum_r[:10]
result_GC_sum_r = sorted_csmililarity_GC_sum_r[:10]
result_RM_sum_r = sorted_csmililarity_RM_sum_r[:10]

result_RD_mean_r = sorted_csmililarity_RD_mean_r[:10]
result_GC_mean_r = sorted_csmililarity_GC_mean_r[:10]
result_RM_mean_r = sorted_csmililarity_RM_mean_r[:10]

result_RD_tf_r = sorted_csmililarity_RD_tf_r[:10]
result_GC_tf_r = sorted_csmililarity_GC_tf_r[:10]
result_RM_tf_r = sorted_csmililarity_RM_tf_r[:10]

In [171]:
print()
print("*************************************BEFORE EXPANSION*******************************************")
print()
print("RECALL@10 :")
print()
print("MAXPOOLING")
print("==============")
RD_max_recall=Precision(result_RD_max,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database Maxpooling before expansion  = " , RD_max_recall/RD_query)
GC_max_recall =Precision(result_GC_max,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection Maxpooling before expansion= " , GC_max_recall/GC_query)
RM_max_recall =Precision(result_RM_max,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model Maxpooling before expansion  = " , RM_max_recall/RM_query)
print()
print("MINPOOLING")
print("==============")
RD_min_recall =Precision(result_RD_min,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database Minpooling before expansion = " , RD_min_recall/RD_query)
GC_min_recall =Precision(result_GC_min,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection Minpooling before expansion = " , GC_min_recall/GC_query)
RM_min_recall =Precision(result_RM_min,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model Minpooling before expansion = " , RM_min_recall/RM_query)
print()
print("SUM")
print("==============")
RD_sum_recall =Precision(result_RD_sum,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database sum before expansion = " , RD_sum_recall/RD_query)
GC_sum_recall =Precision(result_GC_sum,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection sum before expansion = " , GC_sum_recall/GC_query)
RM_sum_recall =Precision(result_RM_sum,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model sum before expansion= " , RM_sum_recall/RM_query)
print()
print("MEAN POOLING")
print("==============")
RD_mean_recall =Precision(result_RD_mean,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database mean before expansion = " , RD_mean_recall/RD_query)
GC_mean_recall =Precision(result_GC_mean,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection mean before expansion = " , GC_mean_recall/GC_query)
RM_mean_recall =Precision(result_RM_mean,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model mean before expansion= " , RM_mean_recall/RM_query)
print()
print("WEIGHTED SUM : TF WEIGHT")
print("===========================")
RD_tf_recall =Precision(result_RD_tf,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database tf-weight before expansion = " , RD_tf_recall/RD_query)
GC_tf_recall =Precision(result_GC_tf,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection tf-weight before expansion = " , GC_tf_recall/GC_query)
RM_tf_recall =Precision(result_RM_tf,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model tf-weight before expansion= " , RM_tf_recall/RM_query)


*************************************BEFORE EXPANSION*******************************************

RECALL@10 :

MAXPOOLING
query:relational database
Recall for cosine similarity of relational database Maxpooling before expansion  =  0.01056338028169014
query:garbage collection
Recall for cosine similarity of garbage collection Maxpooling before expansion=  0.0
query:retrieval model
Recall for cosine similarity of retrieval model Maxpooling before expansion  =  0.0

MINPOOLING
query:relational database
Recall for cosine similarity of relational database Minpooling before expansion =  0.017605633802816902
query:garbage collection
Recall for cosine similarity of garbage collection Minpooling before expansion =  0.0
query:retrieval model
Recall for cosine similarity of retrieval model Minpooling before expansion =  0.0

SUM
query:relational database
Recall for cosine similarity of relational database sum before expansion =  0.02464788732394366
query:garbage collection
Recall for cosine sim

In [172]:
print()
print("*************************************AFTER EXPANSION*******************************************")
print()
print("RECALL@10 :")
print()
print("MAXPOOLING")
print("==============")
RD_max_recall_r=Precision(result_RD_max_r,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database Maxpooling after expansion  = " , RD_max_recall_r/RD_query)
GC_max_recall_r =Precision(result_GC_max_r,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection Maxpooling after expansion= " , GC_max_recall_r/GC_query)
RM_max_recall_r =Precision(result_RM_max_r,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model Maxpooling after expansion  = " , RM_max_recall_r/RM_query)
print()
print("MINPOOLING")
print("==============")
RD_min_recall_r =Precision(result_RD_min_r,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database Minpooling after expansion = " , RD_min_recall_r/RD_query)
GC_min_recall_r =Precision(result_GC_min_r,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection Minpooling after expansion = " , GC_min_recall_r/GC_query)
RM_min_recall_r =Precision(result_RM_min_r,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model Minpooling after expansion = " , RM_min_recall_r/RM_query)
print()
print("SUM")
print("==============")
RD_sum_recall_r =Precision(result_RD_sum_r,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database sum after expansion = " , RD_sum_recall_r/RD_query)
GC_sum_recall_r =Precision(result_GC_sum_r,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection sum after expansion = " , GC_sum_recall_r/GC_query)
RM_sum_recall_r =Precision(result_RM_sum_r,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model sum after expansion= " , RM_sum_recall_r/RM_query)
print()
print("MEAN POOLING")
print("==============")
RD_mean_recall_r =Precision(result_RD_mean_r,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database mean after expansion = " , RD_mean_recall_r/RD_query)
GC_mean_recall_r =Precision(result_GC_mean_r,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection mean after expansion = " , GC_mean_recall_r/GC_query)
RM_mean_recall_r =Precision(result_RM_mean_r,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model mean after expansion= " , RM_mean_recall_r/RM_query)
print()
print("WEIGHTED SUM : TF WEIGHT")
print("===========================")
RD_tf_recall_r =Precision(result_RD_tf_r,"relational database",document_file,document_entity)
print("Recall for cosine similarity of relational database tf after expansion = " , RD_tf_recall_r/RD_query)
GC_tf_recall_r =Precision(result_GC_tf_r,"garbage collection",document_file,document_entity)
print("Recall for cosine similarity of garbage collection tf after expansion = " , GC_tf_recall_r/GC_query)
RM_tf_recall_r =Precision(result_RM_tf_r,"retrieval model",document_file,document_entity)
print("Recall for cosine similarity of retrieval model tf after expansion= " , RM_tf_recall_r/RM_query)


*************************************AFTER EXPANSION*******************************************

RECALL@10 :

MAXPOOLING
query:relational database
Recall for cosine similarity of relational database Maxpooling after expansion  =  0.02464788732394366
query:garbage collection
Recall for cosine similarity of garbage collection Maxpooling after expansion=  0.0
query:retrieval model
Recall for cosine similarity of retrieval model Maxpooling after expansion  =  0.0

MINPOOLING
query:relational database
Recall for cosine similarity of relational database Minpooling after expansion =  0.02112676056338028
query:garbage collection
Recall for cosine similarity of garbage collection Minpooling after expansion =  0.0
query:retrieval model
Recall for cosine similarity of retrieval model Minpooling after expansion =  0.0

SUM
query:relational database
Recall for cosine similarity of relational database sum after expansion =  0.028169014084507043
query:garbage collection
Recall for cosine similarity 

Report recall@10 before the query expansion:

Report recall@10 after the query expansion:

### Discussion
Why we measure recall here instead of precision or NDCG?

Should the tokens added for expansion have the same importance as the original query tokens? If not, how to improve the query expansion in this part?

Recall is the fraction of the relevant documents that are successfully retrieved i.e RECALL = (RELEVANT DOC and RETRIEVED DOC)/ RELEVANT DOC. Precision is the fraction of retrieved documents that are relevant to the query:(RELEVANT DOC and RETRIEVED DOC)/ RETRIEVED DOC
By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. Thus with an increase in retrieved documents using query expansion, recall increases but precision decreases.

To observe this phenomenon, we calculated recall instead of precision or NDCG here. we can see from our results that recall increase for all aggregation methods after query expansion

the tokens added will have higher importance than the original query tokens as more relevant the retrievd tokens will be , better will be query expansion and recall value.