## IRWA Final Project

#### Load Python packages
Let's first import all the packages that we will need during the project.

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bertamitjavilapita/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bertamitjavilapita/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [99]:
import pandas as pd
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy
import collections
from numpy import linalg as la
import time
import re
from collections import Counter
import statistics
import json
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import ast
import pickle
from sklearn.preprocessing import MinMaxScaler
# import rank_bm25
from gensim.models.word2vec import Word2Vec

#### Load data into memory

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Data:** we import the pickle, i.e. the serialized data from create_index_tfidf from part 2. We need need term frequencies, document frequencies, the idf values and index for each term in the collection. And dataset of processed tweets that contains the following information for each tweet: Tweet, Date, Hashtags, Hashtags_count, Likes, Retweets, Url, and docId.

In [44]:
doc_path = '/content/drive/MyDrive/PROJECT IRWA/Part 2/Data/'
doc_path = '/Users/bertamitjavilapita/Desktop/BERTA/UPF/4RT/IRWA/PROJECT/Part 2/Data/'
# lo load the serialized data for future use
def load_serialized_data(filepath="index_data.pkl"):
    with open(filepath, 'rb') as f:
        index, tf, df, idf = pickle.load(f)
    return index, tf, df, idf

In [46]:
# load the serialized data from the file for future use
index, tf, df, idf = load_serialized_data(filepath=doc_path + "index_data.pkl")
processed_tweets_df = pd.read_csv(doc_path + 'processed_tweets.csv')

### Part 3: Ranking

**Ranking score: Given a query, we want to get the top-20 documents related to the query.**

Queries definition:
- "Indian government response to farmers"
- "International support for farmers"
- "Demands of farmers' protests"
- "Police action during farmers protest"
- "Schedules and locations of demonstrations and protests"

**GOAL: Find all the documents that contain all the words in the query and sort them by their relevance with regard to the query.**

**SCORE:**

**1. You’re asked to provide 2 different ways of ranking:**
- **TF-IDF + cosine similarity: Classical scoring, we have also seen during the practical labs**

In [50]:
# definition of our queries
queries = ['Indian government response to farmers',
           'International support for farmers',
           "Demands of farmers' protests",
           'Police action during farmers protest',
           'Schedules and locations of demonstrations and protests']

In [52]:
# function from part 1 to process tweets
def process_tweet(tweet):

    """
    Preprocess the tweet removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.

    Argument:
    line -- string (tweet) to be preprocessed

    Returns:
    tweet - a list of tokens corresponding to the input text after the preprocessing
    """

    stemmer = PorterStemmer() # define the stemmer
    stop_words = set(stopwords.words("english")) # define the stopwords
    tweet =  tweet.lower() # transform the line to lowercase
    tweet = tweet.replace('\\n', '') # remove newline characters
    tweet = ' '.join(tweet.split()) # remove extra whitespaces
    tweet = re.sub(r'\S*https?:\S*', '', tweet) # delete URLs on the tweet because we won't be able to access to them
    tweet.strip() # remove spaces at first and at the end of a message
    tweet = re.sub(r' ?#\S+', '', tweet) # remove word that is with the hastag - hastag is saved separately
    tweet = re.sub(r'[^a-z0-9#@ ]+', '', tweet) # remove punctuation
    tweet = re.sub(r'[^\w\s]', '', tweet) # delete punctuation
    tweet = tweet.split() # tokenize the text to get a list of terms
    tweet = [word for word in tweet if word not in stop_words] # eliminate the stopwords
    tweet = [stemmer.stem(word) for word in tweet] # perform stemming

    return tweet

In [54]:
def rank_documents(terms, docs, index, idf, tf):
    """
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies

    Returns:
    Print the list of ranked documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= query_terms_count[term]/ query_norm * idf[term]

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):

            #tf[term][0] will contain the tf of the term "term" in the doc 26
            if doc in docs:
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)

    result_docs = [x[1] for x in doc_scores]
    result_scores = [x[0] for x in doc_scores]

    if len(result_docs) == 0:
        print('No results found, try again')
        query = input()
        docs = search_tf_idf(query, index)
    return result_docs, result_scores



In [56]:
def search_tf_idf(query, index):
    """
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = process_tweet(query)
    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs=[posting[0] for posting in index[term]]
            docs |= set(term_docs)
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs = rank_documents(query, docs, index, idf, tf)
    return ranked_docs

In [58]:
top = 20
results = {}  # dictionary to store results for each query

# iterate through queries
for i, query in enumerate(queries):
    # execute the search function
    ranked_docs, result_scores = search_tf_idf(query, index)

    # store results in the dictionary
    results[f'ranked_docs_q{i+1}'] = ranked_docs
    results[f'result_scores_q{i+1}'] = result_scores

    print('\n----------------------------------------------------------------------------------------------------------')
    # print the top documents and their scores

    print(f"\nTop {top} results out of {len(ranked_docs)} retrieved documents for the query '{query}':\n")
    for doc_id, score in zip(ranked_docs[:top], result_scores[:top]):
        print(f'docId = {doc_id}\t score = {round(score, 4)}')


----------------------------------------------------------------------------------------------------------

Top 20 results out of 15989 retrieved documents for the query 'Indian government response to farmers':

docId = doc_3234	 score = 11.8045
docId = doc_27784	 score = 9.5898
docId = doc_38114	 score = 9.5064
docId = doc_38045	 score = 7.8308
docId = doc_33082	 score = 7.8308
docId = doc_41021	 score = 7.6648
docId = doc_13543	 score = 7.5804
docId = doc_7288	 score = 7.2119
docId = doc_1353	 score = 7.0977
docId = doc_39288	 score = 6.7811
docId = doc_32344	 score = 6.7811
docId = doc_24570	 score = 6.7811
docId = doc_24061	 score = 6.7811
docId = doc_17249	 score = 6.7811
docId = doc_34963	 score = 6.6396
docId = doc_37157	 score = 6.5963
docId = doc_47642	 score = 6.3002
docId = doc_2453	 score = 6.065
docId = doc_19626	 score = 6.065
docId = doc_16612	 score = 6.065

----------------------------------------------------------------------------------------------------------

Top 

- **Your-Score + cosine similarity: Here the task is to create a new score, and it’s up to you to create a new one.**
ELIMINARLO

Extract the features that are relevant to build our score.

In [60]:
relevance_features_df = processed_tweets_df[['docId', 'Hashtags', 'Hashtags_count', 'Likes', 'Retweets']]

Inspect the data in order to decide which normalization apply for build our score.

In [64]:
def detect_outliers_iqr(df, column):
    """
    Detects outliers in a specified column of a DataFrame using the IQR (Interquartile Range) method.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    column (str): The name of the column in which to detect outliers.

    Returns:
    DataFrame: A DataFrame containing the rows with outliers in the specified column.
    """
    # Calculate the first (Q1) and third (Q3) quartiles
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    # Calculate the Interquartile Range (IQR)
    IQR = Q3 - Q1

    # Define the lower and upper limits for outliers
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR

    # Filter rows with values outside the limits (outliers)
    outliers = df[(df[column] < lower_limit) | (df[column] > upper_limit)]

    if not outliers.empty:
        print(f"The column '{column}' contains outliers.")
    else:
        print(f"No outliers found in the column '{column}'.")

In [66]:
# score
# numero de likes i de retweets normalitzem

# per saber quina normalització aplicar
print('Stadistics Number of Hashtags')
display(processed_tweets_df['Hashtags_count'].describe())

print('\nStadistics Likes')
display(processed_tweets_df['Likes'].describe())

print('\nStadistics Retweets')
display(processed_tweets_df['Retweets'].describe())

Stadistics Number of Hashtags


count    48429.000000
mean         2.816597
std          2.386539
min          1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         25.000000
Name: Hashtags_count, dtype: float64


Stadistics Likes


count    48429.000000
mean        17.955419
std        242.634042
min          0.000000
25%          0.000000
50%          1.000000
75%          3.000000
max      27888.000000
Name: Likes, dtype: float64


Stadistics Retweets


count    48429.000000
mean         7.263664
std         69.987750
min          0.000000
25%          0.000000
50%          0.000000
75%          2.000000
max       6164.000000
Name: Retweets, dtype: float64

In [68]:
# Display the rows containing outliers
detect_outliers_iqr(processed_tweets_df, 'Hashtags_count')
detect_outliers_iqr(processed_tweets_df, 'Likes')
detect_outliers_iqr(processed_tweets_df, 'Retweets')

The column 'Hashtags_count' contains outliers.
The column 'Likes' contains outliers.
The column 'Retweets' contains outliers.


Apply logarithmic normalization to our data. 

In [71]:
# firstly generate new columns with the normlized data
# apply logarithmic transformation (adding 1 to avoid log(0) issues)
processed_tweets_df['Log Likes'] = np.log1p(processed_tweets_df['Likes'])
processed_tweets_df['Log Retweets'] = np.log1p(processed_tweets_df['Retweets'])

In [74]:
likes = dict(zip(processed_tweets_df['docId'], processed_tweets_df['Log Likes']))
retweets = dict(zip(processed_tweets_df['docId'], processed_tweets_df['Log Retweets']))
hashtags = dict(zip(processed_tweets_df['docId'], processed_tweets_df['Hashtags_count']))

Function to build the score based on popularity. 

In [77]:
def popularity_score(likes, retweets, hashtags, weigh_1, weigh_2, weight_3):
    
    popularity_score = likes * weigh_1 + retweets * weigh_2 + hashtags * weight_3

    return popularity_score

In [79]:
def our_rank_documents(terms, docs, index, idf, tf, likes, retweets, hashtags, weights):
    """
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies

    Returns:
    Print the list of ranked documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= query_terms_count[term]/ query_norm * idf[term]

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):

            #tf[term][0] will contain the tf of the term "term" in the doc 26
            if doc in docs:

                # obtain the likes and retweets of the doc
                number_likes = likes[doc]
                number_retweets = retweets[doc]
                number_hashtags = hashtags[doc]

                # popularity 
                score_popularity = popularity_score(number_likes, number_retweets, number_hashtags, weights[0], weights[1], weights[2])
                
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term] + score_popularity

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)

    result_docs = [x[1] for x in doc_scores]
    result_scores = [x[0] for x in doc_scores]

    if len(result_docs) == 0:
        print('No results found, try again')
        query = input()
        docs = search_our_score(query, index)
    return result_docs, result_scores

In [81]:
def search_our_score(query, index):
    """
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = process_tweet(query)
    docs = set()

    # parameters
    # explain weights based on real aspects in twitter
    weights = [0.3, 0.5, 0.2]

    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs=[posting[0] for posting in index[term]]
            docs |= set(term_docs)
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs = our_rank_documents(query, docs, index, idf, tf, likes, retweets, hashtags, weights)
    return ranked_docs

In [83]:
top = 20
results = {}  # dictionary to store results for each query

# iterate through queries
for i, query in enumerate(queries):
    # execute the search function
    ranked_docs, result_scores = search_our_score(query, index)

    # store results in the dictionary
    results[f'ranked_docs_q{i+1}'] = ranked_docs
    results[f'result_scores_q{i+1}'] = result_scores

    print('\n----------------------------------------------------------------------------------------------------------')
    # print the top documents and their scores

    print(f"\nTop {top} results out of {len(ranked_docs)} retrieved documents for the query '{query}':\n")
    for doc_id, score in zip(ranked_docs[:top], result_scores[:top]):
        print(f'docId = {doc_id}\t score = {round(score, 4)}')


----------------------------------------------------------------------------------------------------------

Top 20 results out of 15989 retrieved documents for the query 'Indian government response to farmers':

docId = doc_2727	 score = 27.0626
docId = doc_36986	 score = 24.2098
docId = doc_16637	 score = 24.1463
docId = doc_5169	 score = 23.1911
docId = doc_22724	 score = 22.5843
docId = doc_3500	 score = 22.0304
docId = doc_1353	 score = 20.2666
docId = doc_34869	 score = 19.9662
docId = doc_5936	 score = 19.581
docId = doc_19383	 score = 19.1091
docId = doc_27956	 score = 18.8689
docId = doc_36881	 score = 18.7084
docId = doc_12753	 score = 18.5225
docId = doc_27784	 score = 18.3251
docId = doc_25955	 score = 17.9912
docId = doc_32833	 score = 17.783
docId = doc_19361	 score = 17.4371
docId = doc_43834	 score = 17.3735
docId = doc_4926	 score = 17.1857
docId = doc_7778	 score = 17.111

------------------------------------------------------------------------------------------------

- **BM25**

In [85]:
def rank_documents_bm25 (terms, docs, index, idf, tf, k1, b, N, doc_length, lavg):
    """
    Perform the ranking of the results of a search based on the BM25 weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies
    k1 -- tunning parameter controling document term frequency scaling
    b -- tunning parameter controling the scaling by document length
    N -- number of documents
    doc_lenght -- dictionary with lenghts of all documents
    lavg -- average document length in the whole collection

    Returns:
    list of ranked documents
    scores of the documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        dft = len(index[term]) #in how many documents does it appear

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):

            if doc in docs:
                ld = doc_length[doc] #document length
                doc_vectors[doc][termIndex] = np.log(N/dft) * (((k1+1)*tf[term][doc_index]) / (k1*((1-b)+b*( ld / lavg ) ) + tf[term][doc_index]))

    # Calculate the score of each doc
    doc_scores=[[np.sum(curDocVec), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)
    result_docs = [x[1] for x in doc_scores]
    result_scores = [x[0] for x in doc_scores]
    if len(result_docs) == 0:
        print("No results found, try again")

    return result_docs, result_scores

In [86]:
def search_bm25(query, index, k1, b, N, doc_length, lavg):
    """
    output is the list of documents that contain all of the query terms.
    So, we will get the list of documents for each query term, and take the intersaction of them.
    """
    query = process_tweet(query)
    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs = [posting[0] for posting in index[term]]

            # docs = docs Intersaction term_docs
            if len(docs) == 0:
              docs = set(term_docs)
            else:
              docs.intersaction(set(term_docs))
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs, result_scores = rank_documents_bm25 (query, docs, index, idf, tf, k1, b, N, doc_length, lavg)
    return ranked_docs, result_scores

In [87]:
# Ld
doc_length = dict()
for i, row in processed_tweets_df.iterrows():
  doc_length[row['docId']]= len(row['Tweet'])

In [88]:
# Lavg
lavg = 0
for length in doc_length.values():
  lavg+=length

lavg = round(lavg / len(doc_length))

In [89]:
ranked_docs = [0]*5
result_scores = [0]*5
k1 = 2
b = 0.5
N = len(relevance_features_df)

i = 0
for query in queries:

  ranked_docs[i], result_scores[i] = search_bm25(query, index, k1, b, N, doc_length, lavg)
  top = 20
  print("\n----------------------------------------------------------------------------------------------------------")
  print("Top {} results out of {} for the searched query {}:\n".format(top, len(ranked_docs[i]), query))

  j=0
  for j in range(len(ranked_docs[i])):
    if (j < top):
      print("docId = {}\t score = {}".format(ranked_docs[i][j], result_scores[i][j]))
  i+=1



----------------------------------------------------------------------------------------------------------
Top 20 results out of 2719 for the searched query Indian government response to farmers:

docId = doc_13543	 score = 6.018461467422177
docId = doc_30422	 score = 5.711082995326188
docId = doc_41848	 score = 5.47864863495953
docId = doc_40125	 score = 5.47864863495953
docId = doc_40124	 score = 5.47864863495953
docId = doc_38240	 score = 5.47864863495953
docId = doc_31710	 score = 5.47864863495953
docId = doc_3116	 score = 5.47864863495953
docId = doc_29545	 score = 5.47864863495953
docId = doc_26363	 score = 5.47864863495953
docId = doc_17769	 score = 5.47864863495953
docId = doc_14237	 score = 5.47864863495953
docId = doc_12228	 score = 5.47864863495953
docId = doc_11484	 score = 5.47864863495953
docId = doc_31188	 score = 5.451518096785867
docId = doc_27778	 score = 5.451518096785867
docId = doc_34728	 score = 5.424654938061487
docId = doc_36211	 score = 4.8267924274395
docId =

**2. Return a top-20 list of documents for each of the 5 queries, using word2vec + cosine similarity.**

In [101]:
lines = processed_tweets_df['Tweet']
clean_tweets = []
for line in lines:

  clean_tweets.append(line)

In [103]:
sentences = clean_tweets

In [105]:
def search_word2vec(query, index, vector_size):
    """
    output is the list of documents that contain all of the query terms.
    So, we will get the list of documents for each query term, and take the intersaction of them.
    """
    query = process_tweet(query)
    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs = [posting[0] for posting in index[term]]

            # docs = docs Intersaction term_docs
            if len(docs) == 0:
              docs = set(term_docs)
            else:
              docs.intersaction(set(term_docs))
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs, result_scores = rank_documents_word2vec(query, docs, vector_size)
    return ranked_docs, result_scores

In [107]:
def rank_documents_word2vec (terms, docs, v_size):
    """
    Perform the ranking of the results of a search based on word2vec

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query

    Returns:
    list of ranked documents
    scores of the documents
    """
    tweet2vec = {'docId': docs}

    # Computing a vector for each word
    sentence = []
    for i, row in processed_tweets_df.iterrows(): # for each tweet
      sentences.append(row['Tweet'])
    model = Word2Vec(sentences, vector_size=100)

    for i, row in processed_tweets_df.iterrows():
      if row['docId'] in docs:
        doc_vector = [model.wv[word] for word in row['Tweet']]
        tweet2vec[row['docId']]=np.mean(doc_vector, axis=0)

    # Computing the vector for the query
    query2vec = [model.wv[word] for word in query if word in model.wv.key_to_index]
    query2vec = np.mean(query2vec, axis=0)

    # Calculate the score of each doc
    doc_scores=[[np.dot(curDocVec, query2vec), doc] for doc, curDocVec in tweet2vec.items() if doc in docs]
    doc_scores.sort(reverse=True)
    result_docs = [x[1] for x in doc_scores]
    result_scores = [x[0] for x in doc_scores]
    if len(result_docs) == 0:
        print("No results found, try again")

    return result_docs, result_scores

In [109]:
ranked_docs = [0]*5
result_scores = [0]*5
k1 = 2
b = 0.5
N = len(relevance_features_df)

i = 0
for query in queries:

  ranked_docs[i], result_scores[i] = search_word2vec(query, index, 100)
  top = 20
  print("\n----------------------------------------------------------------------------------------------------------")
  print("Top {} results out of {} for the searched query {}:\n".format(top, len(ranked_docs[i]), query))
    
  j=0
  for j in range(len(ranked_docs[i])):
    if (j < top):
      print("docId = {}\t score = {}".format(ranked_docs[i][j], result_scores[i][j]))
  i+=1


----------------------------------------------------------------------------------------------------------
Top 20 results out of 2719 for the searched query Indian government response to farmers:

docId = doc_8866	 score = 10.882384300231934
docId = doc_28972	 score = 10.8820161819458
docId = doc_36886	 score = 10.856545448303223
docId = doc_1672	 score = 10.845367431640625
docId = doc_1670	 score = 10.833749771118164
docId = doc_32181	 score = 10.828836441040039
docId = doc_42913	 score = 10.827942848205566
docId = doc_22895	 score = 10.82154655456543
docId = doc_27384	 score = 10.80708122253418
docId = doc_45348	 score = 10.806890487670898
docId = doc_26898	 score = 10.803295135498047
docId = doc_43814	 score = 10.803068161010742
docId = doc_5511	 score = 10.80180835723877
docId = doc_1265	 score = 10.796810150146484
docId = doc_1242	 score = 10.796810150146484
docId = doc_1128	 score = 10.796810150146484
docId = doc_8073	 score = 10.777034759521484
docId = doc_11845	 score = 10.776

**3. Can you imagine a better representation than word2vec? Justify your answer. (HINT - what about Doc2vec? Sentence2vec? Which are the pros and cons.**

This part is developed in our report. 