## IRWA Final Project

#### Load Python packages
Let's first import all the packages that we will need during the project.

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bertamitjavilapita/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bertamitjavilapita/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#!pip install gensim

In [4]:
import pandas as pd
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy
import collections
from numpy import linalg as la
import time
import re
from collections import Counter
import statistics
import json
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import ast
import pickle
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

#### Load data into memory

**Data:** dataset of processed tweets that contains the following information for each tweet: Tweet, Date, Hashtags, Hashtags_count, Likes, Retweets, Url, and docId. Additionally, there is an evaluation dataset that will be useful for the evaluation part of the project, which includes the following information: docId, query_id, and label.

In [7]:
# import the processed tweet dataset
doc_path_1 = 'Data/'
doc_path_2 = '/Users/bertamitjavilapita/Desktop/BERTA/UPF/4RT/IRWA/PROJECT/IRWA Project summary and first deliverable/data/'
processed_tweets_df = pd.read_csv(doc_path_1 + 'processed_tweets.csv')

In [10]:
# read the evaluation file and immediately delete the last row - nan values (empty line), the file is invalid
evaluation_df = pd.read_csv(doc_path_2 + 'evaluation.csv', sep=';').iloc[:-1]

### Part 2: Indexing and Evaluation

#### Indexing


**1. Build inverted index: After having pre-processed the data, you can then create the inverted index.**

In [152]:
# convert elements of column tweet to from string to lists in order to apply indexing
processed_tweets_df['Tweet'] = processed_tweets_df['Tweet'].apply(lambda x: ast.literal_eval(x))

In [16]:
def create_index_tfidf(tweets_df, num_tweets):
    """
    Implement the inverted index and compute tf, df and idf

    Argument:
    tweets_df -- collection of processed tweets in pandas dataset form
    num_tweets -- total number of tweets

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of document these keys appears in (and the positions) as values.
    tf - normalized term frequency for each term in each document
    df - number of documents each term appear in
    idf - inverse document frequency of each term
    """

    # obtain the tweets content
    #lines = tweets_df['Tweet']
    #page_ids = tweets_df['docId']

    index = defaultdict(list)
    tf = defaultdict(list)  # term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  # document frequencies of terms in the corpus
    title_index = defaultdict(str)
    idf = defaultdict(float)

    # iterate trough tweets
    for row in tweets_df.itertuples():
        # obtain the tweets content
        line = row.Tweet
        page_id = row.docId

        ## ===============================================================
        ## create the index for the **current page** and store it in current_page_index
        ## current_page_index ==> { ‘term1’: [current_doc, [list of positions]], ...,‘term_n’: [current_doc, [list of positions]]}

        ## Example: if the curr_doc has id 1 and his text is
        ##"web retrieval information retrieval":

        ## current_page_index ==> { ‘web’: [1, [0]], ‘retrieval’: [1, [1,4]], ‘information’: [1, [2]]}

        ## the term ‘web’ appears in document 1 in positions 0,
        ## the term ‘retrieval’ appears in document 1 in positions 1 and 4
        ## ===============================================================

        current_page_index = {}

        # line contain a list of terms for each tweet content
        for position, term in enumerate(line):
            try:
                # if the term is already in the dict append the position to the corresponding list
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term]=[page_id, array('I',[position])] #'I' indicates unsigned int (int in Python)

        #normalize term frequencies
        # Compute the denominator to normalize term frequencies (formula 2 above)
        # norm is the same for all terms of a document.
        norm = 0
        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document.
            # posting ==> [current_doc, [list of positions]]
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)

        # calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1])/norm,4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] += 1 # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        # Compute IDF following the formula (3) above. HINT: use np.log
        for term in df:
            idf[term] = np.round(np.log(float(num_tweets/df[term])), 4)

    return index, tf, df, idf

In [None]:
num_tweets = len(processed_tweets_df)
index, tf, df, idf = create_index_tfidf(processed_tweets_df, num_tweets)

In [158]:
# ater creating the index, tf, df, and idf, serialize them
def serialize_data(index, tf, df, idf, filepath="index_data.pkl"):
    with open(filepath, 'wb') as f:
        pickle.dump((index, tf, df, idf), f)

In [154]:
# lo load the serialized data for future use
def load_serialized_data(filepath="index_data.pkl"):
    with open(filepath, 'rb') as f:
        index, tf, df, idf = pickle.load(f)
    return index, tf, df, idf

In [160]:
# first time using index

# Serialize the data to avoid recomputation in the future
serialize_data(index, tf, df, idf, filepath=doc_path_1 + "index_data.pkl")

In [162]:
# load the serialized data from the file for future use
index, tf, df, idf = load_serialized_data(filepath=doc_path_1 + "index_data.pkl")

**2. Propose test queries: Define five queries that will be used to evaluate your search engine (e.g., “Indian protest”, “support farmers”)**

When evaluating a search engine, it's crucial to use a variety of search terms that represent different topics and formats. This ensures that the engine is tested across a wide range of real-world scenarios users typically face. To do this effectively, we will choose popular search terms. Here are the five queries we'll be using for this evaluation:

- "Indian government response to farmers"
- "International support for farmers"
- "Demands of farmers' protests"
- "Police action during farmers protest"
- "Schedules and locations of demonstrations and protests"

**3. Rank your results: Implement the TF-IDF algorithm and provide ranking based results.**

In [164]:
# function from part 1 to process tweets
def process_tweet(tweet):

    """
    Preprocess the tweet removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.

    Argument:
    line -- string (tweet) to be preprocessed

    Returns:
    tweet - a list of tokens corresponding to the input text after the preprocessing
    """

    stemmer = PorterStemmer() # define the stemmer
    stop_words = set(stopwords.words("english")) # define the stopwords
    tweet =  tweet.lower() # transform the line to lowercase
    tweet = tweet.replace('\\n', '') # remove newline characters
    tweet = ' '.join(tweet.split()) # remove extra whitespaces
    tweet = re.sub(r'\S*https?:\S*', '', tweet) # delete URLs on the tweet because we won't be able to access to them
    tweet.strip() # remove spaces at first and at the end of a message
    tweet = re.sub(r' ?#\S+', '', tweet) # remove word that is with the hastag - hastag is saved separately
    tweet = re.sub(r'[^a-z0-9#@ ]+', '', tweet) # remove punctuation
    tweet = re.sub(r'[^\w\s]', '', tweet) # delete punctuation
    tweet = tweet.split() # tokenize the text to get a list of terms
    tweet = [word for word in tweet if word not in stop_words] # eliminate the stopwords
    tweet = [stemmer.stem(word) for word in tweet] # perform stemming

    return tweet

In [166]:
def rank_documents(terms, docs, index, idf, tf):
    """
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies

    Returns:
    Print the list of ranked documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= query_terms_count[term]/ query_norm * idf[term]

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
           # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....

            #tf[term][0] will contain the tf of the term "term" in the doc 26
            if doc in docs:
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)

    result_docs = [x[1] for x in doc_scores]
    result_scores = [x[0] for x in doc_scores]

    if len(result_docs) == 0:
        print('No results found, try again')
        query = input()
        docs = search_tf_idf(query, index)
    return result_docs, result_scores

In [168]:
def search_tf_idf(query, index):
    """
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = process_tweet(query)
    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs=[posting[0] for posting in index[term]]
            docs |= set(term_docs)
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs = rank_documents(query, docs, index, idf, tf)
    return ranked_docs

In [170]:
# definition of our queries
queries = ['Indian government response to farmers',
           'International support for farmers',
           "Demands of farmers' protests",
           'Police action during farmers protest',
           'Schedules and locations of demonstrations and protests']

In [172]:
top = 10
results = {}  # dictionary to store results for each query

# iterate through queries
for i, query in enumerate(queries):
    # execute the search function
    ranked_docs, result_scores = search_tf_idf(query, index)

    # store results in the dictionary
    results[f'ranked_docs_q{i+1}'] = ranked_docs
    results[f'result_scores_q{i+1}'] = result_scores

    print('\n----------------------------------------------------------------------------------------------------------')
    # print the top documents and their scores

    print(f"\nTop {top} results out of {len(ranked_docs)} retrieved documents for the query '{query}':\n")
    for doc_id, score in zip(ranked_docs[:top], result_scores[:top]):
        print(f'docId = {doc_id}\t score = {round(score, 4)}')


----------------------------------------------------------------------------------------------------------

Top 10 results out of 15989 retrieved documents for the query 'Indian government response to farmers':

docId = doc_3234	 score = 11.8045
docId = doc_27784	 score = 9.5898
docId = doc_38114	 score = 9.5064
docId = doc_38045	 score = 7.8308
docId = doc_33082	 score = 7.8308
docId = doc_41021	 score = 7.6648
docId = doc_13543	 score = 7.5804
docId = doc_7288	 score = 7.2119
docId = doc_1353	 score = 7.0977
docId = doc_39288	 score = 6.7811

----------------------------------------------------------------------------------------------------------

Top 10 results out of 15767 retrieved documents for the query 'International support for farmers':

docId = doc_859	 score = 10.9298
docId = doc_21846	 score = 10.9298
docId = doc_12950	 score = 9.6171
docId = doc_38762	 score = 8.9963
docId = doc_27400	 score = 8.9963
docId = doc_21630	 score = 8.9963
docId = doc_6713	 score = 8.0471
doc

#### Evaluation

There will be 2 main evaluation components:

- A baseline with 2 queries and the ground truth files for each query will be given to you, using a subset of documents from the dataset.
1. Query 1: people's rights.
2.  Query 2: Indian Government


- You will be the expert judges, so you will be setting the ground truth for each document and query in a binary way for the test queries that you defined in step 2 at the indexing stage.

**Task 1:**

**For the prior evaluation components you must evaluate your algorithm by using different evaluation techniques:**
- Precision@K (P@K)
- Recall@K (R@K)
- Average Precision@K (P@K)
- F1-Score@K
- Mean Average Precision (MAP)
- Mean Reciprocal Rank (MRR)
- Normalized Discounted Cumulative Gain (NDCG)

**Only for the second component (your queries) comment on each of the evaluation techniques stating how they differ, and which information gives each of them. Analyze your results.**

In [174]:
# queries definition
defined_queries = ["people's rights",
                   "Indian Government"]

In [176]:
# head of evaluation dataset
evaluation_df.head(5)

Unnamed: 0,docId,query_id,label
0,doc_156,1.0,0.0
1,doc_1039,1.0,0.0
2,doc_1047,1.0,1.0
3,doc_1685,1.0,0.0
4,doc_2100,1.0,1.0


In [178]:
print_result = evaluation_df["label"].unique()
print("The evaluation of our dataset is composed of {} Relevance Levels: {}".format(len(print_result), sorted(print_result)))

The evaluation of our dataset is composed of 2 Relevance Levels: [0.0, 1.0]


In [180]:
# create the subset needed for evaluation
evaluation_q1 = evaluation_df[(evaluation_df['query_id'] == 1)]
evaluation_q2 = evaluation_df[(evaluation_df['query_id'] == 2)]

# extract tweets subset from queries by docId
docs_q1 = processed_tweets_df[processed_tweets_df['docId'].isin(evaluation_q1['docId'])]
docs_q2 = processed_tweets_df[processed_tweets_df['docId'].isin(evaluation_q2['docId'])]

In [182]:
# run the inverted index for each subset - each subset is docs of a query
# then run the search tf-idf performed above
# search preprocess the query

# query 1
indexes_q1, t_q1, d_q1, id_q1  = create_index_tfidf(docs_q1, len(docs_q1))
ranked_docs_q1, result_scores_q1 = search_tf_idf(defined_queries[0], indexes_q1)

# query 2
indexes_q2, t_q2, d_q2, id_q2  = create_index_tfidf(docs_q2, len(docs_q2))
ranked_docs_q2, result_scores_q2 = search_tf_idf(defined_queries[1], indexes_q2)

In [184]:
# build structure to execute evaluated techniques
# 1. create a dict to map docs with results
# 2. add score do the corresponding docId in evaluating df for the current query
# 3. sort by score
# 4. round with 4 decimals

# to avoid warnings
evaluation_q1 = evaluation_q1.copy()
evaluation_q2 = evaluation_q2.copy()

# query 1.
doc_score_map_q1 = dict(zip(ranked_docs_q1, result_scores_q1))
evaluation_q1.loc[:, 'score'] = evaluation_q1['docId'].map(doc_score_map_q1)
evaluation_q1 = evaluation_q1.sort_values(by='score', ascending=False).reset_index(drop=True)
evaluation_q1['score'] = evaluation_q1['score'].round(4)

# query 2.
doc_score_map_q2 = dict(zip(ranked_docs_q2, result_scores_q2))
evaluation_q2.loc[:, 'score'] = evaluation_q2['docId'].map(doc_score_map_q2)
evaluation_q2.fillna(0, inplace=True)
evaluation_q2 = evaluation_q2.sort_values(by='score', ascending=False).reset_index(drop=True)
evaluation_q2['score'] = evaluation_q2['score'].round(4)

##### 1. Precision@K (P@K)

In [187]:
def precision_at_k(y_true, y_score, k=10):
    '''
    Parameters
    ----------
    y_true: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    precision @k : float

    '''
    order = np.argsort(y_score)[::-1]
    y_true = y_true[order[:k]]
    relevant = sum(y_true == 1)
    return float(relevant/k)

# variables -> y_true must be label.
# y_score - predicted relevance

##### 2. Recall@K (R@K)

In [190]:
def recall_at_k(y_true, y_score, k=10):
    '''
    Parameters
    ----------
    y_true: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of docs to consider.

    Returns
    -------
    recall @k : float
    '''

    order = np.argsort(y_score)[::-1]
    y_true = y_true[order[:k]]
    relevant = sum(y_true == 1)
    total_relevant = sum(y_true)
    # avoid warning for division by 0
    if total_relevant == 0:
        return 0.0

    # computation of recall
    return float(relevant / total_relevant)

##### 3. Average Precision@K (P@K)

In [193]:
def avg_precision_at_k(y_true, y_score, k=10):
    """
    Parameters
    ----------
    y_true: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    average precision @k : float
    """

    order = np.argsort(y_score)[::-1]  # get the list of indexes of the predicted score sorted in descending order.

    prec_at_i = 0
    prec_at_i_list = []
    number_of_relevant = 0
    number_to_iterate = min(k, len(order))

    for i in range(number_to_iterate):
        if y_true[order[i]] == 1:
            number_of_relevant += 1
            prec_at_i = number_of_relevant / (i + 1)
            prec_at_i_list.append(prec_at_i)

    if number_of_relevant == 0:
        return 0
    else:
        return round(np.sum(prec_at_i_list) / number_of_relevant, 4)

##### 4. F1-Score@K

In [196]:
def f1_score_at_k(y_true, y_score, k):
    """
    Parameters
    ----------
    y_true: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    f1-score @k : float
    """

    precision_k = precision_at_k(y_true, y_score, k)
    recall_k = recall_at_k(y_true, y_score, k)
    if precision_k + recall_k == 0:
        return 0
    return round((2 * precision_k * recall_k) / (precision_k + recall_k), 4)

##### 5. Mean Average Precision (MAP)

In [199]:
def map_at_k(search_res, k=10):
    '''
    Parameters
    ----------
    search_res: search results dataset containing:
        q_id: query id.
        doc_id: document id.
        predicted_relevance: relevance predicted through LightGBM.
        y_true: actual score of the document for the query (ground truth).

    Returns
    -------
    mean average precision @k : float
    '''
    avp = []
    for q in search_res["query_id"].unique():  #loop over all query id
        curr_data = search_res[search_res["query_id"] == q]  # select data for current query
        avp.append(avg_precision_at_k(np.array(curr_data["label"]), np.array(curr_data["docId"]),
                                      k))  #append average precision for current query
    return round(np.sum(avp) / len(avp), 4), avp  # return mean average precision

##### 6. Mean Reciprocal Rank (MRR)

In [202]:
def rr_at_k(y_true, y_score, k=10):
    """
    Parameters
    ----------
    y_true: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    Reciprocal Rank for qurrent query
    """

    # convert into array
    y_true = np.array(y_true)
    y_score = np.array(y_score)

    order = np.argsort(y_score)[::-1] # get the list of indexes of the predicted score sorted in descending order.
    y_true = np.take(y_true, order[:k]) # sort the actual relevance label of the documents based on predicted score(hint: np.take) and take first k.
    if np.sum(y_true) == 0: # if there are not relevant doument return 0
        return 0
    return 1 / (np.argmax(y_true == 1) + 1) # hint: to get the position of the first relevant document use "np.argmax"

##### 7. Normalized Discounted Cumulative Gain (NDCG)

In [205]:
def dcg_at_k(y_true, y_score, k=10):

    # convert into array
    y_true = np.array(y_true)
    y_score = np.array(y_score)

    order = np.argsort(y_score)[::-1]  # get the list of indexes of the predicted score sorted in descending order.
    y_true = np.take(y_true, order[
                             :k])  # sort the actual relevance label of the documents based on predicted score(hint: np.take) and take first k.
    gain = 2 ** y_true - 1  # Compute gain (use formula 7 above)
    discounts = np.log2(np.arange(len(y_true)) + 2)  # Compute denominator
    return np.sum(gain / discounts)  #return dcg@k


def ndcg_at_k(y_true, y_score, k=10):
    dcg_max = dcg_at_k(y_true, y_true, k)
    if not dcg_max:
        return 0
    return np.round(dcg_at_k(y_true, y_score, k) / dcg_max, 4)

##### 2.1. Evaluation for predefined queries

In [208]:
def print_evaluation_metrics(defined_queries, evaluations, k=10):
   
    for i, query in enumerate(defined_queries):
        print("----------------------------------------------------------------------------------------------------------\n")
        print(f"Query: '{query}'\n")
        
        y_true = np.array(evaluations[i]["label"])
        y_score = np.array(evaluations[i]["score"])
        
        precision = precision_at_k(y_true, y_score, k)
        print(f"=> Precision@{k}: {precision}\n")
        
        recall = recall_at_k(y_true, y_score, k)
        print(f"=> Recall@{k}: {recall}\n")
        
        avg_precision = avg_precision_at_k(y_true, y_score, k)
        print(f"=> Average Precision@{k}: {avg_precision}\n")
        
        f1 = f1_score_at_k(y_true, y_score, k)
        print(f"=> F1-Score@{k}: {f1}\n")
        
        map_value = map_at_k(evaluations[i], k)[0]
        print(f"=> MAP with k={k}: {map_value}\n")
        
        mrr = rr_at_k(y_true, y_score, k)
        print(f"=> Mean Reciprocal Rank with k={k}: {mrr}\n")
        
        ndcg = ndcg_at_k(y_true, y_score, k)
        print(f"=> Normalized Discounted Cumulative Gain with k={k}: {ndcg}\n")

In [210]:
# print results for defined queries
k = 10

# join results in order to the for loop
evaluations = [evaluation_q1, evaluation_q2]
print_evaluation_metrics(defined_queries, evaluations, k)

----------------------------------------------------------------------------------------------------------

Query: 'people's rights'

=> Precision@10: 0.9

=> Recall@10: 1.0

=> Average Precision@10: 0.9889

=> F1-Score@10: 0.9474

=> MAP with k=10: 0.8211

=> Mean Reciprocal Rank with k=10: 1.0

=> Normalized Discounted Cumulative Gain with k=10: 0.9337

----------------------------------------------------------------------------------------------------------

Query: 'Indian Government'

=> Precision@10: 1.0

=> Recall@10: 1.0

=> Average Precision@10: 1.0

=> F1-Score@10: 1.0

=> MAP with k=10: 0.361

=> Mean Reciprocal Rank with k=10: 1.0

=> Normalized Discounted Cumulative Gain with k=10: 1.0



##### 2.2. Evaluation for our queries

In [95]:
# based on the results from step 2, we select the top 30 ranked tweets as "retrieved"
# We build a dataset for each of the 5 queries, including docId, queryId, and label

# we create the evaluation dataset for each query

num = 30
evaluation_datasets = []  # list for evaluation sets for each query 

# loop through each query 
for query_id, query in enumerate(queries):

    #ranked_docs, result_scores = search_tf_idf(query, index)
    ranked_docs = results[f'ranked_docs_q{query_id + 1}']
    result_scores = results[f'result_scores_q{query_id + 1}']
    
    # take the top 30 as "retrieved" (label 1)
    retrieved_docs = ranked_docs[:num]

    # take their correspondant scores
    retrieved_scores = result_scores[:num]

    # list for the actual query
    evaluation_data = []
    
    # add the retrieved docs with label 1 to the evaluation set for the query
    for doc_id, score in zip(retrieved_docs, retrieved_scores):
        evaluation_data.append({"docId": doc_id, "query_id": query_id + 1, "label": 1, "score": round(score, 4)})
    
    # add the results for the query in the evaliation datasets
    evaluation_df = pd.DataFrame(evaluation_data)
    evaluation_datasets.append(evaluation_df)

In [97]:
# to check for relationships, we'll use the unprocessed tweets since this is the vocabulary we understand
# load the JSON file
# keep only the variables tweetId and Tweet
# merge it with the document dataset

with open(doc_path_2 + 'farmers-protest-tweets.json') as fp:
    tweets = fp.readlines()
tweets = [t.strip().replace(' +', ' ') for t in tweets]

tweet_data = []

# Iterate over each tweet JSON line
for tweet in tweets:
    # Convert JSON string to dictionary
    tweet_dict = json.loads(tweet)
    
    # Extract relevant information
    tweet_id = tweet_dict['id']
    tweet_content = tweet_dict['content']
    
    # Append extracted info as a dictionary
    tweet_data.append({"id": tweet_id, "content": tweet_content})

# Create DataFrame with only 'id' and 'content' columns
tweets_df = pd.DataFrame(tweet_data)

tweet_document_ids_map = pd.read_csv(docs_path + 'tweet_document_ids_map.csv')

tweets_df = tweets_df.merge(tweet_document_ids_map, on='id', how='right')

In [98]:
# manual check for assess if docs are related to the queries
# Assumptions:
# - `evaluation_datasets` is a list of DataFrames, one for each query, with each DataFrame containing docId, query_id, label, and score.
# - `processed_tweets_df` is a DataFrame that includes a 'docId' column and a 'tweet' column with the tweet content.

for i, evaluation_df in enumerate(evaluation_datasets):
    query = queries[i]  # Retrieve the query associated with this evaluation dataset

    print('--------------------------------------------------------------------------------------------------------------------------------')
    print(f"\n QUERY {i + 1}: {query}\n")
    
    # filter for documents labeled as '1' (retrieved as relevant)
    relevant_docs = evaluation_df[evaluation_df['label'] == 1]
    
    for _, row in relevant_docs.iterrows():
        doc_id = row['docId']
        
        # find the tweet associated with the docId in the processed_tweets_df
        tweet = tweets_df.loc[tweets_df['docId'] == doc_id, 'content'].values
        print(f" - docId: {doc_id} - Tweet: {tweet[0]}\n")

--------------------------------------------------------------------------------------------------------------------------------

 QUERY 1: Indian government response to farmers

 - docId: doc_3234 - Tweet: Nehru is responsible for #ChineseIntrusion 

Previous govt is responsible for #PetrolDieselPriceHike 

#COVID19 is responsible for economic turmoil

#Khalistan is responsible for #FarmersProtest 

Modi isn’t responsible for anything because he is #DaughterofBengal https://t.co/RiXKhTBO34

 - docId: doc_27784 - Tweet: There is no difference between #DeepSidhu #NikitaJacob #Disha_Ravi #GretaThunberg. They all are responsible for #republicdayviolence. #FarmersProtest #FarmBills2020 #ToolkitGang #ToolkitCase #ToolKitExposed #ToolkitAdmission #ToolkitTruthProbe #ToolkitConspiracy #Toolkit

 - docId: doc_38114 - Tweet: Madness!

If the government and Supreme Court has any responsibility please tell the country `If anything happens to her who will be responsible? ' Corona protocols are for

In [99]:
# aabans de fer l'execució mirar si quadren els resultats
# si no ho son posar un 0 a label

# query 1
non_relevant_doc_ids_q1 = [
    "doc_3234", "doc_27784", "doc_38114", "doc_38045", "doc_32344",
    "doc_24570", "doc_24061", "doc_17249", "doc_37157", "doc_2453",
    "doc_19626", "doc_45649", "doc_43568", "doc_10847", "doc_10556"
]
evaluation_datasets[0].loc[evaluation_datasets[0]['docId'].isin(non_relevant_doc_ids_q1), 'label'] = 0

# query 2
non_relevant_doc_ids_q2 = [
    "doc_12950", "doc_38762", "doc_21630", "doc_6713", "doc_28710",
    "doc_42972", "doc_28358", "doc_27294", "doc_25064", "doc_17785",
    "doc_42502", "doc_6920", "doc_6031", "doc_48125", "doc_44813",
    "doc_41492", "doc_29827", "doc_28889", "doc_1896", "doc_40083"
]
evaluation_datasets[1].loc[evaluation_datasets[1]['docId'].isin(non_relevant_doc_ids_q2), 'label'] = 0

# query 3
non_relevant_doc_ids_q3 = [
    "doc_15908", "doc_10445", "doc_18128", "doc_887", 
    "doc_43810", "doc_43799", "doc_43790", "doc_12672", 
    "doc_7128", "doc_46063"
]
evaluation_datasets[2].loc[evaluation_datasets[2]['docId'].isin(non_relevant_doc_ids_q3), 'label'] = 0


# query 4
non_relevant_doc_ids_q4 = [
    "doc_24374", "doc_13980", "doc_32026", "doc_5491", "doc_34650",
    "doc_14464", "doc_17561", "doc_33549", "doc_33540", "doc_32143",
    "doc_31632", "doc_18542", "doc_32162", "doc_41805", "doc_11674",
    "doc_18802", "doc_8324", "doc_41801", "doc_33548", "doc_33539",
    "doc_33454", "doc_32593", "doc_18568", "doc_11393", "doc_44605",
    "doc_22518", "doc_40791", "doc_37762", "doc_33099", "doc_31109"
]
evaluation_datasets[3].loc[evaluation_datasets[3]['docId'].isin(non_relevant_doc_ids_q4), 'label'] = 0

# query 5 
non_relevant_doc_ids_q5 = [
    "doc_44233", "doc_45111", "doc_45564", "doc_9157",
    "doc_19010", "doc_2192", "doc_17302", "doc_16783",
    "doc_3228", "doc_38629", "doc_38189", "doc_19312", 
    "doc_43347", "doc_42936", "doc_23067", "doc_23067",
    "doc_27672"
]
evaluation_datasets[4].loc[evaluation_datasets[4]['docId'].isin(non_relevant_doc_ids_q5), 'label'] = 0

In [100]:
# print results for our queries
k = 10
print_evaluation_metrics(queries, evaluation_datasets, k)

----------------------------------------------------------------------------------------------------------

Query: 'Indian government response to farmers'

=> Precision@10: 0.5

=> Recall@10: 1.0

=> Average Precision@10: 0.4035

=> F1-Score@10: 0.6667

=> MAP with k=10: 0.8762

=> Mean Reciprocal Rank with k=10: 0.2

=> Normalized Discounted Cumulative Gain with k=10: 0.3726

----------------------------------------------------------------------------------------------------------

Query: 'International support for farmers'

=> Precision@10: 0.3

=> Recall@10: 1.0

=> Average Precision@10: 0.8667

=> F1-Score@10: 0.4615

=> MAP with k=10: 0.6429

=> Mean Reciprocal Rank with k=10: 1.0

=> Normalized Discounted Cumulative Gain with k=10: 0.4441

----------------------------------------------------------------------------------------------------------

Query: 'Demands of farmers' protests'

=> Precision@10: 0.7

=> Recall@10: 1.0

=> Average Precision@10: 0.7546

=> F1-Score@10: 0.8235


**Task 2:**

**Choose one vector representation, TF-IDF or word2vec, and represent the tweets in a two-dimensional scatter plot through the T-SNE (T-distributed Stochastic Neighbor Embedding) algorithm. To do so, you may need first to represent the word as a vector, and then the tweet, i.e., resulted as the average value over the words involved. Any other option rather than T-SNE may be used, but needs to be justified.**

In [138]:
# apply word2vec model
model = Word2Vec(sentences=processed_tweets_df['Tweet'], vector_size=100, window=5, min_count=1, workers=4)

In [139]:
# extract words and vectors from the model
words, word_vectors = zip(*[(word, model.wv[word]) for word in model.wv.index_to_key])
# Convert word vectors to a NumPy array for t-SNE processing
word_vectors = np.array(word_vectors)

In [140]:
# Initialize and fit the t-SNE model
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', max_iter=3500, random_state=23)
tsne_results = tsne_model.fit_transform(word_vectors)

In [141]:
# Plot the results
plt.figure(figsize=(30, 30))
x, y = tsne_results[:, 0], tsne_results[:, 1]

for i, word in enumerate(words):
    plt.scatter(x[i], y[i])
    plt.annotate(word,
                 xy=(x[i], y[i]),
                 xytext=(4, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

plot_filename = 'tsne_plot.png'  
plt.savefig(plot_filename)  # save the figure to a file
plt.close()  # close the plot to avoid displaying it