# PROJECT IRWA 2022

## Part 1: Text Processing

1. Take into account that for future queries, the final output must return (when
present) the following information for each of the selected documents: Tweet |
Username | Date | Hashtags | Likes | Retweets | Url (here the “Url” means the
tweet link).

2. Think about how to handle the hashtags from your pre-processing steps (e.g.,
removing the “#” from the word), since it may be useful to involve them as separate terms
inside the inverted index.

The suggested library that may help you in stemming and stopwords: nltk

In [1]:
#We do all the imports
import nltk
nltk.download('stopwords') #Dowload list of stopwords

from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la
import time
import re #Library used to remove certain symbols / characters from a text
import random

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
!pwd
!ls

/work
data  deepnote_exports	image-20221107-164154.png  imgs


In [3]:
# you can use pandas to explore your files
import pandas as pd
import json

# Read datasets
tweet_document_ids_map = pd.read_csv(r'data/tweet_document_ids_map.csv', header = None, delimiter = "\t")

In [4]:
tweet_document_ids_map


Unnamed: 0,0,1
0,doc_1,1575918182698979328
1,doc_2,1575918151862304768
2,doc_3,1575918140839673873
3,doc_4,1575918135009738752
4,doc_5,1575918119251419136
...,...,...
3995,doc_3996,1575856268022992896
3996,doc_3997,1575856245650919424
3997,doc_3998,1575856228886089728
3998,doc_3999,1575856226139017216


In [5]:
dict_1 = tweet_document_ids_map.to_dict('list')
docs = dict_1[0]
tweet_ids = dict_1[1]

In [6]:
docs_title_index = {docs[i]: tweet_ids[i] for i in range(len(docs))}

In [7]:
tweets_title_index = {tweet_ids[i]: docs[i] for i in range(len(docs))}

In [8]:
doc_ID = {} #Create a dictinary to save the docID
for i in range(len(tweet_document_ids_map)): #Iterate through the csv file rows
    #Store the docID in the dictionary accessing with the tweetID
    doc_ID[tweet_document_ids_map[1].iloc[i]] = tweet_document_ids_map[0].iloc[i]

In [9]:
#We open the json file and extract the tweets in it
with open("data/tw_hurricane_data.json", "r") as json_file:
    raw_tweets = [json.loads(tweet) for tweet in json_file]

In [10]:
tweets = {} #We create a dictionary to save all the needed tweets' information

for tweet in raw_tweets: #Iterate through all tweets in the json file
    dict_tweet = {} #Create a dictinary to save each individual tweet's information
    dict_tweet['Tweet'] = tweet['full_text'] #Store the text of the tweet
    dict_tweet['Date'] = tweet['created_at'] #Store the date of the tweet creation
    dict_tweet['Hashtags']= [] #Create an empty list to store the hashtags in the tweet
    for hashtag in tweet['entities']['hashtags']: #Iterate through the hashtag dictionary
        dict_tweet['Hashtags'].append(hashtag['text']) #Append the hashtag to the list
    dict_tweet['Username'] = tweet['user']['screen_name'] #Store the username of the "writer"
    dict_tweet['Likes'] = tweet['favorite_count'] #Store the likes count of the tweet
    dict_tweet['Retweets'] = tweet['retweet_count'] #Store the retweets count of the tweet

    #We "create" the URL of each tweet and store it
    #https://twitter.com/screen_name/status/tweet_id
    dict_tweet['Url'] = 'https://twitter.com/'+dict_tweet['Username']+'/status/'+tweet['id_str']

    dict_tweet['Doc_ID'] = doc_ID[tweet['id']]
    #add tweet to dictionary tweets with the id as the tweet key
    tweets[tweet['id']] = dict_tweet


In [11]:
#Print an example of a stored tweet information entry from the dictionary
tweets[list(tweets.keys())[2570]]

{'Tweet': '@Next_Gen_X $kerri0922 would help with #HurricaneIan clean up. https://t.co/SuDL5LStQu',
 'Date': 'Fri Sep 30 15:46:52 +0000 2022',
 'Hashtags': ['HurricaneIan'],
 'Username': 'NewKerristartin',
 'Likes': 1,
 'Retweets': 0,
 'Url': 'https://twitter.com/NewKerristartin/status/1575874831190413312',
 'Doc_ID': 'doc_2571'}

In [None]:
#This function that receives a line of text as input will return a list of the words
#contained in it after having removed the stopwords and non-important symbols, 
#transforming to lowercase, tokenizing and stemming.
def build_terms(line):
    

    stemmer = PorterStemmer()

    stop_words = set(stopwords.words("english"))
    line = line.lower()  #Convert to lowercase
    line = line.split()  # Tokenize the text to get a list of terms
    line = [x for x in line if x not in stop_words]  # eliminate the stopwords
    line = [x for x in line if x.startswith(("@", "https://", "$", '#')) != True]  # eliminate mentions
    line = [re.sub('[^a-z]+', '', x) for x in line] # since it's in english we don't have to worry about accents and such
    line = [stemmer.stem(word) for word in line] # perform stemming (HINT: use List Comprehension)
    return line

In [13]:
#We apply the function build_terms(line) to the text section of each tweet
for tweet in tweets.keys():
    tweets[tweet]['Tweet'] = build_terms(tweets[tweet]['Tweet'])

In [14]:
#Print the same example as above to see the result of the text section of the tweet after
#having applied the function build_terms(line)
tweets[list(tweets.keys())[2570]]

{'Tweet': ['would', 'help', 'clean', 'up'],
 'Date': 'Fri Sep 30 15:46:52 +0000 2022',
 'Hashtags': ['HurricaneIan'],
 'Username': 'NewKerristartin',
 'Likes': 1,
 'Retweets': 0,
 'Url': 'https://twitter.com/NewKerristartin/status/1575874831190413312',
 'Doc_ID': 'doc_2571'}

## Part 2: Indexing and Evaluation

**INDEXING**
1. Build inverted index: After having pre-processed the data, you can then create the inverted index.
HINT - you may use the vocabulary data structure, like the one seen during the Practical Labs:
{
Term_id_1: [document_1, document_2, document_4],
Term_id_2: [document_1, document_3, document_5, document_6], etc...
}
Documents information: Since we are dealing with conjunctive queries (AND), each of the returned documents should contain all the words in the query.

## Index Implementation

In [15]:
keys_to_extract = [1575864144393568261, 1575896926129516544, 1575866048775327744, 1575884902138277889, 1575911063144407040, 1575872559937384450, 1575861260372959234, 1575906506909753345, 1575903947180900352, 1575865596935217152, 1575866634191142912, 1575858702913269760, 1575886942977523712, 1575901820060213249, 1575871984566939648, 1575866300488126469, 1575902051862663168, 1575889467813920773, 1575903004896759810, 1575916936373870592, 1575910611208355875, 1575878936940232706, 1575867138174201858, 1575912302024732672, 1575856366320635907]
tweets_subset = {key: tweets[key] for key in keys_to_extract}

In [None]:
def create_index(tweets):
    
    for tweet in tweets.keys():
        doc_ID = tweets[tweet]['Doc_ID']
        terms = build_terms(doc_ID+' '+' '.join(tweets[tweet]['Tweet']))  # docID + tweet_text
        

        current_page_index = {}

        for position, term in enumerate(terms): # terms contains docID + tweet_text. Loop over all terms
            try:
                # if the term is already in the index for the current page (current_page_index)
                # append the position to the corresponding list
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [doc_ID, array('I', [position])]  #'I' indicates unsigned int (int in Python)

        # merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    return index

In [17]:
index = create_index(tweets)
print("Index results for the term 'keeping': {}\n".format(index['keeping']))
print("First 10 Index results for the term 'keep': \n{}".format(index['keep'][:10]))

Index results for the term 'keeping': []

First 10 Index results for the term 'keep': 
[['doc_1', array('I', [1])], ['doc_86', array('I', [14])], ['doc_95', array('I', [11, 14])], ['doc_148', array('I', [10])], ['doc_228', array('I', [2])], ['doc_413', array('I', [1])], ['doc_433', array('I', [1, 3])], ['doc_446', array('I', [4])], ['doc_467', array('I', [2])], ['doc_484', array('I', [28])]]


## Querying The Index

In [None]:
def search(query, index):
    
    query2 = build_terms(query)
    docs = set()
    tweet_ids = set()
    for term in query2:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs = [posting[0] for posting in index[term]]
            # docs = docs Union term_docs
            docs |= set(term_docs)
        except:
            #term is not in index
            pass
    docs = list(docs)
    return docs

In [19]:
query = "keep warm"
docs = search(query, index)
top = 10

print("\n======================\nSample of {} results out of {} for the searched query \'{}\':\n".format(top, len(docs), query))
for d_id in docs[:top]:
    print("doc_id = {}".format(d_id))


Sample of 10 results out of 90 for the searched query 'keep warm':

doc_id = doc_970
doc_id = doc_3955
doc_id = doc_2959
doc_id = doc_653
doc_id = doc_1102
doc_id = doc_2925
doc_id = doc_148
doc_id = doc_998
doc_id = doc_1651
doc_id = doc_3711


2. Propose test queries: Define five queries that will be used to evaluate your search engine (e.g., “covid pandemic”, “covid vaccine”)

We have proposed the following queries:

Bad storm
Hurricane disaster
Evacuate house
Florida shelter
Rain fall

3. Rank your results: Implement the TF-IDF algorithm and provide ranking based results.

## Add Ranking with TF-IDF

In [20]:
def create_index_tfidf(tweets, num_documents):

    index = defaultdict(list)
    tf = defaultdict(list)  #term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  #document frequencies of terms in the corpus
    idf = defaultdict(float)

    for tweet in tweets.keys():
        doc_ID = tweets[tweet]['Doc_ID']
        terms = build_terms(doc_ID+' '+' '.join(tweets[tweet]['Tweet']))  # docID + tweet_text

        current_page_index = {}

        for position, term in enumerate(terms):  ## terms contains page_title + page_text
            try:
                # if the term is already in the dict append the position to the corresponding list
                current_page_index[term][1].append(position) 
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [doc_ID, array('I', [position])]  #'I' indicates unsigned int (int in Python)

        # normalize term frequencies
        # Compute the denominator to normalize term frequencies (formula 2 above)
        # norm is the same for all terms of a document.
        norm = 0
        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document. 
            # posting ==> [current_doc, [list of positions]] 
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)

        #calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1]) / norm, 4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] += 1 # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        # Compute IDF following the formula (3) above. HINT: use np.log
        for term in df:
            idf[term] = np.round(np.log(float(num_documents / df[term])), 4)

    return index, tf, df, idf

In [21]:
start_time = time.time()
num_documents = len(tweets)
index, tf, df, idf = create_index_tfidf(tweets, num_documents)
print("Total time to create the TD-IDF index: {} seconds" .format(np.round(time.time() - start_time, 2)))

Total time to create the TD-IDF index: 181.79 seconds


In [22]:
def rank_documents(terms, docs, index, idf, tf):
    
    # I'm interested only on the element of the docVector corresponding to the query terms 
    # The remaining elements would become 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query. 
    # Example: collections.Counter(["hello","hello","world"]) --> Counter({'hello': 2, 'world': 1})
    #HINT: use when computing tf for query_vector

    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        # TODO: check how to vectorize the query
        # query_vector[termIndex]=idf[term]  # original
        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex] = query_terms_count[term] / query_norm * idf[term]

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
            # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....

            #tf[term][0] will contain the tf of the term "term" in the doc 26            
            if doc in docs:
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc 
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot
    
    doc_scores = [[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items()]
    doc_scores.sort(reverse=True)
    result_docs = [x[1] for x in doc_scores]
    result_pred_score = [x[0] for x in doc_scores]

    if len(result_docs) == 0:
        print("No results found, try again")
        query = input()
        docs = search_tf_idf(query, index)
    #print ('\n'.join(result_docs), '\n')
    return result_docs, result_pred_score

In [23]:
def search_tf_idf(query, index):
    """
    output is the list of documents that contain any of the query terms. 
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = build_terms(query)
    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"                        
            term_docs = [posting[0] for posting in index[term]]
            
            # docs = docs Union term_docs
            docs |= set(term_docs)
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs, pred_score = rank_documents(query, docs, index, idf, tf)
    return ranked_docs, pred_score

In [144]:
query = "bad storm"
ranked_docs, _ = search_tf_idf(query, index)
top = 10

print("\n======================\nTop {} results out of {} for the searched query:\n".format(top, len(ranked_docs)))
for d_id in ranked_docs[:top]:
    print("page_id= {} - page_title: {}".format(d_id, tweets[docs_title_index[d_id]]['Tweet']))


Top 10 results out of 362 for the searched query:

page_id= doc_3812 - page_title: ['bad', 'moment', 'there', 'come']
page_id= doc_200 - page_title: ['bad', 'peopl', 'consid']
page_id= doc_1744 - page_title: ['feel', 'bad', 'tree']
page_id= doc_1587 - page_title: ['wow', 'get', 'bad']
page_id= doc_1264 - page_title: ['bad', 'right', 'now', 'luckili', 'mt', 'airi', 'hope', 'bad', 'think', 'friend', 'famili', 'north', 'carolina', 'coastand', 'everyon', 'effect', 'ian']
page_id= doc_3685 - page_title: ['what', 'overund', 'bad', 'boy', 'come', 'down']
page_id= doc_2802 - page_title: ['bad', 'idea', 'wind', 'increas']
page_id= doc_1814 - page_title: ['were', 'start', 'get', 'idea', 'bad']
page_id= doc_2079 - page_title: ['myka', 'good', 'mum', 'stay', 'home', 'cuz', 'fsu', 'cancel', 'class', 'tuesday', 'thru', 'friday', 'bad', 'come', 'thank', 'cat', 'you', 'okay', 'bad', 'tri', 'get', 'you', '']
page_id= doc_646 - page_title: ['peak', 'tide', 'bad', 'downtown', 'charleston', 'noon', 'went

## **EVALUATION**
There will be 2 main evaluation components:
1. A baseline with 3 queries and the ground truth files for each query will be given to you, using a subset of documents from the dataset.
        A. Query 1: Landfall in South Carolina
        B. Query 2: Help and recovery during the hurricane disaster
        C. Query 3: Floodings in South Carolina
2. You will be the expert judges, so you will be setting the ground truth for each document and query in a binary way for the test queries that you defined in step 2 at the indexing stage.

In [191]:
def create_df_queries(queries, index):
    rows = []
    for num, query in enumerate(queries):
        doc_ids, scores = search_tf_idf(query, index)
        for i in range(len(scores)):
            rows.append([num+1, doc_ids[i], scores[i]])
        df = pd.DataFrame(rows ,columns=["q_id", "doc_id", "predicted_relevance"])

    return df

In [192]:
queries = ["Landfall in South Carolina", "Help and recovery during the hurricane disaster", "Floodings in South Carolina"]
scores_queries = create_df_queries(queries, index)

In [193]:
our_queries = ['Bad storm', 'Hurricane Disaster', 'evacuate house', 'Florida Shelter', 'Rain fall']

In [194]:
our_index, our_tf, our_df, our_idf = create_index_tfidf(tweets_subset, 25)
df_our = create_df_queries(our_queries, our_index)
df_our

Unnamed: 0,q_id,doc_id,predicted_relevance
0,1,doc_3810,1.766678
1,1,doc_3295,1.200579
2,2,doc_85,1.028241
3,2,doc_3264,1.028241
4,2,doc_1259,0.984353
5,2,doc_1909,0.866192
6,2,doc_3221,0.7625
7,2,doc_468,0.44033
8,3,doc_1259,2.576952
9,4,doc_1259,4.854863


In [195]:
doc_rel = pd.read_csv(r'data/Document Relevance - Full 1.csv', delimiter = ",")
doc_rel.head()

Unnamed: 0,Doc_ID,Bad storm,Hurricane Disaster,evacuate house,Florida Shelter,Rain fall
0,doc_3427,0,0,0,0,0
1,doc_1490,0,0,0,0,0
2,doc_3264,1,1,0,0,0
3,doc_1996,0,0,0,0,0
4,doc_571,1,0,0,0,1


In [196]:
doc_rel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Doc_ID              25 non-null     object
 1   Bad storm           25 non-null     int64 
 2   Hurricane Disaster  25 non-null     int64 
 3   evacuate house      25 non-null     int64 
 4   Florida Shelter     25 non-null     int64 
 5   Rain fall           25 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 1.3+ KB


In [197]:
docs_q1 = doc_rel[["Doc_ID", "Bad storm"]]
docs_q1 = docs_q1.rename(columns={"Doc_ID": "doc", "Bad storm": "label"})
docs_q1["query_id"] = 0
docs_q2 = doc_rel[["Doc_ID", "Hurricane Disaster"]]
docs_q2 = docs_q2.rename(columns={"Doc_ID": "doc", "Hurricane Disaster": "label"})
docs_q2["query_id"] = 0
docs_q3 = doc_rel[["Doc_ID", "evacuate house"]]
docs_q3 = docs_q1.rename(columns={"Doc_ID": "doc", "evacuate house": "label"})
docs_q3["query_id"] = 0
docs_q4 = doc_rel[["Doc_ID", "Florida Shelter"]]
docs_q4 = docs_q1.rename(columns={"Doc_ID": "doc", "Florida Shelter": "label"})
docs_q4["query_id"] = 0
docs_q5 = doc_rel[["Doc_ID", "Rain fall"]]
docs_q5 = docs_q1.rename(columns={"Doc_ID": "doc", "Rain fall": "label"})
docs_q5["query_id"] = 0
docs = pd.concat([docs_q1, docs_q2, docs_q3, docs_q4, docs_q5])

In [198]:
docs

Unnamed: 0,doc,label,query_id
0,doc_3427,0,0
1,doc_1490,0,0
2,doc_3264,1,0
3,doc_1996,0,0
4,doc_571,1,0
...,...,...,...
20,doc_610,0,0
21,doc_2278,0,0
22,doc_3185,0,0
23,doc_468,0,0


In [199]:
def binary_relevance(row):
    if row['predicted_relevance'] >= 1:
        return 1
    else: return 0

In [200]:
our_scores = pd.merge(docs, df_our,  how='left', left_on=['doc','query_id'], right_on = ['doc_id', 'q_id'])
our_scores = our_scores.drop(["q_id", "doc_id"], axis =1)
our_scores["y_pred"] = scores.apply(lambda row: binary_relevance(row), axis = 1)

In [201]:
our_scores_q1 = our_scores[our_scores["query_id"] == 1]
our_scores_q1 = our_scores_q1.reset_index(drop=True)
our_scores_q2 = our_scores[our_scores["query_id"] == 2]
our_scores_q2 = our_scores_q2.reset_index(drop=True)
our_scores_q3 = our_scores[our_scores["query_id"] == 3]
our_scores_q3 = our_scores_q3.reset_index(drop=True)
our_scores_q4 = our_scores[our_scores["query_id"] == 4]
our_scores_q4 = our_scores_q4.reset_index(drop=True)
our_scores_q5 = our_scores[our_scores["query_id"] == 5]
our_scores_q5 = our_scores_q5.reset_index(drop=True)

In [202]:
# Read datasets
evaluation_gt = pd.read_csv(r'data/evaluation_gt.csv', delimiter = ",")

In [203]:
scores = pd.merge(evaluation_gt, scores_queries,  how='left', left_on=['doc','query_id'], right_on = ['doc_id', 'q_id'])
scores = scores.drop(["q_id", "doc_id"], axis =1)

In [204]:
scores["y_pred"] = scores.apply(lambda row: binary_relevance(row), axis = 1)

In [205]:
scores_q1 = scores[scores["query_id"] == 1]
scores_q1 = scores_q1.reset_index(drop=True)
scores_q2 = scores[scores["query_id"] == 2]
scores_q2 = scores_q2.reset_index(drop=True)
scores_q3 = scores[scores["query_id"] == 3]
scores_q3 = scores_q3.reset_index(drop=True)

In [206]:
print_result = evaluation_gt["label"].unique()
print("The ground truth of our dataset is composed of {} Relevance Levels: {}".format(len(print_result), sorted(print_result)))

The ground truth of our dataset is composed of 2 Relevance Levels: [0, 1]


For the prior evaluation components you must evaluate your algorithm by using different evaluation techniques and only for the second component (your queries) comment in each of them how they differ, and which information gives each of them:
1. Precision@K (P@K)
2. Recall@K (R@K)
3. Average Precision@K (P@K)
4. F1-Score
5. Mean Average Precision (MAP)
6. Mean Reciprocal Rank (MRR)
7. Normalized Discounted Cumulative Gain (NDCG)

### Precision@K 

In [207]:
def precision_at_k(doc_score, y_score, k=10): #binary relevance, predicted relevance, k for a given query
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    precision @k : float

    """
    order = np.argsort(y_score)[::-1] #we get the ranking of the documents accoirding to the predicted score/ use np.argsort and [::1] to obtain the list of indexes of the predicted score sorted in descending order.
    doc_score = np.take(doc_score, order[:k]) # align the binary relevance to the corresponding document / use the indexes of point 1 to sort the actual relevance label of the documents (hint: np.take).
    relevant = sum(doc_score == 1) #get number of relevant documents
    return float(relevant) / k #calculae precision at k, which is the number of relevant documents trieved at k

Query 1: Landfall in South Carolina

In [208]:
print("precision@k query 1: {}".format(precision_at_k (scores_q1["label"], scores_q1["y_pred"], 20)))

precision@k query 1: 0.5


Query 2: Help and recovery during the hurricane disaster

In [209]:
print("precision@k query 2: {}".format(precision_at_k (scores_q2["label"], scores_q2["y_pred"], 20)))

precision@k query 2: 0.5


Query 3: Floodings in South Carolina

In [210]:
print("precision@k query 3: {}".format(precision_at_k (scores_q3["label"], scores_q3["y_pred"], 20)))

precision@k query 3: 0.5


Our proposed query 1: Bad storm

In [211]:
print("precision@k our query 1: {}".format(precision_at_k (our_scores_q1["label"], our_scores_q1["y_pred"], 25)))

precision@k our query 1: 0.0


Our proposed query 2: Hurricane disaster

In [212]:
print("precision@k our query 2: {}".format(precision_at_k (our_scores_q2["label"], our_scores_q2["y_pred"], 25)))

precision@k our query 2: 0.0


Our proposed query 3: evacuate house

In [213]:
print("precision@k our query 3: {}".format(precision_at_k (our_scores_q3["label"], our_scores_q3["y_pred"], 25)))

precision@k our query 3: 0.0


Our proposed query 4: Florida Shelter

In [215]:
print("precision@k our query 4: {}".format(precision_at_k (our_scores_q4["label"], our_scores_q4["y_pred"], 25)))

precision@k our query 4: 0.0


Our proposed query 5: Rain fall

In [214]:
print("precision@k our query 5: {}".format(precision_at_k (our_scores_q5["label"], our_scores_q5["y_pred"], 25)))

precision@k our query 5: 0.0


### Recall@K

In [149]:
def recall_at_k(doc_score, y_score, k=10): #binary relevance, predicted relevance, k for a given query
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    recall @k : float

    """
    total_relevant = sum(doc_score)
    order = np.argsort(y_score)[::-1] #we get the ranking of the documents accoirding to the predicted score/ use np.argsort and [::1] to obtain the list of indexes of the predicted score sorted in descending order.
    doc_score = np.take(doc_score, order[:k]) # align the binary relevance to the corresponding document / use the indexes of point 1 to sort the actual relevance label of the documents (hint: np.take).
    relevant = sum(doc_score == 1) #get number of relevant documents
    return float(relevant)/ total_relevant #total # of relevant docs ?

In [None]:
def recall_at_k(doc_score, y_score, k=10):
    total_relevant = sum(doc_score)
    order = np.argsort(y_score)[::-1]
    doc_score = np.take(doc_score, order[:k])
    relevant = sum(doc_score == 1)
    return float(relevant)/ total_relevant

Query 1: Landfall in South Carolina

In [105]:
print("recall@k query 1: {}".format(recall_at_k (scores_q1["label"], scores_q1["y_pred"], 20)))

recall@k query 1: 1.0


Query 2: Help and recovery during the hurricane disaster

In [150]:
print("recall@k query 2: {}".format(recall_at_k(scores_q2["label"], scores_q2["y_pred"], 20)))

recall@k query 2: 1.0


Query 3: Floodings in South Carolina

In [151]:
print("recall@k query 3: {}".format(recall_at_k(scores_q3["label"], scores_q3["y_pred"], 20)))

recall@k query 3: 1.0


Our proposed query 1: Bad storm

In [None]:
print("recall@k query 1: {}".format(recall_at_k(our_scores_q1["label"], our_scores_q1["y_pred"], 25)))

Our proposed query 2: Hurricane disaster

In [None]:
print("recall@k query 2: {}".format(recall_at_k(our_scores_q2["label"], our_scores_q2["y_pred"], 25)))

Our proposed query 3: evacuate house

In [None]:
print("recall@k query 3: {}".format(recall_at_k(our_scores_q3["label"], our_scores_q3["y_pred"], 25)))

Our proposed query 4: Florida Shelter

In [None]:
print("recall@k query 4: {}".format(recall_at_k(our_scores_q4["label"], our_scores_q4["y_pred"], 25)))

Our proposed query 5: 

In [None]:
print("recall@k query 5: {}".format(recall_at_k(our_scores_q5["label"], our_scores_q5["y_pred"], 25)))

### AveragePrecision@K

In [109]:
def avg_precision_at_k(doc_score, y_score, k=10): #binary relevance, predicted relevance, k for a given query
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    average precision @k : float
    """
    gtp = np.sum(doc_score == 1) #Total number of gt positives
    order = np.argsort(y_score)[::-1] #same as for precision
    doc_score = np.take(doc_score, order[:k]) #same as for precision
    ## if all documents are not relevant
    if gtp == 0:
        return 0
    n_relevant_at_i = 0
    prec_at_i = 0
    for i in range(len(doc_score)):
        if doc_score[i] == 1: #only add the P@k when the doc is relevant
            n_relevant_at_i += 1
            prec_at_i += n_relevant_at_i / (i + 1) #calculate P@K (#docs relevant at k/k)
    return prec_at_i / gtp #return ap

Query 1: Landfall in South Carolina

In [163]:
print("averagePrecision@k query 1: {}".format(avg_precision_at_k(scores_q1["label"], scores_q1["y_pred"], 20)))

averagePrecision@k query 1: 1.0


Query 2: Help and recovery during the hurricane disaster

In [164]:
print("averagePrecision@k query 2: {}".format(avg_precision_at_k(scores_q2["label"], scores_q2["y_pred"],20)))

averagePrecision@k query 2: 1.0


Query 3: Floodings in South Carolina

In [165]:
print("averagePrecision@k query 3: {}".format(avg_precision_at_k(scores_q3["label"], scores_q3["y_pred"], 20)))

averagePrecision@k query 3: 1.0


Our proposed query 1: Bad storm

In [None]:
print("averagePrecision@k our query 1: {}".format(avg_precision_at_k(our_scores_q1["label"], our_scores_q1["y_pred"], 25)))

Our proposed query 2: Hurricane disaster

In [None]:
print("averagePrecision@k our query 2: {}".format(avg_precision_at_k(our_scores_q2["label"], our_scores_q2["y_pred"], 25)))

Our proposed query 3: evacuate house

In [None]:
print("averagePrecision@k our query 3: {}".format(avg_precision_at_k(our_scores_q3["label"], our_scores_q3["y_pred"], 25)))

Our proposed query 4: Florida Shelter

In [None]:
print("averagePrecision@k our query 4: {}".format(avg_precision_at_k(our_scores_q4["label"], our_scores_q4["y_pred"], 25)))

Our proposed query 5: Rain fall

In [None]:
print("averagePrecision@k our query 5: {}".format(avg_precision_at_k(our_scores_q5["label"], our_scores_q5["y_pred"], 25)))

### F1-Score

In [115]:
def f1_score(doc_score, y_score, beta = 1):
    F1 = (1+np.square(beta))/np.square(beta)
    precision = precision_at_k(doc_score, y_score)
    recall = recall_at_k(doc_score,y_score)
    F1 = F1*precision * recall /(precision + recall)
    return F1

Query 1: Landfall in South Carolina

In [170]:
print("F1- Score query 1: {}".format(f1_score (scores_q1["label"], scores_q1["y_pred"], 1)))

F1- Score query 1: 1.0


Query 2: Help and recovery during the hurricane disaster

In [171]:
print("F1- Score query 2: {}".format(f1_score(scores_q2["label"], scores_q2["y_pred"], 1)))

F1- Score query 2: 0.9


Query 3: Floodings in South Carolina

In [172]:
print("F1- Score query 3: {}".format(f1_score(scores_q3["label"], scores_q3["y_pred"], 1)))

F1- Score query 3: 0.9


Our proposed query 1: Bad storm

In [None]:
print("F1 - Score Our query 1: {}".format(f1_score(our_scores_q1["label"], our_scores_q1["y_pred"], 25)))

Our proposed query 2: Hurricane disaster

In [None]:
print("F1 - Score Our query 2: {}".format(f1_score(our_scores_q2["label"], our_scores_q2["y_pred"], 25)))

Our proposed query 3: evacuate house

In [None]:
print("F1 - Score Our query 3: {}".format(f1_score(our_scores_q3["label"], our_scores_q3["y_pred"], 25)))

Our proposed query 4: Florida Shelter

In [None]:
print("F1 - Score Our query 4: {}".format(f1_score(our_scores_q4["label"], our_scores_q4["y_pred"], 25)))

Our proposed query 5: Rain fall

In [None]:
print("F1 - Score Our query 5: {}".format(f1_score(our_scores_q5["label"], our_scores_q5["y_pred"], 25)))

### Mean Average Precision

In [131]:
def map_at_k(search_res, k=10): #receives all the search esults dataframe containing all the queries and the results and relevances
    """
    Parameters
    ----------
    search_res: search results dataset containing:
        query_id: query id.
        doc_id: document id.
        predicted_relevance: relevance predicted through TFIDF.
        doc_score: actual score of the document for the query (ground truth).

    Returns
    -------
    mean average precision @ k : float
    """
    avp = []
    for q in search_res["query_id"].unique():  # loop over all query ids
        curr_data = search_res[search_res["query_id"] == q]  # select data for current query (get a slice of the dataframe keeping only the data related to the current query)
        avp.append(avg_precision_at_k(np.array(curr_data["label"]), 
                   np.array(curr_data["y_pred"]), k))  #append average precision for current query
    return np.sum(avp) / len(avp), avp  # return mean average precision

In [190]:
print(map_at_k(scores,20))

SyntaxError: invalid syntax (1949901118.py, line 1)

In [189]:
print(map_at_k(our_scores, 25))

(0.036937830687830686, [0.036937830687830686]) 25


### Mean Reciprocal Rank

In [119]:
def rr_at_k(doc_score, y_score, k=10):
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    Reciprocal Rank for qurrent query
    """

    order = np.argsort(y_score)[::-1]  # get the list of indexes of the predicted score sorted in descending order. As before
    doc_score = np.take(doc_score, order[
                             :k])  # sort the actual relevance label of the documents based on predicted score(hint: np.take) and take first k. As before
    if np.sum(doc_score) == 0:  # if there are not relevant doument return 0
        return 0
    return 1 / (np.argmax(doc_score == 1) + 1)  # hint: to get the position of the first relevant document use "np.argmax" (+1 because the idex starts from 0)


Query 1: Landfall in South Carolina

In [120]:
print("MRR query 1: {}".format(rr_at_k (scores_q1["label"], scores_q1["y_pred"], 20)))

MRR query 1: 1.0


Query 2: Help and recovery during the hurricane disaster

In [121]:
print("MRR query 2: {}".format(rr_at_k(scores_q2["label"], scores_q2["y_pred"], 20)))

MRR query 2: 0.5


Query 3: Floodings in South Carolina

In [122]:
print("MRR query 3: {}".format(rr_at_k(scores_q3["label"], scores_q3["y_pred"], 20)))

MRR query 3: 1.0


Our proposed query 1: Bad storm

In [None]:
print("MRR our query 1: {}".format(rr_at_k(our_scores_q1["label"], our_scores_q1["y_pred"], 25)))

Our proposed query 2: Hurricane disaster

In [None]:
print("MRR our query 2: {}".format(rr_at_k(our_scores_q2["label"], our_scores_q2["y_pred"], 25)))

Our proposed query 3: evacuate house

In [None]:
print("MRR our query 3: {}".format(rr_at_k(our_scores_q3["label"], our_scores_q3["y_pred"], 25)))

Our proposed query 4: Florida Shelter

In [None]:
print("MRR our query 4: {}".format(rr_at_k(our_scores_q4["label"], our_scores_q4["y_pred"], 25)))

Our proposed query 5: Rain fall

In [None]:
print("MRR our query 5: {}".format(rr_at_k(our_scores_q5["label"], our_scores_q5["y_pred"], 25)))

### NDCG

In [125]:
def dcg_at_k(doc_score, y_score, k=10): #doc_scire are the labels (ground truth) and y_score are the system scores
    order = np.argsort(y_score)[::-1]  # get the list of indexes of the predicted score sorted in descending order.
    doc_score = np.take(doc_score, order[:k])  # sort the actual relevance label of the documents based on predicted score(hint: np.take) and take first k.
    gain = 2 ** doc_score - 1  # First we calculate the upper part of the formula which is the CG (use formula 7 above) (notice it is based on the ground truth relevance)
    discounts = np.log2(np.arange(len(doc_score)) + 2)  # Compute denominator (np.arrange creates a list of numbers betweeen 0 and len(doc_score)-1), then the + 2 addresses the fact that the numbers start from 0
    return np.sum(gain / discounts)  #return dcg@k


def ndcg_at_k(doc_score, y_score, k=10):
    dcg_max = dcg_at_k(doc_score, doc_score, k) #ideal dcg
    #print(dcg_max)
    if not dcg_max:
        return 0
    return np.round(dcg_at_k(doc_score, y_score, k) / dcg_max, 4)

Query 1: Landfall in South Carolina

In [126]:
print("NDCG query 1: {}".format(ndcg_at_k (scores_q1["label"], scores_q1["y_pred"], 20)))

NDCG query 1: 1.0


Query 2: Help and recovery during the hurricane disaster

In [127]:
print("NDCG query 2: {}".format(ndcg_at_k(scores_q2["label"], scores_q2["y_pred"], 20)))

NDCG query 2: 0.7799


Query 3: Floodings in South Carolina

In [128]:
print("NDGC query 3: {}".format(ndcg_at_k(scores_q3["label"], scores_q3["y_pred"], 20)))

NDGC query 3: 0.8611


Our proposed query 1: Bad storm

In [None]:
print("NDGC our query 1: {}".format(ndcg_at_k(our_scores_q1["label"], our_scores_q1["y_pred"], 25)))

Our proposed query 2: Hurricane disaster

In [None]:
print("NDGC our query 2: {}".format(ndcg_at_k(our_scores_q2["label"], our_scores_q2["y_pred"], 25)))

Our proposed query 3: evacuate house

In [None]:
print("NDGC our query 3: {}".format(ndcg_at_k(our_scores_q3["label"], our_scores_q3["y_pred"], 25)))

Our proposed query 4: Florida Shelter

In [None]:
print("NDGC our query 4: {}".format(ndcg_at_k(our_scores_q4["label"], our_scores_q4["y_pred"], 25)))

Our proposed query 5: Rain fall

In [None]:
print("NDGC our query 5: {}".format(ndcg_at_k(our_scores_q5["label"], our_scores_q5["y_pred"], 25)))

Choose one vector representation, TF-IDF or word2vec, and represent the tweets in a two-dimensional scatter plot through the T-SNE (T-distributed Stochastic Neighbor Embedding) algorithm. To do so, you may need first to represent the word as a vector, and then the tweet, i.e., resulted as the average value over the words involved. Any other option rather than T-SNE may be used, but needs to be justified.

**HINT**: You don’t have to know all the theoretical details used in T-SNE, just use the proper library and generate the output and play with it.

Also, you can choose to perform an alternative method to generate a 2-dimensional representation for the word embeddings (like PCA).


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=a8d1a678-5677-4c27-b5af-590183e3d260' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>