# PROJECT IRWA 2022

## Part 1: Text Processing

1. Take into account that for future queries, the final output must return (when
present) the following information for each of the selected documents: Tweet |
Username | Date | Hashtags | Likes | Retweets | Url (here the “Url” means the
tweet link).

2. Think about how to handle the hashtags from your pre-processing steps (e.g.,
removing the “#” from the word), since it may be useful to involve them as separate terms
inside the inverted index.

The suggested library that may help you in stemming and stopwords: nltk

In [1]:
#We do all the imports
import nltk
nltk.download('stopwords') #Dowload list of stopwords

from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la
import re #Library used to remove certain symbols / characters from a text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
!pwd
!ls

/work
data  deepnote_exports


In [3]:
# you can use pandas to explore your files
import pandas as pd
import json

# Read datasets
tweet_document_ids_map = pd.read_csv(r'data/tweet_document_ids_map.csv', header = None, delimiter = "\t")

In [4]:
tweet_document_ids_map.head(5)

Unnamed: 0,0,1
0,doc_1,1575918182698979328
1,doc_2,1575918151862304768
2,doc_3,1575918140839673873
3,doc_4,1575918135009738752
4,doc_5,1575918119251419136


In [5]:
doc_ID = {} #Create a dictinary to save the docID
for i in range(len(tweet_document_ids_map)): #Iterate through the csv file rows
    #Store the docID in the dictionary accessing with the tweetID
    doc_ID[tweet_document_ids_map[1].iloc[i]] = tweet_document_ids_map[0].iloc[i]

In [6]:
#We open the json file and extract the tweets in it
with open("data/tw_hurricane_data.json", "r") as json_file:
    raw_tweets = [json.loads(tweet) for tweet in json_file]

In [7]:
tweets = {} #We create a dictionary to save all the needed tweets' information

for tweet in raw_tweets: #Iterate through all tweets in the json file
    dict_tweet = {} #Create a dictinary to save each individual tweet's information
    dict_tweet['Tweet'] = tweet['full_text'] #Store the text of the tweet
    dict_tweet['Date'] = tweet['created_at'] #Store the date of the tweet creation
    dict_tweet['Hashtags']= [] #Create an empty list to store the hashtags in the tweet
    for hashtag in tweet['entities']['hashtags']: #Iterate through the hashtag dictionary
        dict_tweet['Hashtags'].append(hashtag['text']) #Append the hashtag to the list
    dict_tweet['Username'] = tweet['user']['screen_name'] #Store the username of the "writer"
    dict_tweet['Likes'] = tweet['favorite_count'] #Store the likes count of the tweet
    dict_tweet['Retweets'] = tweet['retweet_count'] #Store the retweets count of the tweet

    #We "create" the URL of each tweet and store it
    #https://twitter.com/screen_name/status/tweet_id
    dict_tweet['Url'] = 'https://twitter.com/'+dict_tweet['Username']+'/status/'+tweet['id_str']

    dict_tweet['Doc_ID'] = doc_ID[tweet['id']]
    #add tweet to dictionary tweets with the id as the tweet key
    tweets[tweet['id']] = dict_tweet


In [8]:
#Print an example of a stored tweet information entry from the dictionary
tweets[list(tweets.keys())[2570]]

{'Tweet': '@Next_Gen_X $kerri0922 would help with #HurricaneIan clean up. https://t.co/SuDL5LStQu',
 'Date': 'Fri Sep 30 15:46:52 +0000 2022',
 'Hashtags': ['HurricaneIan'],
 'Username': 'NewKerristartin',
 'Likes': 1,
 'Retweets': 0,
 'Url': 'https://twitter.com/NewKerristartin/status/1575874831190413312',
 'Doc_ID': 'doc_2571'}

In [9]:
#This function that receives a line of text as input will return a list of the words
#contained in it after having removed the stopwords and non-important symbols, 
#transforming to lowercase, tokenizing and stemming.
def build_terms(line):
    """
    Preprocess the article text (title + body) removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.
    
    Argument:
    line -- string (text) to be preprocessed
    
    Returns:
    line - a list of tokens corresponding to the input text after the preprocessing
    """

    stemmer = PorterStemmer()

    stop_words = set(stopwords.words("english"))
    line = line.lower()  #Convert to lowercase
    line = line.split()  # Tokenize the text to get a list of terms
    line = [x for x in line if x not in stop_words]  # eliminate the stopwords
    line = [x for x in line if x.startswith(("@", "https://")) != True]  # eliminate mentions
    line = [re.sub('[^a-z]+', '', x) for x in line] # since it's in english we don't have to worry about accents and such
    line = [stemmer.stem(word) for word in line] # perform stemming (HINT: use List Comprehension)
    return line

In [10]:
#We apply the function build_terms(line) to the text section of each tweet
for tweet in tweets.keys():
    tweets[tweet]['Tweet'] = build_terms(tweets[tweet]['Tweet'])

In [11]:
#Print the same example as above to see the result of the text section of the tweet after
#having applied the function build_terms(line)
tweets[list(tweets.keys())[2570]]   

{'Tweet': ['kerri', 'would', 'help', 'hurricaneian', 'clean', 'up'],
 'Date': 'Fri Sep 30 15:46:52 +0000 2022',
 'Hashtags': ['HurricaneIan'],
 'Username': 'NewKerristartin',
 'Likes': 1,
 'Retweets': 0,
 'Url': 'https://twitter.com/NewKerristartin/status/1575874831190413312',
 'Doc_ID': 'doc_2571'}

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f3aafe6d-e8f4-4bba-9fe3-867404489d78' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>