# IRWA-2022-part-1

# Introduction
This is the notebook about the first part of the project of Information Retrival and Web Analysis. In this first part I have to load data from two files and preprocess it by removing stopwords, perform stemming and more...

At the end of the project a user can create a dictionary with the doc_id as a key and all the information of the tweet as value by calling load_data(path_1, path_2).

However, it's more usefull to not store all the information, but only some main information of tweet as values (in a specific format) by calling the function build_map() after the loading.


Moreover the final step is to preprocess the text. To do that I have decided to create another function prep_build_map() that after the loading can create a dictionary with doc_id as key and some preprocess information about the tweet (only full text, username and date) as values. I want to leave the username with @ and hashtags with # as firsts characters because doing so they can easily be differentiated by other worlds (hashtag are anyway preprocess).

I want to insert also username and date in the preprocessed text because I think they are important when I want to search a tweet (in the future steps of the project).

In the ends there are some results of the preprocess function.

# 0) Google Colab

I mount the google drive to use it in google colab, but it isn't necessary in git.

In [31]:
#from google.colab import drive
#drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [32]:
#%cd /content/drive/MyDrive/IRWA/Project/IRWA-2022-part-1
!pwd
!ls

/content/drive/MyDrive/IRWA/Project/IRWA-2022-part-1
/content/drive/MyDrive/IRWA/Project/IRWA-2022-part-1
data  IR_project_part_1.pdf  IRWA-2022-u210426-part-1.ipynb


## 1) Import modules

In [None]:
import datetime
import time

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import regex as re
import numpy as np
import json

After importing all modules I have to download the stop words from nltk module

In [34]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2) Load data and build map
After the default part, I create a function load_data(path_tweets, path_docs_tweet) when path_tweets is the path to the file where are stored the tweets in the json format and path_docs_tweet is the path to another file that contains doc_id and the corrisponding tweet_id

In [35]:
def load_data(path_tweets, path_docs_tweet):
    """
    Argument:
    path_tweets - path of the tweets file
    path_docs_tweet - path of the 'doc_id tweet_id' file 

    Returns:
    doc_tweet - dictionary of doc_id as keys and full tweets as values
    """

    id_tweet = {}
    doc_tweet = {}

    #open the first path where are stored tweets
    with open(path_tweets) as tp:
        for line in tp.readlines():
            #read every line with json.load() that create a dictionary of a single tweet
            tweet = json.loads(line)
            #store the tweet in a dictionary by tweet_id
            id_tweet[tweet['id']] = tweet

    #open the second path with doc_id - tweet_id
    with open(path_docs_tweet) as dp:
        for line in dp.readlines():
            line = line.split()
            #save the tweet from the dict of tweet in a new dict with doc_id as keys
            doc_tweet[line[0]] = id_tweet[int(line[1])]
    
    #return the new dictionary
    return doc_tweet

Alter load data process we have to extract the information from the tweet to show it in the format:

    Tweet | Username | Date | Hashtags | Likes | Retweets | Url
    
Where:
- by tweet we mean the full text,
- by username the screen_name of the user (so it identify the user),
- by date we use the format 'name_day number_day name_month year'(e.g. Friday 30 September 2022)
- by hashtags a string of the hashtags inside the text (double ##)
- by likes the number of the likes
- by retweets the number of retweets
- by url the tweet link in the format 'https://twitter.com/username/status/tweet_id'

In [36]:
#get the text from a dictionary contains all informations of a tweet
def get_text(tweet):
    try:
        return tweet['full_text']
    except KeyError:
        return ' '

#get the username from a dictionary contains all informations of a tweet
def get_username(tweet):
    try:
        #insert the @ to differentiate the username from other worlds
        return '@' + tweet['user']['screen_name']
    except KeyError:
        return ' '


#get the date from a dictionary contains all informations of a tweet
def get_date(tweet):
    try:
        #change the format of the date to a easily readable one
        created_at = datetime.datetime.strptime(tweet['created_at'], "%a %b %d %X %z %Y" )
        return created_at.strftime('%A %d %B %Y')
    except KeyError:
        return ' '


#get the hashtags from a dictionary contains all informations of a tweet
def get_hashtags(tweet):
    try:
        hashtags = []
        #read all the hashtags that are the values of a dictionary store in entities of the tweet
        for hash in  tweet['entities']['hashtags']:
                hashtags.append('#' + hash['text'])
        #return hashtags as string
        return ' '.join(hashtags)
    except KeyError:
        return ' '


#get the number of likes from a dictionary contains all informations of a tweet
def get_likes(tweet):
    try:
        return str(tweet['favorite_count'])
    except KeyError:
        return ' '


#get the number of retweets from a dictionary contains all informations of a tweet
def get_retweets(tweet):
    try:
        return str(tweet['retweet_count'])
    except KeyError:
        return ' '


#get the url from a dictionary contains all informations of a tweet
def get_url(tweet):
    try:
        #create the url by the username and the id of the tweet
        return 'https://twitter.com/' + tweet['user']['screen_name'] + '/status/'+ str(tweet['id'])
    except KeyError:
        return ' '

In [37]:
#create the map of 'doc_id : tweet_as_string'
def build_map(dict_docs_tweet):
    """
    Argument:
    dict_docs_tweet -- dictionary of doc_id as keys and all tweets as values

    Returns:
    doc_map - new dictionary of doc_id as keys and main information of tweets as string as values
    """

    doc_map = {}
    for doc in dict_docs_tweet.keys():
        tweet = dict_docs_tweet[doc]
        
        #insert all elements in a list
        items_list = [get_text(tweet), get_username(tweet), get_date(tweet), get_hashtags(tweet), get_likes(tweet), get_retweets(tweet), get_url(tweet)] 
        #save the list as a string separated by |
        doc_map[doc] = " | ".join(items_list)
        
    return doc_map

## 3) See some results

I initialize the paths and than load the data using tha function created before: 

In [38]:
#save paths
data_path = './../data/'
TWEETS_PATH = data_path + 'tw_hurricane_data.json'
DOCS_PATH = data_path + 'tweet_document_ids_map.csv'

#load data
doc_to_tweet = load_data(TWEETS_PATH, DOCS_PATH)

#print the number of documents in the data
print("Total number of docs of tweets: {}".format(len(doc_to_tweet)))

Total number of docs of tweets: 4000


In the next part I build the map of the future original data to show, and show some of the items:

In [39]:
#build the map with tweet as string
docs_map = build_map(doc_to_tweet)

#see some results of original tweets
for index in range(2):
    doc = list(docs_map.keys())[index]
    print("Original {} text line:\n    {} \n".format(doc, docs_map[doc]))

Original doc_1 text line:
    So this will keep spinning over us until 7 pm…go away already. #HurricaneIan https://t.co/VROTxNS9rz | @suzjdean | Friday 30 September 2022 | #HurricaneIan | 0 | 0 | https://twitter.com/suzjdean/status/1575918182698979328 

Original doc_2 text line:
    Our hearts go out to all those affected by #HurricaneIan. We wish everyone on the roads currently braving the conditions safe travels. 💙 | @lytx | Friday 30 September 2022 | #HurricaneIan | 0 | 0 | https://twitter.com/lytx/status/1575918151862304768 



## 4) Preprocess the text
After creating the map we have to create another map with al the processed text. To do that I create another function that takes a string and transform it by:
- Making all the text lower
- Removing punctuation marks
- Preprocess the hashtags
- Removing links
- Tokenize the text to get a list of terms
- Eliminate the stopwords
- Perform stemming

I choose not to remove # and @ because I want that they appear different (like hashtags and users). However I just preprocess the hashtag

In [40]:
# preprocess function
def preprocess(str_line):
    """
    Argument:
    line -- string (text) to be preprocessed

    Returns:
    line - a list of tokens corresponding to the input text after the preprocessing
    """
    
    stemmer = PorterStemmer() #save the stemmer

    stop_words = set(stopwords.words("english")) #main language english

    str_line = str_line.lower() # case insensitive

    #find hashtags in the text and preprocess it
    for hashtag in re.findall(r'#([^\s]+)', str_line): 
      str_line = str_line.replace(hashtag, ''.join(preprocess(hashtag)))  # Preprocess the hashtags
    
    str_line = re.sub(r'(\s)(http)[^\s]+', ' ', str_line) # Removing links
    str_line = re.sub(r'[^\w\s#@]+', ' ', str_line) # Removing punctuation marks
    str_line = str_line.split()  # Tokenize the text to get a list of terms
    str_line = [x for x in str_line if x not in stop_words]  # Eliminate the stopwords
    str_line = [stemmer.stem(word) for word in str_line]  # Perform stemming
    return str_line

Now I can make a map with preprocess text (I choose the tweet text, the username and date for the research):

In [41]:
def build_prep_map(dict_docs_tweet):
    """
    Argument:
    dict_docs_tweet -- dictionary of doc_id as keys and all tweets as values

    Returns:
    doc_map - new dictionary of doc_id as keys and preprocessed text, username and date as unique string as values
    """

    prep_doc_map = {}
    for doc in dict_docs_tweet.keys():
        tweet = dict_docs_tweet[doc]
        #preprocess every items and store into the dictionary
        prep_doc_map[doc] = preprocess(get_text(tweet)) + preprocess(get_username(tweet)) + preprocess(get_date(tweet))
    return prep_doc_map

In [42]:
start_time = time.time()
prep_docs_map = build_prep_map(doc_to_tweet)
print("Total time to preprocess tweets: {} seconds" .format(np.round(time.time() - start_time, 2)))

Total time to preprocess tweets: 7.67 seconds


## 5) Final result of preprocessing
Now all the text is ready to be insert in the index, but before we can just see some results

In [43]:
for index in range(5):
    doc = list(prep_docs_map.keys())[index]
    print("Preprocess {} text line:\n   {}\n".format(doc, prep_docs_map[doc]))

Preprocess doc_1 text line:
   ['keep', 'spin', 'us', '7', 'pm', 'go', 'away', 'alreadi', '#hurricaneian', '@suzjdean', 'friday', '30', 'septemb', '2022']

Preprocess doc_2 text line:
   ['heart', 'go', 'affect', '#hurricaneian', 'wish', 'everyon', 'road', 'current', 'brave', 'condit', 'safe', 'travel', '@lytx', 'friday', '30', 'septemb', '2022']

Preprocess doc_3 text line:
   ['kissimme', 'neighborhood', 'michigan', 'ave', '#hurricaneian', '@cheathwftv', 'friday', '30', 'septemb', '2022']

Preprocess doc_4 text line:
   ['one', 'tree', 'backyard', 'scare', 'poltergeist', 'tree', 'storm', 'windi', 'like', '#scwx', '#hurricaneian', '@spiralgypsi', 'friday', '30', 'septemb', '2022']

Preprocess doc_5 text line:
   ['@ashleyruizwx', '@stephan89441722', '@lilmizzheidi', '@mr__sniffl', '@winknew', '@dylanfedericowx', '@julianamwx', '@sydneypers', '@nicolegabetv', 'pray', 'everyon', 'affect', '#hurricaneian', 'associ', 'winknew', 'sympathi', 'anim', 'abus', 'liar', 'condon', '@blondie610', 

In [44]:
#end of the part-1