# Team Members
1. Aaron Khoo
2. Calvin Yusnoveri 
3. Amarjyot Kaur Narula
4. Joseph Chng 1003811

# Task Description

As COVID-19 impacts our daily routine and changed the norms that we accepted prior to the pandemic, it is thus an important task for us to quantify its impacts on the global stage. One example of such changes is the explosion of social media usage such as Twitter and Youtube, where most people who are not able to move freely, shares their thoughts through such platforms. Twitter in particular, provides a platform that allows the users to post their thoughts in a succint manner and add hashtags or mentions to increase the tweet's exposure on the platform. Our task will thus be to predict the number of retweets a tweet that is COVID-19 related will have using the TweetsCOV-19 dataset.

# Dataset Description

The dataset that is used for this project is obtained from the COVID-19 Retweet Prediction Challenge. For this prediction model, we used Part 2 dataset that can be obtained from this website https://data.gesis.org/tweetscov19/#dataset. This dataset consists of tweets that is COVID-19 related from the month of May 2020.

From the dataset, there are different features for the tweet data that we obtain and the feature description are as follows:

1. Tweet Id: Long. Unique ID for a specific tweet
2. Username: String. Username of the user that published the tweet which is encrypted for privacy.
3. Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ). Specific time and date of the tweet
4. #Followers: Integer. Number of followers of the Twitter user who posted the tweet.
5. #Friends: Integer. Number of friends that the Twitter user who posted the tweet.
6. #Retweets: Integer. Number of retweets that the tweet has obtained and is the label for this project.
7. #Favorites: Integer. Number of favorites for the tweet
8. Entities: String. The entities of the tweet is obtained by aggregating the original text. Every annotated entity will then have its produced score from FEL library. Each entity is separated by char ":" to store the entity in this form "original_text:annotated_entity:score;". Each entity is separated from another entity by char ";".Any tweet that has no corresponding entities will be stored as "null;".
9. Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. The two sentiments are splitted by whitespace char " ". Positive sentiment was stored first and followed by negative sentiment (i.e. "2 -1").
10. Mentions: String. Contains mentions and concatenate them with whitespace char " ". If there is no mention, it is stored as "null;".
11. Hashtags: String. Contains hashtags and concatenate the hashtags with whitespace char " ". If there is no hashtag, it is stored as "null;".
12. URLs: String: Contains URLs and concatenate the URLs using ":-: ". If there is no URL, it is stored as "null;"

In [7]:
# imports

import logging
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from gensim.models import Word2Vec



In [5]:
header = [
    "Tweet Id", 
    "Username", 
    "Timestamp", 
    "#Followers",
    "#Friends",
    "#Retweets",
    "#Favorites",
    "Entities",
    "Sentiment",
    "Mentions",
    "Hashtags",
    "URLs"]

data = pd.read_csv("./data/TweetsCOV19_052020.tsv.gz", compression='gzip', names=header, sep='\t', quotechar='"')
data.head(5)

Unnamed: 0,Tweet Id,Username,Timestamp,#Followers,#Friends,#Retweets,#Favorites,Entities,Sentiment,Mentions,Hashtags,URLs
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;


# Preprocessing

In order to train our prediction model, we have also done some preprocessing of the features that are available in the dataset. All these changes to the raw features allow us to link these processed features to the final retweet prediction in a more precise manner. 

Clean Data (Final structure/form of data before it is fed into the model):
1. Username: String. Encrypted for privacy issues. - NOT USED
2. #Followers: Integer.
3. #Friends: Integer.
4. #Retweets: Integer.
5. #Favorites: Integer.
6. Hashtags Embedding: (25, 1) Vector
7. Mentions Embedding: (25, 1) Vector

## Hashtags & Mentions

Both Hashtags and Mentions are in the form of list of Strings seperated by whitespace. Thus, in order to create tractable input for the model, embeddings are created for both the Hashtags and Mentions of size `(25, 1)`.

In [13]:
hashtags_embeddings = Word2Vec.load('./data/hashtag_embeddings')
mentions_embeddings = Word2Vec.load('./data/mention_embeddings')

hashtags_vocab = hashtag_embeddings.wv.index_to_key
mentions_vocab = mention_embeddings.wv.index_to_key

print(hashtags_vocab[:5]) # example of hashtags key
print(mentions_vocab[:5]) # example of mentions key

['COVID19', 'coronavirus', 'Covid_19', 'covid19', 'May']
['realDonaldTrump', 'PMOIndia', 'narendramodi', 'jaketapper', 'YouTube']


In [21]:
hashtags_example = 'COVID19'
mentions_example = 'realDonaldTrump'

print(f"{hashtags_example} -> {hashtags_embeddings.wv[hashtags_example]}")
print(f"{mentions_example} -> {mentions_embeddings.wv[mentions_example]}")

COVID19 -> [ 0.10622272  0.26996937 -0.46450084  0.10561462 -0.5595082   0.26207525
  0.28835535  0.80339587  0.30626374 -0.13036335  0.8120623  -0.46314418
  0.20126966 -0.9723947  -1.0051426  -0.04809839  0.4593365   0.09532893
 -0.21894015 -0.23557915  0.42107382  0.4622469   0.53460604 -0.589559
  0.6296402 ]
realDonaldTrump -> [ 0.5991303  -0.10410535  0.23690729 -0.23115875 -0.961905   -0.11418784
  0.12405131  0.4196795  -1.493182   -0.20270342  1.2276924  -1.3593616
 -0.19556278  0.27365074  0.32451993  1.9415929  -0.20647514 -0.17526582
 -0.69910485 -1.6436449   1.3161302  -0.17269903 -0.5424232   1.0386076
  1.062889  ]


These embeddings are trained over 5 epoch and only considers String symbols that occur at least 200 timex to be relevant. This is done by passing the argument `min_count=200`, when training.

Using these embeddings, both Hashtags and Mentions column are iterated over and converted into vectors of size `(25, 1)`. With these rules: 
- For those Hashtags/Mentions cells that contain `null` 0 vector of size 25 is outputted
- For those Hashtags/Mentions cells that contain String symbols that occur than less 200 times (hence, not in vocab), they're treated as null
- For those Hashtags/Mentions cells that contain multiple String symbols, their embedding vectors are summed

### Effectiveness of Mentions & Hashtags and their Embeddings

Before attempting to create these embeddings, a quick exploration was done to check the relevance of these Mentions and Hashtags in predicting Retweet score. 

An initial assumption is that, certain Hashtags or Mentions would correlate in higher Retweet score. But, a quick look of data seems to suggest little correlations as most high Retweet score have `null` Mentions and Hashtags. 

In [19]:
sorted_by_retweets = data.sort_values(by='#Retweets', ascending=False)
sorted_by_retweets.head(10) # observe that most Mentions and Hashtags are null

Unnamed: 0,Tweet Id,Username,Timestamp,#Followers,#Friends,#Retweets,#Favorites,Entities,Sentiment,Mentions,Hashtags,URLs
1637862,1265465820995411973,0d4d9b3135ab4271ea36f4ebf8e9eae9,Wed May 27 02:12:17 +0000 2020,3317,3524,257467,845579,tear gas:Tear_gas:-1.688018296396458;,1 -1,null;,null;,null;
1208647,1266553959973445639,c9378a990def5939fb179e034a0d402e,Sat May 30 02:16:10 +0000 2020,18661,0,135818,363852,null;,1 -3,null;,null;,null;
1328169,1258750892448387074,1921c65230cd080c689dc82ea62e6e74,Fri May 08 13:29:33 +0000 2020,83320,1753,88667,224288,mike pence:Mike_Pence:-0.6712149436851893;ppe:...,1 -1,null;,null;,null;
1736035,1263579286201446400,7c4529bc4da01f288b95cd3876b4da47,Thu May 21 21:15:52 +0000 2020,451,359,82495,225014,null;,1 -1,null;,null;,null;
751238,1266546753182056453,32634ab407c86a56dde59551b3871c42,Sat May 30 01:47:31 +0000 2020,1545,874,66604,193599,douche:Douche:-2.0041883604919835;,3 -1,null;,null;,null;
702118,1259975524581064704,69745f3009b864ba75b7d066ade0adba,Mon May 11 22:35:48 +0000 2020,6106969,726,63054,248214,null;,1 -1,null;,null;,null;
1037044,1266738565641371648,71b9c38db144b44e4cbbda75c9fbf272,Sat May 30 14:29:43 +0000 2020,45941,4550,61422,100570,null;,1 -1,null;,null;,null;
482286,1267066200049229824,56eb2d106e7611ab8bb76de07af8f318,Sun May 31 12:11:37 +0000 2020,678,524,61038,101117,quarantine:Quarantine:-2.3096035868012508;,2 -1,null;,null;,null;
1812643,1256657625334284292,6b7cc62c18b45d1eee1c34eb375e72a4,Sat May 02 18:51:40 +0000 2020,778,694,60719,213614,null;,1 -1,null;,null;,null;
1401494,1260237550091935746,6b49e6ca36daebd1048d59b1459026ae,Tue May 12 15:57:00 +0000 2020,3704,1144,60650,214508,flatten the curve:Flatten_the_curve:-1.6515462...,1 -1,null;,null;,null;


However, it is believed that there should at least be some value in including these Mentions and Hashtags even though such correlations are weak and not easily discernable. Thus, the embeddings are created regardless of the known weak correlation.

As for the embeddings themselves, based on similarity scores, they seem to be working well. For instance, the embedding are able to recognize `coronavirus` to be similar to `pandemic`, `COVID` and `virus` fairly confidently.

In [20]:
hashtag_embeddings.wv.most_similar(['coronavirus'])

[('virus', 0.7936981320381165),
 ('pandemic', 0.6945043802261353),
 ('COVID', 0.6876555681228638),
 ('corona', 0.6836724281311035),
 ('mask', 0.6728377342224121),
 ('trump', 0.6640805006027222),
 ('masks', 0.6489465832710266),
 ('covid', 0.6476311087608337),
 ('ClimateChange', 0.6469046473503113),
 ('COVID__19', 0.6411719918251038)]

## Timestamp

## Sentiment



## Entities
a

# Model Architecture

We use ensemble methods (insert image):
1. 0-classifier (print out layer)
2. regression model (print out layer)

# Results

Loss curve image. (Maybe save to .txt when training so can read independently and display with matplotlib independently.)

Accuracy on train and test set. (just run test.py on test set)

Validation images. (just screenshot gui)

# Discussion

comparing with state of the art.

possible issues. possible improvements

# GUI

Step-by-step Usage:
1. run xxx
2. click button
3. done

# Sources

1. Source code: https://github.com/arglux/50021-ai-project
2. Report: 
3. Reference papers: