# Graph Construction For Tweets

This notebook download and process the Semeval Tweet data to build some possible graphs from the Tweets, in order to check the use of the GCN.

## Data Download

The first step is to download the processed data (that is already divided in train/test/validation).

In [1]:
%%bash

rm -rf data/
mkdir data/

curl -LO https://cs.famaf.unc.edu.ar/~ccardellino/resources/semeval/semeval.abortion.tgz
tar xvf semeval.abortion.tgz -C data/
rm -rf semeval.abortion.tgz

semeval.abortion.test.csv
semeval.abortion.train.csv
semeval.abortion.validation.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 44737  100 44737    0     0   117k      0 --:--:-- --:--:-- --:--:--  117k


## Data Loading

Now that we have the data we need to process it. First of all, import the necessary libraries and continue with the loading of the dataset.

In [2]:
import json
import nltk
import pandas as pd
import string

from gensim import corpora, models
from joblib import Parallel, delayed
from nltk.tokenize import casual_tokenize
from nltk import ngrams
from nltk.corpus import stopwords
from operator import itemgetter

nltk.download("stopwords")

STOPWORDS = set(stopwords.words("english"))
PUNCTUATION = set(string.punctuation)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/crscardellino/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
dataset = []

for split in ["train", "validation", "test"]:
    dataset.append(
        pd.read_csv("./data/semeval.abortion.{}.csv".format(split))
    )
    dataset[-1].loc[:, "Split"] = split.capitalize()

dataset = pd.concat(dataset, ignore_index=True)
dataset.insert(0, "ID", dataset.index)
dataset.head()

Unnamed: 0,ID,Tweet,Stance,Split
0,0,Just laid down the law on abortion in my bioet...,AGAINST,Train
1,1,Bad 2 days for #Kansas Conservatives #ksleg @g...,NONE,Train
2,2,"Now that there's marriage equality, can we sta...",AGAINST,Train
3,3,I'll always put all my focus and energy toward...,AGAINST,Train
4,4,"@BarackObama celebrates ""equality"" while 3000 ...",AGAINST,Train


## Tweets Graph

For this part I focus on the graph building cycle. Let's use the "TweetTokenizer" of NLTK for tokenizing the tweets.

In [4]:
dataset["TokenizedTweet"] = dataset["Tweet"].apply(lambda t: casual_tokenize(t, reduce_len=True))
dataset.head()

Unnamed: 0,ID,Tweet,Stance,Split,TokenizedTweet
0,0,Just laid down the law on abortion in my bioet...,AGAINST,Train,"[Just, laid, down, the, law, on, abortion, in,..."
1,1,Bad 2 days for #Kansas Conservatives #ksleg @g...,NONE,Train,"[Bad, 2, days, for, #Kansas, Conservatives, #k..."
2,2,"Now that there's marriage equality, can we sta...",AGAINST,Train,"[Now, that, there's, marriage, equality, ,, ca..."
3,3,I'll always put all my focus and energy toward...,AGAINST,Train,"[I'll, always, put, all, my, focus, and, energ..."
4,4,"@BarackObama celebrates ""equality"" while 3000 ...",AGAINST,Train,"[@BarackObama, celebrates, "", equality, "", whi..."


### Normalization

We need to define a function that normalizes a token. Some posible normalizations I could think of:

1. Lowercase.
1. Remove hashtags/mentions.
1. Normalize hashtags/mentions (e.g. by removing the "#"/"@" symbols). 
  1. This could be further expand by splitting the hashtags into multiple words.
1. Remove punctuation.
1. Remove stopwords (careful since many stopwords denote sentiment).
1. Stemming.
1. Remove low occurring words.


In [5]:
def normalize_token(token, **kwargs):
    if kwargs.get("remove_hashtags") and token.startswith("#"):
        return ""
    
    if kwargs.get("remove_mentions") and token.startswith("@"):
        return ""
  
    if kwargs.get("normalize_hashtags") and token.startswith("#"):
        # TODO: Maybe a way to split hashtags?
        token = token[1:]
  
    if kwargs.get("normalize_mentions") and token.startswith("@"):
        token = token[1:]
  
    if kwargs.get("lowercase"):
        token = token.lower()
  
    return token


def normalize_tweet(tweet, stopwords=set(), punctuation=set(), **kwargs):
    tweet = [normalize_token(t, **kwargs).strip() for t in tweet 
             if t not in stopwords and t not in punctuation]
  
    return [t for t in tweet if t != ""]

In [6]:
normalization_config = {
    "lowercase": True,
    "remove_hashtags": True,
    "remove_mentions": True
}

dataset["NormalizedTweet"] = dataset["TokenizedTweet"].apply(
    lambda t: normalize_tweet(t, punctuation=PUNCTUATION, **normalization_config)
)

dataset[["TokenizedTweet", "NormalizedTweet"]].head()

Unnamed: 0,TokenizedTweet,NormalizedTweet
0,"[Just, laid, down, the, law, on, abortion, in,...","[just, laid, down, the, law, on, abortion, in,..."
1,"[Bad, 2, days, for, #Kansas, Conservatives, #k...","[bad, 2, days, for, conservatives, going, 0-4,..."
2,"[Now, that, there's, marriage, equality, ,, ca...","[now, that, there's, marriage, equality, can, ..."
3,"[I'll, always, put, all, my, focus, and, energ...","[i'll, always, put, all, my, focus, and, energ..."
4,"[@BarackObama, celebrates, "", equality, "", whi...","[celebrates, equality, while, 3000, unborn, ba..."


For the TF-IDF we use gensim. Building a vocabulary from the corpus of Tweets, we remove all those tokens that are only present in one Tweet.

In [7]:
tweet_vocab = corpora.Dictionary(dataset["NormalizedTweet"])
tweet_vocab.filter_extremes(no_below=2, no_above=1.0)

bow_corpus = dataset["NormalizedTweet"].apply(tweet_vocab.doc2bow).tolist()
tfidf = models.TfidfModel(bow_corpus, dictionary=tweet_vocab)

### Graph edges

This part is for creating the functions for extracting the features that will be in charge of creating the nodes of the graph.

Some of the edges I thought of are:

1. Hashtags overlap.
1. Users's Mentions overlap.
1. Ngrams overlap (3, 4, and 5 for now).
1. Overlap of Top 10 TF-IDF words.

In [8]:
def extract_hashtags(tokens):
    return sorted([t for t in tokens if t.startswith("#") and t.strip() != "#"])

def extract_mentions(tokens):
    return sorted([t for t in tokens if t.startswith("@") and t.strip() != "@"])

def extract_ngrams(tokens, n=3):
    return sorted(["_".join(ngram) for ngram in ngrams(tokens, n=n)])

def extract_toptfdf(tweet, tfidf_model, vocab, k=10):
    return sorted(tfidf_model[vocab.doc2bow(tweet)], key=itemgetter(1), reverse=True)[:k]

In [9]:
dataset["Hashtags"] = dataset["TokenizedTweet"].apply(extract_hashtags)
dataset["Mentions"] = dataset["TokenizedTweet"].apply(extract_mentions)
for i in range(2, 6):
    dataset["{}-grams".format(i)] = dataset["NormalizedTweet"].apply(lambda t: extract_ngrams(t, n=i))
dataset["TopTfIdf"] = dataset["NormalizedTweet"].apply(lambda t: extract_toptfdf(t, tfidf, tweet_vocab))

In [10]:
dataset.head()

Unnamed: 0,ID,Tweet,Stance,Split,TokenizedTweet,NormalizedTweet,Hashtags,Mentions,2-grams,3-grams,4-grams,5-grams,TopTfIdf
0,0,Just laid down the law on abortion in my bioet...,AGAINST,Train,"[Just, laid, down, the, law, on, abortion, in,...","[just, laid, down, the, law, on, abortion, in,...",[#Catholic],[],"[abortion_in, bioethics_class, down_the, in_my...","[abortion_in_my, down_the_law, in_my_bioethics...","[abortion_in_my_bioethics, down_the_law_on, in...","[abortion_in_my_bioethics_class, down_the_law_...","[(1, 0.587697067393385), (2, 0.467890035647413..."
1,1,Bad 2 days for #Kansas Conservatives #ksleg @g...,NONE,Train,"[Bad, 2, days, for, #Kansas, Conservatives, #k...","[bad, 2, days, for, conservatives, going, 0-4,...","[#Kansas, #SCOTUSMarriage, #SCOTUScare, #SemST...",[@govsambrownback],"[0-4_in, 2_days, bad_2, conservatives_going, d...","[0-4_in_courts, 2_days_for, bad_2_days, conser...","[2_days_for_conservatives, bad_2_days_for, con...","[2_days_for_conservatives_going, bad_2_days_fo...","[(11, 0.5348717493397233), (10, 0.455119520052..."
2,2,"Now that there's marriage equality, can we sta...",AGAINST,Train,"[Now, that, there's, marriage, equality, ,, ca...","[now, that, there's, marriage, equality, can, ...",[#SemST],[],"[can_we, equal_rights, equality_can, for_unbor...","[can_we_start, equal_rights_for, equality_can_...","[can_we_start_working, equal_rights_for_unborn...","[can_we_start_working_on, equal_rights_for_unb...","[(22, 0.4166756775234931), (24, 0.355166766759..."
3,3,I'll always put all my focus and energy toward...,AGAINST,Train,"[I'll, always, put, all, my, focus, and, energ...","[i'll, always, put, all, my, focus, and, energ...",[#SemST],[],"[alive_instead, all_my, always_put, and_energy...","[alive_instead_of, all_my_focus, always_put_al...","[alive_instead_of_deciding, all_my_focus_and, ...","[alive_instead_of_deciding_who, all_my_focus_a...","[(32, 0.3120267165946575), (38, 0.312026716594..."
4,4,"@BarackObama celebrates ""equality"" while 3000 ...",AGAINST,Train,"[@BarackObama, celebrates, "", equality, "", whi...","[celebrates, equality, while, 3000, unborn, ba...","[#LifeEquality, #SemST]",[@BarackObama],"[3000_unborn, a_real, about_a, babies_were, ce...","[3000_unborn_babies, a_real_inequality, about_...","[3000_unborn_babies_were, a_real_inequality_si...","[3000_unborn_babies_were_killed, about_a_real_...","[(45, 0.3833117705497533), (52, 0.383311770549..."


### Graph Construction

The graph constructions comes from the intersection of the extracted graph features of the previous part.

For each type of edge, we define a graph. For now, we will use them as separate representations for different baselines. Eventually we can see how to aggregate all this info.

This implementation uses brute force, not the best, but I'll check how to optimize it later.

In [11]:
edges = ["Hashtags", "Mentions", "TopTfIdf"] + ["{}-grams".format(i) for i in range(2, 6)]

def edge_adjacency_matrix(edge, dataset):
    adjacency = []
    for idx, row_i in dataset.iterrows():
        adjacency.append((row_i["ID"], row_i["ID"], 0))  # Needed for NetworkX to keep track of all existing nodes (even isolated ones)
        # We only store a triangular matrix (the matrix is symmetric)
        for _, row_j in dataset.loc[idx+1:].iterrows():
            edge_weight = len(set(row_i[edge]).intersection(row_j[edge]))  # TODO: Maybe weight this a little better?
            if edge_weight > 0:
                adjacency.append((row_i["ID"], row_j["ID"], edge_weight))
    return edge, adjacency

adjacencies = dict(
    Parallel(n_jobs=-1, verbose=10)(
        delayed(edge_adjacency_matrix)(edge, dataset.loc[:, ["ID", edge]]) for edge in edges
    )
)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   46.1s
[Parallel(n_jobs=-1)]: Done   2 out of   7 | elapsed:   46.2s remaining:  1.9min
[Parallel(n_jobs=-1)]: Done   3 out of   7 | elapsed:   46.8s remaining:  1.0min
[Parallel(n_jobs=-1)]: Done   4 out of   7 | elapsed:   47.4s remaining:   35.5s
[Parallel(n_jobs=-1)]: Done   5 out of   7 | elapsed:   55.3s remaining:   22.1s
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.1min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.1min finished


## Save the data

Finally, with both the graphs and the complete dataset (with the splits), we save it.

In [12]:
for edge, adjacency in adjacencies.items():
    pd.DataFrame(
        adjacency, 
        columns=["row", "col", "weight"]
    ).to_csv("./data/semeval.abortion.graph.{}.csv".format(edge.lower()), index=False)

dataset[["ID", "Tweet", "Stance", "Split"]].to_csv("./data/semeval.abortion.data.csv", index=False)

In [13]:
%%bash

cd data/
tar zcvf semeval.abortion.graph_data.tgz semeval.abortion.data.csv semeval.abortion.graph.*.csv

semeval.abortion.data.csv
semeval.abortion.graph.2-grams.csv
semeval.abortion.graph.3-grams.csv
semeval.abortion.graph.4-grams.csv
semeval.abortion.graph.5-grams.csv
semeval.abortion.graph.hashtags.csv
semeval.abortion.graph.mentions.csv
semeval.abortion.graph.toptfidf.csv


## Resource

This resource is available at: https://cs.famaf.unc.edu.ar/~ccardellino/resources/semeval/semeval.abortion.graph_data.tgz