# Task 1: Rumour Detection System
This notebook includes the data pre-processing, BERT language model training, model evaluation, and covid-19 rumour prediction. The code are extracted from the tutorial 6 of COMP90042.
This notebook was run on Google Colab with GPU runtime.

In [1]:
import json
from collections import Counter
import re
import numpy as np
import pandas as pd
import pickle
from IPython.display import display

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#%% Read the labels of train data
train_label = json.load(open('drive/MyDrive/NLP_Pro1/train.label.json')) #read as a dict
dev_label = json.load(open('drive/MyDrive/NLP_Pro1/dev.label.json')) #read as a dict

In [5]:
#%% Read the Tweet Event Data
'''
The data read is a list of list of dict
  A list of Event, 
  Each Event is a list of Tweets, where the first Tweet in the event is the source, the rest is the reply
  Each Tweets is a dict that contain information about that tweet
'''
train_data = []
dev_data = []
test_data = []
with open('drive/MyDrive/NLP_Pro1/train.data.jsonl') as f:
    for line in f:
        train_data.append(json.loads(line))
with open('drive/MyDrive/NLP_Pro1/dev.data.jsonl') as f:
    for line in f:
        dev_data.append(json.loads(line))
with open('drive/MyDrive/NLP_Pro1/test.data.jsonl') as f:
    for line in f:
        test_data.append(json.loads(line))

In [6]:
#%% Pre-process and convert the tweet event data to simple dataframe
'''
This function take the Twitter Event data and label as input. It will
extract and pre-process the text of each tweet, then save the eventID, 
pre-processed text, and label into a dataframe
The pre-processing includes remove "@mention" and add [CLS] & [SEP] token for BERT

Input:
    data: a list of list of dict
    label: a dict, if no label available, please input None
Output:
    a dataframe that contain pre-processed text, eventID, and label
'''
def dataToDf(data,label):
    data_dict = {}
    eventTexts = [] #To save the text of all events 
    eventIds = [] #To save the Id of all events
    for event in data:  #every event is a list of list
        eventText = '' #To save the text from all tweets of a event
        eventId = event[0]['id_str'] #first tweet of an event is the source tweet, take out its ID
    
        for tweet in event: #first tweet is the source tweet #tweet is a dict
            tempText = re.sub("@[\S]*[\s]?", "", tweet['text']).strip() #remove mention@
            #tempText = re.sub("#[\S]*[\s]?", "hashtag", tempText) #remove hashtag
            eventText = eventText +' [SEP] '+tempText.lower()
        eventText = eventText + ' [SEP]' #add [SEP] to seperate the text of different tweet in the same event
        eventText=eventText.strip()
        eventText=eventText.split(' ', 1)[1] #remove the first word '[SEP]'
        eventText = '[CLS]'+' '+eventText #attach the [CLS] token before the first tweet (the source tweet)
        #print(eventText)
        eventTexts.append(eventText)
        eventIds.append(eventId)
    data_dict['eventID'] = eventIds
    data_dict['eventTexts'] = eventTexts
    data_dict['label'] = 3 #initialize the label to 3, this value indicated that the label is not provided
    tempDf = pd.DataFrame(data = data_dict)
    #if the label is provided, then encoded the label
    #0==non-rumour, 1==rumour
    if label != None:
        for key in label.keys():
            tempDf.loc[tempDf['eventID']==key,'label'] = label[key]
        tempDf['label']=tempDf['label'].apply(lambda x: 1 if x=='rumour' else 0)
    return tempDf

In [7]:
df_train = dataToDf(train_data,train_label)
df_dev = dataToDf(dev_data,dev_label)
df_test = dataToDf(test_data,None)

In [8]:
'''
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(df_dev)
'''    
display(df_dev)
display(df_test)
print(df_dev.loc[4,'eventTexts'])
print(len(df_dev.loc[4,'eventTexts']))

Unnamed: 0,eventID,eventTexts,label
0,553588913747808256,[CLS] #breaking reports: 2 brothers suspected ...,1
1,524949003834634240,[CLS] you are not alone today #ottawa - we are...,0
2,553221281181859841,"[CLS] have said it before, but needs saying ag...",0
3,580322346508124160,[CLS] germanwings #a320 plane crashes in south...,1
4,544307417677189121,[CLS] hostage situation in sydney\nto all our ...,1
...,...,...,...
575,525025279803424768,[CLS] the soldier shot dead in wednesday's ott...,1
576,552784600502915072,[CLS] charlie hebdo became well known for publ...,0
577,499696525808001024,[CLS] we got through. that's a sniper on top o...,0
578,580320612155060224,[CLS] last position of germanwings flight #4u9...,1


Unnamed: 0,eventID,eventTexts,label
0,544382249178001408,[CLS] 5 people have been able to get out of sy...,3
1,525027317551079424,[CLS] new: sources: deceased gunman who killed...,3
2,544273220128739329,[CLS] isis flag visible as gunman seizes sydne...,3
3,499571799764770816,[CLS] people of #ferguson: stop #attacking our...,3
4,552844104418091008,"[CLS] #charliehebdo editor, assassinated today...",3
...,...,...,...
576,553581227165642752,[CLS] we are hearing gunfire at the siege at t...,3
577,552816302780579840,[CLS] “i don’t feel as though i’m killing some...,3
578,580350000074457088,[CLS] we must confirm to our deepest regret th...,3
579,498584409055174656,"[CLS] protestors have blocked west florissant,...",3


[CLS] hostage situation in sydney
to all our fans and friends staying in sydney, stay safe and keep praying... http://t.co/sq62baketz [SEP] people praying is exactly what caused this situation in the first place.
#yourgodsnotrealbutmineis [SEP] what if it's an isis attack? so sorry to all those hostages, keep calm because the police will get you out. [SEP] who are you? do i even know you? stay away satan! [SEP]
414


In [None]:
#df_train = df_train[:100]

In [9]:
!pip install torch torchvision transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b0/9e/5b80becd952d5f7250eaf8fc64b957077b12ccfe73e9c03d37146ab29712/transformers-4.6.0-py3-none-any.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 7.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 20.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 38.2MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Installing c

In [10]:
#load pretrained bert base model
#this is already trained on a large courpus
from transformers import BertModel

bert_model = BertModel.from_pretrained('bert-base-uncased')

print("Done loading BERT model.")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Done loading BERT model.


In [11]:
from transformers import BertTokenizer

#load BERT's WordPiece tokenisation model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [12]:
import torch
torch.cuda.empty_cache() #empty the cache before training to prevent memory error
torch.cuda.memory_summary(device=None, abbreviated=False)



In [13]:
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer
import pandas as pd

class myDataset(Dataset):

    def __init__(self, dataframe, maxlen): #dataframe: df_dev or df_train

        #Store the contents of the file in a pandas dataframe
        self.df = dataframe

        #Initialize the BERT tokenizer
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        self.maxlen = maxlen

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):

        #Selecting the sentence and label at the specified index in the data frame
        sentence = self.df.loc[index, 'eventTexts']
        label = self.df.loc[index, 'label']
        eventID = self.df.loc[index, 'eventID']
        #Preprocessing the text to be suitable for BERT
        tokens = self.tokenizer.tokenize(sentence) #Tokenize the sentence
        
        if len(tokens) < self.maxlen:
            tokens = tokens + ['[PAD]' for _ in range(self.maxlen - len(tokens))] #Padding sentences
        else:
            tokens = tokens[:self.maxlen-1] + ['[SEP]'] #Prunning the list to be of specified max length

        tokens_ids = self.tokenizer.convert_tokens_to_ids(tokens) #Obtaining the indices of the tokens in the BERT Vocabulary
        tokens_ids_tensor = torch.tensor(tokens_ids) #Converting the list to a pytorch tensor

        #Obtaining the attention mask i.e a tensor containing 1s for no padded tokens and 0s for padded ones
        attn_mask = (tokens_ids_tensor != 0).long()

        return tokens_ids_tensor, attn_mask, label, eventID

In [14]:
from torch.utils.data import DataLoader

#Creating instances of training and development set
#maxlen sets the maximum length a sentence can have
#any sentence longer than this length is truncated to the maxlen size
train_set = myDataset(filename = df_train, maxlen = 500) 
dev_set = myDataset(filename = df_dev, maxlen = 500)
test_set = myDataset(filename = df_test, maxlen = 500)

#Creating intsances of training and development dataloaders
train_loader = DataLoader(train_set, batch_size = 10, num_workers = 2) #batch_size = 64
dev_loader = DataLoader(dev_set, batch_size = 10, num_workers = 2)

print("Done preprocessing training and development data.")

Done preprocessing training and development data.


In [15]:
display(train_loader)
display(dev_loader)


<torch.utils.data.dataloader.DataLoader at 0x7f75814adf50>

<torch.utils.data.dataloader.DataLoader at 0x7f7581574e10>

In [16]:
import torch
import torch.nn as nn
from transformers import BertModel

class SentimentClassifier(nn.Module):

    def __init__(self):
        super(SentimentClassifier, self).__init__()
        #Instantiating BERT model object 
        self.bert_layer = BertModel.from_pretrained('bert-base-uncased')
        
        #Classification layer
        #input dimension is 768 because [CLS] embedding has a dimension of 768
        #output dimension is 1 because we're working with a binary classification problem
        self.cls_layer = nn.Linear(768, 1) #initialize the layer

    def forward(self, seq, attn_masks):
        '''
        Inputs:
            -seq : Tensor of shape [B, T] containing token ids of sequences
            -attn_masks : Tensor of shape [B, T] containing attention masks to be used to avoid contibution of PAD tokens
        '''

        #Feeding the input to BERT model to obtain contextualized representations
        outputs = self.bert_layer(seq, attention_mask = attn_masks)
        cont_reps = outputs.last_hidden_state

        #Obtaining the representation of [CLS] head (the first token)
        cls_rep = cont_reps[:, 0] #for all the context, just take the first cls token

        #Feeding cls_rep to the classifier layer
        logits = self.cls_layer(cls_rep)

        return logits

In [17]:
gpu = 0 #gpu ID

print("Creating the sentiment classifier, initialised with pretrained BERT-BASE parameters...")
net = SentimentClassifier() #initailize the net
net.cuda(gpu) #Enable gpu support for the model #tell the model to move to GPU
print("Done creating the sentiment classifier.")

Creating the sentiment classifier, initialised with pretrained BERT-BASE parameters...


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Done creating the sentiment classifier.


In [18]:
import torch.nn as nn
import torch.optim as optim

criterion = nn.BCEWithLogitsLoss()  #BCE: binary cross entropy
opti = optim.Adam(net.parameters(), lr = 2e-5) #optimizer

In [19]:
import time

def train(net, criterion, opti, train_loader, dev_loader, max_eps, gpu):
    net.train()
    best_f1 = 0
    st = time.time()
    train_loss_curve = []
    train_f1_curve = []
    dev_loss_curve = []
    dev_f1_curve = []
    for ep in range(max_eps):
        
        for it, (seq, attn_masks, labels, eventID) in enumerate(train_loader):
            #Clear gradients
            opti.zero_grad() #make all the gradient zero  
            #Converting these to cuda tensors
            #normal tensor is on cpu
            #cuda tensor is on gpu
            seq, attn_masks, labels = seq.cuda(gpu), attn_masks.cuda(gpu), labels.cuda(gpu)

            #Obtaining the logits from the model
            logits = net(seq, attn_masks)

            #Computing loss
            loss = criterion(logits.squeeze(-1), labels.float())

            #Backpropagating the gradients
            loss.backward()

            #Optimization step
            opti.step() #update the weight with the gradient
              
            if it % 30 == 0: #print the train loss and f1 every 30 steps
                
                f1 = get_accuracy_from_logits(logits, labels, gpu)
                print("Iteration {} of epoch {} complete. Loss: {}; F1: {}; Time taken (s): {}".format(it, ep, loss.item(), f1, (time.time()-st)))
                train_loss_curve.append(loss.item())
                train_f1_curve.append(f1)
                st = time.time()

        
        dev_f1, dev_loss = evaluate(net, criterion, dev_loader, gpu)
        dev_loss_curve.append(dev_loss)
        dev_f1_curve.append(dev_f1)
        print("Epoch {} complete! Development F1: {}; Development Loss: {}".format(ep, dev_f1, dev_loss))
        torch.save(net.state_dict(), 'drive/MyDrive/NLP_Pro1/bertcls_{}.dat'.format(ep))
        if dev_f1 > best_f1:
            print("Best development F1 improved from {} to {}, saving model...".format(best_f1, dev_f1))
            best_f1 = dev_f1
            torch.save(net.state_dict(), 'drive/MyDrive/NLP_Pro1/bertcls_{}.dat'.format(ep))
    print(train_loss_curve)
    print(train_f1_curve)
    print(dev_loss_curve)
    print(dev_f1_curve)

In [20]:
import time

def predict(net, data_set, gpu):
    data_loader = DataLoader(data_set, batch_size = 1, num_workers = 2)
    net.eval() #to fix the model prevent random dropout
    predicted_dict = {}
    st = time.time() 
    for it, (seq, attn_masks, labels, eventID) in enumerate(data_loader):
        #Clear gradients
        with torch.no_grad():
        
            #Converting these to cuda tensors
            #normal tensor is on cpu
            #cuda tensor is on gpu
            seq, attn_masks, labels = seq.cuda(gpu), attn_masks.cuda(gpu), labels.cuda(gpu)
            #print(seq,attn_masks)
            #Obtaining the logits from the model
            logits = net(seq, attn_masks)
            #print(logits)
            
            probs = torch.sigmoid(logits.unsqueeze(-1))
            soft_probs = (probs > 0.5).long()
            #predictedLabel = int(soft_probs.data[0][0][0])
            predictedLabel = int(soft_probs.item())
            eventID = eventID[0]
            #acc = (soft_probs.squeeze() == labels).float().mean()
            if predictedLabel == 1:
                predicted_dict[eventID] = "rumour"
            else: 
                predicted_dict[eventID] = "non-rumour"
            '''
            print(it)
            print(eventID)
            print(logits)
            print("probs={}".format(probs))
            print("soft_probs={}".format(soft_probs))
            print(predictedLabel)
            print(" ")
            '''
    et = time.time()
    print("time spent on predict is {}".format(et-st))
    return predicted_dict


In [21]:
# This function compute the f1 score
def get_accuracy_from_logits(logits, labels, gpu):
    probs = torch.sigmoid(logits.unsqueeze(-1))
    soft_probs = (probs > 0.5).long()
    batchSize = labels.size()[0]
    allTrue = [1] * batchSize
    allTrue = torch.FloatTensor(allTrue).cuda(gpu)
    precision = (((soft_probs.squeeze() == labels)&(labels==allTrue)).float().mean()) / ((soft_probs.squeeze() == 1).float().mean())
    recall = (((soft_probs.squeeze() == labels)&(labels==allTrue)).float().mean()) / (labels==allTrue).float().mean()
    f1 = (2*precision*recall) / (precision+recall)
    if torch.isnan(f1):
        f1 = 0
    #print(f1)
    return f1

def evaluate(net, criterion, dataloader, gpu):
    net.eval()

    mean_f1, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        for seq, attn_masks, labels, eventID in dataloader:
            seq, attn_masks, labels = seq.cuda(gpu), attn_masks.cuda(gpu), labels.cuda(gpu)
            logits = net(seq, attn_masks)
            mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
            mean_f1 += get_accuracy_from_logits(logits, labels, gpu)
            count += 1

    return mean_f1 / count, mean_loss / count

In [22]:
#uncomment this part to load previously trained model.
'''
net = SentimentClassifier()
net.load_state_dict(torch.load('drive/MyDrive/NLP_Pro1/bertcls_19.dat'))
net.cuda(gpu) #Enable gpu support for the model #tell the model to move to GPU
net.eval()
'''

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SentimentClassifier(
  (bert_layer): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise

In [23]:
num_epoch = 10

#fine-tune the model
train(net, criterion, opti, train_loader, dev_loader, num_epoch, gpu)
torch.save(net.state_dict(), 'drive/MyDrive/NLP_Pro1/bertcls_{}.dat'.format(num_epoch-1))


Iteration 0 of epoch 0 complete. Loss: 0.0005321679054759443; F1: 1.0; Time taken (s): 1.0595190525054932
Iteration 30 of epoch 0 complete. Loss: 0.00017172304796986282; F1: 1.0; Time taken (s): 16.151247262954712
Iteration 60 of epoch 0 complete. Loss: 0.00024070090148597956; F1: 1.0; Time taken (s): 16.112316370010376
Iteration 90 of epoch 0 complete. Loss: 0.0012424641754478216; F1: 1.0; Time taken (s): 16.117817878723145
Iteration 120 of epoch 0 complete. Loss: 0.0031329658813774586; F1: 1.0; Time taken (s): 16.138188362121582
Iteration 150 of epoch 0 complete. Loss: 0.0011653919937089086; F1: 1.0; Time taken (s): 16.12478542327881
Iteration 180 of epoch 0 complete. Loss: 0.0002350305876461789; F1: 0; Time taken (s): 16.131574869155884
Iteration 210 of epoch 0 complete. Loss: 0.00023703357146587223; F1: 1.0; Time taken (s): 16.115753173828125
Iteration 240 of epoch 0 complete. Loss: 0.0069624572061002254; F1: 1.0; Time taken (s): 16.12736439704895
Iteration 270 of epoch 0 complete.

In [24]:
params = list(net.parameters())
display(len(params))

201

In [25]:
test_predicted_dict = predict(net, test_set ,gpu)
train_predicted_dict = predict(net, train_set, gpu)
dev_predicted_dict = predict(net, dev_set, gpu)

time spent on predict is 19.734493732452393
time spent on predict is 125.01564002037048
time spent on predict is 15.872421264648438


In [26]:
#display(test_predicted_dict)

In [27]:
# Save the predicted labels
with open("drive/MyDrive/NLP_Pro1/test-output.json", "w") as outfile: 
    json.dump(test_predicted_dict, outfile,separators=(',', ':'))
with open("drive/MyDrive/NLP_Pro1/train-output.json", "w") as outfile: 
    json.dump(train_predicted_dict, outfile,separators=(',', ':'))
with open("drive/MyDrive/NLP_Pro1/dev-output.json", "w") as outfile: 
    json.dump(dev_predicted_dict, outfile,separators=(',', ':'))

Now, we handle the covid data

In [28]:
#%% Read covid data
covid_data = []
with open('drive/MyDrive/NLP_Pro1/covid.data.jsonl') as f:
    for line in f:
        covid_data.append(json.loads(line))

In [29]:
df_covid = dataToDf(covid_data, None)
covid_set = myDataset(filename = df_covid, maxlen = 500)
covid_predicted_dict = predict(net, covid_set ,gpu)
with open("drive/MyDrive/NLP_Pro1/covid-output.json", "w") as outfile: 
    json.dump(covid_predicted_dict, outfile,separators=(',', ':'))

time spent on predict is 565.2573494911194
