<a href="https://colab.research.google.com/github/dmwhang/WNUT-task2/blob/main/Dartmouth_CS_at_WNUT_2020_Task_2_Fine_tuning_BERT_for_Tweet_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dartmouth CS at WNUT 2020 Task 2: Fine-tuning BERT for Tweet classification
### Author: Dylan Whang
### Last updated: 10.8.2020

This notebook was developed for The 6th Workshop on Noisy User-generated Text (W-NUT 2020) Task 2: The identification of informatic COVID-19 Tweets.

## Abstract

We describe the systems developed for the WNUT-2020 shared task 2, identification of informative COVID-19 English Tweets. BERT is a highly performant model for Natural Language Processing tasks. We increased BERT’s performance in this classification task by fine-tuning BERT and concatenating its embeddings with Tweet-specific features and training a Support Vector Machine (SVM) for classification (henceforth called BERT+). We compared its performance to a suite of machine learning models. We used a Twitter specific data cleaning pipeline and word-level TF-IDF to extract features for the non-BERT models. BERT+ was the top performing model with an F1-score of 0.8713.

## Install packages

In [None]:
pip install transformers emoji profanity-check syllables

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 2.7MB/s 
[?25hCollecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/ff/1c/1f1457fe52d0b30cbeebfd578483cedb3e3619108d2d5a21380dfecf8ffd/emoji-0.6.0.tar.gz (51kB)
[K     |████████████████████████████████| 51kB 7.4MB/s 
[?25hCollecting profanity-check
[?25l  Downloading https://files.pythonhosted.org/packages/26/dd/bdbfe61f11b328a583960ece9145a3e080082475f52f9f56795b22ab4c41/profanity_check-1.0.3-py3-none-any.whl (2.4MB)
[K     |████████████████████████████████| 2.4MB 53.0MB/s 
[?25hCollecting syllables
  Downloading https://files.pythonhosted.org/packages/16/d9/81a31f640ccf405fdfd0eae8eebfc2579b438804dbf34dc03cad3e76169a/syllables-0.1.0-py2.py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonh

## Import packages

In [None]:
# Preprocessing packages
import csv
import re
import nltk
import ssl
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from emoji.unicode_codes import UNICODE_EMOJI
import syllables
from profanity_check import predict as profanity_predict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Traditional ML Models packages
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import svm
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

# Bert packages
import numpy as np
import pandas as pd
import tensorflow as tf
import torch
import transformers # pytorch transformers
import time
from sklearn.model_selection import cross_val_score, train_test_split
from transformers import BertTokenizer
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler



## Pre-process: Methods

In [None]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')
nltk.download('wordnet')
lmtzr = WordNetLemmatizer() 

def stopWords(tweet):
  return " ".join([word for word in tweet.split() if word not in stopwords.words('english')])

def lemmatize(tweet): 
  return ' '.join([lmtzr.lemmatize(word, 'v') for word in tweet.split()])

def lower(tweet):
  return tweet.lower()

def emoji(tweet):
  out = []
  for word in tweet.split():
    if word in UNICODE_EMOJI:
      word = UNICODE_EMOJI[word]
    out.append(word)
  return ' '.join(out)

def charSqueeze(tweet):
  squeezed = []
  prev = None
  rep = False
  for curr_char in tweet:
    if rep:
      if curr_char != prev:
        rep = False 
        squeezed.append(curr_char)
    else:
      squeezed.append(curr_char)
      if prev == curr_char:
        rep = True
      else:
        prev = curr_char
  return ''.join(squeezed)
  # return re.sub(r'([A-z])(?=[A-z]\1)', "", tweet)

def rmurls(tweet):
  return re.sub("HTTPURL", "", tweet)

def rmUser(tweet):
  return re.sub("@USER", "", tweet)

def AlNum(tweet):
  clean = []
  for word in tweet.split():
    if word[0] == '#' or word[0] == ':' or str.isalnum(word):
      clean.append(word)
    else:
      clean.append(re.sub(r'[\W_]+', '', word, flags=re.UNICODE))
  return ' '.join(clean)

def hashtag(tweet):
  clean = []
  for word in tweet.split():
    if word[0] == '#':
      word = word[1:]
    clean.append(word)
  return ' '.join(clean)

def process(corpora, ur=True, us=True, sw=True, ch=True, lo=True, le=True, em=True, an=True, ha=True):
  clean_corpora = []
  for tweet in corpora:
    if ur:
      tweet = rmurls(tweet)
    if us:
      tweet = rmUser(tweet)
    if sw:
      tweet = stopWords(tweet)
    if ch:
      tweet = charSqueeze(tweet)
    if em:
      tweet = emoji(tweet)
    if lo:
      tweet = lower(tweet)
    if le:
      tweet = lemmatize(tweet)
    if an:
      tweet = AlNum(tweet)
    if ha:
      tweet = hashtag(tweet)
    clean_corpora.append(tweet)
  return clean_corpora

def f1_score(predictions, labels):
  TP = 0
  FP = 0
  FN = 0
  for i in range(len(predictions)):
    if predictions[i] == 1:
      if labels[i] == 1:
        TP += 1
      else:
        FP += 1
    elif labels[i] == 1:
      FN += 1
  pre = TP/(TP+FP)
  rec = TP/(TP+FN)
  f1 = 2*((pre*rec)/(pre+rec))
  return f1

# SVM vector
# 1) has "HTTPURL"
# 2) # of "HTTPURL"
# 3) has '#'
# 4) # of '#'
# 5) has "@USER"
# 6) # of "@USER"
# 7) has emojis
# 8) # of emojis
# 9) # of words
# 10) syllable count
# 11) contains profanity
def svm_vector_generator(tweet):
  vector = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  for word in tweet.split():
    if word == "HTTPURL":
      vector[0] = 1
      vector[1] += 1
    if word[0] == '#':
      vector[2] = 1
      vector[3] += 1
    if word == "@USER":
      vector[4] = 1
      vector[5] += 1
    if word in UNICODE_EMOJI:
      vector[6] = 1
      vector[7] += 1
    vector[8] += 1
    vector[9] += syllables.estimate(word)
  vector[10] = profanity_predict([tweet])[0]
  vector[9] = vector[9]/vector[8]
  return vector


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Pre-process: load and process data

In [None]:
train_file = open("./train.tsv")
valid_file = open("./valid.tsv")
test_file = open("./unlabeled_test_with_noise.tsv")
next(train_file)
train_corpora = []
valid_corpora = []
test_corpora = []

train_labels = []
valid_labels = []

train_svm_vectors = [] 
valid_svm_vectors = []
test_svm_vectors = []

for raw_line in train_file:
  line = raw_line.split("\t")
  train_corpora.append(line[1])
  train_svm_vectors.append(svm_vector_generator(line[1]))
  if line[2] == "INFORMATIVE\n":
    train_labels.append(1)
  else:
    train_labels.append(0)
for raw_line in valid_file:
  line = raw_line.split("\t")
  # train_corpora.append(line[1])
  # train_svm_vectors.append(svm_vector_generator(line[1]))
  # if line[2] == "INFORMATIVE\n":
  #   train_labels.append(1)
  # else:
  #   train_labels.append(0)
  valid_corpora.append(line[1])
  valid_svm_vectors.append(svm_vector_generator(line[1]))
  if line[2] == "INFORMATIVE\n":
    valid_labels.append(1)
  else:
    valid_labels.append(0)
for raw_line in test_file:
  line = raw_line.split("\t")
  test_corpora.append(line[1]) 
  test_svm_vectors.append(svm_vector_generator(line[1]))

# clean corpora
clean_train_corpora = process(train_corpora, lo=False, le=False)
clean_valid_corpora = process(valid_corpora, lo=False, le=False)
clean_test_corpora = process(test_corpora, lo=False, le=False)




Traditional ML models

BERT: Loading gpu

In [None]:
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
  # If there's a GPU available...
  if torch.cuda.is_available():    
      # Tell PyTorch to use the GPU.    
      device = torch.device("cuda")
      print('There are %d GPU(s) available.' % torch.cuda.device_count())
      print('We will use the GPU:', torch.cuda.get_device_name(0))
  # If not...
  else:
      print('No GPU available, using the CPU instead.')
      device = torch.device("cpu")
      print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


BERT: Load tokenizer and tokenize data

In [None]:
# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased', do_lower_case=True)
# tokenizer = BertTokenizer.from_pretrained('bert-large-cased', do_lower_case=False)

df = pd.DataFrame({0: clean_train_corpora, 1: train_labels})
tweets = df[0].values
labels = df[1].values
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []
for tweet in clean_train_corpora:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        # max_length = 64,           # Pad & truncate all sentences.
                        max_length = 128,           # Accurate length
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', clean_train_corpora[6])
print('SVM: ', train_svm_vectors[6])
print('Token IDs:', input_ids[6])

Loading BERT tokenizer...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…






Original:  number COVID19 deaths surpassed 1000 worldwide NY surpassed Italy number recorded deaths New York state coronavirus cases country worldwide
SVM:  [1, 1, 1, 1, 0, 0, 0, 0, 29, 1.7241379310344827, 0]
Token IDs: tensor([  101,  2193,  2522, 17258, 16147,  6677, 15602,  6694,  4969,  6396,
        15602,  3304,  2193,  2680,  6677,  2047,  2259,  2110, 21887, 23350,
         3572,  2406,  4969,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,


BERT: Partitition data into train and valid for fine tuning

In [None]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

6,300 training samples
  700 validation samples


BERT: initialize data loaders for tuning

In [None]:
# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

BERT: load bert for sequence classification

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    "bert-large-uncased",
    # "bert-large-cased",
    num_labels = 2,  
    output_attentions = False,
    output_hidden_states = True,
)

model.cuda()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1

BERT: show initial configuration

In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Last Transformer ====\n')

for p in params[-20:-4]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 393 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 1024)
bert.embeddings.position_embeddings.weight               (512, 1024)
bert.embeddings.token_type_embeddings.weight               (2, 1024)
bert.embeddings.LayerNorm.weight                             (1024,)
bert.embeddings.LayerNorm.bias                               (1024,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight        (1024, 1024)
bert.encoder.layer.0.attention.self.query.bias               (1024,)
bert.encoder.layer.0.attention.self.key.weight          (1024, 1024)
bert.encoder.layer.0.attention.self.key.bias                 (1024,)
bert.encoder.layer.0.attention.self.value.weight        (1024, 1024)
bert.encoder.layer.0.attention.self.value.bias               (1024,)
bert.encoder.layer.0.attention.output.dense.weight      (1024, 1024)
bert.encoder.layer.0.attention.output.dense.bias             (

BERT: initialize training features

In [None]:
from transformers import get_linear_schedule_with_warmup

# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

BERT: functions for tracking fine tuning of model

In [None]:
import numpy as np
import time
import datetime


def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))
    
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

BERT: Fine tuning

In [None]:
import random
import numpy as np

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

training_stats = []
total_t0 = time.time()

for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    for step, batch in enumerate(train_dataloader):
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        loss, logits = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)[:2]

        total_train_loss += loss.item()
        loss.backward()

        # Clip the norm of the gradients to 1.0 to  prevent "exploding gradients"
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)   
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        with torch.no_grad():        
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)[:2]
            
        total_eval_loss += loss.item()
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        total_eval_accuracy += flat_accuracy(logits, label_ids)

    # Report the final accuracy, loss, and time
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...
  Batch    40  of    197.    Elapsed: 0:00:49.
  Batch    80  of    197.    Elapsed: 0:01:38.
  Batch   120  of    197.    Elapsed: 0:02:28.
  Batch   160  of    197.    Elapsed: 0:03:17.

  Average training loss: 0.33
  Training epoch took: 0:04:02

Running Validation...
  Accuracy: 0.93
  Validation Loss: 0.19
  Validation took: 0:00:09

Training...
  Batch    40  of    197.    Elapsed: 0:00:49.
  Batch    80  of    197.    Elapsed: 0:01:38.
  Batch   120  of    197.    Elapsed: 0:02:27.
  Batch   160  of    197.    Elapsed: 0:03:16.

  Average training loss: 0.13
  Training epoch took: 0:04:01

Running Validation...
  Accuracy: 0.95
  Validation Loss: 0.16
  Validation took: 0:00:09

Training...
  Batch    40  of    197.    Elapsed: 0:00:49.
  Batch    80  of    197.    Elapsed: 0:01:38.
  Batch   120  of    197.    Elapsed: 0:02:27.
  Batch   160  of    197.    Elapsed: 0:03:16.

  Average training loss: 0.06
  Training epoch took: 0:04:01

Running Validation...
  Accu

BERT: Evaluate and Compute f1-score for valid corpora:

In [None]:
import pandas as pd
import sklearn

# Load the dataset into a pandas dataframe.
df = pd.DataFrame({0: clean_valid_corpora, 1: valid_labels})

# Report the number of sentences.
print('Number of test sentences: {:,}\n'.format(df.shape[0]))

# Create sentence and label lists
tweets = df[0].values
labels = df[1].values

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for tweet in tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Create the DataLoader.
prediction_data = TensorDataset(input_ids, attention_masks, labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

model.eval()
predictions , true_labels = [], []

for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

flat_predictions = np.concatenate(predictions, axis=0)
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

flat_true_labels = np.concatenate(true_labels, axis=0)
print(len(flat_predictions))
print('epochs:', epochs, 'batch size', batch_size)
print('f1-score', sklearn.metrics.f1_score(flat_true_labels, flat_predictions))

Number of test sentences: 1,000





1000
epochs: 4 batch size 32
f1-score 0.8701973001038421


BERT: evaluate test corpora

In [None]:
import pandas as pd
import sklearn

# Load the dataset into a pandas dataframe.
df = pd.DataFrame({0: clean_test_corpora})
labels
# Report the number of sentences.
print('Number of test sentences: {:,}\n'.format(df.shape[0]))

# Create sentence and label lists
tweets = df[0].values

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for tweet in tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        max_length = 128,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

# Create the DataLoader.
prediction_data = TensorDataset(input_ids, attention_masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

model.eval()
predictions , true_labels = [], []

for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask = batch
  
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  logits = outputs[0]
  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  # Store predictions and true labels
  predictions.append(logits)

flat_predictions = np.concatenate(predictions, axis=0)
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()


NameError: ignored

BERT: Generate predictions.txt

In [None]:
print(len(flat_predictions))
prediction_file = open("./predictions.txt", "w+")
for pred in flat_predictions:
  result = ""
  if pred == 1:
    result = "INFORMATIVE\n"
  else:
    result = "UNINFORMATIVE\n"
  prediction_file.write(result)

NameError: ignored

Bert+SVM: generate features for SVM with finetuned BERT model

In [None]:
import pandas as pd
import sklearn
from sklearn import svm

# Load the dataset into a pandas dataframe.
train_df = pd.DataFrame({0: clean_train_corpora, 1: train_labels})
valid_df = pd.DataFrame({0: clean_valid_corpora, 1: valid_labels})

# Report the number of sentences.
print('Number of test sentences: {:,}\n'.format(valid_df.shape[0]))

# Create sentence and label lists
train_tweets = train_df[0].values
train_labels = train_df[1].values
valid_tweets = valid_df[0].values
valid_labels = valid_df[1].values

# Tokenize all of the sentences and map the tokens to thier word IDs.
train_input_ids = []
train_attention_masks = []
valid_input_ids = []
valid_attention_masks = []

# For every sentence...
print("Tokenizing")
for tweet in train_tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    train_input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    train_attention_masks.append(encoded_dict['attention_mask'])
for tweet in valid_tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    valid_input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    valid_attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
train_input_ids = torch.cat(train_input_ids, dim=0)
train_attention_masks = torch.cat(train_attention_masks, dim=0)
train_labels = torch.tensor(train_labels)
valid_input_ids = torch.cat(valid_input_ids, dim=0)
valid_attention_masks = torch.cat(valid_attention_masks, dim=0)
valid_labels = torch.tensor(valid_labels)


# Create the DataLoader.
train_prediction_data = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_prediction_sampler = SequentialSampler(train_prediction_data)
train_prediction_dataloader = DataLoader(train_prediction_data, sampler=train_prediction_sampler, batch_size=batch_size)
valid_prediction_data = TensorDataset(valid_input_ids, valid_attention_masks, valid_labels)
valid_prediction_sampler = SequentialSampler(valid_prediction_data)
valid_prediction_dataloader = DataLoader(valid_prediction_data, sampler=valid_prediction_sampler, batch_size=batch_size)

model.eval()
train_predictions , train_true_labels , train_last_layers = [], [], []
valid_predictions , valid_true_labels , valid_last_layers = [], [], []

print("running model - train")
for batch in train_prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  last_layer = outputs[1][-1]

  # Move logits and labels to CPU
  label_ids = b_labels.to('cpu').numpy()
  last_layer = last_layer.detach().cpu().numpy()

  # Store predictions and true labels
  train_true_labels.append(label_ids)
  train_last_layers.append(last_layer)
train_flat_last_layers = np.concatenate(train_last_layers, axis=0)

print("running model - valid")
for batch in valid_prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  logits = outputs[0]
  last_layer = outputs[1][-1]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  last_layer = last_layer.detach().cpu().numpy()

  # Store predictions and true labels
  valid_predictions.append(logits)
  valid_true_labels.append(label_ids)
  valid_last_layers.append(last_layer)
valid_flat_predictions = np.concatenate(valid_predictions, axis=0)
valid_flat_last_layers = np.concatenate(valid_last_layers, axis=0)


Number of test sentences: 1,000

Tokenizing




running model - train
running model - valid


BERT+SVM: train SVM and make predictions on valid corpora

In [None]:
from numpy import mean

train_vec = []
valid_vec = []
print(train_flat_last_layers[0][0])
print(len(train_flat_last_layers[0][0]))

for i in range(len(train_svm_vectors)):
  temp = train_flat_last_layers[i][0].tolist()
  temp.extend(train_svm_vectors[i])
  train_vec.append(temp)
for i in range(len(valid_svm_vectors)):
  temp = valid_flat_last_layers[i][0].tolist()
  temp.extend(valid_svm_vectors[i])
  valid_vec.append(temp)

clf = svm.SVC()
clf.fit(train_vec, train_labels)
svm_predictions = clf.predict(valid_vec)

print('f1-score', sklearn.metrics.f1_score(valid_labels, svm_predictions))

[-0.93971133  0.36854303  0.42245233 ...  0.05302332  1.7739592
 -0.5768178 ]
1024
f1-score 0.8713080168776373


Bert+SVM: generate features for SVM with finetuned BERT model


In [None]:
import pandas as pd
import sklearn
from sklearn import svm

# Load the dataset into a pandas dataframe.
train_df = pd.DataFrame({0: clean_train_corpora, 1: train_labels})
# valid_df = pd.DataFrame({0: clean_valid_corpora, 1: valid_labels})
test_df = pd.DataFrame({0: clean_test_corpora})

# Report the number of sentences.
print('Number of test sentences: {:,}\n'.format(test_df.shape[0]))

# Create sentence and label lists
train_tweets = train_df[0].values
train_labels = train_df[1].values
# valid_tweets = valid_df[0].values
# valid_labels = valid_df[1].values
test_tweets = test_df[0].values

# Tokenize all of the sentences and map the tokens to thier word IDs.
train_input_ids = []
train_attention_masks = []
# valid_input_ids = []
# valid_attention_masks = []
test_input_ids = []
test_attention_masks = []

# For every sentence...
print("Tokenizing")
for tweet in train_tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    train_input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    train_attention_masks.append(encoded_dict['attention_mask'])
# for tweet in valid_tweets:
#     encoded_dict = tokenizer.encode_plus(
#                         tweet,                      # Sentence to encode.
#                         add_special_tokens = True, # Add '[CLS]' and '[SEP]'
#                         truncation = True,
#                         max_length = 64,           # Pad & truncate all sentences.
#                         pad_to_max_length = True,
#                         return_attention_mask = True,   # Construct attn. masks.
#                         return_tensors = 'pt',     # Return pytorch tensors.
#                    )
    
#     # Add the encoded sentence to the list.    
#     valid_input_ids.append(encoded_dict['input_ids'])
#     # And its attention mask (simply differentiates padding from non-padding).
#     valid_attention_masks.append(encoded_dict['attention_mask'])
for tweet in test_tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation = True,
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    test_input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    test_attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
train_input_ids = torch.cat(train_input_ids, dim=0)
train_attention_masks = torch.cat(train_attention_masks, dim=0)
train_labels = torch.tensor(train_labels)
# valid_input_ids = torch.cat(valid_input_ids, dim=0)
# valid_attention_masks = torch.cat(valid_attention_masks, dim=0)
# valid_labels = torch.tensor(valid_labels)
test_input_ids = torch.cat(test_input_ids, dim=0)
test_attention_masks = torch.cat(test_attention_masks, dim=0)

# Create the DataLoader.
train_prediction_data = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_prediction_sampler = SequentialSampler(train_prediction_data)
train_prediction_dataloader = DataLoader(train_prediction_data, sampler=train_prediction_sampler, batch_size=batch_size)
# valid_prediction_data = TensorDataset(valid_input_ids, valid_attention_masks, valid_labels)
# valid_prediction_sampler = SequentialSampler(valid_prediction_data)
# valid_prediction_dataloader = DataLoader(valid_prediction_data, sampler=valid_prediction_sampler, batch_size=batch_size)
test_prediction_data = TensorDataset(test_input_ids, test_attention_masks)
test_prediction_sampler = SequentialSampler(test_prediction_data)
test_prediction_dataloader = DataLoader(test_prediction_data, sampler=test_prediction_sampler, batch_size=batch_size)

model.eval()
train_predictions , train_true_labels , train_last_layers = [], [], []
valid_predictions , valid_true_labels , valid_last_layers = [], [], []
test_predictions , test_last_layers = [], []

print("running model - train")
for batch in train_prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  last_layer = outputs[1][-1]

  # Move logits and labels to CPU
  label_ids = b_labels.to('cpu').numpy()
  last_layer = last_layer.detach().cpu().numpy()

  # Store predictions and true labels
  train_true_labels.append(label_ids)
  train_last_layers.append(last_layer)
train_flat_last_layers = np.concatenate(train_last_layers, axis=0)

# print("running model - valid")
# for batch in valid_prediction_dataloader:
#   # Add batch to GPU
#   batch = tuple(t.to(device) for t in batch)
#   # Unpack the inputs from our dataloader
#   b_input_ids, b_input_mask, b_labels = batch
#   with torch.no_grad():
#       # Forward pass, calculate logit predictions
#       outputs = model(b_input_ids, token_type_ids=None, 
#                       attention_mask=b_input_mask)
#   logits = outputs[0]
#   last_layer = outputs[1][-1]

#   # Move logits and labels to CPU
#   logits = logits.detach().cpu().numpy()
#   label_ids = b_labels.to('cpu').numpy()
#   last_layer = last_layer.detach().cpu().numpy()

#   # Store predictions and true labels
#   valid_predictions.append(logits)
#   valid_true_labels.append(label_ids)
#   valid_last_layers.append(last_layer)
# valid_flat_predictions = np.concatenate(valid_predictions, axis=0)
# valid_flat_last_layers = np.concatenate(valid_last_layers, axis=0)

print("running model - test")
for batch in test_prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask = batch
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  last_layer = outputs[1][-1]

  # Move logits and labels to CPU
  last_layer = last_layer.detach().cpu().numpy()

  # Store predictions and true labels
  test_last_layers.append(last_layer)
test_flat_last_layers = np.concatenate(test_last_layers, axis=0)


BERT+SVM: train SVM and make predictions on test corpora





In [None]:
from numpy import mean

train_vec = []
test_vec = []
print(train_flat_last_layers[0][0])
print(len(train_flat_last_layers[0][0]))

for i in range(len(train_svm_vectors)):
  temp = train_flat_last_layers[i][0].tolist()
  temp.extend(train_svm_vectors[i])
  train_vec.append(temp)
for i in range(len(test_svm_vectors)):
  temp = test_flat_last_layers[i][0].tolist()
  temp.extend(test_svm_vectors[i])
  test_vec.append(temp)

clf = svm.SVC()
clf.fit(train_vec, train_labels)
svm_predictions = clf.predict(test_vec)

[-0.743843    0.53942937 -0.5784221  ...  0.76193064  0.39196065
 -0.05459801]
1024


BERT+SVM: output predictions for test corpora

In [None]:
prediction_file = open("./predictions.txt", "w+")
for pred in svm_predictions:
  result = ""
  if pred == 1:
    result = "INFORMATIVE\n"
  else:
    result = "UNINFORMATIVE\n"
  prediction_file.write(result)

In [None]:
new = open("./predictions.txt")
old = open("./predictions 2.txt")
i = 0
diff = 0
for line in new:
  o_line = next(old)
  if i == 0:
    print(line, o_line)
  if o_line != line:
    diff += 1
  i += 1
print("total: ", diff)

INFORMATIVE
 INFORMATIVE

total:  372


Referenced code:

Title: BERT Fine-Tuning Sentence Classification

Author: Chris McCormick and Nick Ryan

Date: March 20, 2020

Code version: 3.0

Availablilty: https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX?authuser=1#scrollTo=8o-VEBobKwHk