## Text classification with the *Longformer*



In a previous [post](https://jesusleal.io/2020/10/20/RoBERTA-Text-Classification/) I explored how to use Hugging Face Transformers *Trainer* class to easily create a text classification pipeline. The code was pretty straighforward to implement and I was able to obtain results that put the basic model at a very competitive level with a few lines of code. In that post I also briefly discussed one of the main drawbacks of the first generation of Transformers and BERT based architectures; the sequence lenght is limited to a maximum of 512 characters. The reason behind that limitation is the fact that self-attention mechanism scale quadratically with the input sequence length *O(n^2)*. Given the need to process longer sequences of text a a second generation of attention based models have been proposed. The idea behind this models is to reduce the memory footprint of the attention mechanisms in order to process longer sequences of text; see this really useful analysis of transformer models that try to overcome this limitation [by researchers from Google](https://arxiv.org/pdf/2009.06732.pdf). New models such as the [***Reformer***](https://arxiv.org/pdf/2001.04451.pdf) by Google proposes a series of innovations to the traditional Transformer architecture such as Local Self Attention, Locality sensitive hashing (LSH) Self-Attention, Chunked Feed Forward Layers, etc. This [post](https://huggingface.co/blog/reformer) from Hugging Face for a detailed discussion). This model can process sequences of half a million tokens with as little as 8GB of RAM. However one big drawback of the model for downstream applications is the fact that the authors have not released pre trained weights of their model and at the time of publication of this post there is no freely available model pretrained on a large corpus. 

Another very promising model, and the subject of this post, is the [***Longformer***](https://arxiv.org/pdf/2004.05150.pdf) by researchers from Allen AI Institure. The Longformer allows the processing sequences of thousand characters without facing the memory bottleneck of BERT like architectures and achieved SOTA at the time of publication in several benchmarks. The authors use a new variation of attention, called local attention where every token only attends to tokens in the vicinity defined by a window *w* where each token attends to $\frac{1}{2}\ w$  tokens to the left and to the right. To increase the receptive field the authors also applied dilation to the local window so they can increase the size of w without incurring in memory costs. A dilation is simply a "hole", meaning the token simply skips that token thus allowing tokens to reach farther tokens. The performance is not hurt since the transformer architecture has multiple attention heads across multiple layers and the different layers and head learn and attend different properties of texts and tokens. In addition to the local attention the authors also included a token that is attended globally so it can be used in downstream taks, just like thee *CLS* token of BERT. One of the interesting aspects of this model is the fact that the authors created their own CUDA kernel to calculate the attention scores of the sliding window attention. This type of attention is more efficient since there are many zeros in the matrix, this operation is called  matrix banded multiplication, but is not implemented in Pytorch/Tensorflow. Thanks to our friends from Hugging Face an implementation with standard CUDA kernels is available altough it does not have all the capabilities the authors of the Longformer model describe in their paper it is suitable for finetuning [downtream tasks](https://github.com/allenai/longformer). 


The authors tested the model with an autoregressive model to process sequences of thousands of tokens, achieving state of the art on *text8* and *enwik8*. They also tested the model on downstream tasks by finetuning the model with the weights of RoBERTA to conduct masked token prediction (MLM) of one third of the real news dataset, and a third of the stories dataset.  The authors pretrained two variations of the model a base model (with 12 layers) and a large model (30 layers). Both models were trained for 65K gradient updates with sequences of length 4,096 and batch size 64. Once the pretraining was completed they tested the models on downstream tasks such as question answering, coreference resolution and document classification. The model achieved SOTA results on the WikiHop TriviaQA datasets and in the hyper partisan data. For the IMDB dataset the authors achieved 95.7 percent accuracy, a small increase from the 95.3 percent accuracy reported by RoBERTa. 

Given all this nice features I decided to try the model and see how it compares to RoBERTA on the IMDB the iris dataset of text classification. For this script I used the trainer class from Hugging Face and the pretrained model offered by Allen AI available in the model hub of Hugging Face.


## Setup


In [None]:
!pip install -q -U watermark
!pip install -qq transformers
!pip install tqdm
!pip install livelossplot --quiet
!pip install tweet-preprocessor
!pip install GPUtil
%reload_ext watermark
%watermark -v -p numpy,pandas,torch,transformers

[K     |████████████████████████████████| 1.6 MB 7.7 MB/s 
[K     |████████████████████████████████| 4.7 MB 8.3 MB/s 
[K     |████████████████████████████████| 120 kB 93.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 68.0 MB/s 
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting GPUtil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
Building wheels for collected packages: GPUtil
  Building wheel for GPUtil (setup.py) ... [?25l[?25hdone
  Created wheel for GPUtil: filename=GPUtil-1.4.0-py3-none-any.whl size=7411 sha256=f21a33f8759

In [None]:
#@title Setup & Config
import transformers
from transformers import logging
logging.set_verbosity_error()
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
from transformers import DistilBertTokenizer, DistilBertModel
from transformers import LongformerConfig, LongformerModel
from transformers import RobertaTokenizer
import torch
from sklearn.model_selection import train_test_split
import warnings


import torch
from GPUtil import showUtilization as gpu_usage
from numba import cuda


import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap

import pickle
import tensorflow as tf
from sklearn.model_selection import StratifiedKFold
import sklearn
import keras
from tensorflow.keras.layers import Conv2D, BatchNormalization, GlobalAveragePooling2D, \
Dense, Input, Activation, MaxPool2D
from tensorflow.keras import Model

import numpy as np
import pandas as pd
import re
import csv
# import preprocessor as p
import math
from torch.utils.data import TensorDataset, DataLoader

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from tqdm import tqdm

from tensorflow import summary
import datetime
from torch.utils.tensorboard import SummaryWriter
from torch.cuda.amp import autocast 
%load_ext tensorboard

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from livelossplot import PlotLosses

pd.options.display.max_colwidth = 1000
pd.set_option('display.expand_frame_repr', False)

import re
import imageio,glob

import random
seed = 0
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

%matplotlib inline
%config InlineBackend.figure_format='retina'
torch.set_printoptions(precision=3, sci_mode=False)

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

rcParams['figure.figsize'] = 12, 8



DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#DEVICE =torch.device('cpu')

In [None]:
# declare global settings
 # this is still possible on the gpu for Bert - 32 not tested yet
batch_size = 16
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'
#DEVICE =torch.device('cpu')

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#DEVICE =torch.device('cpu')

In [None]:
from google.colab import drive
drive.mount('/content/drive')
PROJECT_PATH = 'drive/MyDrive/Colab\ Notebooks/application_project/personality-prediction'
%cd $PROJECT_PATH

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/application_project/personality-prediction


## Longformer

In [None]:
import pandas as pd
#import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn as nn
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
#import wandb
import os
import preprocessor as p

One of the cool features about this model is that you can specify the attention sliding window across different levels; the authors exploited this design for the autoregressive language model using different sliding windows for different layers. If this parameter is not changed it will assume a default of 512 across all the different layers.

In [None]:
config = LongformerConfig()

config

LongformerConfig {
  "attention_probs_dropout_prob": 0.1,
  "attention_window": 512,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "longformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "sep_token_id": 2,
  "transformers_version": "4.21.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

wandb.login()

In [None]:
def preprocess_text(sentence):
    # remove hyperlinks, hashtags, smileys, emojies
    sentence = p.clean(sentence)
    # Remove hyperlinks
    sentence = re.sub(r"http\S+", " ", sentence)
    # lower case
    sentence = sentence.lower()
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z.?!,]', ' ', sentence)
    # repeat punctuation
    sentence = re.sub(r'(\.)\1+', r'\1', sentence)
    sentence = re.sub(r'(!)\1+', r'\1', sentence)
    sentence = re.sub(r'(\?)\1+', r'\1', sentence) 
    # Single character removal (except I)
    sentence = re.sub(r"\s+[a-zA-HJ-Z]\s+", ' ', sentence)
    # Removing multiple spaces
    sentence = re.sub(r"\s+", " ", sentence)
    sentence = re.sub(r"\|\|\|", " ", sentence)

    return sentence


def load_essays_df(datafile):
    with open(datafile, "rt") as csvf:
        csvreader = csv.reader(csvf, delimiter=",", quotechar='"')
        first_line = True
        df = pd.DataFrame(
            columns=["user", "text", "token_len", "EXT", "NEU", "AGR", "CON", "OPN"]
        )
        for line in csvreader:
            if first_line:
                first_line = False
                continue

            text = line[1]
            df = df.append(
                {
                    "user": line[0],
                    "text": text,
                    "token_len": 0,
                    "EXT": 1 if line[2].lower() == "y" else 0,
                    "NEU": 1 if line[3].lower() == "y" else 0,
                    "AGR": 1 if line[4].lower() == "y" else 0,
                    "CON": 1 if line[5].lower() == "y" else 0,
                    "OPN": 1 if line[6].lower() == "y" else 0,
                },
                ignore_index=True,
            )

    #print("EXT : ", df["EXT"].value_counts())
    #print("NEU : ", df["NEU"].value_counts())
    #print("AGR : ", df["AGR"].value_counts())
    #print("CON : ", df["CON"].value_counts())
    #print("OPN : ", df["OPN"].value_counts())

    return df

In [None]:
def get_inputs(token_length):
    datafile = "data/essays/essays.csv"
    df = load_essays_df(datafile)

    # preprocessing
    df['text'] = df['text'].apply(preprocess_text)
    # remove empty texts
    df = df.drop(df[df.text == ''].index)
    df = df.drop(df[df.text == ' '].index)

    tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = token_length)
    # bert encoding
    df['encoding'] = df['text'].apply(lambda text: tokenizer.encode_plus(
              text,
              padding = 'max_length', truncation=True, max_length = token_length
              #add_special_tokens=True,
              #max_length=token_length,
              #pad_to_max_length=True,
              #return_token_type_ids=False,
              #return_attention_mask=True,
          ))

    df['input_ids'] = df['encoding'].apply(lambda x: np.asarray(x.input_ids))
    df['attention_mask'] = df['encoding'].apply(lambda x: np.asarray(x.attention_mask))
    df['author_ids'] = range(1, len(df) + 1)
    df['targets'] = df[["EXT", "NEU", "AGR", "CON", "OPN"]].values.tolist()

    input_ids = df.input_ids.to_numpy()
    attention_mask = df.attention_mask.to_numpy()
    author_ids = df.author_ids.to_numpy()
    targets_arr = df.targets.to_numpy()
    targets_arr = [np.asarray(x) for x in targets_arr]

    return author_ids, input_ids, attention_mask, targets_arr


In [None]:
'''
np.save('data/essays/input_ids.npy', df.input_ids.to_numpy())
np.save('data/essays/attention_mask.npy', df.attention_mask.to_numpy())
np.save('data/essays/author_ids.npy', df.author_ids.to_numpy())
np.save('data/essays/targets.npy', df.targets.to_numpy())


input_ids = np.load(open('data/essays/input_ids.npy', 'rb'), allow_pickle=True)
attention_mask = np.load(open('data/essays/attention_mask.npy', 'rb'), allow_pickle=True)
author_ids = np.load(open('data/essays/author_ids.npy', 'rb'), allow_pickle=True)
targets_arr = np.load(open('data/essays/targets.npy', 'rb'), allow_pickle=True)
targets_arr = [np.asarray(x) for x in targets_arr]
len(input_ids[0])
'''

"\nnp.save('data/essays/input_ids.npy', df.input_ids.to_numpy())\nnp.save('data/essays/attention_mask.npy', df.attention_mask.to_numpy())\nnp.save('data/essays/author_ids.npy', df.author_ids.to_numpy())\nnp.save('data/essays/targets.npy', df.targets.to_numpy())\n\n\ninput_ids = np.load(open('data/essays/input_ids.npy', 'rb'), allow_pickle=True)\nattention_mask = np.load(open('data/essays/attention_mask.npy', 'rb'), allow_pickle=True)\nauthor_ids = np.load(open('data/essays/author_ids.npy', 'rb'), allow_pickle=True)\ntargets_arr = np.load(open('data/essays/targets.npy', 'rb'), allow_pickle=True)\ntargets_arr = [np.asarray(x) for x in targets_arr]\nlen(input_ids[0])\n"

In [None]:
# one target
class Bert_Dataset(Dataset):
    def __init__(self, author_ids, input_ids, attention_masks, targets, trait_idx):
        input_ids = [np.asarray(x) for x in input_ids]
        attention_masks = [np.asarray(x) for x in attention_masks]
        self.author_ids = torch.from_numpy(np.array(author_ids))
        self.input_ids = torch.from_numpy(np.array(input_ids))
        self.attention_masks = torch.from_numpy(np.array(attention_masks))
        #targets = targets['EXT']
        #one_hot_encoding = tf.keras.utils.to_categorical(targets.to_numpy(), num_classes=2)
        #self.targets = torch.from_numpy(one_hot_encoding).float()
        #print(targets)
        self.targets = torch.from_numpy(targets.to_numpy())[:,trait_idx]

        self.global_attention_mask = torch.zeros_like(self.input_ids)
        # global attention on cls token
        self.global_attention_mask[:, 0] = 1
        #print(f'input_ids: {self.input_ids.size()}')
        #print(f'attention_mask: {self.attention_masks.size()}')
        #print(f'targets: {self.targets.size()}')
        

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):
        #return (self.author_ids[idx], self.input_ids[idx].to(DEVICE), self.attention_masks[idx].to(DEVICE), self.targets[idx].to(DEVICE))
        return {'input_ids': self.input_ids[idx], 'attention_mask': self.attention_masks[idx], 'global_attention_mask': self.global_attention_mask[idx], 'label': self.targets[idx]}

In [None]:
def get_dataloader(trait_idx, token_length):

    author_ids, input_ids, attention_mask, targets_arr = get_inputs(token_length)

    tokenized_df = pd.DataFrame(list(zip(author_ids, input_ids, attention_mask)),
                  columns =['author_ids', 'input_ids', 'attention_masks']).apply(np.asarray)
    target_df = pd.DataFrame(targets_arr, columns = ["EXT", "NEU", "AGR", "CON", "OPN"])

    df_inputs_train, df_inputs_test, df_targets_train, df_targets_test = train_test_split(tokenized_df, target_df, test_size=0.1, stratify=target_df)

    # testing
    test_on = 10
    inputs = df_inputs_train.iloc[:test_on,]
    targets = df_targets_train.iloc[:test_on,]

    inputs_train, inputs_val, targets_train, targets_val = train_test_split(inputs, targets, test_size=0.5)
    #auth = inputs_train['author_ids']
    #tar = targets_train
    #print(f'author_ids: \n{auth}\ntargets: \n{tar}')

    # dataloader
    train_dataset_small = Bert_Dataset(inputs_train['author_ids'].to_numpy(), inputs_train['input_ids'].to_numpy(), inputs_train['attention_masks'].to_numpy(), targets_train, trait_idx)
    val_dataset_small = Bert_Dataset(inputs_val['author_ids'].to_numpy(), inputs_val['input_ids'].to_numpy(), inputs_val['attention_masks'].to_numpy(), targets_val, trait_idx)
    train_dataloader_small = DataLoader(train_dataset_small, batch_size = batch_size, shuffle = False)
    val_dataloader_small = DataLoader(val_dataset_small, batch_size = batch_size)


    # normal
    inputs = df_inputs_train#.iloc[:test_on,]
    targets = df_targets_train#.iloc[:test_on,]

    inputs_train, inputs_val, targets_train, targets_val = train_test_split(inputs, targets, test_size=0.15, stratify=targets)

    # dataloader
    train_dataset = Bert_Dataset(inputs_train['author_ids'].to_numpy(), inputs_train['input_ids'].to_numpy(), inputs_train['attention_masks'].to_numpy(), targets_train, trait_idx)
    val_dataset = Bert_Dataset(inputs_val['author_ids'].to_numpy(), inputs_val['input_ids'].to_numpy(), inputs_val['attention_masks'].to_numpy(), targets_val, trait_idx)
    train_dataloader = DataLoader(train_dataset, batch_size = batch_size, shuffle = True)
    val_dataloader = DataLoader(val_dataset, batch_size = batch_size)

    return train_dataset, val_dataset, train_dataset_small, val_dataset_small


In [None]:
#dic = iter(train_dataloader_small).next()
#dic['label'].dtype == torch.long

In [None]:
'''
train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], 
                                             cache_dir='/media/data_files/github/website_tutorials/data')
'''

"\ntrain_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], \n                                             cache_dir='/media/data_files/github/website_tutorials/data')\n"

In [None]:
'''
train_data
#dir(train_data)
'''

'\ntrain_data\n#dir(train_data)\n'

For my implementation of the model, and to save speed in the pretraining I chose the maximun length of 1024 characters which covers close to 98 percent of all the documents in the dataset. Before using my brand new and still pretty much impossible to find RTX3090, I set the gradient checkpointing parameter to true. This saves a huge amount of memory and allows models such as the longformer to train on more modest GPU's such as my old EVGA GeForce GTX 1080. Gradient checkpointing is a really nice way to re use weights in the neural network and allows massive models to run on more modest settings with a 30 percent increase in training time. The original paper discussing gradient checkpointing can be found [here](https://arxiv.org/pdf/1604.06174.pdf) and a nice [discussion of gradient checkpointing can be hound here](https://qywu.github.io/2019/05/22/explore-gradient-checkpointing.html). 

Additionally to save memory and increase training time I also used mixed precision training to speed up the computation time of the training process. If you want to learn more about mixed precision I recommend this [blogpost](https://jonathan-hui.medium.com/mixed-precision-in-deep-learning-67f6dce3e0f3). With the combination of mixed precision, gradient accumulation and gradient checkpoint you can set the length to 4096. 

In [None]:
# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                           gradient_checkpointing=False,
                                                           attention_window = 512)
#tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)

Downloading config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/570M [00:00<?, ?B/s]

In [None]:
model.config

LongformerConfig {
  "_name_or_path": "allenai/longformer-base-4096",
  "attention_mode": "longformer",
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "ignore_attention_mask": false,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 4098,
  "model_type": "longformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "sep_token_id": 2,
  "transformers_version": "4.21.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

In [None]:
'''
# define a function that will tokenize the model, and will return the relevant inputs for the model
def tokenization(batched_text):
    return tokenizer(batched_text['text'], padding = 'max_length', truncation=True, max_length = 1024)

train_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))
test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))
'''

"\n# define a function that will tokenize the model, and will return the relevant inputs for the model\ndef tokenization(batched_text):\n    return tokenizer(batched_text['text'], padding = 'max_length', truncation=True, max_length = 1024)\n\ntrain_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))\ntest_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))\n"

In [None]:
'''
# we make sure our truncation strateging and the padding are set to the maximung length
len(train_data['input_ids'][0])
'''

"\n# we make sure our truncation strateging and the padding are set to the maximung length\nlen(train_data['input_ids'][0])\n"

Once the tokenization process is finished we can use the set the column names and types. One thing that is important to note is that the `LongformerForSequenceClassification` implementation by default sets the global attention to the `CLS`[token](https://huggingface.co/transformers/_modules/transformers/modeling_longformer.html#LongformerForSequenceClassification), so there is no need to further modify the inputs.

In [None]:
'''
train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
'''

"\ntrain_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])\ntest_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])\n"

## Training

In [None]:
# define accuracy metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    #print(f'labels: {labels}')
    #print(f'preds: {preds}')
    # argmax(pred.predictions, axis=1)
    #pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    #print(f'acc: {acc}')
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In the paper the authors trained for 15 epochs, with batch size of 32, learning rate of 3e-5 and linear warmup steps equal to 0.1 of the total training steps. For this quick tutorial I went for the default learning rate of the trainer class which is 5e-5, 5 epochs for training, batch size of 8 with gradient accumulation of 8 steps for an effective batch size of 64 and 200 warm up steps (roughly 10 percent of total training steps). The overall training time for this implementation was 2 hours and 54 minutes.

In [None]:

def free_gpu_cache():
    print("Initial GPU Usage")
    gpu_usage()                             

    torch.cuda.empty_cache()

    cuda.select_device(0)
    cuda.close()
    cuda.select_device(0)

    print("GPU Usage after emptying the cache")
    gpu_usage()

#free_gpu_cache()                           


In [None]:
trait_labels = ["EXT", "NEU", "AGR", "CON", "OPN"]
for trait_idx, trait in enumerate(trait_labels):
    #if trait_idx == 4:
     #   continue
    print(f'Trait: {trait}')
    train_dataset, val_dataset, train_dataset_small, val_dataset_small = get_dataloader(trait_idx = trait_idx, token_length = 4096)

    # define the training arguments
    training_args = TrainingArguments(
        output_dir = '/results_longformer',
        num_train_epochs = 5,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 16,    
        per_device_eval_batch_size= 2,
        evaluation_strategy = "epoch",
        disable_tqdm = False, 
        #load_best_model_at_end=True,
        warmup_steps=200,
        weight_decay=0.01,
        logging_steps = 1,
        fp16 = True,
        logging_dir='/logs_longformer',
        dataloader_num_workers = 0,
        run_name = 'longformer-classification_2_warm'
    )

    # instantiate the trainer class and check for available devices
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    trainer.train()

Trait: EXT


Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 1887
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 16
  Total optimization steps = 295


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7031,0.691502,0.522523,0.683897,0.519637,1.0
2,0.7068,0.689798,0.555556,0.581921,0.565934,0.598837
3,0.688,0.686444,0.57958,0.539474,0.621212,0.476744
4,0.747,0.70751,0.57958,0.651741,0.569565,0.761628
5,0.3465,0.842962,0.573574,0.59887,0.582418,0.616279


***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




Trait: NEU


loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/859f4633944e1b7e7fa301e72161388cd5903e36385d0ef2917256506bff64c3.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/af6fcabe2bf8cab6f77b20d94ba46a3dbf441ca0549e1f3c852c437b612f5224.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/93ab433997eab2709f7adf8fa46f21d4699497bf20768f3ffd25e2e73b9b93c2.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/allenai/longformer-base-4096/res

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6769,0.716781,0.486486,0.219178,0.461538,0.143713
2,0.6488,0.705923,0.495495,0.508772,0.497143,0.520958
3,0.5597,0.754877,0.471471,0.511111,0.476684,0.550898
4,0.5899,0.921062,0.486486,0.571429,0.491379,0.682635
5,0.1944,1.188633,0.486486,0.457143,0.486486,0.431138


***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




Trait: AGR


loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/859f4633944e1b7e7fa301e72161388cd5903e36385d0ef2917256506bff64c3.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/af6fcabe2bf8cab6f77b20d94ba46a3dbf441ca0549e1f3c852c437b612f5224.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/93ab433997eab2709f7adf8fa46f21d4699497bf20768f3ffd25e2e73b9b93c2.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/allenai/longformer-base-4096/res

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.691,0.689569,0.54955,0.669604,0.548736,0.858757
2,0.6721,0.684312,0.60961,0.700461,0.59144,0.858757
3,0.7237,0.694002,0.573574,0.579882,0.608696,0.553672
4,0.4583,0.781133,0.516517,0.533333,0.547619,0.519774
5,0.3719,0.912569,0.525526,0.579787,0.547739,0.615819


***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




Trait: CON


loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/859f4633944e1b7e7fa301e72161388cd5903e36385d0ef2917256506bff64c3.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/af6fcabe2bf8cab6f77b20d94ba46a3dbf441ca0549e1f3c852c437b612f5224.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/93ab433997eab2709f7adf8fa46f21d4699497bf20768f3ffd25e2e73b9b93c2.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/allenai/longformer-base-4096/res

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.682,0.69526,0.546547,0.494983,0.569231,0.43787
2,0.6401,0.71368,0.558559,0.672606,0.539286,0.893491


***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.682,0.69526,0.546547,0.494983,0.569231,0.43787
2,0.6401,0.71368,0.558559,0.672606,0.539286,0.893491
3,0.6623,0.781187,0.522523,0.589147,0.522936,0.674556
4,0.5987,0.851811,0.534535,0.619165,0.529412,0.745562
5,0.3023,1.203149,0.492492,0.470219,0.5,0.443787


***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




Trait: OPN


loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/859f4633944e1b7e7fa301e72161388cd5903e36385d0ef2917256506bff64c3.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/af6fcabe2bf8cab6f77b20d94ba46a3dbf441ca0549e1f3c852c437b612f5224.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/93ab433997eab2709f7adf8fa46f21d4699497bf20768f3ffd25e2e73b9b93c2.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/allenai/longformer-base-4096/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/allenai/longformer-base-4096/res

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6577,0.713682,0.504505,0.666667,0.510836,0.959302
2,0.6905,0.694453,0.540541,0.529231,0.562092,0.5
3,0.6781,0.703114,0.576577,0.59366,0.588571,0.598837
4,0.4853,0.792425,0.567568,0.595506,0.576087,0.616279
5,0.2666,0.977825,0.540541,0.553936,0.555556,0.552326


***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2
***** Running Evaluation *****
  Num examples = 333
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




In [None]:
# train the model
trainer.train()

***** Running training *****
  Num examples = 1887
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 295


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6855,0.687596,0.555556,0.6,0.560606,0.645349
2,0.5815,0.696063,0.528529,0.269767,0.674419,0.168605
3,0.6175,0.659777,0.612613,0.562712,0.674797,0.482558
4,0.6027,0.76471,0.576577,0.42915,0.706667,0.30814
5,0.3364,0.857054,0.576577,0.534653,0.618321,0.47093


***** Running Evaluation *****
  Num examples = 333
  Batch size = 4
***** Running Evaluation *****
  Num examples = 333
  Batch size = 4
***** Running Evaluation *****
  Num examples = 333
  Batch size = 4
***** Running Evaluation *****
  Num examples = 333
  Batch size = 4
***** Running Evaluation *****
  Num examples = 333
  Batch size = 4


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=295, training_loss=0.5946918003639933, metrics={'train_runtime': 2289.3531, 'train_samples_per_second': 4.121, 'train_steps_per_second': 0.129, 'total_flos': 6197394955898880.0, 'train_loss': 0.5946918003639933, 'epoch': 5.0})

After the training has been completed we can evaluate the performance of the model and make sure we are loading the right model.

In [None]:
trainer.evaluate()

{'eval_loss': 0.13697753846645355,
 'eval_accuracy': 0.9534,
 'eval_f1': 0.9535282619968887,
 'eval_precision': 0.9509109714376641,
 'eval_recall': 0.95616,
 'epoch': 4.9984}