Link to the notebook: https://tinyurl.com/y3l9vr67
Copy the notebook to your GDrive to edit.

# Sentiment Analysis With Transformers

In this lab, we will discuss the transformer model in the context of performing text classification. Transformers were first introduced in the paper "Attention is All You Need" ([Vaswani et al. 2017](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)), but the idea on which they are based, namely **attention**, goes back at least to 2015 ([Bahdanau et al. 2015](https://arxiv.org/pdf/1409.0473.pdf), [Luong et al. 2015](https://arxiv.org/pdf/1508.04025.pdf)). You will learn about attention in more detail in the lectures. The purpose of this lab is to demonstrate the basic structure and calculations involved in a transformer network by constructing one from scratch, followed by demonstrating how to easily fine-tune the latest pretrained language models on a downstream task.

# Download prerequisite packages

First we will install extra packages which aren't included in the default colab environment. In this case, we are installing the `bpemb` package, which will enable us to use word-pieces for tokenization, and the HuggingFace `transformers` library, which we will use to fine-tune a pretrained language model. 

The [HuggingFace transformers](https://github.com/huggingface/transformers) library provides an easy interface to using, fine-tuning, and training from scratch the latest transformer models which have been shown to perform extremely well on a wide range of tasks.

In [1]:
!pip install bpemb
!pip install transformers



Here we are just using some magic commands to make sure changes to external packages are automatically loaded and plots are displayed in the notebook.

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import torch
import random
import numpy as np
import pandas as pd

from functools import partial
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
from typing import List, Tuple
from bpemb import BPEmb
from tqdm import tqdm_notebook as tqdm

## Reproducibility!

In [4]:
def enforce_reproducibility(seed=42):
    # Sets seed manually for both CPU and CUDA
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # For atomic operations there is currently 
    # no simple way to enforce determinism, as
    # the order of parallel operations is not known.
    # CUDNN
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # System based
    random.seed(seed)
    np.random.seed(seed)

In [5]:
enforce_reproducibility()

# Upload the dataset 
We'll use the Stanford Sentiment Treebank dataset packaged with the [GLUE benchmark](https://gluebenchmark.com/tasks). We'll use the train a dev files as labels are not provided with the test data. 

The dataset consists of movie reviews from Rotten Tomatoes labelled for sentiment (0: negative, 1: positive).

In [6]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Read in the data

This is largely the same as the last lab

In [7]:
train_data = pd.read_csv('./train.tsv', sep='\t')
valid_data = pd.read_csv('./dev.tsv', sep='\t')
test_data = pd.read_csv('./test.tsv', sep='\t')

valid_data.head()

Unnamed: 0,sentence,label
0,it 's a charming and often affecting journey .,1
1,unflinchingly bleak and desperate,0
2,allows us to hope that nolan is poised to emba...,1
3,"the acting , costumes , music , cinematography...",1
4,"it 's slow -- very , very slow .",0


In [8]:
len(train_data), len(valid_data)

(67349, 872)

# Reading data into a model

A simple and common way that data is read in PyTorch is to use the two following classes: `torch.utils.data.Dataset` and `torch.utils.data.DataLoader`. 

The `Dataset` class can be extended to read in and store the data you are using for your experiment. The only requirements are to implement the `__len__` and `__getitem__` methods. `__len__` simply returns the size of your dataset and `__getitem__` takes an index and returns that sample from your dataset, processed in whatever way is necessary to be input to your model.

The `DataLoader` class determines how to iterate through your `Dataset`, including how to shuffle and batch your data.



In [9]:
def text_to_batch_transformer_bpemb(text: List, tokenizer, max_seq_len: int = 512) -> Tuple[List, List]:
    """
    Creates a tokenized batch for input to a bilstm model
    :param text: A list of sentences to tokenize
    :param tokenizer: A tokenization function to use (i.e. fasttext)
    :return: Tokenized text as well as the length of the input sequence
    """
    # Some light preprocessing
    input_ids = [tokenizer.encode_ids_with_bos_eos(t)[:max_seq_len] for t in text]
    for ids in input_ids:
      ids[-1] = tokenizer.EOS

    masks = [[1] * len(i) for i in input_ids]

    return input_ids, masks

def collate_batch_transformer(pad_id, input_data: Tuple) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    input_ids = [i[0][0] for i in input_data]
    masks = [i[1][0] for i in input_data]
    labels = [i[2] for i in input_data]

    max_length = max([len(i) for i in input_ids])

    input_ids = [(i + [pad_id] * (max_length - len(i))) for i in input_ids]
    masks = [(m + [pad_id] * (max_length - len(m))) for m in masks]

    assert (all(len(i) == max_length for i in input_ids))
    assert (all(len(m) == max_length for m in masks))
    return torch.tensor(input_ids), torch.tensor(masks), torch.tensor(labels)

# This will load the dataset and process it lazily in the __getitem__ function
class ClassificationDatasetReader(Dataset):
  def __init__(self, df, tokenizer, max_seq_len=512):
    self.df = df
    self.tokenizer = tokenizer
    self.max_seq_len = max_seq_len

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    row = self.df.values[idx]
    # Calls the text_to_batch function
    input_ids,masks = text_to_batch_transformer_bpemb([row[0]], self.tokenizer, self.max_seq_len)
    label = row[1]#int(row[0] >= 0.5)
    return input_ids, masks, label

# Creating the model

Here we will create a transformer network from scratch. For an excellent explanation of each of the components of the transformer, see [this blog post](https://jalammar.github.io/illustrated-transformer/). The model will be divided into the following modules:

`TransformerEmbeddingLayer`: The initial layer with BPE pretrained embeddings.

`TransformerEncoderHead`: A single self-attention head which performs the attention calculations.

`TransformerEncoderLayer`: Concatenates the outputs of multiple self-attention heads and performs the feed-forward and layer normalization operations.

`TransformerEncoder`: Combines multiple transformer layers and outputs the final representations.

`TransformerClassifierHead`: Top level module which takes the output of the transformer encoder and applies a classifier to it. 

**Disclaimer**: PyTorch has built in modules for creating transformer networks; see [torch.nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)

In [10]:
class TransformerEmbeddingLayer(nn.Module):
  """
  The embedding layer, first layer of the transformer network
  """
  def __init__(
      self,
      num_embeddings: int,
      embedding_dim: int,
      n_positions: int,
      pretrained_embeddings: torch.tensor,
      padding_idx: int = 0
  ):
    # First thing is to call the superclass initializer
    super(TransformerEmbeddingLayer, self).__init__()

    # Create token embeddings
    self.token_embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=pretrained_embeddings.shape[0] - 1)
    
  def forward(self, inputs, attention_mask=None):

    # Get the token embeddings from the input IDs
    tok_embeds = self.token_embeddings(inputs)

    output = tok_embeds
    if attention_mask is not None:
      output *= attention_mask.unsqueeze(-1)
    
    return output

# Attention Head

## 1) Q, K, and V representations

![](https://jalammar.github.io/images/t/self-attention-matrix-calculation.png)

## 2) Self-Attention

![](https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)

Source: [Jay Alammar](https://jalammar.github.io/illustrated-transformer/)

In [11]:
class TransformerEncoderHead(nn.Module):
  """
  Creates a single transformer encoder head
  """
  def __init__(
      self,
      input_dim: int,
      d_head: int
  ):
    # First thing is to call the superclass initializer
    super(TransformerEncoderHead, self).__init__()

    self.d_head = d_head
    self.softmax = nn.Softmax(dim=-1)

    # This is for scaled dot product attention
    self.scale = nn.Parameter(torch.tensor(np.sqrt(d_head)), requires_grad=False)

    # Create the Q, K, and V matrices
    self.Q = nn.Parameter(nn.init.xavier_normal_(torch.empty(input_dim,d_head)))
    self.K = nn.Parameter(nn.init.xavier_normal_(torch.empty(input_dim,d_head)))
    self.V = nn.Parameter(nn.init.xavier_normal_(torch.empty(input_dim,d_head)))

  def forward(self, inputs, attention_mask=None):
    """
    inputs: b x sl x d
    """
    # Get head embeddings (gets b x sl x d_head)
    q = torch.matmul(inputs, self.Q.unsqueeze(0))
    k = torch.matmul(inputs, self.K.unsqueeze(0))
    v = torch.matmul(inputs, self.V.unsqueeze(0))

    # Outer product of q and k (b x sl x sl), scaled
    dot_product = torch.bmm(q, k.transpose(2,1))
    sdp = dot_product / np.sqrt(self.d_head)
    # Mask out anything that needs it
    if attention_mask is not None:
      # Get the mask in the shape needed
      mask = attention_mask.repeat(1,inputs.shape[1]).reshape(inputs.shape[0], inputs.shape[1], inputs.shape[1])
      sdp[mask == 0] = -100000

    # Attention is calculated using softmax
    attn = self.softmax(sdp)

    # Take attention over the values (b x sl x d_head)
    output = torch.bmm(attn, v) * attention_mask.unsqueeze(-1)

    return output


# Combining Heads and Producing Layer Representations

![](https://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png)

![](https://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png)

Source: [Jay Alammar](https://jalammar.github.io/illustrated-transformer/)

In [12]:
class TransformerEncoderLayer(nn.Module):
  """
  Defines a single transformer encoder layer with a number of heads
  """
  def __init__(
      self,
      hidden_size: int,
      d_head: int,
      n_heads: int,
      dropout_prob: float = 0.1
  ):

    # First thing is to call the superclass initializer
    super(TransformerEncoderLayer, self).__init__()

    # Create the encoder heads
    self.encoder_heads = nn.ModuleList([TransformerEncoderHead(hidden_size, d_head) for j in range(n_heads)])
    # Matrix to turn encoder head size into hidden size
    self.Wo = nn.Parameter(nn.init.xavier_normal_(torch.empty(n_heads*d_head, hidden_size)))
    
    self.LayerNorm = nn.LayerNorm(hidden_size)
    self.dropout = nn.Dropout(dropout_prob)

    self.linear = nn.Linear(hidden_size, hidden_size)

    # Initialize
    self._init_weights()

  def _init_weights(self):
      all_params = list(self.linear.named_parameters())
      for n,p in all_params:
          if 'weight' in n:
              nn.init.xavier_normal_(p)
          elif 'bias' in n:
              nn.init.zeros_(p)

  def forward(self, inputs, attention_mask=None):
    # First, get all of the attention head outputs
    head_outs = [head(inputs, attention_mask) for head in self.encoder_heads]
    # Concatenate (b x sl x n_heads*d_head)
    head_outs = torch.cat(head_outs, dim=-1)
    # Reduce dim (b x sl x hidden_size)
    z = torch.matmul(head_outs, self.Wo)
    # Residual and layer normalization (b x sl x hidden_size)
    z = self.LayerNorm(z + inputs)
    # Linear layer (b x sl x hidden_size)
    lin_out = nn.GELU()(self.linear(z))
    # Layer norm + Dropout
    out = self.dropout(self.LayerNorm(lin_out + z))

    return out

The next modules simply tie the different layers of the network together and add a classifier on top.

In [13]:
class TransformerEncoder(nn.Module):
  """
  Top level transformer encoder module
  """
  def __init__(
      self,
      hidden_size: int,
      vocab_size: int,
      max_length: int,
      n_heads: int,
      d_head: int,
      n_layers: int,
      pretrained_embeddings: torch.tensor,
      dropout_prob: float = 0.1
  ):
    # First thing is to call the superclass initializer
    super(TransformerEncoder, self).__init__()

    # Create the embedding layer
    self.embedding = TransformerEmbeddingLayer(vocab_size, hidden_size, max_length, pretrained_embeddings)

    # Create the encoder layers
    self.layers = nn.ModuleList([TransformerEncoderLayer(
        hidden_size, 
        d_head, 
        n_heads, 
        dropout_prob
    ) for l in range(n_layers)])

  def forward(self, inputs, attention_mask=None):

    # Get the embeddings
    embs = self.embedding(inputs, attention_mask)

    # Run through the network
    out = embs
    for layer in self.layers:
      out = layer(out, attention_mask)

    return out

class TransformerClassifierHead(nn.Module):
  """
  Top level transformer classifier module
  """
  def __init__(
      self,
      hidden_size: int,
      vocab_size: int,
      max_length: int,
      n_heads: int,
      d_head: int,
      n_layers: int,
      pretrained_embeddings: torch.tensor,
      n_classes: int = 2,
      dropout_prob: float = 0.1
  ):
    # First thing is to call the superclass initializer
    super(TransformerClassifierHead, self).__init__()

    # Create the Transformer
    self.xformer = TransformerEncoder(
        hidden_size,
        vocab_size,
        max_length,
        n_heads,
        d_head,
        n_layers,
        pretrained_embeddings,
        dropout_prob
    )

    # Create the classifier
    self.classifier = nn.Linear(hidden_size, n_classes)

    self.dropout = nn.Dropout(dropout_prob)

    self._init_weights()

  def _init_weights(self):
      all_params = list(self.classifier.named_parameters())
      for n,p in all_params:
          if 'weight' in n:
              nn.init.xavier_normal_(p)
          elif 'bias' in n:
              nn.init.zeros_(p)

  def forward(self, inputs, attention_mask=None, labels=None):

    # Get the embeddings
    xformer_out = self.xformer(inputs, attention_mask)

    # Get the classifier embeddings (b x hidden_size)
    cls_embs = self.dropout(xformer_out[:,0,:])
    
    # Logits
    logits = self.classifier(cls_embs)

    outputs = (logits,)
    # Get loss if needed
    if labels is not None:
      xent = nn.CrossEntropyLoss()
      loss = xent(logits, labels)
      outputs = (loss,) + outputs

    return outputs


# Training and Evaluation

Here we define the main training loop and evaluation metrics. The main difference from the pytorch intro is that instead of inputing the sequence length to the model, we input an `attention_mask` which masks out pad tokens during the attention operation.

In [14]:
def accuracy(logits, labels):
  logits = np.asarray(logits).reshape(-1, len(logits[0]))
  labels = np.asarray(labels).reshape(-1)
  return np.sum(np.argmax(logits, axis=-1) == labels).astype(np.float32) / float(labels.shape[0])

def evaluate(model: nn.Module, valid_dl: DataLoader):
  """
  Evaluates the model on the given dataset
  :param model: The model under evaluation
  :param valid_dl: A `DataLoader` reading validation data
  :return: The accuracy of the model on the dataset
  """
  # VERY IMPORTANT: Put your model in "eval" mode -- this disables things like 
  # layer normalization and dropout
  model.eval()
  labels_all = []
  logits_all = []

  # ALSO IMPORTANT: Don't accumulate gradients during this process
  with torch.no_grad():
    for batch in tqdm(valid_dl, desc='Evaluation'):
      batch = tuple(t.to(device) for t in batch)
      input_ids = batch[0]
      attention_mask = batch[1]
      labels = batch[2]

      _, logits = model(input_ids, attention_mask, labels=labels)
      labels_all.extend(list(labels.detach().cpu().numpy()))
      logits_all.extend(list(logits.detach().cpu().numpy()))
    acc = accuracy(logits_all, labels_all)

    return acc

def train(
    model: nn.Module, 
    train_dl: DataLoader, 
    valid_dl: DataLoader, 
    optimizer: torch.optim.Optimizer, 
    n_epochs: int, 
    device: torch.device,
    scheduler = None
):
  """
  The main training loop which will optimize a given model on a given dataset
  :param model: The model being optimized
  :param train_dl: The training dataset
  :param valid_dl: A validation dataset
  :param optimizer: The optimizer used to update the model parameters
  :param n_epochs: Number of epochs to train for
  :param device: The device to train on
  :return: (model, losses) The best model and the losses per iteration
  """

  # Keep track of the loss and best accuracy
  losses = []
  best_acc = 0.0

  # Iterate through epochs
  for ep in range(n_epochs):

    loss_epoch = []

    #Iterate through each batch in the dataloader
    for batch in tqdm(train_dl):
      # VERY IMPORTANT: Make sure the model is in training mode, which turns on 
      # things like dropout and layer normalization
      model.train()

      # VERY IMPORTANT: zero out all of the gradients on each iteration -- PyTorch
      # keeps track of these dynamically in its computation graph so you need to explicitly
      # zero them out
      optimizer.zero_grad()

      # Place each tensor on the GPU
      batch = tuple(t.to(device) for t in batch)
      input_ids = batch[0]
      attention_mask = batch[1]
      labels = batch[2]

      # Pass the inputs through the model, get the current loss and logits
      loss, logits = model(input_ids, attention_mask, labels=labels)
      losses.append(loss.item())
      loss_epoch.append(loss.item())
      
      # Calculate all of the gradients and weight updates for the model
      loss.backward()

      # Optional: clip gradients
      #torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      # Finally, update the weights of the model
      optimizer.step()
      if scheduler is not None:
        scheduler.step()
      #gc.collect()

    # Perform inline evaluation at the end of the epoch
    acc = evaluate(model, valid_dl)
    print(f'Validation accuracy: {acc}, train loss: {sum(loss_epoch) / len(loss_epoch)}')

    # Keep track of the best model based on the accuracy
    best_model = model.state_dict()
    if acc > best_acc:
      torch.save(model.state_dict(), 'best_model')
      best_acc = acc
        #gc.collect()

  model.load_state_dict(best_model)
  return model, losses

Now we can define hyperparameters, get the device to run on, and build the model.

In [15]:
# Define some hyperparameters
batch_size = 16
lr = 1e-4
n_epochs = 10
hidden_size = 300
n_heads = 2
d_head = 150
n_layers = 8
dropout_prob = 0.1
num_labels = 2
max_seq_len = 512

# Get the device
device = torch.device("cpu")
if torch.cuda.is_available():
  device = torch.device("cuda")

# Load english model with 25k word-pieces
bpemb_en = BPEmb(lang='en', dim=300, vs=25000)
# Extract the embeddings and add a randomly initialized embedding for our extra [PAD] token
pretrained_embeddings = np.concatenate([bpemb_en.emb.vectors, np.zeros(shape=(1,300))], axis=0)
# Extract the vocab and add an extra [PAD] token
vocabulary = bpemb_en.emb.index2word + ['[PAD]']

# Create the model
model = TransformerClassifierHead(
    hidden_size,
    len(vocabulary),
    max_seq_len,
    n_heads,
    d_head,
    n_layers,
    torch.FloatTensor(pretrained_embeddings),
    num_labels,
    dropout_prob
).to(device)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [16]:
# Create the dataset readers
# train_dataset = ClassificationDatasetReader(train_data, tokenizer)
train_dataset = ClassificationDatasetReader(train_data, bpemb_en)
# dataset loaded lazily with N workers in parallel
collate_fn = partial(collate_batch_transformer, len(vocabulary) - 1)
train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn, num_workers=8)

#valid_dataset = ClassificationDatasetReader(valid_data, tokenizer)
valid_dataset = ClassificationDatasetReader(valid_data, bpemb_en)
valid_dl = DataLoader(valid_dataset, batch_size=len(valid_data), collate_fn=collate_fn, num_workers=8)

# Create the optimizer
optimizer = Adam(model.parameters(), lr=lr)

# Train
model, losses = train(model, train_dl, valid_dl, optimizer, n_epochs, device)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7431192660550459, train loss: 0.471504485269128


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7591743119266054, train loss: 0.35279364296613425


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7626146788990825, train loss: 0.2897930059228048


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7580275229357798, train loss: 0.24343581041652984


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7798165137614679, train loss: 0.22649371975473176


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7706422018348624, train loss: 0.19447458022608113


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7706422018348624, train loss: 0.19663545139379482


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.786697247706422, train loss: 0.18342957729042336


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7809633027522935, train loss: 0.19958652108808153


HBox(children=(FloatProgress(value=0.0, max=4210.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


Validation accuracy: 0.7763761467889908, train loss: 0.18584201380039728


# HuggingFace `transformers` Version

We can actually do much better than this with significantly less code. The [`transformers`](https://github.com/huggingface/transformers) project from HuggingFace is a repository of the most up to date pretrained transformer language models. They can be used on a wide range of tasks, achieving high or state-of-the-art results, and only require a few lines of code to use. You will use this library starting in week 40 for machine translation. 

In [17]:
from transformers import PreTrainedTokenizer
from transformers import RobertaTokenizer
from transformers import RobertaConfig
from transformers import RobertaForSequenceClassification
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

In [18]:
def text_to_batch_transformer(text: List, tokenizer: PreTrainedTokenizer) -> Tuple[List, List]:
    """Turn a piece of text into a batch for transformer model

    :param text: The text to tokenize and encode
    :param tokenizer: The tokenizer to use
    :return: A list of IDs and a mask
    """
    input_ids = [tokenizer.encode(t, add_special_tokens=True, truncation=True) for t in text]

    masks = [[1] * len(i) for i in input_ids]

    return input_ids, masks

class ClassificationDatasetReaderBert(Dataset):
  def __init__(self, df, tokenizer):
    self.df = df
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    row = self.df.values[idx]
    # Calls the text_to_batch function
    input_ids,masks = text_to_batch_transformer([row[0]], self.tokenizer)
    label = row[1]
    return input_ids, masks, label

In [19]:
# Set up some hyperparameters
weight_decay = 0.01
n_epochs = 2
lr = 3e-5

# Create the tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
vocabulary = tokenizer.get_vocab()

# Create the dataset readers
train_dataset = ClassificationDatasetReaderBert(train_data, tokenizer)
# dataset loaded lazily with N workers in parallel
collate_fn = partial(collate_batch_transformer, 0)
train_dl = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn, num_workers=8)

valid_dataset = ClassificationDatasetReaderBert(valid_data, tokenizer)
valid_dl = DataLoader(valid_dataset, batch_size=8, collate_fn=collate_fn, num_workers=8)

config = RobertaConfig.from_pretrained('roberta-base', num_labels=2)
model = RobertaForSequenceClassification.from_pretrained('roberta-base', config=config).to(device)

# Create the optimizer
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
      'weight_decay': weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    200,
    n_epochs * len(train_dl)
)

model, losses = train(model, train_dl, valid_dl, optimizer, n_epochs, device, scheduler)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out

HBox(children=(FloatProgress(value=0.0, max=8419.0), HTML(value='')))




Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=109.0, style=ProgressStyle(description_w…


Validation accuracy: 0.9311926605504587, train loss: 0.2424312170282093


HBox(children=(FloatProgress(value=0.0, max=8419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=109.0, style=ProgressStyle(description_w…


Validation accuracy: 0.9334862385321101, train loss: 0.11953866727348335
