We're now facing the same problem we discussed in the word embedding notebooks - we may not have enough data to train a good model. This is even more important for deep learning, since neural networks perform best on large amounts of training data and because training the best performing models from scratch require more computing power than we can access. So, as with word embeddings, let's make use of some open source pre-trained models.

You may remember that a previous notebook mentioned fine-tuning. This is the process of slightly retraining a model using our own data, instead of just using a pre-trained model for feature extraction. Hopefully this will let us benefit from both the general knowledge of the pre-trained model and the specific nuances of our training data. The most basic type of fine-tuning, which we'll try here, just adds a single layer on top of a model's existing architure for a specific classification task. You might also decide to unfreeze and retrain other layers in the model if you were taking a more complex approach to fine-tuning.

We'll try to fine-tune the multi-lingual version of BERT (Bidirectional Encoder Representations from Transformers), a pre-trained model produced by Google using an architecture they developed ([read more here](http://jalammar.github.io/illustrated-bert/ 
)). The `transformers` library has become popular as a way to access large pre-trained NLP models, including BERT, so we'll use that. Significant portions of this notebook are based on ([this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#11-using-colab-gpu-for-training)), but I've tried to include clearer explanations and modified some of the code. 

We'll use `pytorch` here, but the `transformers` library has utilities for both torch and tensorflow.

Again, you'll start by enabling GPU access if you're using Google Colab.

In [0]:
import torch
import numpy as np
import pandas as pd

if torch.cuda.is_available():     
    device = torch.device("cuda")
    
else:
    print('No GPU available - maybe try again later?')

In [2]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

#train = train[train.label <= 20]
#val = val[val.label <= 20]
#test = test[test.label <= 20]

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


BERT was trained on a few tasks involved sentence-length inputs, such as next sentence prediction. As a result, it doesn't allow input text longer than 512 characters.

Even if we set our max sequence length under 512, we'll get a bunch of warnings if we don't preemptively truncate our text here

In [0]:
train_sentences = [sent[:512] for sent in train.text]

val_sentences = [sent[:512] for sent in val.text]

test_sentences = [sent[:512] for sent in test.text]

Now we need the transformers library

In [4]:
!pip install transformers



A key appeal of the `transformers` library is that it lets you import many large, pre-trained language model using simple syntax. You import the class of the model or tokenizer you want, and then point to the specific model you choose within `from_pretrained()`.

We'll use the BERT multilingual model here - a model that was trained on many different languages including Arabic. More recent work has shown that training a BERT model with data from just one language is a better approach, unsurprisingly, but the multilingual model is still quite good.

There are cased (all lowercase) and uncased (upper and lower) versions of the model, which doesn't really matter for Arabic but for other languages you want to be sure the casing of your inputs match the casing of your model.

In [0]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=False)

BERT has some idiosyncracies in terms of its required input formats - but for us, the only relevant requirement is that input must begin with a [CLS] token and end with a [SEP] token.

The `encode_plus` method of a BertTokenizer will address this for us, and address other steps that should look familiar from our LSTM notebook, such as converting words to indices and padding inputs. So we'll wrap that in a simple function.

We'll also convert both our inputs and our labels to tensors.

In [0]:
def bert_encoder(sentences, labels):# Tokenize all of the sentences and map the tokens to thier word IDs.
  input_ids = []
  attention_masks = []

  # For every sentence...
  for sent in sentences:
      # `encode_plus` will:
      #   (1) Tokenize the sentence.
      #   (2) Prepend the `[CLS]` token to the start.
      #   (3) Append the `[SEP]` token to the end.
      #   (4) Map tokens to their IDs.
      #   (5) Pad or truncate the sentence to `max_length`
      #   (6) Create attention masks for [PAD] tokens.
      encoded_dict = tokenizer.encode_plus(
                          sent,                      # Sentence to encode.
                          add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                          max_length = 200,           # Pad & truncate all sentences.
                          pad_to_max_length = True,
                          return_attention_mask = True,   # Construct attn. masks.
                          return_tensors = 'pt',     # Return pytorch tensors.
                    )
      
      # Add the encoded sentence to the list.    
      input_ids.append(encoded_dict['input_ids'])
      
      # And its attention mask (simply differentiates padding from non-padding).
      attention_masks.append(encoded_dict['attention_mask'])

  # Convert the lists into tensors.
  input_ids = torch.cat(input_ids, dim=0)
  attention_masks = torch.cat(attention_masks, dim=0)
  labels = torch.tensor(labels)

  return input_ids, attention_masks, labels

In [0]:
tr_input_ids, tr_attention_masks, tr_labels = bert_encoder(train_sentences, train.label)
va_input_ids, va_attention_masks, va_labels = bert_encoder(val_sentences, val.label)
ts_input_ids, ts_attention_masks, ts_labels = bert_encoder(test_sentences, test.label)

Now we pass our inputs and our labels into a TensorDataset, a pytorch class that makes loading batches and other aspects model training a bit easier.

In [0]:
from torch.utils.data import TensorDataset

train_dataset = TensorDataset(tr_input_ids, tr_attention_masks, tr_labels)
val_dataset = TensorDataset(va_input_ids, va_attention_masks, va_labels)
test_dataset = TensorDataset(ts_input_ids, ts_attention_masks, ts_labels)


TensorDatasets can be passed into DataLoaders for batching.

In [0]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

`transformers` makes a number of different BERT versions available, which you can see listed [here](https://huggingface.co/transformers/model_doc/bert.html). We'll use BertForSequenceClassification, which is the most appropriate for classification tasks. A more advanced user could just use the vanilla BertModel and then build their own fine-tuning mechanisms with normal pytorch syntax.

In [10]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    'bert-base-multilingual-uncased', 
    num_labels = 40,   
    output_attentions = False,
    output_hidden_states = False,
)

# Tell pytorch to run this model on the GPU.
model.cuda()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

When we tell the model to run on GPU, it also prints a summary of the model - you'll see a layer called "classifier" with 40 output features at the end. The `transformers` library has added this for us, because we passed the argument `num_labels=40`, along with the droput layer immediately above it.

Next, we need our optimizer - we'll use a version of Adam available through `transformers` and stick with the default hyperparameters by not passing any arguments.

We'll also use a scheduler to control our learning rate.

In [0]:
optimizer = AdamW(model.parameters())

In [0]:
from transformers import get_linear_schedule_with_warmup

epochs = 3

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

As in the LSTM notebook, it will be helpful to have a utility for getting a flat set of predictions.

In [0]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

Now it's time to train our model! This is a long model, so I've included in-line comments rather than explaining everything up here. Again, some of this code and comments come from [this very helpful tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/).

In [14]:
import random

seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. This by itself doesn't train the model.
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        loss, logits = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy
        }
    )

print("")
print("Training complete!")


Training...

  Average training loss: 3.32

Running Validation...
  Accuracy: 0.15
  Validation Loss: 3.28

Training...

  Average training loss: 3.29

Running Validation...
  Accuracy: 0.15
  Validation Loss: 3.27

Training...

  Average training loss: 3.28

Running Validation...
  Accuracy: 0.15
  Validation Loss: 3.27

Training complete!


Hm, this isn't looking as great as our LSTMs, but let's evaluate on the test set. This should look very similar to the LSTM notebook and at a broad level to all of the previous notebooks. We make sure our test data is in the same format as the training data, we get our predictions, and then we compute our metrics.

In [0]:
# Create the DataLoader.
prediction_dataloader = DataLoader(test_dataset, batch_size=batch_size)

In [16]:
# Prediction on test set

print('Predicting labels for {:,} test sentences...'.format(len(test_dataset)))

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

print('    DONE.')

Predicting labels for 2,767 test sentences...
    DONE.


In [0]:
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

In [18]:
from sklearn.metrics import f1_score

f1_score(test.label, flat_predictions, average = 'weighted')


0.04216713458409312

Wow, that's not very good at all! We did bump up against the computational limits of Google Colab here - in previous versions of this notebook with a longer sequence length, the GPU was running out of memory. We're not giving the model the full text of the training data to learn from, so it makes sense that performence wouldn't be great. But hopefully this still illustrates some basic principles of how to use BERT or other large models for this kind of problem.

One final note: what if you were doing error analysis, and wanted to see the original text for an item that was labeled incorrectly? The BERT tokenizer has a simple `decode()` method for this. Let's look at a few of the (many!) items that our model got wrong.

In [19]:
for i in range(len(test_dataset[:10])):
  if flat_predictions[i] != test.label[i]:
    text = tokenizer.decode(test_dataset.__getitem__(i)[0])
    print(text,'\n\n')

[CLS] فابيوس : الخلافات الفرنسية الروسية [UNK] [UNK] يجب [UNK] تعيق التعاون في حل [UNK] سوريا [UNK] وزير الخارجية الفرنسي لوران فابيوس [UNK] خلافات بين باريس وموسكو حول [UNK] [UNK] يجب [UNK] تعيق تعاون الجانبين في حل غيرها من القضايا بما في ذلك [UNK] السورية باريس تعلن بدء استطلاع مواقع داعش في سوريا استعدادا لقصفها هولاند سنرفع العقوبات ضد روسيا في حال استمرار التقدم في حل [UNK] [UNK] وقال فابيوس في مقابلة مع [UNK] الثلاثاء سبتمبر [UNK] [UNK] يجب عدم الخلط بين المشاكل الموجودة [UNK] [UNK] سيكون حلها مستحيلا [UNK] [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 


[CLS] [UNK] تجسس [UNK] تدنو من [UNK] الروسية [UNK] مواقع [UNK] تعنى بمراقبة حركة [UNK] العسكرية باقتراب [UNK] استطلاع استراتيجية [UNK] من نوع من [UNK] مقاطعة كالينينغراد غرب روسيا للمرة الثانية وذكرت هذه المواقع [

Interesting! Here we see with the appearance of the [UNK] token that BERT didn't recognize a few of the words in our training data. This would definitely impact model performance, and would be an important area to investigate.