<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Pretrained_Transformers_(deberta-v2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Transformers

The most effective technique for most NLP tasks today is to take a transformer neural network, typically pretrained on massive collections of text data in an unsupervised manner, and then fine-tune it to your task. In this tutorial we'll show you how to use these techniques to train a text classifier for MSHA injury narratives.

### Overview of the Transformer
The transformer neural network was [introduced](https://arxiv.org/abs/1706.03762) in 2017 and originally designed to operate over a sequence of inputs representing words and/or subword sequences (i.e. tokens). The distinguishing operation of a transformer is self-attention. In self-attention, each input in a sequence is compared to every other input in the sequence and then aggregated to produce a new more context-aware sequence. Transformer networks typically consist of multiple blocks of self-attention, and incorporate numerous additional tricks to improve performance. Positional information about the sequence, which is necessary for many language tasks, is incorporated by adding a positional vector to each input in the sequence. 

A more extensive treatment of transformers is available [here](http://peterbloem.nl/blog/transformers).

### Self-Supervision
The second, far more important innovation we will use is self-supervised pretraining, which allows us to to initialize our model with language knowledge without access to labeled data. For language tasks, self-supervised pretraining is typically accomplished by gatherings huge collections of text, masking and/or corruption portions of the text and training the model to predict the missing or corrupted pieces. The first transformer to be pre-trained in this way is known as [BERT](https://arxiv.org/abs/1810.04805), and was introduced in 2018. Because it typically requires a huge amount of time to pre-train models on huge collections of text, we will not do that in this tutorial. Instead we will simply use one of the many that have already been pretrained for us.

Our basic approach is as follows:
1. Install the `transformers` library, which makes it easy to use pretrained transformers.
2. Download our MSHA injury data for training and evaluation
3. Load a pretrained transformer model with an untrained classification output layer for our task.
4. Prepare our data so it's compatible with the inputs expected by the model.
5. Finetune (i.e. further train) the pretrained transformer model on our MSHA data.
6. Evaluate and use the model.

# Setup

## Install Transformers Library

In [None]:
!pip install transformers>=4.4.2

## Download the MSHA Data

In [None]:
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'

--2021-04-21 17:35:38--  https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx [following]
--2021-04-21 17:35:38--  https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4183086 (4.0M) [application/octet-stream]
Saving to: ‘msha.xlsx’


2021-04-21 17:35:39 (97.9 MB/s) - ‘msha.xlsx’ saved [4183086/4183086]



In [None]:
import pandas as pd
from sklearn import preprocessing

df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].dt.year
labeler = preprocessing.LabelEncoder()   # Labeler that will convert our codes to indexes
labeler.fit(df['INJ_BODY_PART'])         # Calculate an index for each unique code
df['PART_INDEX'] = labeler.transform(df['INJ_BODY_PART']) # Add the code indexes to our dataframe, we will need these later
# separate training and validation by year
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012].copy()
# show the results
print('n_classes:', len(df['INJ_BODY_PART'].unique()))
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))
df[['INJ_BODY_PART', 'PART_INDEX', 'NARRATIVE', 'ACCIDENT_YEAR']].head()

n_classes: 46
training rows: 18681
validation rows: 9032


Unnamed: 0,INJ_BODY_PART,PART_INDEX,NARRATIVE,ACCIDENT_YEAR
0,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),35,"Cleaning out Gabion Grizzly, Rocks get Jammed...",2010
1,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),35,"Injured was walking in the pit area, stepped o...",2010
2,HIPS (PELVIS/ORGANS/KIDNEYS/BUTTOCKS),22,"Employee, parked s/c on grade at 16-Block #3 E...",2012
3,ANKLE,1,Contractor employee working as a carpenter mis...,2013
4,FINGER(S)/THUMB,16,The employee's finger was pinched between the ...,2011


## Load a Pretrained Transformer

We will start with a pretrained transformer called `distilbert-base-uncased`, which is a carefully minitiarized version of BERT. 

In loading this model we are actually doing several things.
1. We are loading the tokenizer used to train the transformer. This is important because it allows us to convert our narrative inputs into the format expected by the model.
2. We are loading the model that was pretrained trained on data tokenized by this tokenizer.
3. We are adapting the model to sequence classification, i.e. classification from a sequence of inputs. In practice this means chopping off the last part of the pretrained transformer and adding an untrained output layer (essentially a multinomial logistic regression), which will produce outputs for each of the codes we wish to assign. There are 46 part codes in our data, so we specify num_labels=46 when loading the model.

In [None]:
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=46)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

# Preparing the Data

The pretrained transformer expects inputs in a specific format (described in the [documentation](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification)), the tokenizer is our primary mechanism for accomplishing this. We illustrate it's basic usage below:

In [None]:
inputs = tokenizer('Workers arm caught in grider', return_tensors='pt')
print(inputs)

{'input_ids': tensor([[ 101, 3667, 2849, 3236, 1999, 8370, 2121,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


The tokenizer expects one or more input texts. In then converts them into two tensors, one consisting of indexes corresponding to the tokens that make up the word (`input_ids`), the other indicating the number of input words (`attention_mask`). These are the exact inputs expected by the model, which we can now invoke as follows:

In [None]:
output = model(**inputs)
print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.0749, -0.1781,  0.1764, -0.1064, -0.1286, -0.0361, -0.1055,  0.0268,
          0.0013, -0.0534, -0.0041,  0.2025, -0.0888,  0.0731,  0.0904,  0.0598,
          0.0416, -0.0253,  0.0262,  0.1578,  0.0420,  0.1467,  0.0392,  0.1198,
          0.0584, -0.0358,  0.1646, -0.1632,  0.1824, -0.0613, -0.1047,  0.0661,
         -0.0212,  0.0690, -0.0693, -0.0369, -0.0760,  0.0401,  0.0230, -0.0529,
          0.0540,  0.0631,  0.0492,  0.0266,  0.0937,  0.1761]],
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


In [None]:
output['logits'].shape

torch.Size([1, 46])

The model output consists of 2 components:
1. logits: [n_examples, n_labels] is the pre-softmax outputs from our final layer, one row for each input example, one column for each part of body classification
2. loss: the model's loss on the inputs. Since no label was included in the inputs no loss was calculated. 

If we add the true labels to our inputs, the model will calculate the Negative Log Likelhood Loss (the pre-softmax equivalent of the Cross-Entropy Loss) for our model as shown below:

In [None]:
inputs['labels'] = torch.tensor([[3]])
outputs = model(**inputs)
print(outputs)

SequenceClassifierOutput(loss=tensor(3.9611, grad_fn=<NllLossBackward>), logits=tensor([[ 0.0749, -0.1781,  0.1764, -0.1064, -0.1286, -0.0361, -0.1055,  0.0268,
          0.0013, -0.0534, -0.0041,  0.2025, -0.0888,  0.0731,  0.0904,  0.0598,
          0.0416, -0.0253,  0.0262,  0.1578,  0.0420,  0.1467,  0.0392,  0.1198,
          0.0584, -0.0358,  0.1646, -0.1632,  0.1824, -0.0613, -0.1047,  0.0661,
         -0.0212,  0.0690, -0.0693, -0.0369, -0.0760,  0.0401,  0.0230, -0.0529,
          0.0540,  0.0631,  0.0492,  0.0266,  0.0937,  0.1761]],
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


The loss allows us to calculate the parameter gradients, which in turn tells us how to update parameter values when training the model with gradient descent.

Probabilistic-like outputs from the final model can be recovered by applying the softmax to the logits:

In [None]:
torch.softmax(outputs['logits'], -1)

tensor([[0.0228, 0.0177, 0.0253, 0.0190, 0.0186, 0.0204, 0.0191, 0.0218, 0.0212,
         0.0201, 0.0211, 0.0259, 0.0194, 0.0228, 0.0232, 0.0225, 0.0221, 0.0207,
         0.0217, 0.0248, 0.0221, 0.0245, 0.0220, 0.0239, 0.0225, 0.0204, 0.0250,
         0.0180, 0.0254, 0.0199, 0.0191, 0.0226, 0.0207, 0.0227, 0.0198, 0.0204,
         0.0196, 0.0220, 0.0217, 0.0201, 0.0224, 0.0226, 0.0222, 0.0218, 0.0233,
         0.0253]], grad_fn=<SoftmaxBackward>)

### About the Tokenizer

You might have noticed the tokenizer is creating more tokens that we have input words. There are two reasons for this:
1. Its adding special characters that are expected by the model.
2. Its using a combination of word and subword tokens so it can represent all possible words with a very limited vocabularly.

We can see both of these effects more clearly by partially reversing the tokenization below:


In [None]:
inputs = tokenizer('Workers arm crushed by anthracite.')
print(inputs)

{'input_ids': [101, 3667, 2849, 10560, 2011, 14405, 13492, 17847, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'])

['[CLS]',
 'workers',
 'arm',
 'crushed',
 'by',
 'ant',
 '##hra',
 '##cite',
 '.',
 '[SEP]']

Note the special `[CLS]` and `[SEP]` tokens, to indicate the start and end of the sequence, and the subword tokenization of `anthracite` into `ant`, `##hra`, `##cite`. 

# Generating Batches

Neural networks tend to be extraordinarily computationally expensive and that requires very careful use of our computing resources. One of the most important ways we accomplish this is computing on batches of data as it's rarely possible to compute on the entire dataset at once.

To help us assemble these batches, Pytorch provides special Datasets, Samplers, Loaders, and collators. We describe each below.

## PyTorch Dataset
The PyTorch Dataset is a representation of our data (training, validation, or test) that has a `__len__` method, so PyTorch knows how many examples it contains, and a `__get_item__` method so PyTorch can retrieve an example by specifying it's index (i.e. row number). 

We create our CustomDataset below by subclassing the Pytorch Dataset and modifying the associated methods to fit our purposes. In particular, we modify the `__get_item__` method so it produces a tokenized version of the corresponding narrative and the integer version of the code associated with that narrative if target_field is specified.

In [None]:
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
  def __init__(self, df, tokenizer, text_field, max_len=200, label_field=None):
    self.df = df                    # dataframe containing our data
    self.tokenizer = tokenizer      # pretrained tokenizer to convert narratives to indexes
    self.text_field = text_field    # column of dataframe containing input text
    self.max_len = max_len          # optional: max tokens we will use in a text
    self.label_field = label_field  # optional: field containing example label

  def __len__(self):
    return len(self.df)

  def __getitem__(self, index):
    row = self.df.iloc[index]              # retrieve example at specified index
    inputs = self.tokenizer.encode_plus(  
                text=row[self.text_field], # specify the field to tokenize
                max_length=self.max_len,   # set the max tokens we will allow in narrative (typically < 512 because of memory limitations)
                truncation=True)           # truncate the narrative to max_len if it exceeds max_len (avoids memory errors)
    if self.label_field:                   # if label_field is specified, we include labels in each example
        inputs['labels'] = row[self.label_field]
    return dict(inputs) 

Example of converting our df_train and df_valid dataframes into PyTorch Datasets.

In [None]:
train_dataset = CustomDataset(df=df_train, tokenizer=tokenizer, max_len=200,
                              text_field='NARRATIVE', label_field='PART_INDEX')
valid_dataset = CustomDataset(df=df_valid, tokenizer=tokenizer, max_len=200,
                              text_field='NARRATIVE', label_field='PART_INDEX')

Example of retrieving a row from our dataset by specifying the row index (in this case row 0).

In [None]:
print(train_dataset[0])

{'input_ids': [101, 9344, 2041, 11721, 26282, 2078, 24665, 29266, 1010, 5749, 2131, 21601, 1998, 7861, 22086, 4402, 2038, 2000, 2593, 29198, 2030, 5245, 1996, 5749, 2041, 1997, 1996, 24665, 29266, 1012, 7904, 2001, 2478, 1037, 22889, 24225, 8691, 2000, 2131, 1996, 5749, 4558, 1998, 2766, 2242, 1999, 2010, 2157, 3244, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': 35}


## PyTorch Sampler

The job of the PyTorch Sampler is to decide which examples need to be sampled from our dataset next. For training, we typically want to do this randomly without replacement so that each example is selected an equal number of times. We can accomplish this with the RandomSampler, illustrated below. 

In [None]:
from torch.utils.data import RandomSampler

random_sampler = RandomSampler(train_dataset)
sampled_index = random_sampler.__iter__().__next__()
print(sampled_index)

14572


When training sequence models however, we can often get large speedups by sampling in such a way that each batch consists of narratives of similar length. The LengthGroupedSampler does exactly this so we will use it instead.

In [None]:
from transformers.trainer_pt_utils import LengthGroupedSampler

length_sampler = LengthGroupedSampler(train_dataset, batch_size=32)
length_sampler.__iter__().__next__()


6186

Once the model is trained, we typically only want to pull each example once, in order. This is accomplished using the SequentialSampler.

In [None]:
from torch.utils.data import SequentialSampler

sequential_sampler = SequentialSampler(train_dataset)
sequential_sampler.__iter__().__next__()

0

## Collate Function

Once the samples are drawn, we need to combine them into tensors of equal size. This is slightly complicated by the fact that different narratives are of different lengths. DataCollatorWithPadding solves this by padding each example in the batch to equal length using special padding tokens, and then combining the result into a tensor.   

In [None]:
from transformers.data.data_collator import DataCollatorWithPadding

pad_and_collate = DataCollatorWithPadding(tokenizer=tokenizer)

## DataLoader
The job of the dataloader is connect the dataset, sampler, and collator so we can easily generate batches of the desired size. Specifically: the dataloader uses the sampler to retrieve examples from the dataset until it reaches the number specified by batch_size. It then uses the collate function to combine these into tensors. 

As a general rule we typically set the batch size to the largest number that can fit in our GPU's memory as larger batch sizes tend to enable faster training. If you set the batch size too high however you will get memory errors when training or using the model. If that happens you should restart the runtime and try a smaller batch size or a smaller max sequence length (transformer models in particular require large amounts of memory for long narratives).

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, 
                          batch_size=32,                                              # number of training examples in each batch
                          sampler=LengthGroupedSampler(train_dataset, batch_size=32), # sample batches with similar length
                          collate_fn=pad_and_collate,
                          pin_memory=True, # optimize memory transfer to from CPU to GPU
                          drop_last=True)  # if last batch is smaller than batch_size, drop it (usually best for training)

valid_loader = DataLoader(valid_dataset, 
                          batch_size=64,                                              # number of validation examples in each batch (can usually be double training)
                          sampler=LengthGroupedSampler(valid_dataset, batch_size=64), # sample batches with similar length
                          collate_fn=pad_and_collate,
                          pin_memory=True, # optimize memory transfer from CPU to GPU
                          drop_last=False) # don't drop partial batches (best for evaluation and inference)

Example of pulling a batch from our dataloader

In [None]:
batch = train_loader.__iter__().__next__()
print(batch)

{'input_ids': tensor([[  101,  2012,  2260,  ...,  1015,  1012,   102],
        [  101, 25212,  2187,  ...,     0,     0,     0],
        [  101,  3626,  2001,  ...,     0,     0,     0],
        ...,
        [  101, 25212,  2001,  ...,     0,     0,     0],
        [  101,  2006,  1017,  ...,     0,     0,     0],
        [  101,  1996,  5043,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([ 6,  4,  1,  4, 32,  3,  4, 12, 15,  9, 35,  4, 24, 21, 31,  4, 17, 30,
        17, 31, 31, 45, 42, 31, 35, 26,  3,  4,  7, 16, 24, 15])}


Note that the input_ids and attention_mask tensors have been padded to equal size (the pad_token is 0).

In [None]:
batch['input_ids'].size()

torch.Size([32, 137])

# Train the Model

The standard approach to training a neural network is to use batch gradient descent. Specifically, over a series of epochs (i.e. passes through the training dataset) we will repeatedly:
1. Retrieve a batch of training examples (by iterating through the train_loader)
2. Calculate the model's predictions ('logits') and ('loss') on the batch
3. Calculate the gradient, i.e. how much of the loss can be attributed to each of the model parameters
4. Update the model parameters in the direction that reduces loss, as measured by the gradient
5. Repeat until we reach a stopping criteria, such as validation performance no longer improving or a specified number of epochs being reached.

To monitor performance, at the end of each training epoch we will:
1. Retrieve batches from the validation data
2. Calculate the model's predictions on these batches
3. Compare these predictions to the labels to calculate the accuracy and f1-score
This helps us determine how many training epochs we should use (if validation accuracy and macro-f1 stop improving, it's time to stop).

In [None]:
from tqdm.notebook import tqdm # provides graphical update on training progress
import transformers
from transformers import AdamW
from sklearn.metrics import accuracy_score, f1_score

model = model.to(torch.device('cuda'))                 # transfer model parameters to GPU
optimizer = AdamW(params=model.parameters(), lr=1e-4)  # set optimizer and learning rate (will update model parameters)

for epoch in range(3):         # repeat 3 times
  print(f'Epoch: {epoch}')
  model.train()                # set dropout and similar layers to random mode
  for batch in tqdm(train_loader, desc='Train'):    # iterate through all the batches in training data
    batch = {k: v.cuda() for k, v in batch.items()} # transfer the batch to the GPU
    output = model(**batch)    # calculate logits (predictions) and loss on training batch
    output['loss'].backward()  # calculate gradient (change in loss with respect to parameters)
    optimizer.step()           # adjust the parameters in the direction that reduces loss as measured by gradient
    optimizer.zero_grad()      # zero out the gradient as we're now moving on to the next batch
  # at the end of each epoch, evaluate our model on the validation data
  preds = []                   # will accumulate validation predictions
  labels = []                  # will accumulate validation true labels
  with torch.no_grad():            # disable gradient tracking, its slow and not needed for validation
    model.eval()                   # change dropout and similar layers to deterministic mode   
    for batch in tqdm(valid_loader, desc='Valid'):    # iterate through batches of validation data
      batch = {k: v.cuda() for k, v in batch.items()} # transfer the batch to the GPU
      output = model(**batch)      # generate predictions on validation data batch [n_examples, n_labels]
      labels.append(batch['labels'].cpu())     # transfer labels to CPU and append to list
      preds.append(output['logits'].cpu())     # transfer logits to CPU and append to list
  all_preds = torch.cat((preds), 0).argmax(-1) # stack predicted logits and retrieve best predictions
  all_labels = torch.cat((labels), 0)          # stack true labels
  acc = accuracy_score(all_labels, all_preds)            # calculate validation accuracy
  mf1 = f1_score(all_labels, all_preds, average='macro') # calculate validation macro-f1
  print(f'accuracy: {round(acc, 3)}') 
  print(f'micro-f1: {round(mf1, 3)}')


Epoch: 0


HBox(children=(FloatProgress(value=0.0, description='Train', max=583.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='Valid', max=142.0, style=ProgressStyle(description_width=…


accuracy: 0.806
micro-f1: 0.607
Epoch: 1


HBox(children=(FloatProgress(value=0.0, description='Train', max=583.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='Valid', max=142.0, style=ProgressStyle(description_width=…


accuracy: 0.808
micro-f1: 0.61
Epoch: 2


HBox(children=(FloatProgress(value=0.0, description='Train', max=583.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='Valid', max=142.0, style=ProgressStyle(description_width=…


accuracy: 0.811
micro-f1: 0.633


## Saving the Model

Training neural networks can take a very long time so we often want to save the resulting model. We can save it and reload it as follows.

In [None]:
# save the model
torch.save(model, 'path_for_my_trained_model')

## Reloading the Model

In [None]:
# reload the model
reloaded_model = torch.load('path_for_my_trained_model')

## Using the Model

Using the model to generate predictions is similar to using the model for validation. We will put the data into a PyTorch Dataset and DataLoader, and then retrieve batches, calculate predictions and probabilities on those batches, and finally save the results to a dataframe.

In [None]:
pred_loader = DataLoader(valid_dataset, 
                         batch_size=64,                            # number of validation examples in each batch (can usually be double training)
                         sampler=SequentialSampler(valid_dataset), # sample batches in order
                         collate_fn=pad_and_collate,
                         pin_memory=True, # optimize memory transfromer from CPU to GPU
                         drop_last=False) # don't drop partial batches (we want predictions for all examples)

In [None]:
preds = []                    # list that will accumulate all our batch predictions
with torch.no_grad():         # disable gradient tracking, it's expensive and only useful for training
  reloaded_model.eval()       # set model to evaluation mode so dropout is deterministic
  for batch in pred_loader:   # iterate through sequential batches of validation data
    batch = {k: v.cuda() for k, v in batch.items()} # transfer the batch to the GPU
    output = model(**batch)   # generate model predictions on batch
    preds.append(output['logits'].cpu())   # append the logits to preds
  all_preds = torch.cat(preds, dim=0)      # concatenate our list of preds into one big array [n_valid, n_labels]
  all_probs = torch.softmax(all_preds, -1) # calculate the probabilities associated with our logits [n_valid, n_labels]
best_preds = all_probs.argmax(dim=1)       # calculate the highest probability prediction code index for each example
best_pred_probs = all_probs.max(dim=1)[0]  # calculate the probability for each prediction

In [None]:
# convert our predictions, which are code indexes, back to their string values
df_valid['PRED_CODE'] = labeler.inverse_transform(best_preds)
df_valid['PRED_PROB'] = best_pred_probs
df_valid[['NARRATIVE', 'INJ_BODY_PART', 'PRED_CODE', 'PRED_PROB']].sample(10)

Unnamed: 0,NARRATIVE,INJ_BODY_PART,PRED_CODE,PRED_PROB
547,The splitter operator pushed stone through jaw...,FINGER(S)/THUMB,FINGER(S)/THUMB,0.981081
36888,EE was cleaning out tracks on 230 excavator Jo...,KNEE/PATELLA,KNEE/PATELLA,0.992641
23870,DEHYDRATION,BODY SYSTEMS,BODY SYSTEMS,0.9621
2144,"Late afternoon, cloudy, misting rain.",WRIST,NECK,0.213622
814,"Changing hose on double boom bolter and, when ...",HAND (NOT WRIST OR FINGERS),HAND (NOT WRIST OR FINGERS),0.975269
475,Employee states that while being loaded by 106...,NECK,NECK,0.972752
10572,Employee states that he injured his back while...,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),BACK (MUSCLES/SPINE/S-CORD/TAILBONE),0.986079
31259,"While attempting to cut a zip tie, safety lock...",FINGER(S)/THUMB,FINGER(S)/THUMB,0.661779
21391,The employee was installing roof screen while ...,FINGER(S)/THUMB,FINGER(S)/THUMB,0.984873
37041,Ind. was clawing when he felt alot of pain in ...,KNEE/PATELLA,KNEE/PATELLA,0.989116


# Training the Model using the Trainer (optional)
It is easy to make mistakes when constructing the training loop by hand, so the transformers library also provides the Trainer class, which abstracts away the training, optimization, and validation.

In [None]:
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score

def get_metrics(eval_prediction):
  """ 
  Function specifying the metrics to periodically calculate on validation data 
  Input:
    eval_prediction: evaluation object with two attributes
      label_ids: the true labels
      predictions: the model predictions
  Output:
    Dict[metric_name, metric_value]
  """
  y_true = eval_prediction.label_ids
  y_pred = torch.from_numpy(eval_prediction.predictions).softmax(-1).argmax(axis=1)
  acc = accuracy_score(y_true, y_pred)
  mf1 = f1_score(y_true, y_pred, average='macro')
  return {'accuracy': acc,
          'macro-f1': mf1}

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total # of training epochs (i.e. iterations through training dataset)
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation (can usually be about 2x train_batch_size because no gradients are calculated)
    warmup_steps=500,                # number of warmup steps for learning rate scheduler (often useful to gradually increase lr for transformers)
    weight_decay=0.01,               # strength of weight decay (L2 regularization)
    logging_dir='./logs',            # directory where logs are stored
    save_strategy='epoch',           # when to save checkpoints of our model (epoch means at the end of every epoch)
    save_total_limit=3,              # the maximum number of model checkpoints to save (by default they are saved every 500 steps)
    do_eval=True,                    # whether to periodically evaluate on the validtion data
    evaluation_strategy='epoch',     # how often we will evaluate (epoch means at the end of each epoch)
    group_by_length=True,            # use the length sampler for training
    learning_rate=1e-4               # the optimizer's learning rate, i.e. how much we adjust the weights at each step
)

trainer = Trainer(
    model=model,                         # the model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=get_metrics          # metrics that we want computed
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Macro-f1,Runtime,Samples Per Second
1,0.4159,0.757202,0.808791,0.620547,24.9924,361.39
2,0.3481,0.779492,0.818313,0.626997,24.6488,366.428


TrainOutput(global_step=1168, training_loss=0.36934691912507356, metrics={'train_runtime': 276.8585, 'train_samples_per_second': 4.219, 'total_flos': 677350883466708.0, 'epoch': 2.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 2490368, 'train_mem_gpu_alloc_delta': 548987904, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 1982753280})

## Evaluate the Model Using the Trainer
Evaluation happens on the eval (validation) data.

In [None]:
trainer.evaluate()

{'epoch': 2.0,
 'eval_accuracy': 0.8183126660761736,
 'eval_loss': 0.7794921398162842,
 'eval_macro-f1': 0.6269974132894985,
 'eval_mem_cpu_alloc_delta': 0,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 262299648,
 'eval_runtime': 24.5903,
 'eval_samples_per_second': 367.299}

## Saving and Reloading the Trainer Model
When using the trainer, the underlying model is attached to the trainer as an attribute. We can access and save it as follows:

In [None]:
torch.save(trainer.model, 'my_torch_model')

In [None]:
my_reloaded_model = torch.load('my_torch_model')

## Resuming Training
Alternately, you can resume training from a previous checkpoint by specifying the checkpoint in the train method.



In [None]:
training_args.num_train_epochs += 1      # add one more training epoch

trainer = Trainer(
    model=model,                         # the model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=get_metrics          # metrics that we want computed
)

trainer.train(r'./results/checkpoint-1168')

Epoch,Training Loss,Validation Loss,Accuracy,Macro-f1,Runtime,Samples Per Second
3,0.2256,0.886861,0.812223,0.6507,24.924,362.382
4,0.1788,0.951256,0.81333,0.628196,24.7869,364.386


TrainOutput(global_step=2336, training_loss=0.08691935179984733, metrics={'train_runtime': 319.8689, 'train_samples_per_second': 7.303, 'total_flos': 1355118973466304.0, 'epoch': 4.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': -10366976, 'train_mem_gpu_alloc_delta': 1083420672, 'train_mem_cpu_peaked_delta': 13570048, 'train_mem_gpu_peaked_delta': 1982733824})