<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Transformers%20for%20Text%20Classification%20from%20Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Text Classification with Transformers and Pretraining

This tutorial describes how to pretrain an already pretrained language model on domain specific data, and then finetune it for classification.

## Install Libraries

In [None]:
!pip install transformers==3.4.0

Collecting transformers==3.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 18.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 53.5MB/s 
Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/7c/a5/78be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c/tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 36.8MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)


## Download and Prepare Data

In [None]:
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'

--2020-11-20 20:59:58--  https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx [following]
--2020-11-20 20:59:58--  https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4183086 (4.0M) [application/octet-stream]
Saving to: ‘msha.xlsx.2’


2020-11-20 20:59:58 (71.3 MB/s) - ‘msha.xlsx.2’ saved [4183086/4183086]



In [None]:
import pandas as pd
from sklearn import preprocessing

# read in the data
df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].dt.year
# convert part codes to numeric indicators
labeler = preprocessing.LabelEncoder()
labeler.fit(df['INJ_BODY_PART'])
df['PART_CODE'] = labeler.transform(df['INJ_BODY_PART'])
# separate pretraining, training, and validation data
# To simulate the common scenario where we have more unlabeled pretraining data
# than labeled data use all available 2010 and 2011 data for pretraining and
# only a small sample of that for supervised training.
df_pretrain = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy().reset_index(drop=True)
df_train = df_pretrain.sample(1000).copy().reset_index(drop=True)
df_valid = df[df['ACCIDENT_YEAR'] == 2012].copy().sample(1000).reset_index(drop=True)
# show the rseults
print('n_classes:', len(df['INJ_BODY_PART'].unique()))
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))
df[['INJ_BODY_PART', 'PART_CODE', 'NARRATIVE', 'ACCIDENT_YEAR']].head()

n_classes: 46
training rows: 1000
validation rows: 1000


Unnamed: 0,INJ_BODY_PART,PART_CODE,NARRATIVE,ACCIDENT_YEAR
0,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),35,"Cleaning out Gabion Grizzly, Rocks get Jammed...",2010
1,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),35,"Injured was walking in the pit area, stepped o...",2010
2,HIPS (PELVIS/ORGANS/KIDNEYS/BUTTOCKS),22,"Employee, parked s/c on grade at 16-Block #3 E...",2012
3,ANKLE,1,Contractor employee working as a carpenter mis...,2013
4,FINGER(S)/THUMB,16,The employee's finger was pinched between the ...,2011


# Domain-Specific Language Model Pretraining (Optional)

Recent research such as [ULMFiT](https://arxiv.org/abs/1801.06146) and [Don't Stop Pretraining](https://arxiv.org/abs/2004.10964) suggests models pretrained on general-purpose language modelling can often be improved by additional task-specific language modeling. We illustrate how to do this this below by further pretraining an already pretrained model on MSHA data. Note that because we performing the language model pretraining task, we do not even use the part codes assigned to the individual cases. This approach allows us to make use of large amounts of unlabeled task-specific data.

### Specify the Language Model



In [None]:
import torch
import torch.nn as nn
from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer

#model_name = 'roberta-base'
model_name = 'distilbert-base-uncased'
config = AutoConfig.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Where do we get those input_ids and attention_mask? They're generated by the tokenizer.

In [None]:
tokenizer.encode_plus('a man fell while lifting a ladder')

{'input_ids': [101, 1037, 2158, 3062, 2096, 8783, 1037, 10535, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Test the language model

A language model attempts to predict missing words (tokens) using the surrounding words. We can verify that our model is indeed pretrained by testing it out. One simple way is by using the fill_mask pipeline, as follows. Note, in this case we are asking the model to complete the prompt `"the worker sprained his [blank]"`. 

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

fill_mask(f'The worker sprained his {tokenizer.mask_token}')

[{'score': 0.33903393149375916,
  'sequence': '[CLS] the worker sprained his ankle [SEP]',
  'token': 10792,
  'token_str': 'ankle'},
 {'score': 0.07981275767087936,
  'sequence': '[CLS] the worker sprained his wrist [SEP]',
  'token': 7223,
  'token_str': 'wrist'},
 {'score': 0.04735306650400162,
  'sequence': '[CLS] the worker sprained his knee [SEP]',
  'token': 6181,
  'token_str': 'knee'},
 {'score': 0.04560253396630287,
  'sequence': '[CLS] the worker sprained his foot [SEP]',
  'token': 3329,
  'token_str': 'foot'},
 {'score': 0.044954314827919006,
  'sequence': '[CLS] the worker sprained his neck [SEP]',
  'token': 3300,
  'token_str': 'neck'}]

### Convert Data into PyTorch Datasets / Samplers / Loaders

PyTorch provides four optional utilities to assist with managing data, especially when that data is large or requires expensive processing that we would like to parellelize and conduct on-the-fly. These are as follows:
* PyTorch Dataset - a representation of the input data accessible either by index or by iteration
* PyTorch DataSampler - a mechanism for sampling indexes from the Dataset (typically either sequential or random)
* PyTorch collate_fn - a functon for collating samples into batches
* PyTorch DataLoader - a mechanism that combines the other mechanisms to load batches onto the GPU. In particular this allows the preparation of batches while the GPU is processing so they are ready the instant the GPU becomes available for the next one.

An added advantage of using these data structures is that other libraries (like Transformers, and Pytorch-Lightning) are designed to work with them.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

class CustomDataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len, input_field, target_field=None):
    self.tokenizer = tokenizer
    self.dataframe = dataframe
    self.target_field = target_field
    self.input_field = input_field
    self.max_len = max_len

  def __len__(self):
    return len(self.dataframe)

  def __getitem__(self, index):
    input = self.dataframe[self.input_field][index]
    inputs = self.tokenizer.encode_plus(
        input, None, add_special_tokens=True, max_length=self.max_len, 
        padding='max_length', truncation=True, return_token_type_ids=False)
    # if we know the code, i.e. we're using this for training, add the target
    if self.target_field:
        inputs['labels'] = torch.tensor(self.dataframe[self.target_field][index], dtype=torch.long)
    return inputs   

In [None]:
train_dataset = CustomDataset(dataframe=df_pretrain, tokenizer=tokenizer, max_len=200,
                              input_field='NARRATIVE')
valid_dataset = CustomDataset(dataframe=df_valid, tokenizer=tokenizer, max_len=200,
                              input_field='NARRATIVE')

Example of retrieving a row of data from our dataset by index

In [None]:
train_dataset[0]

{'input_ids': [101, 9344, 2041, 11721, 26282, 2078, 24665, 29266, 1010, 5749, 2131, 21601, 1998, 7861, 22086, 4402, 2038, 2000, 2593, 29198, 2030, 5245, 1996, 5749, 2041, 1997, 1996, 24665, 29266, 1012, 7904, 2001, 2478, 1037, 22889, 24225, 8691, 2000, 2131, 1996, 5749, 4558, 1998, 2766, 2242, 1999, 2010, 2157, 3244, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Example of sampling an index from the dataset using the sampler

In [None]:
sampler = RandomSampler(train_dataset)
sampled_index = sampler.__iter__().__next__()
print(sampled_index)

7409


### Collator

The job of the collator is to group the sampled rows into batches. In the case of language model pretraining we give it the added role of assembling "targets"
for prediction, in this case the words that we want the model to predict from the context. By default this collator produces BERT style masked-language modeling targets, i.e. 15% of the tokens are chosen as prediction targets (i.e. not -100), of these 80% are replaced with a [mask] token, 10% with a random word, and 10% with the original word. 

In [None]:
from transformers import DataCollatorForLanguageModeling

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer)
collator([train_dataset[0]])

{'input_ids': tensor([[  101,  9344,  2041, 11721, 26282,  2078, 24665, 29266, 14823,   103,
           2131, 21601,  1998,  7861,   103,  4402,  2038,  2000,  2593, 29198,
           2030,  5245,  1996,  5749,  2041,  1997,   103, 24665,   103,  1012,
           7904,  2001,  2478,   103, 22889, 24225,  8691,  2000,  2131,  1996,
           5749,  4558,  1998,  2766,   103,  1999,  2010,  2157,  3244,   103,
            102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,    

In [None]:
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, 
                          sampler=RandomSampler(train_dataset), collate_fn=collator)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, 
                          sampler=SequentialSampler(valid_dataset), collate_fn=collator)

Example of pulling a batch from our dataloader

In [None]:
batch = train_loader.__iter__().__next__()
for k, v in batch.items():
  batch[k] = v.to(torch.device('cuda'))
print(batch)

{'input_ids': tensor([[  101,  3384, 27546,  ...,     0,     0,     0],
        [  101,   103,  2001,  ...,     0,     0,     0],
        [  101, 25212,  2988,  ...,     0,     0,     0],
        ...,
        [  101,  2096,  4895,  ...,     0,     0,     0],
        [  101,  7904,   103,  ...,     0,     0,     0],
        [  101,  2096,  5094,  ...,     0,     0,     0]], device='cuda:0'), 'labels': tensor([[-100, -100, -100,  ..., -100, -100, -100],
        [-100, 6778, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        ...,
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, 2001,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100]], device='cuda:0')}


In [None]:
batch['labels'].shape

torch.Size([32, 200])

In [None]:
import transformers
from sklearn.metrics import accuracy_score, f1_score

model = model.to(torch.device('cuda'))
optimizer = transformers.AdamW(params=model.parameters(), lr=1e-4)

for epoch in range(2):
  print(f'Epoch: {epoch}')
  training_loss = []
  # set the model to training mode so things like dropout behave correctly
  model.train()
  for idx, batch in enumerate(train_loader):
    # send the batch to cuda
    for k, v in batch.items():
      batch[k] = v.to(torch.device('cuda'))
    # calculate the model predictions on our training batch
    loss, pred = output = model(**batch)
    # calculate model loss, i.e. how well the predictions match the labels
    # model already calculates the loss but...
    # loss = criteria(output[1], batch['labels'])
    # calculate change in loss with respect to parameters (i.e. gradient)
    loss.backward()
    training_loss.append(loss)
    # adjust the parameters in the direction that reduces loss as measured by gradient
    optimizer.step()
    # zero out the gradient as we're now moving on to the next training batch
    optimizer.zero_grad()
  print(f'training_loss {torch.tensor(training_loss).mean()}')
  print('validating')
  preds = []
  labels = []
  # at the end of each training epoch, calculate the accuracy on the validation data
  with torch.no_grad():
    # set the model to evaluate mode so things like dropout are no longer random
    model.eval()
    valid_loss = []
    for idx, batch in enumerate(valid_loader):
      # send the batch to cuda
      for k, v in batch.items():
        batch[k] = v.to(torch.device('cuda'))
      loss, pred = model(**batch)
      valid_loss.append(loss)
      if idx % 5 == 0:
        print(f'Epoch: {epoch} valid loss: {torch.tensor(valid_loss[-5:]).mean()}')
  print(f'average valid loss: {torch.tensor(valid_loss).mean()}')


Epoch: 0
training_loss 2.2565858364105225
validating
Epoch: 0 valid loss: 2.022676706314087
Epoch: 0 valid loss: 2.050161600112915
Epoch: 0 valid loss: 2.109633684158325
Epoch: 0 valid loss: 2.021721363067627
Epoch: 0 valid loss: 1.707619309425354
Epoch: 0 valid loss: 1.9760822057724
Epoch: 0 valid loss: 1.8700177669525146
average valid loss: 1.961233377456665
Epoch: 1
training_loss 1.9429348707199097
validating
Epoch: 1 valid loss: 2.0210466384887695
Epoch: 1 valid loss: 1.7757422924041748
Epoch: 1 valid loss: 1.9089982509613037
Epoch: 1 valid loss: 1.9186903238296509
Epoch: 1 valid loss: 1.7960214614868164
Epoch: 1 valid loss: 1.920635461807251
Epoch: 1 valid loss: 1.9651975631713867
average valid loss: 1.8740062713623047


Verify that the language model has been further trained by examining predictions.

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)
model.to(torch.device('cpu'))
fill_mask(f'The worker sprained his {tokenizer.mask_token}')

[{'score': 0.3544308841228485,
  'sequence': '[CLS] the worker sprained his ankle [SEP]',
  'token': 10792,
  'token_str': 'ankle'},
 {'score': 0.1934514343738556,
  'sequence': '[CLS] the worker sprained his knee [SEP]',
  'token': 6181,
  'token_str': 'knee'},
 {'score': 0.11106089502573013,
  'sequence': '[CLS] the worker sprained his wrist [SEP]',
  'token': 7223,
  'token_str': 'wrist'},
 {'score': 0.07629260420799255,
  'sequence': '[CLS] the worker sprained his back [SEP]',
  'token': 2067,
  'token_str': 'back'},
 {'score': 0.05465313047170639,
  'sequence': '[CLS] the worker sprained his shoulder [SEP]',
  'token': 3244,
  'token_str': 'shoulder'}]

### Save Pretrained Language Model

In [None]:
model.save_pretrained('msha_pretrain')

# Train Model for Text Classification

## Load Pretrained Language Model for Text Classification
To use our pretrained language model for text classification we simply need to swap out the last few layers and train those for our classification task. We can easily do this in transformers as follows:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('msha_pretrain', num_labels=len(labeler.classes_))
print(model)

Some weights of the model checkpoint at msha_pretrain were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at msha_pretrain and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'class

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

## Prepare Data for Text Classification
This requires only a slight modification to our preparations for language model pretraining, instead of using the language model collator to generate masked words for prediction we will instead using the "part of body" codes from the MSHA dataset as our targets.

In [None]:
train_dataset = CustomDataset(dataframe=df_train, tokenizer=tokenizer, max_len=200,
                              input_field='NARRATIVE', target_field='PART_CODE')
valid_dataset = CustomDataset(dataframe=df_valid, tokenizer=tokenizer, max_len=200,
                              input_field='NARRATIVE', target_field='PART_CODE')

In [None]:
def collate_for_classification(sampled_rows):
  keys = sampled_rows[0].keys()
  batch = {key: [] for key in keys}
  # assemble the rows into lists of tensors for each input
  for row in sampled_rows:
    for k, v in row.items():
      batch[k].append(torch.tensor(v, dtype=torch.long))
  # stack the list of tensors into one big tensor and move it to the GPU
  for k, v in batch.items():
    batch[k] = torch.stack(v).to(torch.device('cuda'))
  return batch

In [None]:
train_loader = DataLoader(train_dataset, batch_size=16, 
                          sampler=RandomSampler(train_dataset), collate_fn=collate_for_classification)
valid_loader = DataLoader(valid_dataset, batch_size=16, 
                          sampler=SequentialSampler(valid_dataset), collate_fn=collate_for_classification)

In [None]:
train_loader.__iter__().__next__()

  import sys


{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'),
 'input_ids': tensor([[ 101, 8430, 8479,  ...,    0,    0,    0],
         [ 101, 7904, 2001,  ...,    0,    0,    0],
         [ 101, 4542, 1998,  ...,    0,    0,    0],
         ...,
         [ 101, 7904, 2001,  ...,    0,    0,    0],
         [ 101, 7904, 2001,  ...,    0,    0,    0],
         [ 101, 7904, 2018,  ...,    0,    0,    0]], device='cuda:0'),
 'labels': tensor([16,  3, 35,  9, 37, 16, 16,  4, 31, 35, 11, 17,  8,  3, 29, 21],
        device='cuda:0')}

## Training the Model with our own Training Loop

### Option 1: Create a custom training loop

In [None]:
import transformers
from sklearn.metrics import accuracy_score, f1_score

model = model.to(torch.device('cuda'))
optimizer = transformers.AdamW(params=model.parameters(), lr=1e-4)

for epoch in range(5):
  print(f'Epoch: {epoch}')
  training_loss = []
  # set the model to training mode so things like dropout behave correctly
  model.train()
  for idx, batch in enumerate(train_loader):
    # calculate the model loss and predictions on our training batch
    loss, pred = output = model(**batch)
    # calculate change in loss with respect to parameters (i.e. gradient)
    loss.backward()
    training_loss.append(loss)
    # adjust the parameters in the direction that reduces loss as measured by the gradient
    optimizer.step()
    # zero out the gradient as we're now moving on to the next batch
    optimizer.zero_grad()
  print(f'training_loss {torch.tensor(training_loss).mean()}')
  print('validating')
  preds = []
  labels = []
  # at the end of each epoch, calculate the loss and accuracy on the validation data
  with torch.no_grad():
    # set the model to evaluate mode so things like dropout are no longer random
    model.eval()
    valid_loss = []
    pred_codes = []
    true_codes = []
    for idx, batch in enumerate(valid_loader):
      # send the batch to cuda
      for k, v in batch.items():
        batch[k] = v.to(torch.device('cuda'))
      loss, pred = model(**batch)
      valid_loss.append(loss)
      true_codes.append(batch['labels'].cpu())
      pred_codes.append(pred.cpu())
      if idx % 5 == 0:
        print(f'Epoch: {epoch} valid loss: {torch.tensor(valid_loss[-5:]).mean()}')
  print(f'average valid loss: {torch.tensor(valid_loss).mean()}')
  all_true = torch.cat(true_codes, dim=0)
  all_pred = torch.cat(pred_codes, dim=0).argmax(dim=1)
  acc = accuracy_score(all_true, all_pred)
  print(f'valid accuracy: {acc}')



Epoch: 0


  import sys


training_loss 2.5757479667663574
validating
Epoch: 0 valid loss: 1.5770251750946045
Epoch: 0 valid loss: 1.302986979484558
Epoch: 0 valid loss: 1.8998000621795654
Epoch: 0 valid loss: 1.5334079265594482
Epoch: 0 valid loss: 1.8177725076675415
Epoch: 0 valid loss: 1.435440182685852
Epoch: 0 valid loss: 1.7725677490234375
Epoch: 0 valid loss: 1.8031257390975952
Epoch: 0 valid loss: 1.478172779083252
Epoch: 0 valid loss: 1.5638761520385742
Epoch: 0 valid loss: 1.5756371021270752
Epoch: 0 valid loss: 1.524620771408081
Epoch: 0 valid loss: 1.6226778030395508
average valid loss: 1.612066626548767
valid accuracy: 0.646
Epoch: 1
training_loss 1.3201613426208496
validating
Epoch: 1 valid loss: 0.9834232926368713
Epoch: 1 valid loss: 0.9454551935195923
Epoch: 1 valid loss: 1.4750162363052368
Epoch: 1 valid loss: 1.101564645767212
Epoch: 1 valid loss: 1.3577848672866821
Epoch: 1 valid loss: 1.0907435417175293
Epoch: 1 valid loss: 1.207595944404602
Epoch: 1 valid loss: 1.4173157215118408
Epoch: 1 

### Option 2: Use the Transformers Trainer
It is easy to make mistakes when constructing the training loop by hand so the transformer's library also provides the Trainer class, which abstracts away the training, optimization, and validation. This makes it easier to train the model, but harder to debug or customize the training loop. 

By default the trainer expects the model to produce the loss as the first in a tuple of outputs when "labels" are provided as an input. This is already the default behavior of Transformers models so no additional modifications are necessary in our case.

In [None]:
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score

def get_metrics(eval_prediction):
  y_true = eval_prediction.label_ids
  y_pred = torch.from_numpy(eval_prediction.predictions).softmax(-1).argmax(axis=1)
  acc = accuracy_score(y_true, y_pred)
  return {'accuracy': acc}

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    save_total_limit=3,
    do_eval=True,
    evaluation_strategy='epoch',
    learning_rate=1e-4
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    data_collator=collate_for_classification,
    compute_metrics=get_metrics          # metrics that we want computed
)

In [None]:
trainer.train()

  import sys


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.970974,0.758
2,No log,0.958075,0.757
3,No log,0.959449,0.756


TrainOutput(global_step=96, training_loss=0.22440765301386514)

## Saving and Reloading the Trained Model
When using the trainer, the underlying model is attached to the trainer as an attribute. We can access and save it as follows:

In [None]:
torch.save(trainer.model, 'my_torch_model')

In [None]:
my_reloaded_model = torch.load('my_torch_model')