<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Sequence-Classification-intro" data-toc-modified-id="Sequence-Classification-intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Sequence Classification intro</a></span></li><li><span><a href="#Set-up-a-simple-dummy-training-batch" data-toc-modified-id="Set-up-a-simple-dummy-training-batch-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Set up a simple dummy training batch</a></span></li><li><span><a href="#Train-model" data-toc-modified-id="Train-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train model</a></span><ul class="toc-item"><li><span><a href="#Freeze-weight" data-toc-modified-id="Freeze-weight-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Freeze weight</a></span></li></ul></li><li><span><a href="#Hugging-face-Trainer" data-toc-modified-id="Hugging-face-Trainer-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Hugging face Trainer</a></span></li><li><span><a href="#Additional-metrics" data-toc-modified-id="Additional-metrics-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Additional metrics</a></span></li></ul></div>

# Sequence Classification intro

https://huggingface.co/transformers/training.html

Let’s consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset

The library also includes a number of task-specific final layers or ‘heads’ whose weights are instantiated randomly when not present in the specified pre-trained model.

In [1]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [2]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)

The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can **apply weight decay to all parameters other than bias and layer normalization terms**:

In [5]:
for n,p in model.named_parameters():
    print(n,p.shape)

bert.embeddings.word_embeddings.weight torch.Size([30522, 768])
bert.embeddings.position_embeddings.weight torch.Size([512, 768])
bert.embeddings.token_type_embeddings.weight torch.Size([2, 768])
bert.embeddings.LayerNorm.weight torch.Size([768])
bert.embeddings.LayerNorm.bias torch.Size([768])
bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder

In [11]:
a=['ab','cd','ef']
b = 'a'
list(b in i for i in a)

[True, False, False]

In [12]:
no_decay = ['bias', 'LayerNorm.weight']
# apply weight decay to all parameters (0.01) that's not bias or layernorm
# apply no weight decay to bias and layernorm
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)

# Set up a simple dummy training batch 

In [13]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', 
                     padding=True, truncation=True)


In [14]:
encoding

{'input_ids': tensor([[  101,  1045,  2293, 14255, 18684,  2099,  1012,   102,     0,     0,
             0,     0],
        [  101,  1045,  2123,  1005,  1056,  2729,  2005, 14255, 18684,  2099,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [15]:
for i in encoding['input_ids']:
    print(tokenizer.decode(i))

[CLS] i love pixar. [SEP] [PAD] [PAD] [PAD] [PAD]
[CLS] i don't care for pixar. [SEP]


In [16]:
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

In [29]:
input_ids,input_ids.shape

(tensor([[  101,  1045,  2293, 14255, 18684,  2099,  1012,   102,     0,     0,
              0,     0],
         [  101,  1045,  2123,  1005,  1056,  2729,  2005, 14255, 18684,  2099,
           1012,   102]]),
 torch.Size([2, 12]))

# Train model

In [21]:
type(model)

transformers.models.bert.modeling_bert.BertForSequenceClassification

In [18]:
import torch

With the labels argument, the first returned element is the Cross Entropy loss between the predictions and the passed labels

In [22]:
labels = torch.tensor([1,0]).unsqueeze(0)
labels.shape

torch.Size([1, 2])

In [23]:
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

In [27]:
outputs

SequenceClassifierOutput(loss=tensor(0.6836, grad_fn=<NllLossBackward>), logits=tensor([[-0.6426, -0.3265],
        [-0.5096, -0.2707]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [32]:
outputs.logits

tensor([[-0.6426, -0.3265],
        [-0.5096, -0.2707]], grad_fn=<AddmmBackward>)

In [31]:
outputs.loss

tensor(0.6836, grad_fn=<NllLossBackward>)

In [33]:
loss = outputs.loss
loss.backward()
optimizer.step()

Same code, just get the logits and calculate the loss yourself.

In [None]:
# from torch.nn import functional as F
# labels = torch.tensor([1,0])
# outputs = model(input_ids, attention_mask=attention_mask)
# loss = F.cross_entropy(outputs.logits, labels)
# loss.backward()
# optimizer.step()

You can also do learning rate scheduling tools. 

In [None]:
# from transformers import get_linear_schedule_with_warmup
# scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_train_steps)
# ...
# loss.backward()
# optimizer.step()
# scheduler.step()

## Freeze weight

set the requires_grad attribute to False on the encoder parameters, which can be accessed with the base_model

In [36]:
type(model.base_model)

transformers.models.bert.modeling_bert.BertModel

In [None]:
for param in model.base_model.parameters():
    param.requires_grad = False

# Hugging face Trainer

- Simple but feature-complete training and evaluation interface through Trainer()
- Train, fine-tune, and evaluate any 🤗 Transformers model with a wide range of training options


In [38]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

training_args = TrainingArguments(
    output_dir='models/results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='models/logs',            # directory for storing logs
)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

In [39]:
# trainer = Trainer(
#     model=model,                         # the instantiated 🤗 Transformers model to be trained
#     args=training_args,                  # training arguments, defined above
#     train_dataset=train_dataset,         # training dataset
#     eval_dataset=test_dataset            # evaluation dataset
# )

Simply call trainer.train() to train and trainer.evaluate() to evaluate. 

You can use your own module as well, but the first argument returned from forward must be the loss which you wish to optimize (just like the output of model(...) above)

# Additional metrics

To calculate additional metrics in addition to the loss, define your own compute_metrics and **and pass it to the trainer.**

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }