<font color='#3498DB'><center><h2>On Stability of Few-Sample Transformer Fine-Tuning</h2></center></font>
<br>

<font color='#3498DB'><h3>Introduction</h3></font>
Fine-tuning Transformer models tend to exhibit training instability. Even with the same hyperparameter values (learning rate, batch size, etc.), distinct random seeds can lead to substantially different results. The issue is even more apparent, especially when using the large variants of Transformers on small datasets.

*This notebook will dive into different aspects of the few-sample fine-tuning optimization process and techniques. The goal is to better understand the different remedies we have to deal with a few sample finetuning problems.*
<br>

<font color='#3498DB'><h3>Problem</h3></font>

The instability of the Transformer fine-tuning process has been known since the introduction of BERT, and from then various methods have been proposed to address it.

E.g. in this competition, we have only ~2.8k samples. When divided into folds each model only receives ~2.2k examples and the data has noisy labels as well. So, one way we've all figured for stability is evaluating more within each epoch rather than after each epoch. 
<br>

<font color='#3498DB'><h3>Solution</h3></font>
There are many recently proposed methods to increase few-sample fine-tuning stability and they show a significant performance improvement over simple finetuning methods. 
 - Debiasing Omission In BertADAM  
 - Re-Initializing Transformer Layers 
 - Utilizing Intermediate Layers  
 - Layer-wise Learning Rate Decay (LLRD)
 - Mixout Regularization  
 - Pre-trained Weight Decay  
 - Stochastic Weight Averaging   
 
*Note 1: These methods are independent and it's not recommended to use all of them at once. Even though mixing two or more techniques might result in improvement but this may not be always true.*
<br>

<font color='#3498DB'><h3>Contents</h3></font>
- [**Debiasing Omission In BertADAM**](#section1)
  - [Intoduction](#section1a)
  - [Adam Pseudocode](#section1b)
  - [Implementation](#section1b)
  - [Resources and References](#section1b)
- [**Re-Initializing Transformer Layers**](#section2)
  - [Intoduction](#section2a)
  - [Implementation](#section2b)
    - [Pooler Re-init](#section2a1)
    - [RoBERTa Re-init](#section2a2)
    - [XLNet Re-init](#section2a3)
    - [BART Re-init](#section2a3)
  - [Sensitivity to number of layers Re-Initialized](#section2c)
  - [Resources and References](#section2d)
- [**Utilizing Intermediate Layers**](#section3)
  - [Intoduction](#section3a)
  - [Idea](#section3b)
  - [Implementation](#section3c)
      - [Weighted Layer Pooling](#section3c1)
  - [Pooling Strategy and Layer Choice](#section3d)
  - [Resources and References](#section3e)
- [**Layer-wise Learning Rate Decay (LLRD)**](#section4)
  - [Intoduction](#section4a)
  - [Implementation](#section4b)
  - [Visualization](#section4c)
  - [Resources and References](#section4d)
- [**Mixout Regularization**](#section5)
  - [Intoduction](#section5a)
  - [Idea](#section5b)
  - [Implementation](#section5c)
  - [Conclusion](#section5d)
  - [Resources and References](#section5e)
- [**Pre-trained Weight Decay**](#section6)
  - [Intoduction](#section6a)
  - [Implementation](#section6b) 
  - [Resources and References](#section6c)
- [**Stochastic Weight Averaging**](#section7)
  - [Intoduction](#section7a)
  - [Idea](#section7b)
  - [Implementation](#section7c)
  - [Resources and References](#section7d)   
- [**Ending Notes**](#section8)

<font color='#3498DB'><h3>What's New?</h3></font>
1. [SWA, Apex AMP & Interpreting Transformers in Torch](https://www.kaggle.com/rhtsingh/swa-apex-amp-interpreting-transformers-in-torch) notebook is an implementation of the Stochastic Weight Averaging technique with NVIDIA Apex on transformers using PyTorch. The notebook also implements how to interactively interpret Transformers using LIT (Language Interpretability Tool) a platform for NLP model understanding.   
It has in-depth explanations and code implementations for,
 - SWA 
 - Apex AMP
 - Weighted Layer Pooling
 - MADGRAD Optimizer
 - Grouped LLRD
 - Language Interpretibility Tool
    - Attention Visualization
    - Saliency Maps
    - Integrated Gradients
    - LIME 
    - Embedding Space (UMAP & PCA)
    - Counterfactual generation
    - And many more ...


2. [Utilizing Transformer Representations Efficiently](https://www.kaggle.com/rhtsingh/utilizing-transformer-representations-efficiently) notebook will show many different ways these outputs and hidden representations can be utilized to do much more than just adding an output layer. It has code implementations and detailed explanations for all the below techniques,
 - Pooler Output  
 - Last Hidden State Output  
    - CLS Embeddings  
    - Mean Pooling  
    - Max Pooling  
    - Mean + Max Pooling  
    - Conv1D Pooling  
 - Hidden Layers Output  
    - Layerwise CLS Embeddings  
    - Concatenate Pooling  
    - Weighted Layer Pooling  
    - LSTM / GRU Pooling  
    - Attention Pooling  
    - WKPooling  
 
3. [Speeding up Transformer w/ Optimization Strategies](https://www.kaggle.com/rhtsingh/speeding-up-transformer-w-optimization-strategies) notebook explains in-depth 5 optimization strategies with code. All these techniques are promising and can improve the model performance both in terms of speed and accuracy.
   - Dynamic Padding and Uniform Length Batching
   - Gradient Accumulation
   - Freeze Embedding
   - Numeric Precision Reduction
   - Gradient Checkpointing  

In [1]:
import gc
gc.enable()

<font color='#3498DB'><a id="section1"><h2>Debiasing Omission In BERTAdam</h2></a></font>

<font color='#3498DB'><a id="section10"><h3>Introduction</h3></a></font>

BERTAdam is the most commonly used optimizer for fine-tuning Transformer which is a modified version of the ADAM optimizer.

It differs from the original ADAM algorithm (Kingma & Ba, 2014) in omitting a bias correction step. This change was introduced in the BERT paper, and subsequently made its way into common open source libraries, including the official implementation HuggingFace Transformers.

<font color='#3498DB'><a id="section10"><h3>The Adam pseudocode</h3></a></font>

Require: *α*: learning rate; *β1, β2 ∈ [0, 1):* exponential decay rates for the moment estimates; f(θ): stochastic objective function with parameters θ; θ0: initial parameter vector; λ ∈ [0, 1): decoupled weight decay.

01: $m0 ← 0$ (Initialize first moment vector)  
02: $v0 ← 0 $ (Initialize second moment vector)  
03: $t ← 0 $ (Initialize timestep)  
04: **while** *θt* not converged **do** (Initialize timestep)  
05:   $ t ← t + 1$  
06:  $ gt ← ∇θft(θt−1)$ (Get gradients w.r.t. stochastic objective at timestep t)  
07:  $ mt ← β1 · mt−1 + (1 − β1) · gt$ (Update biased first moment estimate)  
08:  $ vt ← β2 · vt−1 + (1 − β2) · g^2t$ (Update biased second raw moment estimate)  
<font color='#F1948A'>09: $ mt ← mt/(1 − βt1)$ (Compute bias-corrected first moment estimate)</font>  
<font color='#F1948A'>10:$ vt ← vt/(1 − βt2)$ (Compute bias-corrected second raw moment estimate)</font>  
11:$ θt ← θt−1 − α · ~mt/(√~vt + e)$ (Update parameters)  
12: **end while**  
13: **return** θt (Resulting parameters)  
<br>
Above shows the ADAM algorithm and highlights the omitted line in the non-standard BERTAdam implementation. Without the bias correction results in degenerate runs and at times for few samples, models fine-tuned fail to outperform the random baseline. 

Models trained with BERTAdam on small models result in underfitting and to keep it simple, this correction is crucial for Transformer finetuning on small datasets i.e. with fewer than 10k training samples.

<font color='#3498DB'><a id="section10"><h3>Implementation</h3></a></font>

Here we will implement bias corrected Adam with HuggingFace Transformers library. This is relatively straightforward using HuggingFace AdamW optimizer by setting `correct_bias` parameter to true.

*Note: HuggingFace Transformers AdamW has `correct_bias` parameter set to True by default. Still it's worth noting the importance this parameter serves.*

In [2]:
from transformers import (
    AdamW,
    AutoConfig,
    AutoModelForSequenceClassification
)
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_pretrained_model = 'roberta-base'
lr = 2e-5
epsilon = 1e-6
weight_decay = 0.01
use_bertadam = False

config = AutoConfig.from_pretrained(_pretrained_model)
model = AutoModelForSequenceClassification.from_pretrained(
    _pretrained_model, 
    config=config
)

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [{
    "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
    "weight_decay": weight_decay,
    "lr": lr,
},
{
    "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
    "weight_decay": 0.0,
    "lr": lr,
}]

optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=lr,
    eps=epsilon,
    correct_bias=not use_bertadam # bias correction step
)

del model, optimizer_grouped_parameters, optimizer
gc.collect();

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 
 - [REVISITING FEW-SAMPLE BERT FINE-TUNING](https://arxiv.org/pdf/2006.05987.pdf)
 - [ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES](https://arxiv.org/pdf/2006.04884.pdf)

<font color='#3498DB'><a id="section1"><h2>Reinitializing Transformer Layers</h2></a></font>

<font color='#3498DB'><a id="section10"><h3>Introduction</h3></a></font>

This is a very interesting technique where instead of using the pretrained weights for all layers, we re-initialize the pooler layers and the top Transformer blocks using the original Transformer initialization. The layers reinitialized results in destruction of gained pretrained knowledge for those specific blocks.

<font color='#3498DB'><a id="section10"><h3>Idea</h3></a></font>

The idea is motivated by computer vision transfer learning results where we know that lower pre-trained layers learn more general features while higher layers closer to the output specialize more to the pre-training tasks. 
Existing methods using Transformer show that using the complete network is not always the most effective choice and usually slows down training and hurts performance. 

<font color='#3498DB'><a id="section10"><h3>Implementation</h3></a></font>

The implementation varies for various transformers depending upon the type of Transformer they are (Autoencoding, Autoregressive, etc.). 

We will be implementing pooler reinitialization and block initialization for 3 architectures RoBERTa, XLNet, BART.

<font color='#3498DB'><a id="section10"><h4>Pooler Reinitialization</h4></a></font>

-  We "pool" the model by simply taking the hidden state corresponding to the first token.

In [3]:
import torch
import torch.nn as nn
from transformers import RobertaModel, RobertaConfig
from transformers.models.roberta.modeling_roberta import RobertaClassificationHead

_model_type = 'roberta'
_pretrained_model = 'roberta-base'
config = RobertaConfig.from_pretrained(_pretrained_model)
add_pooler = True
reinit_pooler = True

class Net(nn.Module):
    def __init__(self, config, _pretrained_model, add_pooler):
        super(Net, self).__init__()
        self.roberta = RobertaModel.from_pretrained(_pretrained_model, add_pooling_layer=add_pooler)
        self.classifier = RobertaClassificationHead(config)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
        )
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output)
        return logits
        
model = Net(config, _pretrained_model, add_pooler)

if reinit_pooler:
    print('Reinitializing Pooler Layer ...')
    encoder_temp = getattr(model, _model_type)
    encoder_temp.pooler.dense.weight.data.normal_(mean=0.0, std=encoder_temp.config.initializer_range)
    encoder_temp.pooler.dense.bias.data.zero_()
    for p in encoder_temp.pooler.parameters():
        p.requires_grad = True
    print('Done.!')
    
del model
gc.collect();

Reinitializing Pooler Layer ...
Done.!


<font color='#3498DB'><a id="section10"><h4>Layer Reinitialization - RoBERTa</h4></a></font>

- RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

- RoBERTa doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token.

*Note 1: TF version uses truncated_normal for initialization.*

*Note 2: To check wether the weights are being re-initialized, run this block of code before and after re-initialization*

```python
for layer in model.roberta.encoder.layer[-reinit_layers:]:
    for module in layer.modules():
        if isinstance(module, nn.Linear):
            print(module.weight.data)
```

In [4]:
from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

reinit_layers = 2
_model_type = 'roberta'
_pretrained_model = 'roberta-base'
config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if reinit_layers > 0:
    print(f'Reinitializing Last {reinit_layers} Layers ...')
    encoder_temp = getattr(model, _model_type)
    for layer in encoder_temp.encoder.layer[-reinit_layers:]:
        for module in layer.modules():
            if isinstance(module, nn.Linear):
                module.weight.data.normal_(mean=0.0, std=config.initializer_range)
                if module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, nn.Embedding):
                module.weight.data.normal_(mean=0.0, std=config.initializer_range)
                if module.padding_idx is not None:
                    module.weight.data[module.padding_idx].zero_()
            elif isinstance(module, nn.LayerNorm):
                module.bias.data.zero_()
                module.weight.data.fill_(1.0)
    print('Done.!')

del model
gc.collect();

Reinitializing Last 2 Layers ...
Done.!


<font color='#3498DB'><a id="section10"><h4>Layer Reinitialization - XLNet</h4></a></font>

- XLNet is one of the few models that has no sequence length limit.

- XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts.

*Note: TF version uses truncated_normal for initialization.*

In [5]:
from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
from transformers import logging
from transformers.models.xlnet.modeling_xlnet import XLNetRelativeAttention
logging.set_verbosity_warning()
logging.set_verbosity_error()

reinit_layers = 2
_model_type = 'xlnet'
_pretrained_model = 'xlnet-base-cased'
config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if reinit_layers > 0:
    print(f'Reinitializing Last {reinit_layers} Layers ...')
    for layer in model.transformer.layer[-reinit_layers :]:
        for module in layer.modules():
            if isinstance(module, (nn.Linear, nn.Embedding)):
                module.weight.data.normal_(mean=0.0, std=model.transformer.config.initializer_range)
                if isinstance(module, nn.Linear) and module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, nn.LayerNorm):
                module.bias.data.zero_()
                module.weight.data.fill_(1.0)
            elif isinstance(module, XLNetRelativeAttention):
                for param in [
                    module.q,
                    module.k,
                    module.v,
                    module.o,
                    module.r,
                    module.r_r_bias,
                    module.r_s_bias,
                    module.r_w_bias,
                    module.seg_embed,
                ]:
                    param.data.normal_(mean=0.0, std=model.transformer.config.initializer_range)
    print('Done.!')
    
del model
gc.collect();

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/467M [00:00<?, ?B/s]

Reinitializing Last 2 Layers ...
Done.!


<font color='#3498DB'><a id="section10"><h4>Layer Reinitialization - BART</h4></a></font>

 - Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).

 - The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.

In [6]:
from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
from transformers import logging
from transformers.models.xlnet.modeling_xlnet import XLNetRelativeAttention
logging.set_verbosity_warning()
logging.set_verbosity_error()

reinit_layers = 2
_model_type = 'bart'
_pretrained_model = 'facebook/bart-base'
config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if reinit_layers > 0:
    print(f'Reinitializing Last {reinit_layers} Layers ...')
    for layer in model.model.decoder.layers[-reinit_layers :]:
        for module in layer.modules():
            model.model._init_weights(module)
    print('Done.!')

del model
gc.collect();

Downloading:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/558M [00:00<?, ?B/s]

Reinitializing Last 2 Layers ...
Done.!


<font color='#3498DB'><a id="section10"><h4>Sensitivity to Number of Layers Re-initialized </h4></a></font>

Experiments show that Re-init is more robust to unfavorable random seed. Improvements is seen when only the pooler layer is re-initialized. Re-initializing further layers helps more. 

However it is not suggested to reinit more than top 6 layers as the performance plateaus and even decreases as futher re-initialization destroys pre-trained layers with general important features. The best number of reinit layers varies across datasets.

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 
 - [Investigating Transferability in Pretrained Language Models](https://arxiv.org/pdf/2004.14975.pdf)
 - [REVISITING FEW-SAMPLE BERT FINE-TUNING](https://arxiv.org/pdf/2006.05987.pdf)
 - [RIFLE: Backpropagation in Depth for Deep Transfer Learning through
Re-Initializing the Fully-connected LayEr](https://arxiv.org/pdf/2007.03349.pdf) 
 - [Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping](https://arxiv.org/pdf/2002.06305.pdf)

<font color='#3498DB'><a id="section1"><h2>Utilizing Intermediate Layers</h2></a></font>

<font color='#3498DB'><a id="section1"><h3>Introduction</h3></a></font>
This is one of the best techniques that has been widely studied using probing methods which shows that the pre-trained features from intermediate layers are more transferable. 

In HuggingFace Transformers there are 2 main outputs and 3 if configured; that we receive after giving `input_ids` and `attention_mask` as input.

 - **last hidden state** (batch size, seq Len, hidden size) which is the sequence of hidden states at the output of the last layer.
 - **pooler output** (batch size, hidden size) - Last layer hidden-state of the first token of the sequence
 - **all hidden states** (n layers, batch size, seq Len, hidden size) - Hidden states for all layers and for all ids.
 
 
<font color='#3498DB'><a id="section1"><h3>Idea</h3></a></font>
As we have discussed before in reinitialization section, the output of the last layer may not always be the best representation of the input text during the fine-tuning for downstream
tasks. 

For pre-trained language models, including Transformer, the most transferable contextualized representations of input text tend to occur in the middle layers, while the top layers specialize for language modeling. Therefore, the onefold use of the last layer’s output may restrict the power of the pre-trained representation.

<font color='#3498DB'><a id="section1"><h3>Implementation</h3></a></font>
We have multiple application-dependent strategies for fetching intermediate representations and not all of them can be shared in this notebook. But, I will do share here the most useful one and which helps in improvement for almost any type of problem. 

**WeightedLayerPooling** - Token embeddings are the weighted mean of their different hidden layer representations.

In [7]:
import torch
import torch.nn as nn
import pandas as pd
from transformers import (
    AutoConfig, 
    AutoModel, 
    AutoTokenizer
)

_pretrained_model = 'roberta-base'
batch_size = 16
max_seq_length = 256

train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
texts = train['excerpt'][:batch_size].tolist()

config = AutoConfig.from_pretrained(_pretrained_model)
# configure to output all hidden states as well
config.update({'output_hidden_states':True}) 
model = AutoModel.from_pretrained(_pretrained_model, config=config)
tokenizer = AutoTokenizer.from_pretrained(_pretrained_model)

features = tokenizer.batch_encode_plus( 
    texts, 
    max_length=max_seq_length,
    padding='max_length', 
    truncation=True, 
    add_special_tokens=True,
    return_attention_mask=True, 
    return_tensors='pt'
)
print(features['input_ids'].shape)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

torch.Size([16, 256])


In [8]:
outputs = model(features['input_ids'], attention_mask=features['attention_mask'])
print("Total number of outputs: ", len(outputs))
print('Shape of 1st output', outputs[0].shape)
print('Shape of 2nd output', outputs[1].shape)
print('Length of 3rd output', len(outputs[2]))

Total number of outputs:  3
Shape of 1st output torch.Size([16, 256, 768])
Shape of 2nd output torch.Size([16, 768])
Length of 3rd output 13


- We can see after setting `output_hidden_states` to `True` that we now receive three different outputs. 

- We have 13 hidden layers outputs despite 12 hidden layers in the model because we also receive outputs for the embedding layers.

In [9]:
class WeightedLayerPooling(nn.Module):
    def __init__(self, num_hidden_layers, layer_start: int = 4, layer_weights = None):
        super(WeightedLayerPooling, self).__init__()
        self.layer_start = layer_start
        self.num_hidden_layers = num_hidden_layers
        self.layer_weights = layer_weights if layer_weights is not None \
            else nn.Parameter(
                torch.tensor([1] * (num_hidden_layers+1 - layer_start), dtype=torch.float)
            )

    def forward(self, features):
        ft_all_layers = features['all_layer_embeddings']

        all_layer_embedding = torch.stack(ft_all_layers)
        all_layer_embedding = all_layer_embedding[self.layer_start:, :, :, :]

        weight_factor = self.layer_weights.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).expand(all_layer_embedding.size())
        weighted_average = (weight_factor*all_layer_embedding).sum(dim=0) / self.layer_weights.sum()

        features.update({'token_embeddings': weighted_average})
        return features

Now we will add our hidden layers outputs to features with key `all_layer_embeddings` for convenience and pass that to WeightedLayerPooling operation. We will be using hidden states from our last 4 hidden layers. We will add the weighted layer pooling outputs to our features dict with key - `token_embeddings`.

In [10]:
layer_start = 9
pooler = WeightedLayerPooling(
    config.num_hidden_layers, 
    layer_start=layer_start, layer_weights=None
)
features.update({'all_layer_embeddings':outputs[2]})
features = pooler(features)
print("Weighted Layer Pooling Embeddings Shape: ", features['token_embeddings'].shape)

Weighted Layer Pooling Embeddings Shape:  torch.Size([16, 256, 768])


Now we have a combined final representation of last four layers. We can now simply take the cls token outputs, concatenate. 
The standard pooling operation as implemented in HuggingFace Transformer for BERT, RoBERTa etc. can also be appled here. Below we simply take the `cls` token outputs and pass it from a Linear layer.

In [11]:
sequence_output = features['token_embeddings'][:, 0]
outputs = nn.Linear(config.hidden_size, 1)(sequence_output)
print("Outputs Shape: ", outputs.shape)

del model, tokenizer
gc.collect();

Outputs Shape:  torch.Size([16, 1])


<font color='#3498DB'><a id="section1"><h3>Pooling Strategy and Layer Choice</h3></a></font>

The BERT authors tested word-embedding strategies by feeding different vector combinations as input features to a BiLSTM used on a named entity recognition task and observing the resulting F1 scores.

This is partially demonstrated by noting that the different layers of BERT encode very different kinds of information, so the appropriate pooling strategy will change depending on the application because different layers encode different kinds of information.

![embedding_layers](http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png)

Han Xiao created an open-source project named bert-as-service on GitHub which is intended to create word embeddings for your text using BERT `bert-as-service`, by default, uses the outputs from the second-to-last layer of the model.

His observations are - 
 - The embeddings start in the first layer as having no contextual information.
 - As the embeddings move deeper into the network, they pick up more and more contextual information with each layer.
 - As you approach the final layer, however, you start picking up information that is specific to BERT’s pre-training tasks (the “Masked Language Model” (MLM) and “Next Sentence Prediction” (NSP)).
    - What we want are embeddings that encode the word meaning well…
    - BERT is motivated to do this, but it is also motivated to encode anything else that would help it determine what a missing word is (MLM), or whether the second sentence came after the first (NSP).
 - The second-to-last layer is what Han settled on as a reasonable sweet spot.

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 
 - [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf)
 - [Deepening Hidden Representations from Pre-trained Language Models](https://arxiv.org/pdf/1911.01940.pdf)
 - [WHAT DO YOU LEARN FROM CONTEXT? PROBING FOR SENTENCE STRUCTURE IN CONTEXTUALIZED WORD REPRESENTATIONS](https://openreview.net/pdf?id=SJzSgnRcKX)
 - [Linguistic Knowledge and Transferability of Contextual Representations](https://www.aclweb.org/anthology/N19-1112.pdf)
 - [BERT Word Embeddings Tutorial](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/)
 - [Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT](https://github.com/UKPLab/sentence-transformers)

<font color='#3498DB'><a id="section1"><h2>LLRD - Layerwise Learning Rate Decay</h2></a>

<font color='#3498DB'><a id="section1"><h3>Introduction</h3></a></font>
    
LLRD  is a method that applies higher learning rates for top layers and lower learning rates for bottom layers. This is accomplished by setting the learning rate of the top layer and using a multiplicative decay rate to decrease the learning rate layer-by-layer from top to bottom. 
    
The goal is to modify the lower layers that encode more general information less than the top layers that are more specific to the pre-training task. This method is adopted in fine-tuning several recent pre-trained models, including XLNet and ELECTRA.

<font color='#3498DB'><a id="section1"><h3>Implementation</h3></a></font> 
    
[Guide to HuggingFace Schedulers & Differential LRs](https://www.kaggle.com/rhtsingh/guide-to-huggingface-schedulers-differential-lrs) notebook introduces various differential learning rate strategies but not this one. We will implement official LLRD here and visualize how learning rate changes for various layers. 
    
First we import our necessary modules, define model params, optimizer params, scheduler params then, create model and config.

In [12]:
from transformers import (
    AdamW, 
    AutoConfig, 
    AutoModelForSequenceClassification,
    get_cosine_schedule_with_warmup,
    get_linear_schedule_with_warmup
)
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_model_type = 'roberta'
_pretrained_model = 'roberta-base'
# optimizer params
learning_rate = 5e-5
layerwise_learning_rate_decay = 0.9
weight_decay = 0.01
adam_epsilon = 1e-6
use_bertadam = False
# scheduler params
num_epochs = 20
num_warmup_steps = 0

config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

The below is our LLRD function, we will first initialize our task specific head. Then we multiply our `learning rate` with `layerwise learning rate decay` and assign it to each transformer block. 

As we will see the top layers closer to task-specific head have higher learning rate than the bottom ones.

In [13]:
def get_optimizer_grouped_parameters(
    model, model_type, 
    learning_rate, weight_decay, 
    layerwise_learning_rate_decay
):
    no_decay = ["bias", "LayerNorm.weight"]
    # initialize lr for task specific layer
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if "classifier" in n or "pooler" in n],
            "weight_decay": 0.0,
            "lr": learning_rate,
        },
    ]
    # initialize lrs for every layer
    num_layers = model.config.num_hidden_layers
    layers = [getattr(model, model_type).embeddings] + list(getattr(model, model_type).encoder.layer)
    layers.reverse()
    lr = learning_rate
    for layer in layers:
        lr *= layerwise_learning_rate_decay
        optimizer_grouped_parameters += [
            {
                "params": [p for n, p in layer.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": weight_decay,
                "lr": lr,
            },
            {
                "params": [p for n, p in layer.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
                "lr": lr,
            },
        ]
    return optimizer_grouped_parameters

We create our grouped parameters, initialize our optimizer and scheduler.

In [14]:
grouped_optimizer_params = get_optimizer_grouped_parameters(
    model, _model_type, 
    learning_rate, weight_decay, 
    layerwise_learning_rate_decay
)
optimizer = AdamW(
    grouped_optimizer_params,
    lr=learning_rate,
    eps=adam_epsilon,
    correct_bias=not use_bertadam
)
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_epochs
)

<font color='#3498DB'><a id="section1"><h3>Visualization</h3></a></font> 
We will now perform `optimizer.step()` and `scheduler.step()` like any other normal training and collect our learning rate for each layer in each epoch. Then we will visualize the learning rates.

*Note: The visualization has been done using plotly and has been hidden.*

In [15]:
(learning_rates1, learning_rates2, learning_rates3, learning_rates4,
learning_rates5, learning_rates6, learning_rates7, learning_rates8,
learning_rates9, learning_rates10, learning_rates11, learning_rates12, 
learning_rates13, learning_rates14) = [[] for i in range(14)]

def collect_lr(optimizer):
    learning_rates1.append(optimizer.param_groups[0]["lr"])
    learning_rates2.append(optimizer.param_groups[2]["lr"])
    learning_rates3.append(optimizer.param_groups[4]["lr"])
    learning_rates4.append(optimizer.param_groups[6]["lr"])
    learning_rates5.append(optimizer.param_groups[8]["lr"])
    learning_rates6.append(optimizer.param_groups[10]["lr"])
    learning_rates7.append(optimizer.param_groups[12]["lr"])
    learning_rates8.append(optimizer.param_groups[14]["lr"])
    learning_rates9.append(optimizer.param_groups[16]["lr"])
    learning_rates10.append(optimizer.param_groups[18]["lr"])
    learning_rates11.append(optimizer.param_groups[20]["lr"])
    learning_rates12.append(optimizer.param_groups[22]["lr"])
    learning_rates13.append(optimizer.param_groups[24]["lr"])
    learning_rates14.append(optimizer.param_groups[26]["lr"])

collect_lr(optimizer)
for epoch in range(num_epochs):
    optimizer.step()
    scheduler.step()
    collect_lr(optimizer)

In [16]:
import plotly
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
import plotly.offline as pyo
pio.templates.default='plotly_white'

def get_default_layout(title):
    font_style = 'Courier New'
    layout = {}
    #layout['height'] = 400
    #layout['width'] = 1200
    layout['template'] = 'plotly_white'
    layout['dragmode'] = 'zoom'
    layout['hovermode'] = 'x'
    layout['hoverlabel'] = {
        'font_size': 14,
        'font_family':font_style
    }
    layout['font'] = {
        'size':14,
        'family':font_style,
        'color':'rgb(128, 128, 128)'
    }
    layout['xaxis'] = {
        'title': 'Epochs',
        'showgrid': True,
        'type': 'linear',
        'categoryarray': None,
        'gridwidth': 1,
        'ticks': 'outside',
        'showline': True, 
        'showticklabels': True,
        'tickangle': 0,
        'tickmode': 'array'
    }
    layout['yaxis'] = {
        'title': 'Learning Rate',
        'exponentformat':'none',
        'showgrid': True,
        'type': 'linear',
        'categoryarray': None,
        'gridwidth': 1,
        'ticks': 'outside',
        'showline': True, 
        'showticklabels': True,
        'tickangle': 0,
        'tickmode': 'array'
    }
    layout['title'] = {
        'text':title,
        'x': 0.5,
        'y': 0.95,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {
            'family':font_style,
            'size':14,
            'color':'black'
        }
    }
    layout['showlegend'] = True
    layout['legend'] = {
        'x':0.1,
        'y':1.1,
        'orientation':'h',
        'itemclick': 'toggleothers',
        'font': {
            'family':font_style,
            'size':14,
            'color':'black'
        }
    }
    return go.Layout(layout)

In [17]:
def build_trace(learning_rates, num_epochs, name, color):
    return go.Scatter(
        x=list(range(0, num_epochs, 1)), 
        y=learning_rates, 
        texttemplate="%{y:.6f}",
        mode='markers+lines',
        name=name,
        marker=dict(color=color),
    )

trace1 = build_trace(learning_rates1, num_epochs, name='Regressor', color='#83c8d2')
trace2 = build_trace(learning_rates2, num_epochs, name='Layer 12', color='#82c9d2')
trace3 = build_trace(learning_rates3, num_epochs, name='Layer 11', color='#85c7cf')
trace4 = build_trace(learning_rates4, num_epochs, name='Layer 10', color='#88c4cc')
trace5 = build_trace(learning_rates5, num_epochs, name='Layer 9', color='#8cc1c8')
trace6 = build_trace(learning_rates6, num_epochs, name='Layer 8', color='#8fbfc5')
trace7 = build_trace(learning_rates7, num_epochs, name='Layer 7', color='#92bcc2')
trace8 = build_trace(learning_rates8, num_epochs, name='Layer 6', color='#96babe')
trace9 = build_trace(learning_rates9, num_epochs, name='Layer 5', color='#99b7bb')
trace10 = build_trace(learning_rates10, num_epochs, name='Layer 4', color='#9cb4b8')
trace11 = build_trace(learning_rates11, num_epochs, name='Layer 3', color='#a0b2b4')
trace12 = build_trace(learning_rates12, num_epochs, name='Layer 2', color='#a3afb1')
trace13 = build_trace(learning_rates13, num_epochs, name='Layer 1', color='#a7adad')
trace14 = build_trace(learning_rates14, num_epochs, name='Embeddings', color='#aaa')

layout=get_default_layout('Layer Wise Learning Rate Decay')
fig = go.Figure(
    data=[
        trace1, trace2, trace3, trace4, trace5, trace6, 
        trace7, trace8, trace9, trace10, trace11, trace12, 
        trace13, trace14
    ], 
    layout=layout.update({'showlegend':False})
)

fig.show()

del model, grouped_optimizer_params, optimizer, scheduler
gc.collect();

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 
 - [Universal language model fine-tuning for text classification](https://arxiv.org/pdf/1801.06146.pdf)
 - [Xlnet: Generalized autoregressive pretraining for language understanding](https://arxiv.org/pdf/1906.08237.pdf)
 - [ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS](https://arxiv.org/pdf/2003.10555.pdf)


<font color='#3498DB'><a id="section1"><h2>Mixout Regularization</h2></a>

<font color='#3498DB'><a id="section112"><h3>Introduction</h3></a></font>
    
Mixout is a stochastic regularization technique motivated by Dropout and DropConnect. At each training iteration, each model parameter is replaced with its pre-trained value with probability p. The goal is to prevent catastrophic forgetting, and proves it constrains the fine-tuned model from deviating too much from the pre-trained initialization.
    
<font color='#3498DB'><a id="section112"><h3>Idea</h3></a></font>

![mixout](https://d3i71xaburhd42.cloudfront.net/7fb48d00f44771e061c34d9e83415487cf538110/2-Figure1-1.png)

<font color='#000000'>Suppose that `u` is target model parameter and `w` is current model parameter. 
- We first memorize the parameters of the vanilla network at u. 
- In the dropout network, we randomly choose an input neuron to be dropped (a dotted
neuron) with a probability of p. That is, all outgoing parameters from the dropped neuron are
eliminated (dotted connections). 
- In the mixout(u) network, the eliminated parameters in (b)
are replaced by the corresponding parameters in (a). In other words, the mixout(u) network at w is the mixture of the vanilla network at u and the dropout network at w with a probability of p.</font>
    
<font color='#3498DB'><a id="section112"><h3>Implementation</h3></a></font>
    
Here we will implement Mixout. The code has been taken from https://github.com/bloodwass/mixout

In [18]:
import torch
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
from torch.nn import Parameter
from torch.autograd.function import InplaceFunction

class Mixout(InplaceFunction):
    @staticmethod
    def _make_noise(input):
        return input.new().resize_as_(input)

    @classmethod
    def forward(cls, ctx, input, target=None, p=0.0, training=False, inplace=False):
        if p < 0 or p > 1:
            raise ValueError("A mix probability of mixout has to be between 0 and 1," " but got {}".format(p))
        if target is not None and input.size() != target.size():
            raise ValueError(
                "A target tensor size must match with a input tensor size {},"
                " but got {}".format(input.size(), target.size())
            )
        ctx.p = p
        ctx.training = training

        if ctx.p == 0 or not ctx.training:
            return input

        if target is None:
            target = cls._make_noise(input)
            target.fill_(0)
        target = target.to(input.device)

        if inplace:
            ctx.mark_dirty(input)
            output = input
        else:
            output = input.clone()

        ctx.noise = cls._make_noise(input)
        if len(ctx.noise.size()) == 1:
            ctx.noise.bernoulli_(1 - ctx.p)
        else:
            ctx.noise[0].bernoulli_(1 - ctx.p)
            ctx.noise = ctx.noise[0].repeat(input.size()[0], 1)
        ctx.noise.expand_as(input)

        if ctx.p == 1:
            output = target
        else:
            output = ((1 - ctx.noise) * target + ctx.noise * output - ctx.p * target) / (1 - ctx.p)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        if ctx.p > 0 and ctx.training:
            return grad_output * ctx.noise, None, None, None, None
        else:
            return grad_output, None, None, None, None


def mixout(input, target=None, p=0.0, training=False, inplace=False):
    return Mixout.apply(input, target, p, training, inplace)


class MixLinear(torch.nn.Module):
    __constants__ = ["bias", "in_features", "out_features"]
    def __init__(self, in_features, out_features, bias=True, target=None, p=0.0):
        super(MixLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(out_features, in_features))
        if bias:
            self.bias = Parameter(torch.Tensor(out_features))
        else:
            self.register_parameter("bias", None)
        self.reset_parameters()
        self.target = target
        self.p = p

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

    def forward(self, input):
        return F.linear(input, mixout(self.weight, self.target, self.p, self.training), self.bias)

    def extra_repr(self):
        type = "drop" if self.target is None else "mix"
        return "{}={}, in_features={}, out_features={}, bias={}".format(
            type + "out", self.p, self.in_features, self.out_features, self.bias is not None
        )

Above we have defined Mixout Regularization. Now we will be adding this to our model.

In [19]:
import math
from transformers import AutoModelForSequenceClassification, AutoConfig
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_pretrained_model = 'roberta-base'
mixout = 0.7

config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if mixout > 0:
    print('Initializing Mixout Regularization')
    for sup_module in model.modules():
        for name, module in sup_module.named_children():
            if isinstance(module, nn.Dropout):
                module.p = 0.0
            if isinstance(module, nn.Linear):
                target_state_dict = module.state_dict()
                bias = True if module.bias is not None else False
                new_module = MixLinear(
                    module.in_features, module.out_features, bias, target_state_dict["weight"], mixout
                )
                new_module.load_state_dict(target_state_dict)
                setattr(sup_module, name, new_module)
    print('Done.!')

del model
gc.collect();

Initializing Mixout Regularization
Done.!


And we're done, this can now be used for downstream fine-tuning tasks and Mixout will do its work.

<font color='#3498DB'><a id="section112"><h3>Conclusions</h3></a></font>
Mixout is an adaptive L2-regularizer toward  optimization trajectory in the sense that its regularization coefficient adapts along the optimization path. Mixout improves the stability of finetuning a big, pretrained language model even with only a few training examples of a target task. This is well known technique for improving stability in Transformer finetuning.

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 
 - [MIXOUT: EFFECTIVE REGULARIZATION TO FINETUNE
LARGE-SCALE PRETRAINED LANGUAGE MODELS](https://arxiv.org/pdf/1909.11299.pdf)
 - [MIXOUT Code Implementation](https://github.com/bloodwass/mixout)

<font color='#3498DB'><a id="section1"><h2>Pre-trained Weight Decay</h2></a></font>

<font color='#3498DB'><a id="section112"><h3>Introduction</h3></a></font>
Weight decay (WD) is a common regularization technique. At each optimization iteration, λw is subtracted from the model parameters, where λ is a hyperparameter for the regularization strength and w is the model parameters. Pre-trained weight decay adapts this method for fine-tuning pre-trained models by subtracting λ(w − wˆ ) from the objective, where wˆ is the pre-trained parameters. The Mixout paper has shown Pre-trained weight decay works better than conventional weight decay in Transformer fine-tuning and can stabilize fine-tuning. 

<font color='#3498DB'><a id="section112"><h3>Implementation</h3></a></font>
Here we will be implementing the pretrained weight decay.

In [20]:
import torch
import torch.nn as nn
from torch.optim import Optimizer
from transformers import (
    AdamW, 
    AutoConfig, 
    AutoModelForSequenceClassification,
    get_cosine_schedule_with_warmup,
    get_linear_schedule_with_warmup
)
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_model_type = 'roberta'
_pretrained_model = 'roberta-base'

# optimizer params
learning_rate = 5e-5
weight_decay = 0.01
adam_epsilon = 1e-6
use_bertadam = False
use_prior_wd = True

config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

Below is our code for wdecay, the code is pretty intuitive to understand and does exactly what we described above. This will be a wrapper around our optimizer.

In [21]:
class PriorWD(Optimizer):
    def __init__(self, optim, use_prior_wd=False, exclude_last_group=True):
        super(PriorWD, self).__init__(optim.param_groups, optim.defaults)
        self.param_groups = optim.param_groups
        self.optim = optim
        self.use_prior_wd = use_prior_wd
        self.exclude_last_group = exclude_last_group
        self.weight_decay_by_group = []
        for i, group in enumerate(self.param_groups):
            self.weight_decay_by_group.append(group["weight_decay"])
            group["weight_decay"] = 0

        self.prior_params = {}
        for i, group in enumerate(self.param_groups):
            for p in group["params"]:
                self.prior_params[id(p)] = p.detach().clone()

    def step(self, closure=None):
        if self.use_prior_wd:
            for i, group in enumerate(self.param_groups):
                for p in group["params"]:
                    if self.exclude_last_group and i == len(self.param_groups):
                        p.data.add_(-group["lr"] * self.weight_decay_by_group[i], p.data)
                    else:
                        p.data.add_(
                            -group["lr"] * self.weight_decay_by_group[i], p.data - self.prior_params[id(p)],
                        )
        loss = self.optim.step(closure)

        return loss

    def compute_distance_to_prior(self, param):
        assert id(param) in self.prior_params, "parameter not in PriorWD optimizer"
        return (param.data - self.prior_params[id(param)]).pow(2).sum().sqrt()

Now we create our optimizer with simple grouped param intialization and optimizer params as defined above.

In [22]:
def get_optimizer_grouped_parameters(model, learning_rate, weight_decay):
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": weight_decay,
            "lr": learning_rate,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
            "lr": learning_rate,
        },
    ]
    return optimizer_grouped_parameters

optimizer_grouped_parameters = get_optimizer_grouped_parameters(model, learning_rate, weight_decay)
optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=learning_rate,
    eps=adam_epsilon,
    correct_bias=not use_bertadam
)

optimizer = PriorWD(optimizer, use_prior_wd=use_prior_wd)

This can now be used directly in training and the prior weight decay will do its work.

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 
 - [MIXOUT: EFFECTIVE REGULARIZATION TO FINETUNE
LARGE-SCALE PRETRAINED LANGUAGE MODELS](https://arxiv.org/pdf/1909.11299.pdf)
 - [MIXOUT Code Implementation](https://github.com/bloodwass/mixout)

<font color='#3498DB'><a id="section1"><h2>Stochastic Weight Averaging</h2></a></font>

<font color='#3498DB'><a id="section112"><h3>Introduction</h3></a></font>

Snapshot ensembling is a technique where we take weights snapshot while training the same network and then after training create an ensemble of nets with the same architecture but different weights. This allows to improve test performance, and it is a very cheap way too because you just train one model once, just saving weights from time to time.

In SWA (Stochastic Weight Averaging) the authors propose to use a novel ensembling in the weights space. This method produces an ensemble by combining weights of the same network at different stages of training and then uses this model with combined weights to make predictions. There are 2 benefits from this approach:
 - when combining weights, we still get one model at the end, which speeds up predictions
 - it can be applied to any architecture and data set and shows good result in all of them.
 
<font color='#3498DB'><a id="section112"><h3>Idea</h3></a></font>
![swa](https://miro.medium.com/max/1766/1*_USiR_z8PKaDuIcAs9xomw.png)

Intuition for SWA comes from empirical observation that local minima at the end of each learning rate cycle tend to accumulate at the border of areas on loss surface where loss value is low (points W1, W2 and W3 are at the border of the red area of low loss in the left panel of figure above). 
By taking the average of several such points, it is possible to achieve a wide, generalizable solution with even lower loss (Wswa in the left panel of the figure above).

Here is how it works. Instead of an ensemble of many models, you only need two models:
 - the first model that stores the running average of model weights (w_swa in the formula). This will be the final model after the end of the training which will be used for predictions.
 - the second model (w in the formula) that will be traversing the weight space, exploring it by using a cyclical learning rate schedule.
 
![swa2](https://miro.medium.com/max/502/1*Afu2bqxzC6p1BpIRTDWJtg.png)
 
At the end of each learning rate cycle, the current weights of the second model will be used to update the weight of the running average model by taking weighted mean between the old running average weights and the new set of weights from the second model (formula provided in the figure on the left). 
 By following this approach, you only need to train one model, and store only two models in memory during training. For prediction, you only need the running average model and predicting on it is much faster than using ensemble described above, where you use many models to predict and then average results.

<font color='#3498DB'><a id="section112"><h3>Implementation</h3></a></font>

I won't be implementing it here, since this requires its own separate kernel. The main code will look something like below,

```python
from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

loader, optimizer, model, loss_fn = ...
swa_start = 5
swa_model = AveragedModel(model)
swa_scheduler = SWALR(optimizer, swa_lr=0.05)
scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data 
preds = swa_model(test_input)
```

<font color='#3498DB'><a id="section112"><h3>References and Resources</h3></a></font>
 - [Jigsaw Unintended Bias in Toxicity Classification - 1st Place Solution](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/103280)
 - [Averaging Weights Leads to Wider Optima and Better Generalization](https://arxiv.org/pdf/1803.05407.pdf)
 - [Torch SWA Example](https://github.com/izmailovpavel/torch_swa_examples)
 - [Google QUEST Q&A Labeling - How to use SWA in PyTorch](https://www.kaggle.com/c/google-quest-challenge/discussion/129936)
 - [Stochastic Weight Averaging — a New Way to Get State of the Art Results in Deep Learning](https://towardsdatascience.com/stochastic-weight-averaging-a-new-way-to-get-state-of-the-art-results-in-deep-learning-c639ccf36a)
 - [PyTorch 1.6 now includes Stochastic Weight Averaging](https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/)
 - [Fast.ai SWA](https://github.com/fastai/fastai/pull/276/commits)
 

<font color='#3498DB'><a id="section6"><h2>Ending Notes</h2></a></font>

- There are many more stable training strategies which I haven't covererd and which one do further research on,
    - Early Stopping
    - Training Iterations: Longer Fine-Tuning
    - Transferring via an Intermediate Task - STILTs Training
    - Weight initialization and data order
    - Mixed Precision Training
    
- I will be sharing a FineTuning kernel with all of the above idea and results soon.

- More comprehensive repository for learning and implementing Transformers for various tasks can be found [here](https://notebooks.quantumstat.com/), [here](https://huggingface.co/transformers/master/community.html#community-notebooks) and [here](https://huggingface.co/transformers/notebooks.html)  

- I want to acknowledge once more that this kernel has code implementations from the potpourri of best papers out there on Stable and Robust Transformer Fine-Tuning Strategies.

 - [REVISITING FEW-SAMPLE BERT FINE-TUNING](https://arxiv.org/pdf/2006.05987.pdf)
 - [ON THE STABILITY OF FINE-TUNING BERT](https://arxiv.org/pdf/2006.04884.pdf)
 - [SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models](https://arxiv.org/pdf/1911.03437.pdf)
 - [Fine-Tuning Pretrained Language Models:Weight Initializations, Data Orders, and Early Stopping](https://arxiv.org/pdf/2002.06305.pdf)
 - [MIXOUT: EFFECTIVE REGULARIZATION TO FINETUNE LARGE-SCALE PRETRAINED LANGUAGE MODELS](https://arxiv.org/pdf/1909.11299.pdf)
 - [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf)
 - [Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks](https://arxiv.org/pdf/1811.01088.pdf)
 
<font color='#3498DB'><a id="section2"><h2>Thanks & Please Do Upvote!</h2></a></font>