# Transfer learning with transformers using Roberta large
by Artyom Glazunov

In this notebook you can find one example on how to use transfer learning with the transfermers library. There is also some information on how to load such models in kaggle notebooks, because the inference in some competition does not support Internet connection (the solution is to load some notebooks output with saved models in the inference notebook as an input). The notebook with saved Roberta model you can find here https://www.kaggle.com/artemglazunov1990/roberta-save, this model is used here to be finetuned on our regression task. The inference example you can find here https://www.kaggle.com/artemglazunov1990/inference-with-roberta

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/commonlitreadabilityprize/sample_submission.csv
/kaggle/input/commonlitreadabilityprize/train.csv
/kaggle/input/commonlitreadabilityprize/test.csv
/kaggle/input/roberta-save/rob_tok.zip
/kaggle/input/roberta-save/__results__.html
/kaggle/input/roberta-save/rob.zip
/kaggle/input/roberta-save/__notebook__.ipynb
/kaggle/input/roberta-save/__output__.json
/kaggle/input/roberta-save/custom.css


Download pretrained Roberta large model with tokenizer

In [2]:
%%bash
cp ../input/roberta-save/rob.zip .
cp ../input/roberta-save/rob_tok.zip .
unzip rob.zip
unzip rob_tok.zip 
rm -r rob.zip rob_tok.zip 

Archive:  rob.zip
   creating: rob/
  inflating: rob/pytorch_model.bin   
  inflating: rob/config.json         
Archive:  rob_tok.zip
   creating: rob_tok/
  inflating: rob_tok/tokenizer_config.json  
  inflating: rob_tok/merges.txt      
  inflating: rob_tok/vocab.json      
  inflating: rob_tok/special_tokens_map.json  


Import usefull packages

In [3]:
import numpy as np
import pandas as pd
from transformers import RobertaTokenizer, RobertaModel, AdamW, get_linear_schedule_with_warmup
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split
import tqdm

Load the data

In [4]:
data = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
data.head()

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
0,c12129c31,,,When the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,,,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,b69ac6792,,,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,,,And outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,,,Once upon a time there were Three Bears who li...,0.247197,0.510845


Get train and validation sets

In [5]:
data_train, data_val, y_err_train, y_err_val = train_test_split(data['excerpt'].values, data[['target', 'standard_error']].values,
                                                        test_size=0.15,
                                                        random_state=42)
data_train.shape, data_val.shape

((2408,), (426,))

Create Roberta tokenizer

In [6]:
tokenizer = RobertaTokenizer.from_pretrained(
    'rob_tok'
)

Encode data

In [7]:
%%time
encoded_data_train = tokenizer.batch_encode_plus(
    data_train,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=512,
    return_tensors='pt',
)

encoded_data_val = tokenizer.batch_encode_plus(
    data_val,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=512,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
values_train = torch.tensor(y_err_train[:, 0],dtype=torch.float)
errors_train = torch.tensor(y_err_train[:, 1],dtype=torch.float)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
values_val = torch.tensor(y_err_val[:, 0], dtype=torch.float)
errors_val = torch.tensor(y_err_val[:, 1],dtype=torch.float)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


CPU times: user 5.63 s, sys: 35.8 ms, total: 5.67 s
Wall time: 5.72 s


As a result, we have pytorch tensors of padded ids lists (ids of tokens in our texts from our pretrained Roberta vocab), attention masks (to show the model where is our padding, we do not want it to shange the model's behavior), target and errors (here, it isn't used, but you can try to use it as an uncertainty level in your criterion later).

Let's create tensor datasets and, after that, dataloaders (iterators that will be providing us with batches)

In [8]:
dataset_train = TensorDataset(input_ids_train,
                             attention_masks_train,
                             values_train,
                             errors_train)
dataset_val = TensorDataset(input_ids_val,
                            attention_masks_val,
                            values_val,
                            errors_val)

In [9]:
batch_size = 4

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=2*batch_size
)

Create device (we will load our model and batches on it, but be carefull and watch the memory)

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

Let's initialize our model class, criterion and rmse function

In [11]:
class BERTRegressor(torch.nn.Module): 
    def __init__(self, pretrained_src = 'rob'): 
        super().__init__()
        self.bert = RobertaModel.from_pretrained(pretrained_src)
        self.linear = torch.nn.Linear(1024, 1)
        self.dropout = torch.nn.Dropout(0.15)
        
    def forward(self, input_ids, attention_mask): #x - tokenized batch
        hidden = self.bert(input_ids, 
                           attention_mask=attention_mask)[0][:, 0, :]#CLS token output                                                          
        output = self.linear(self.dropout(hidden))
        return output


class RMSELoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.mse = torch.nn.MSELoss()
        
    def forward(self,yhat,y):
        loss = torch.sqrt(self.mse(yhat,y))
        return loss

def rmse_metric(y_true, y_pred):
    return np.sqrt(mse(y_true, y_pred))

Our params for training

In [12]:
warm_prop = 0.1 # we want our learning rate to grow for a while
epochs = 8
clip = 1 #we do not want too big gradients

model = BERTRegressor().to(device)
criterion = RMSELoss()
optimizer = AdamW(
    model.parameters(),
    lr= 3e-5,#the original paper:2e-5 -> 5e-5
    eps=1e-8
)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(len(dataloader_train)*epochs * warm_prop),
    num_training_steps=len(dataloader_train)*epochs
)

Let's initialize our function for evaluation

In [13]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm.notebook.tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1]
        }
        target = batch[2]

        with torch.no_grad():        
            output = model(**inputs)
            
        loss = criterion(output, target.view(-1,1))
        loss_val_total += loss

        output = output.detach().cpu().numpy()
        target = target.cpu().numpy()
        predictions.append(output)
        true_vals.append(target)
    
    loss_val_avg = loss_val_total / len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

Let's start our transfer learning process

In [14]:
best_val_loss = float('inf')
for epoch in tqdm.notebook.tqdm(range(epochs)):
    model.train()

    epoch_loss = 0
    for batch in tqdm.notebook.tqdm(dataloader_train):

        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                'attention_mask': batch[1]
          }
        target = batch[2]

        optimizer.zero_grad()        

        output = model(**inputs)     
        loss = criterion(output, target.view(-1,1))      
        loss.backward()
        epoch_loss += loss.item()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)      
        optimizer.step()
        scheduler.step()     

    val_loss, predictions, true_vals = evaluate(dataloader_val)
    if val_loss < best_val_loss:        
        #here can be you code, if you want to save your best model
        pass
    train_loss = epoch_loss / len(dataloader_train)
    rmse_val = rmse_metric(true_vals, predictions)
    print('-------')
    print(f'Training loss: {train_loss}')
    print(f'Validation loss: {val_loss}')
    print(f"RMSE on validation: {rmse_val}")

  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.8054351467280293
Validation loss: 0.6232926249504089
RMSE on validation: 0.6422837376594543


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.6453529352019
Validation loss: 0.5307248830795288
RMSE on validation: 0.5453047752380371


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.5012190815902925
Validation loss: 0.521551787853241
RMSE on validation: 0.5478167533874512


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.4042558380869833
Validation loss: 0.5879412889480591
RMSE on validation: 0.6068800687789917


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.34241728508764524
Validation loss: 0.5271953344345093
RMSE on validation: 0.5379378199577332


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.26550881010155347
Validation loss: 0.501842200756073
RMSE on validation: 0.5136346220970154


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.21909108116876247
Validation loss: 0.5561699271202087
RMSE on validation: 0.5664398074150085


  0%|          | 0/602 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

-------
Training loss: 0.17824170714015283
Validation loss: 0.5223321318626404
RMSE on validation: 0.5421127080917358


In [15]:
!rm -r rob rob_tok