# Quora Question Pairs - Basic Transfer learning with BERT

In this notebook I present a basic workflow to transfer learning from a general language model (BERT) to a specific task of called paraphrasing: identifiy if two sentences have same meaning.

In [3]:
import transformers
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification, Trainer
from transformers import TrainingArguments
from datasets import ClassLabel, Value
from transformers import DataCollatorWithPadding
import numpy as np
from datasets import load_metric

For this task, I used Hugging Face library, with make our job really straightforward. To save GPU budged in this demo, I choose a basic transformer model (Bert-base-uncased)

The classes `AutoTokenizer` and `AutoModelForSequenceClassification` have the workflow steps to preprocess the text and model architecture respectivelly. Note since we are using `AutoModelForSequenceClassification` and BERT was pre-trained for a Language Model task, a warning is shown saying the last layer of our model is uninitialized.

In [4]:

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Here we say pytorch to use GPU if available.

In [5]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

Now, we use `load_dataset` method to load the dataset from Hugging Face Hub (a models, dataset and other resources repository): https://huggingface.co/datasets/quora

## Load and preprocess our data

In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("quora")
raw_datasets

Downloading:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/559 [00:00<?, ?B/s]

Downloading and preparing dataset quora/default (download: 55.48 MiB, generated: 55.46 MiB, post-processed: Unknown size, total: 110.94 MiB) to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

Dataset quora downloaded and prepared to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['questions', 'is_duplicate'],
        num_rows: 404290
    })
})

Above we can see the number of examples and features names.

In [7]:
raw_datasets['train'].features

{'questions': Sequence(feature={'id': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None)}, length=-1, id=None),
 'is_duplicate': Value(dtype='bool', id=None)}

Here there are two example of sentence pairs and their labels (if is duplicated or not).

In [8]:
raw_datasets['train'][0:2]

{'questions': [{'id': [1, 2],
   'text': ['What is the step by step guide to invest in share market in india?',
    'What is the step by step guide to invest in share market?']},
  {'id': [3, 4],
   'text': ['What is the story of Kohinoor (Koh-i-Noor) Diamond?',
    'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?']}],
 'is_duplicate': [False, False]}

Since `AutoTokenizer` class expects as input a list of sentence pairs, we need to process our dataset first.

In [9]:
def tokenize_function(example):
    questions = example['questions']
    t1 = []
    t2 = []
    for t in questions:
        t1.append(t['text'][0])
        t2.append(t['text'][1])
    return tokenizer(t1, t2, truncation=True)

In [10]:
tokenized_datasets = raw_datasets['train'].map(tokenize_function, batched=True)
tokenized_datasets

  0%|          | 0/405 [00:00<?, ?ba/s]

Dataset({
    features: ['attention_mask', 'input_ids', 'is_duplicate', 'questions', 'token_type_ids'],
    num_rows: 404290
})

Here we drop original columns, cast the boolean type to `ClassLabel`, rename `is_duplicate` to `labels` and split data into train (80%) and test (20%)

In [11]:
new_features = tokenized_datasets.features.copy()
new_features["is_duplicate"] = ClassLabel(num_classes=2, names=['not_duplicate', 'duplicate'], names_file=None, id=None)
tokenized_datasets = tokenized_datasets.cast(new_features)
tokenized_datasets = tokenized_datasets.remove_columns('questions').rename_column('is_duplicate', 'labels')
tokenized_datasets = tokenized_datasets.train_test_split(test_size=0.2)
tokenized_datasets

Casting the dataset:   0%|          | 0/41 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 323432
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 80858
    })
})

Almost there! Our function `tokenize_function` used only the option truncate=True. This causes a problem, because the transformer model expects every input to have the same lenght (**Padding**). But doing this for all records at once demands to fit our dataset into memory. This is slow and resource consuming.
So we'll use a dynamic padding during the collation process. Collation is what pytorch do to put examples together into batchs.

For our convenience, transformers already have a class for that:

In [12]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Now whe can sample few examples to see if our collator works.

In [13]:
samples = tokenized_datasets['train'][:8]
samples = {k: v for k, v in samples.items()}
batch = data_collator(samples)
batch = batch.to(device)
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 55]),
 'input_ids': torch.Size([8, 55]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 55])}

## Fine tuning our BERT model

Finally, we're ready to heat our processors to do the job, using the `Trainer` API. 
We need to provide a function to compute scores, since transformers returns logits and we need to calculate accuracy and F1 score by hand. Also, we pass some parameters to `TrainingArguments` class before create our `Trainer`

In [14]:

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [15]:
training_args = TrainingArguments("./quora-saved-model", evaluation_strategy="epoch", save_strategy='no', 
                                  report_to='none', num_train_epochs=3, 
                                  per_device_train_batch_size=32,
                                  per_device_eval_batch_size=32)
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [16]:
trainer.train()

***** Running training *****
  Num examples = 323432
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 30324


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2596,0.251683,0.89604,0.859253
2,0.1676,0.249931,0.907282,0.878782
3,0.0859,0.325111,0.910287,0.879313


***** Running Evaluation *****
  Num examples = 80858
  Batch size = 32


Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

***** Running Evaluation *****
  Num examples = 80858
  Batch size = 32
***** Running Evaluation *****
  Num examples = 80858
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=30324, training_loss=0.19235535508299106, metrics={'train_runtime': 8040.3046, 'train_samples_per_second': 120.679, 'train_steps_per_second': 3.771, 'total_flos': 3.403727534333136e+16, 'train_loss': 0.19235535508299106, 'epoch': 3.0})

In [17]:
eval_results = trainer.evaluate()
print(eval_results)

***** Running Evaluation *****
  Num examples = 80858
  Batch size = 32


{'eval_loss': 0.325111448764801, 'eval_accuracy': 0.9102871701006703, 'eval_f1': 0.8793132133231292, 'eval_runtime': 202.3987, 'eval_samples_per_second': 399.499, 'eval_steps_per_second': 12.485, 'epoch': 3.0}


With only 3 epochs we achieved an accuracy and f1-score near 90%, using a basic model that disconsider letter cases.

## 3. Test our model

Lastly, we can play with our self-created pair of sentences and test our model.

In [18]:
tokens = tokenizer([
    ['How can I be successful in Kaggle Competitions?', 'How can I be successful in life?'],
    ['What is the best place to eat a pizza in Italy?','What is the best restaurant in Italy?'],
    ['What are the good courses to learn pytorch?','Are there good courses to learn pytorch?']],
    truncation=True, padding=True, return_tensors='pt')

tokens.to(device)

{'input_ids': tensor([[  101,  2129,  2064,  1045,  2022,  3144,  1999, 10556, 24679,  6479,
          1029,   102,  2129,  2064,  1045,  2022,  3144,  1999,  2166,  1029,
           102,     0,     0,     0,     0,     0],
        [  101,  2054,  2003,  1996,  2190,  2173,  2000,  4521,  1037, 10733,
          1999,  3304,  1029,   102,  2054,  2003,  1996,  2190,  4825,  1999,
          3304,  1029,   102,     0,     0,     0],
        [  101,  2054,  2024,  1996,  2204,  5352,  2000,  4553,  1052, 22123,
          2953,  2818,  1029,   102,  2024,  2045,  2204,  5352,  2000,  4553,
          1052, 22123,  2953,  2818,  1029,   102]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]], device='cuda:0'), 'attentio

In [19]:
logits = model(**tokens).logits
logits = logits.cpu().detach().numpy()
preds = np.argmax(logits, axis=-1)
preds

array([0, 0, 1])

In [20]:
model.save_pretrained('/kaggle/working/quora-saved-model')
tokenizer.save_pretrained('/kaggle/working/quora-saved-model')


Configuration saved in /kaggle/working/quora-saved-model/config.json
Model weights saved in /kaggle/working/quora-saved-model/pytorch_model.bin
tokenizer config file saved in /kaggle/working/quora-saved-model/tokenizer_config.json
Special tokens file saved in /kaggle/working/quora-saved-model/special_tokens_map.json


('/kaggle/working/quora-saved-model/tokenizer_config.json',
 '/kaggle/working/quora-saved-model/special_tokens_map.json',
 '/kaggle/working/quora-saved-model/vocab.txt',
 '/kaggle/working/quora-saved-model/added_tokens.json',
 '/kaggle/working/quora-saved-model/tokenizer.json')

In [21]:
# Saving the model
model.save_pretrained('/kaggle/working/quora-saved-model')
tokenizer.save_pretrained('/kaggle/working/quora-saved-model')

# Loading the model
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('/kaggle/working/quora-saved-model')
tokenizer = AutoTokenizer.from_pretrained('/kaggle/working/quora-saved-model')


Configuration saved in /kaggle/working/quora-saved-model/config.json
Model weights saved in /kaggle/working/quora-saved-model/pytorch_model.bin
tokenizer config file saved in /kaggle/working/quora-saved-model/tokenizer_config.json
Special tokens file saved in /kaggle/working/quora-saved-model/special_tokens_map.json
loading configuration file /kaggle/working/quora-saved-model/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classifi