### Fine - Tuning BERT for Seque

The model will be finetuned to classify sequences as either positive or negative. In order to train the model we need to:

<ol>

<li>Retrive dataset.</li>
<li>Tokenize and pre-process the dataset using the collator function.</li>
<li>Instantiate the trainer and begin the training loop.</li>
<li>Conduct inference.</li>
</ol>

In [None]:
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer

In [None]:
!pip install datasets evaluate



In [None]:
from datasets import load_dataset

In [None]:
checkpoint = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_dataset = load_dataset("glue", "sst2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
raw_dataset["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
def tokenize(example):
  return tokenizer(example["sentence"], truncation=True)

In [None]:
tokenized_dataset = raw_dataset.map(tokenize, batched=True)

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
train_arguments = TrainingArguments("test-directory")

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def compute_metrics(eval_predict):
  metric = evaluate.load("glue", "sst2")
  logits, labels = eval_predict
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
trainer = Trainer(
    model,
    args = train_arguments,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

Step,Training Loss
500,0.4323
1000,0.3844
1500,0.3757
2000,0.387
2500,0.3401
3000,0.3365
3500,0.3238
4000,0.3178
4500,0.3182
5000,0.3124


TrainOutput(global_step=25257, training_loss=0.23319151014775144, metrics={'train_runtime': 1322.3203, 'train_samples_per_second': 152.797, 'train_steps_per_second': 19.101, 'total_flos': 3251945755383360.0, 'train_loss': 0.23319151014775144, 'epoch': 3.0})

In [None]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
import torch
print(torch.version.cuda)

12.1


In [None]:
pip install torch --no-cache-dir



In [None]:
import os
os.environ['TORCH_USE_CUDA_DSA'] = '1'

In [None]:
trainer.predict(tokenized_dataset["test"], ignore_keys=True)

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
