## Fine-tuning a pretrained Bert Model fro Classification using Pytorch

Install Required Packages

In [None]:
!pip install datasets
!pip install transformers

Before finetuning a model we need the dataset with which to finetune the model. For this tutorial we will use the imdb dataset to classify a movie reviews as positive and negative. The raw_datasets object is a dictionary with three keys: "train", "test" and "unsupervised" (which correspond to the three splits of that dataset). We will use the "train" split for training and the "test" split for validation.

In [None]:
from datasets import load_dataset
raw_dataset = load_dataset('imdb')
print(raw_dataset)

Now we will use the AutoTokenizer from the transformers library to preprocess our data.

In [None]:
from transformers import AutoTokenizer
tokenizer  = AutoTokenizer.from_pretrained('bert-base-cased')

We will use the map method for pre-processing all the splits of data

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)

Lets keep a small dataset for finetuning. We will use this dataset for finetuning in this tutorial

In [None]:
small_train_dataset = tokenized_datasets['train'].shuffle(seed = 42).select(range(1000))
small_test_dataset = tokenized_datasets['test'].shuffle(seed = 42).select(range(1000))
full_train_dataset = tokenized_datasets['train']
full_test_datasets = tokenized_datasets['test']

In [None]:
import pandas as pd
print(small_train_dataset['label'][0:5],'\n')
df = pd.DataFrame(small_train_dataset)

In [None]:
df.head(1)

Since our downstream task is a cassification task while the Bert has been pre-trained for text Generation task, we will use the "BertForClassification" model for our down stream task. This means we are removing the Pre-trained Head of Bert and replacing it with the Classification Head. This apparently means some of the weights will be randomly initialized.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased',num_labels = 2)

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments('test_trainer', evaluation_strategy='epoch')

In [None]:
from transformers import Trainer
trainer = Trainer(model = model, args = training_args, train_dataset = small_train_dataset, eval_dataset = small_test_dataset, compute_metrics = compute_metrics)

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
trainer.train()
trainer.evaluate()