## <font color='blue'>**BERT (Bidirectional Encoder Representations from Transformers)**</font>

https://arxiv.org/pdf/1810.04805.pdf

### <font color='red'>**Overview:** </font>
BERT is a pre-trained model that learns contextual word representations by considering both left and right context in a sentence. Unlike RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks), which process sequential data sequentially, BERT utilizes the Transformer architecture, enabling efficient computation through parallel processing.

### <font color='red'>**Pretraining:** </font>
During pretraining, BERT learns contextual representations through masked language modeling and next sentence prediction tasks, leveraging large-scale unlabeled data.
### <font color='red'>**Fine-tuning:** </font>
After pretraining, BERT can be fine-tuned on downstream tasks by adding task-specific layers and training on labeled data. Fine-tuning BERT's pretrained representations often leads to state-of-the-art performance on various natural language processing tasks.

![Pre-training and Fine-tuning](https://github.com/Rekha215/Pytorch-Basics/blob/main/Screenshot%202024-04-04%20130426.png?raw=true)



![Embeddings](https://github.com/Rekha215/Pytorch-Basics/blob/main/Screenshot%202024-04-04%20130513.png?raw=true)

## <font color='green'>**Tutorial components** </font>

**1. Predictions using existing fine-tuned model**

**2. Fine-tuning of bert-base-uncased using AG_NEWS dataset**

  2.1 Load the dataset from datasets

  2.2 Tokenizing samples

  2.3 Instantiating bert model (bert-base-uncased or bert-large-uncased)

  2.4 Creating data loaders

  2.5 Training

  2.6 Evaluation

In [1]:
!pip -q install datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification
from tqdm import tqdm
from sklearn.metrics import classification_report

## **1. Predictions using existing Fine-Tuned model on AG_NEWS dataset**

https://huggingface.co/fabriceyhc/bert-base-uncased-ag_news


In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "fabriceyhc/bert-base-uncased-ag_news"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/919 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [4]:
# Example news articles
news_articles = [
    "President Trump visits Japan for a summit meeting.",
    "The new iPhone is expected to be released next month.",
    "Scientists discover a new species of dinosaur in Argentina.",
    "Stock market experiences a major crash, causing panic among investors."
]

In [5]:
# getting tokens of example news articles
for i in range(len(news_articles)):
  tokens = tokenizer.tokenize(news_articles[i])
  print(tokens)

['president', 'trump', 'visits', 'japan', 'for', 'a', 'summit', 'meeting', '.']
['the', 'new', 'iphone', 'is', 'expected', 'to', 'be', 'released', 'next', 'month', '.']
['scientists', 'discover', 'a', 'new', 'species', 'of', 'dinosaur', 'in', 'argentina', '.']
['stock', 'market', 'experiences', 'a', 'major', 'crash', ',', 'causing', 'panic', 'among', 'investors', '.']


In [6]:
# Tokenize the news articles
encoded_articles = tokenizer(news_articles, truncation=True, padding=True, return_tensors='pt')

print("Encoded Articles \n", encoded_articles)

Encoded Articles 
 {'input_ids': tensor([[  101,  2343,  8398,  7879,  2900,  2005,  1037,  6465,  3116,  1012,
           102,     0,     0,     0],
        [  101,  1996,  2047, 18059,  2003,  3517,  2000,  2022,  2207,  2279,
          3204,  1012,   102,     0],
        [  101,  6529,  7523,  1037,  2047,  2427,  1997, 15799,  1999,  5619,
          1012,   102,     0,     0],
        [  101,  4518,  3006,  6322,  1037,  2350,  5823,  1010,  4786,  6634,
          2426,  9387,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [7]:
# Get model predictions
with torch.no_grad():
    outputs = model(**encoded_articles)

print("Outputs \n", outputs)

Outputs 
 SequenceClassifierOutput(loss=None, logits=tensor([[ 5.7878, -2.2241, -1.8262, -2.2474],
        [-0.7577, -4.3739, -0.8240,  4.6821],
        [ 0.3683, -4.0031, -1.6276,  4.1006],
        [ 0.8006, -4.8784,  3.0927, -1.0170]]), hidden_states=None, attentions=None)


In [8]:
# Get the predicted class indices
predicted_indices = torch.argmax(outputs.logits, dim=1)

In [9]:
print(predicted_indices)

tensor([0, 3, 3, 2])


## **2. BERT-base-uncased Fine-Tuning using ag_news dataset**

Dataset Link - https://huggingface.co/datasets/ag_news

Bert-base-uncased paper Link - https://arxiv.org/pdf/1810.04805.pdf


### **2.1 Load the dataset from datasets**

In [10]:
# Load AG News dataset
dataset = load_dataset("ag_news")
print(dataset)

Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


In [11]:
# Extract train, validation, and test data
train_texts = dataset["train"]["text"]
train_labels = dataset["train"]["label"]
val_texts = dataset["test"]["text"]
val_labels = dataset["test"]["label"]
test_texts = dataset["test"]["text"]
test_labels = dataset["test"]["label"]

### **2.2 Tokenizing samples**

In [12]:
# Define a simple dataset class for AG News
class AGNewsDataset(Dataset):
    def __init__(self, tokenizer, texts, labels):
        self.tokenizer = tokenizer
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoded_inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            padding='max_length',
            max_length=128,
            truncation=True,
            return_tensors='pt'
        )

        input_ids = encoded_inputs['input_ids'].squeeze()
        attention_mask = encoded_inputs['attention_mask'].squeeze()

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'label': label
        }


### **2.3 Instantiating bert model (bert-base-uncased or bert-large-uncased)**

In [13]:
# Initialize tokenizer and model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4).to(device)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **2.4 Creating data loaders**

In [14]:
# Create train dataset and dataloader
train_dataset = AGNewsDataset(tokenizer, train_texts, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Create validation dataset and dataloader
val_dataset = AGNewsDataset(tokenizer, val_texts, val_labels)
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=False)

# Create test dataset and dataloader
test_dataset = AGNewsDataset(tokenizer, test_texts, test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)

### **2.5 Training**

In [15]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.1, eps=1e-8)

In [16]:
# Training loop
epochs = 3

for epoch in range(epochs):
    model.train()
    train_predicted_labels = []
    train_true_labels = []

    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{epochs} - Training")
    for batch in progress_bar:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()

        train_predicted_label = torch.argmax(outputs.logits, dim=1)
        train_predicted_labels.extend(train_predicted_label.cpu().numpy())
        train_true_labels.extend(labels.cpu().numpy())

        progress_bar.set_postfix({'loss': loss.item()})

    train_classification_rep = classification_report(train_true_labels, train_predicted_labels, digits=4)
    print(f"\nEpoch {epoch + 1}/{epochs} - Training Classification Report:\n{train_classification_rep}")

    # Validation loop
    model.eval()
    val_predicted_labels = []
    val_true_labels = []

    progress_bar = tqdm(val_dataloader, desc=f"Epoch {epoch + 1}/{epochs} - Validation")
    with torch.no_grad():
        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            val_predicted_label = torch.argmax(outputs.logits, dim=1)
            val_predicted_labels.extend(val_predicted_label.cpu().numpy())
            val_true_labels.extend(labels.cpu().numpy())

    val_classification_rep = classification_report(val_true_labels, val_predicted_labels, digits=4)
    print(f"\nEpoch {epoch + 1}/{epochs} - Validation Classification Report:\n{val_classification_rep}")

Epoch 1/3 - Training: 100%|██████████| 1875/1875 [11:31<00:00,  2.71it/s, loss=0.246]



Epoch 1/3 - Training Classification Report:
              precision    recall  f1-score   support

           0     0.9365    0.9111    0.9236     30000
           1     0.9728    0.9780    0.9754     30000
           2     0.8845    0.8832    0.8838     30000
           3     0.8836    0.9042    0.8938     30000

    accuracy                         0.9191    120000
   macro avg     0.9194    0.9191    0.9192    120000
weighted avg     0.9194    0.9191    0.9192    120000



Epoch 1/3 - Validation: 100%|██████████| 119/119 [00:20<00:00,  5.94it/s]



Epoch 1/3 - Validation Classification Report:
              precision    recall  f1-score   support

           0     0.9559    0.9479    0.9519      1900
           1     0.9792    0.9911    0.9851      1900
           2     0.9176    0.9084    0.9130      1900
           3     0.9132    0.9189    0.9161      1900

    accuracy                         0.9416      7600
   macro avg     0.9415    0.9416    0.9415      7600
weighted avg     0.9415    0.9416    0.9415      7600



Epoch 2/3 - Training: 100%|██████████| 1875/1875 [11:29<00:00,  2.72it/s, loss=0.0788]



Epoch 2/3 - Training Classification Report:
              precision    recall  f1-score   support

           0     0.9671    0.9497    0.9583     30000
           1     0.9879    0.9917    0.9898     30000
           2     0.9288    0.9152    0.9219     30000
           3     0.9150    0.9414    0.9280     30000

    accuracy                         0.9495    120000
   macro avg     0.9497    0.9495    0.9495    120000
weighted avg     0.9497    0.9495    0.9495    120000



Epoch 2/3 - Validation: 100%|██████████| 119/119 [00:20<00:00,  5.88it/s]



Epoch 2/3 - Validation Classification Report:
              precision    recall  f1-score   support

           0     0.9698    0.9463    0.9579      1900
           1     0.9879    0.9905    0.9892      1900
           2     0.9314    0.9000    0.9154      1900
           3     0.8978    0.9474    0.9219      1900

    accuracy                         0.9461      7600
   macro avg     0.9467    0.9461    0.9461      7600
weighted avg     0.9467    0.9461    0.9461      7600



Epoch 3/3 - Training: 100%|██████████| 1875/1875 [11:30<00:00,  2.71it/s, loss=0.185]



Epoch 3/3 - Training Classification Report:
              precision    recall  f1-score   support

           0     0.9780    0.9614    0.9697     30000
           1     0.9924    0.9953    0.9938     30000
           2     0.9463    0.9343    0.9403     30000
           3     0.9319    0.9568    0.9442     30000

    accuracy                         0.9619    120000
   macro avg     0.9621    0.9619    0.9620    120000
weighted avg     0.9621    0.9619    0.9620    120000



Epoch 3/3 - Validation: 100%|██████████| 119/119 [00:20<00:00,  5.93it/s]


Epoch 3/3 - Validation Classification Report:
              precision    recall  f1-score   support

           0     0.9667    0.9474    0.9569      1900
           1     0.9863    0.9884    0.9874      1900
           2     0.9160    0.9121    0.9140      1900
           3     0.9104    0.9305    0.9204      1900

    accuracy                         0.9446      7600
   macro avg     0.9449    0.9446    0.9447      7600
weighted avg     0.9449    0.9446    0.9447      7600






### **2.6 Evaluation**

In [17]:
# Test set evaluation
test_predicted_labels = []
test_true_labels = []

model.eval()
with torch.no_grad():
    for batch in tqdm(test_dataloader, desc="Test"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        test_predicted_label = torch.argmax(outputs.logits, dim=1)

        test_predicted_labels.extend(test_predicted_label.cpu().numpy())
        test_true_labels.extend(labels.cpu().numpy())

test_classification_rep = classification_report(test_true_labels, test_predicted_labels, digits=4)
print(f"\nTest Set Classification Report:\n{test_classification_rep}")

Test: 100%|██████████| 7600/7600 [01:26<00:00, 87.98it/s]


Test Set Classification Report:
              precision    recall  f1-score   support

           0     0.9667    0.9474    0.9569      1900
           1     0.9863    0.9884    0.9874      1900
           2     0.9160    0.9121    0.9140      1900
           3     0.9104    0.9305    0.9204      1900

    accuracy                         0.9446      7600
   macro avg     0.9449    0.9446    0.9447      7600
weighted avg     0.9449    0.9446    0.9447      7600




