## <font color='blue'>**BERT (Bidirectional Encoder Representations from Transformers)**</font>

https://arxiv.org/pdf/1810.04805.pdf

### <font color='red'>**Overview:** </font>
BERT is a pre-trained model that learns contextual word representations by considering both left and right context in a sentence. Unlike RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks), which process sequential data sequentially, BERT utilizes the Transformer architecture, enabling efficient computation through parallel processing.

### <font color='red'>**Pretraining:** </font>
During pretraining, BERT learns contextual representations through masked language modeling and next sentence prediction tasks, leveraging large-scale unlabeled data.
### <font color='red'>**Fine-tuning:** </font>
After pretraining, BERT can be fine-tuned on downstream tasks by adding task-specific layers and training on labeled data. Fine-tuning BERT's pretrained representations often leads to state-of-the-art performance on various natural language processing tasks.

![Pre-training and Fine-tuning](https://github.com/Rekha215/Pytorch-Basics/blob/main/Screenshot%202024-04-04%20130426.png?raw=true)



**Token embeddings** encode the meaning of individual words, **segment embeddings** distinguish between different segments of text, and **position embeddings** encode the sequential order of tokens within a sequence.

![Embeddings](https://github.com/Rekha215/Pytorch-Basics/blob/main/Screenshot%202024-04-04%20130513.png?raw=true)

In [1]:
from transformers import BertTokenizer, BertModel
import torch
import pandas as pd

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [2]:
# Example text
text = "this is the last session of AI/ML course."

tokens = tokenizer.tokenize(text)
print(tokens)

# Tokenize input
input_ids = tokenizer.encode(text, add_special_tokens=True)
print(input_ids)

# Convert inputs to PyTorch tensors
input_ids = torch.tensor([input_ids])

['this', 'is', 'the', 'last', 'session', 'of', 'ai', '/', 'ml', 'course', '.']
[101, 2023, 2003, 1996, 2197, 5219, 1997, 9932, 1013, 19875, 2607, 1012, 102]


In [3]:
# Forward pass, get hidden states
with torch.no_grad():
    outputs = model(input_ids)

print(outputs)

# Extract the hidden states from the output
hidden_states = outputs[0]
print(hidden_states)

word_representation = hidden_states[0]
print(word_representation)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3580, -0.0250,  0.2351,  ..., -0.3656,  0.1550,  0.3390],
         [-0.6572, -0.6982, -0.2003,  ..., -0.1571,  0.5329, -0.2065],
         [-0.6760, -0.6005,  0.4523,  ...,  0.0308,  0.1842,  0.4794],
         ...,
         [ 0.7289, -0.2130,  0.3197,  ...,  0.2856, -0.5340, -0.0213],
         [ 0.8087,  0.1905, -0.3884,  ...,  0.0236, -0.6671, -0.3784],
         [-0.5319,  0.1416,  0.2567,  ...,  0.0173, -0.3320, -0.0277]]]), pooler_output=tensor([[-0.9501, -0.5846, -0.9602,  0.8968,  0.6309, -0.2747,  0.9399,  0.5013,
         -0.7919, -1.0000, -0.3180,  0.9130,  0.9840,  0.6651,  0.9250, -0.7721,
         -0.3486, -0.6704,  0.4965, -0.7313,  0.7469,  1.0000,  0.1154,  0.5230,
          0.6156,  0.9812, -0.7208,  0.9328,  0.9745,  0.7941, -0.8171,  0.3915,
         -0.9906, -0.4003, -0.9721, -0.9973,  0.5069, -0.8072, -0.2521, -0.2444,
         -0.9312,  0.4838,  1.0000, -0.0173,  0.4456, -0.4658, -1.0000,  0.

In [4]:
# Convert tensor to a numpy array
word_representation_numpy = word_representation.numpy()

# Extracting tokens from tokenizer
tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze().tolist())

# Convert numpy array to a pandas DataFrame
df = pd.DataFrame(word_representation_numpy, index=tokens)

# Set column names as embedding_dim_0, embedding_dim_1, etc.
df.columns = [f'embedding_dim_{i}' for i in range(word_representation_numpy.shape[1])]

# Display the DataFrame
df

Unnamed: 0,embedding_dim_0,embedding_dim_1,embedding_dim_2,embedding_dim_3,embedding_dim_4,embedding_dim_5,embedding_dim_6,embedding_dim_7,embedding_dim_8,embedding_dim_9,...,embedding_dim_758,embedding_dim_759,embedding_dim_760,embedding_dim_761,embedding_dim_762,embedding_dim_763,embedding_dim_764,embedding_dim_765,embedding_dim_766,embedding_dim_767
[CLS],-0.357968,-0.025011,0.235098,-0.09152,-0.407983,-0.491192,0.7488,0.591719,-0.042049,0.021393,...,0.081465,0.014005,0.517787,0.069586,-0.094016,0.2946,0.140071,-0.365574,0.154976,0.33895
this,-0.657183,-0.698246,-0.200257,-0.293208,0.373259,-0.378906,0.089233,1.332639,-0.183366,0.320185,...,0.867239,-0.504051,1.009788,-0.221148,-0.401614,0.639541,0.258121,-0.157102,0.532929,-0.206522
is,-0.675955,-0.60048,0.452305,-0.117508,0.090807,-0.694865,0.470758,1.032447,-0.277027,0.106946,...,0.457346,-0.206722,0.782795,-0.300906,-0.204903,0.216077,0.605646,0.030832,0.184157,0.479357
the,-0.744336,-0.607774,0.410556,-0.093455,0.3873,-0.671389,0.27639,1.018139,-0.149591,0.286997,...,0.548211,0.082561,0.967757,-0.464343,-0.525179,0.321742,0.403398,0.163151,0.400861,-0.133964
last,-0.169711,-0.566675,1.139931,0.263556,-0.337775,-0.194454,0.809776,0.690075,0.300043,0.443654,...,0.269258,-0.563727,0.457213,-0.641326,-0.357631,0.054144,0.553309,0.103217,-0.06612,-0.27842
session,0.158174,-0.031431,0.032985,-0.081283,-0.187248,-0.753298,1.084636,0.389594,-0.335923,0.705883,...,-0.059288,0.145946,-0.03188,0.024708,-0.470789,0.59627,0.385298,0.121989,-0.494081,-0.526524
of,-0.458881,0.309886,0.091665,-0.656385,0.512449,-0.39526,1.132027,0.579112,-0.329545,0.211426,...,0.666989,0.029743,0.53182,-0.078343,-0.133534,-0.032729,0.123146,-0.481668,0.276603,-0.089457
ai,-0.453237,0.138157,0.362522,-0.473861,0.786913,0.193629,0.357376,0.115303,-0.1323,-0.510591,...,-0.058973,0.295782,0.546885,-0.263775,-0.195554,1.23878,0.690225,-0.732256,-0.188194,-0.296405
/,-0.236299,-0.330371,0.267579,-0.167829,0.692405,-0.318068,0.500881,0.70406,-0.013894,0.594625,...,0.216771,0.131904,0.274017,-0.100709,0.000178,0.901705,0.078792,-0.434113,0.109634,-0.075869
ml,-0.044216,-0.840455,0.423911,-0.240436,-0.101604,0.046052,0.675716,-0.156269,0.13949,-0.075717,...,0.376299,0.39356,0.469674,-0.182568,-0.396846,1.205963,0.522218,-0.884017,-0.515486,0.430348


## <font color='green'>**Tutorial components** </font>

**1. Predictions using existing fine-tuned model**

**2. Fine-tuning of bert-base-uncased using AG_NEWS dataset**

  2.1 Load the dataset from datasets

  2.2 Tokenizing samples

  2.3 Loading and Instantiating bert model (bert-base-uncased or bert-large-uncased)

  2.4 Creating data loaders

  2.5 Training

  2.6 Evaluation

In [5]:
!pip -q install datasets

In [6]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification
from tqdm import tqdm
from sklearn.metrics import classification_report

## **1. Predictions using existing Fine-Tuned model on AG_NEWS dataset**

https://huggingface.co/fabriceyhc/bert-base-uncased-ag_news


In [7]:
model_name = "fabriceyhc/bert-base-uncased-ag_news"
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

In [8]:
# List all the model parameters
model_params = list(model.named_parameters())
print("Model Parameters:")
for param_name, param in model_params:
    print(f"{param_name}: shape={param.shape}, requires_grad={param.requires_grad}")

Model Parameters:
bert.embeddings.word_embeddings.weight: shape=torch.Size([30522, 768]), requires_grad=True
bert.embeddings.position_embeddings.weight: shape=torch.Size([512, 768]), requires_grad=True
bert.embeddings.token_type_embeddings.weight: shape=torch.Size([2, 768]), requires_grad=True
bert.embeddings.LayerNorm.weight: shape=torch.Size([768]), requires_grad=True
bert.embeddings.LayerNorm.bias: shape=torch.Size([768]), requires_grad=True
bert.encoder.layer.0.attention.self.query.weight: shape=torch.Size([768, 768]), requires_grad=True
bert.encoder.layer.0.attention.self.query.bias: shape=torch.Size([768]), requires_grad=True
bert.encoder.layer.0.attention.self.key.weight: shape=torch.Size([768, 768]), requires_grad=True
bert.encoder.layer.0.attention.self.key.bias: shape=torch.Size([768]), requires_grad=True
bert.encoder.layer.0.attention.self.value.weight: shape=torch.Size([768, 768]), requires_grad=True
bert.encoder.layer.0.attention.self.value.bias: shape=torch.Size([768]), r

In [9]:
# Example news articles
news_articles = [
    "President Trump visits Japan for a summit meeting.",
    "The new iPhone is expected to be released next month.",
    "Scientists discover a new species of dinosaur in Argentina.",
    "Stock market experiences a major crash, causing panic among investors."
]

In [10]:
# getting tokens of example news articles
for i in range(len(news_articles)):
  tokens = tokenizer.tokenize(news_articles[i])
  print(tokens)

['president', 'trump', 'visits', 'japan', 'for', 'a', 'summit', 'meeting', '.']
['the', 'new', 'iphone', 'is', 'expected', 'to', 'be', 'released', 'next', 'month', '.']
['scientists', 'discover', 'a', 'new', 'species', 'of', 'dinosaur', 'in', 'argentina', '.']
['stock', 'market', 'experiences', 'a', 'major', 'crash', ',', 'causing', 'panic', 'among', 'investors', '.']


In [11]:
# Tokenize the news articles
encoded_articles = tokenizer(news_articles, truncation=True, padding=True, return_tensors='pt')

print("Encoded Articles \n", encoded_articles)

Encoded Articles 
 {'input_ids': tensor([[  101,  2343,  8398,  7879,  2900,  2005,  1037,  6465,  3116,  1012,
           102,     0,     0,     0],
        [  101,  1996,  2047, 18059,  2003,  3517,  2000,  2022,  2207,  2279,
          3204,  1012,   102,     0],
        [  101,  6529,  7523,  1037,  2047,  2427,  1997, 15799,  1999,  5619,
          1012,   102,     0,     0],
        [  101,  4518,  3006,  6322,  1037,  2350,  5823,  1010,  4786,  6634,
          2426,  9387,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [12]:
# Get model predictions
with torch.no_grad():
    outputs = model(**encoded_articles)

print("Outputs \n", outputs)

Outputs 
 SequenceClassifierOutput(loss=None, logits=tensor([[ 5.7878, -2.2241, -1.8262, -2.2474],
        [-0.7577, -4.3739, -0.8240,  4.6821],
        [ 0.3683, -4.0031, -1.6276,  4.1006],
        [ 0.8006, -4.8784,  3.0927, -1.0170]]), hidden_states=None, attentions=None)


In [13]:
# Get the predicted class indices
predicted_indices = torch.argmax(outputs.logits, dim=1)

In [14]:
print(predicted_indices)

tensor([0, 3, 3, 2])


## **2. BERT-base-uncased Fine-Tuning using ag_news dataset**

Dataset Link - https://huggingface.co/datasets/ag_news

Bert-base-uncased paper Link - https://arxiv.org/pdf/1810.04805.pdf


### **2.1 Load the dataset from datasets**

In [15]:
# Load AG News dataset
dataset = load_dataset("ag_news")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


In [16]:
# Extract train, validation, and test data
train_texts = dataset["train"]["text"]
train_labels = dataset["train"]["label"]
val_texts = dataset["test"]["text"]
val_labels = dataset["test"]["label"]
test_texts = dataset["test"]["text"]
test_labels = dataset["test"]["label"]

### **2.2 Tokenizing samples**

In [17]:
# Define a simple dataset class for AG News
class AGNewsDataset(Dataset):
    def __init__(self, tokenizer, texts, labels):
        self.tokenizer = tokenizer
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoded_inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            padding='max_length',
            max_length=128,
            truncation=True,
            return_tensors='pt'
        )

        input_ids = encoded_inputs['input_ids'].squeeze()
        attention_mask = encoded_inputs['attention_mask'].squeeze()

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'label': label
        }


### **2.3 Loading and Instantiating bert model (bert-base-uncased or bert-large-uncased)**

In [18]:
# Initialize tokenizer and model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4).to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **2.4 Creating data loaders**

In [19]:
# Create train dataset and dataloader
train_dataset = AGNewsDataset(tokenizer, train_texts, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Create validation dataset and dataloader
val_dataset = AGNewsDataset(tokenizer, val_texts, val_labels)
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=False)

# Create test dataset and dataloader
test_dataset = AGNewsDataset(tokenizer, test_texts, test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)

### **2.5 Training**

In [20]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.1, eps=1e-8)

In [21]:
# Training loop
epochs = 3

for epoch in range(epochs):
    model.train()
    train_predicted_labels = []
    train_true_labels = []

    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{epochs} - Training")
    for batch in progress_bar:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()

        train_predicted_label = torch.argmax(outputs.logits, dim=1)
        train_predicted_labels.extend(train_predicted_label.cpu().numpy())
        train_true_labels.extend(labels.cpu().numpy())

        progress_bar.set_postfix({'loss': loss.item()})

    train_classification_rep = classification_report(train_true_labels, train_predicted_labels, digits=4)
    print(f"\nEpoch {epoch + 1}/{epochs} - Training Classification Report:\n{train_classification_rep}")

    # Validation loop
    model.eval()
    val_predicted_labels = []
    val_true_labels = []

    progress_bar = tqdm(val_dataloader, desc=f"Epoch {epoch + 1}/{epochs} - Validation")
    with torch.no_grad():
        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            val_predicted_label = torch.argmax(outputs.logits, dim=1)
            val_predicted_labels.extend(val_predicted_label.cpu().numpy())
            val_true_labels.extend(labels.cpu().numpy())

    val_classification_rep = classification_report(val_true_labels, val_predicted_labels, digits=4)
    print(f"\nEpoch {epoch + 1}/{epochs} - Validation Classification Report:\n{val_classification_rep}")

Epoch 1/3 - Training: 100%|██████████| 1875/1875 [11:28<00:00,  2.72it/s, loss=0.297]



Epoch 1/3 - Training Classification Report:
              precision    recall  f1-score   support

           0     0.9401    0.9072    0.9234     30000
           1     0.9614    0.9822    0.9717     30000
           2     0.8918    0.8722    0.8819     30000
           3     0.8757    0.9067    0.8909     30000

    accuracy                         0.9171    120000
   macro avg     0.9173    0.9171    0.9170    120000
weighted avg     0.9173    0.9171    0.9170    120000



Epoch 1/3 - Validation: 100%|██████████| 119/119 [00:20<00:00,  5.95it/s]



Epoch 1/3 - Validation Classification Report:
              precision    recall  f1-score   support

           0     0.9574    0.9474    0.9524      1900
           1     0.9832    0.9868    0.9850      1900
           2     0.8884    0.9221    0.9050      1900
           3     0.9223    0.8937    0.9078      1900

    accuracy                         0.9375      7600
   macro avg     0.9379    0.9375    0.9375      7600
weighted avg     0.9379    0.9375    0.9375      7600



Epoch 2/3 - Training: 100%|██████████| 1875/1875 [11:27<00:00,  2.73it/s, loss=0.0762]



Epoch 2/3 - Training Classification Report:
              precision    recall  f1-score   support

           0     0.9665    0.9493    0.9578     30000
           1     0.9877    0.9920    0.9899     30000
           2     0.9307    0.9152    0.9229     30000
           3     0.9143    0.9419    0.9279     30000

    accuracy                         0.9496    120000
   macro avg     0.9498    0.9496    0.9496    120000
weighted avg     0.9498    0.9496    0.9496    120000



Epoch 2/3 - Validation: 100%|██████████| 119/119 [00:19<00:00,  5.97it/s]



Epoch 2/3 - Validation Classification Report:
              precision    recall  f1-score   support

           0     0.9674    0.9379    0.9524      1900
           1     0.9900    0.9868    0.9884      1900
           2     0.9416    0.8832    0.9115      1900
           3     0.8766    0.9605    0.9166      1900

    accuracy                         0.9421      7600
   macro avg     0.9439    0.9421    0.9422      7600
weighted avg     0.9439    0.9421    0.9422      7600



Epoch 3/3 - Training: 100%|██████████| 1875/1875 [11:28<00:00,  2.72it/s, loss=0.0534]



Epoch 3/3 - Training Classification Report:
              precision    recall  f1-score   support

           0     0.9783    0.9620    0.9701     30000
           1     0.9923    0.9954    0.9938     30000
           2     0.9475    0.9337    0.9406     30000
           3     0.9316    0.9577    0.9445     30000

    accuracy                         0.9622    120000
   macro avg     0.9624    0.9622    0.9622    120000
weighted avg     0.9624    0.9622    0.9622    120000



Epoch 3/3 - Validation: 100%|██████████| 119/119 [00:19<00:00,  5.97it/s]


Epoch 3/3 - Validation Classification Report:
              precision    recall  f1-score   support

           0     0.9523    0.9563    0.9543      1900
           1     0.9899    0.9842    0.9871      1900
           2     0.9295    0.8947    0.9118      1900
           3     0.9037    0.9389    0.9210      1900

    accuracy                         0.9436      7600
   macro avg     0.9439    0.9436    0.9435      7600
weighted avg     0.9439    0.9436    0.9435      7600






### **2.6 Evaluation**

In [22]:
# Test set evaluation
test_predicted_labels = []
test_true_labels = []

model.eval()
with torch.no_grad():
    for batch in tqdm(test_dataloader, desc="Test"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        test_predicted_label = torch.argmax(outputs.logits, dim=1)

        test_predicted_labels.extend(test_predicted_label.cpu().numpy())
        test_true_labels.extend(labels.cpu().numpy())

test_classification_rep = classification_report(test_true_labels, test_predicted_labels, digits=4)
print(f"\nTest Set Classification Report:\n{test_classification_rep}")

Test: 100%|██████████| 7600/7600 [01:26<00:00, 87.99it/s]


Test Set Classification Report:
              precision    recall  f1-score   support

           0     0.9523    0.9563    0.9543      1900
           1     0.9899    0.9842    0.9871      1900
           2     0.9295    0.8947    0.9118      1900
           3     0.9037    0.9389    0.9210      1900

    accuracy                         0.9436      7600
   macro avg     0.9439    0.9436    0.9435      7600
weighted avg     0.9439    0.9436    0.9435      7600




