We are fine-tuning a pre-trained BERT for sentiment analysis on imdb dataset.
Script written to run on any local system.

The Q, K, and V vectors are being extracted during the training loop of a BERT-based sentiment classifier. 
These vectors are associated with the attention mechanism in BERT and are used to compute attention scores.

Q vectors: In the attention mechanism, the query vector is used to determine how much attention should be given to different positions in the input sequence. In BERT, each layer of the self-attention mechanism has its own set of query vectors.

Key vectors: The key vector is used to determine the importance of different positions in the input sequence when computing attention scores.Like query vectors, each layer of the self-attention mechanism in BERT has its own set of key vectors.

Value vectors: The value vector represents the information at different positions in the input sequence. Each layer of the self-attention mechanism in BERT has its own set of value vectors.

In the code, a **hook function** (hook_fn) is registered for the **first attention head in the first layer** of the BERT model. This hook function is called during the forward pass, and it extracts the query, key, and value vectors for that specific attention head. The vectors are then appended to the corresponding lists (Q_vectors, K_vectors, and V_vectors). During each training iteration, these lists will be populated with the Q, K, and V vectors. 

In [25]:
import torch
from transformers import BertTokenizer, BertModel
from torch.utils.data import DataLoader, Dataset
from torch import nn, optim

In [26]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=64):  # Adjust max_length as needed
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Tokenize the text using the provided tokenizer
        tokenized_text = self.tokenizer(
            text,
            padding='max_length',  # Pad to the specified max_length
            truncation=True,
            max_length=self.max_length,
            return_tensors='pt'
        )

        # Extract relevant tensors
        input_ids = tokenized_text['input_ids'].squeeze()  # Remove the batch dimension
        attention_mask = tokenized_text['attention_mask'].squeeze()

        return {'input_ids': input_ids, 'attention_mask': attention_mask, 'label': label}

In [27]:
# Sample IMDb sentiment dataset
texts = ["This movie is great!", "I didn't like the ending."]
labels = [1, 0]  # 1 for positive, 0 for negative

# Tokenize and prepare the dataset
dataset = SentimentDataset(texts, labels, tokenizer)

In [28]:
# Fine-tuning the BERT model
class SentimentClassifier(nn.Module):
    def __init__(self, bert_model):
        super(SentimentClassifier, self).__init__()
        self.bert = bert_model
        self.fc = nn.Linear(768, 2)  # 768 is the size of BERT's hidden layers

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state
        pooled_output = last_hidden_state[:, 0, :]  # Use the [CLS] token representation
        logits = self.fc(pooled_output)
        return logits

In [29]:
# Initialize the sentiment classifier
classifier = SentimentClassifier(bert_model)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(classifier.parameters(), lr=2e-5)

In [30]:
# Store Q, K, V vectors during training
Q_vectors = []
K_vectors = []
V_vectors = []

def hook_fn(module, input, output):
    Q_vectors.append(module.query.weight.detach().cpu().numpy())
    K_vectors.append(module.key.weight.detach().cpu().numpy())
    V_vectors.append(module.value.weight.detach().cpu().numpy())

# Register the hook for the first attention head in the first layer
classifier.bert.encoder.layer[0].attention.self.register_forward_hook(hook_fn)

# Training loop
for epoch in range(10):
    for batch in DataLoader(dataset, batch_size=2, shuffle=True):
        optimizer.zero_grad()

        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']

        input_ids, attention_mask, labels = input_ids.to('cpu'), attention_mask.to('cpu'), labels.to('cpu')
        logits = classifier(input_ids, attention_mask)
        loss = criterion(logits, labels)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch + 1}/{10} completed')

Epoch 1/10 completed
Epoch 2/10 completed
Epoch 3/10 completed
Epoch 4/10 completed
Epoch 5/10 completed
Epoch 6/10 completed
Epoch 7/10 completed
Epoch 8/10 completed
Epoch 9/10 completed
Epoch 10/10 completed


In [31]:
len(Q_vectors)

10

In [34]:
len((V_vectors[0]))

768