# Masked Language Modeling (MLM)
Masked Language Modeling (MLM) is a self-supervised learning task used during the pre-training of language models like BERT. In MLM, a certain percentage of tokens in the input text are masked (replaced with a special [MASK] token), and the model is trained to predict the original value of the masked tokens based on the surrounding context.

In [3]:
! pip install transformers



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example input sentence
sentence = "I love playing [MASK]."

# Tokenize the input sentence
tokens = tokenizer.tokenize(sentence)

# Mask a token randomly
masked_index = 3
tokens[masked_index] = '[MASK]'

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Convert input IDs to tensors
input_ids_tensor = torch.tensor([input_ids])

# Create a mask tensor indicating the masked positions
mask_tensor = (input_ids_tensor == tokenizer.mask_token_id)

# Prepare the labels tensor with the original values of masked tokens
labels_tensor = torch.tensor(input_ids)

# Print the input IDs, mask, and labels tensors
print("Input IDs:", input_ids_tensor)
print("Mask:", mask_tensor)
print("Labels:", labels_tensor)

Input IDs: tensor([[1045, 2293, 2652,  103, 1012]])
Mask: tensor([[False, False, False,  True, False]])
Labels: tensor([1045, 2293, 2652,  103, 1012])


In the above example, we start with an input sentence and tokenize it using a BERT tokenizer. We randomly select a token to be masked and replace it with the special [MASK] token. Then, we convert the tokens to input IDs using the tokenizer. We create a mask tensor that indicates the positions of the masked tokens. Finally, we prepare the labels tensor with the original values of the masked tokens. The input IDs, mask, and labels tensors can be used as training data for the MLM task during the pre-training phase of a language model like BERT.

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# How to use it in training 

In [10]:
import torch
import torch.nn as nn
from transformers import BertForMaskedLM, BertTokenizer

# Load pre-trained BERT model and tokenizer
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example training data
input_sequence = "I love [MASK] soccer."
target_sequence = "playing"

# Tokenize the input and target sequences
input_tokens = tokenizer.tokenize(input_sequence)
target_tokens = tokenizer.tokenize(target_sequence)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
target_ids = tokenizer.convert_tokens_to_ids(target_tokens)

# Convert input IDs to tensors
input_ids_tensor = torch.tensor([input_ids])
target_ids_tensor = torch.tensor([target_ids])

# Forward pass
outputs = model(input_ids_tensor)
logits = outputs.logits

# Compute the loss
loss_function = nn.CrossEntropyLoss()
loss = loss_function(logits.view(-1, model.config.vocab_size), target_ids_tensor.view(-1))

# Backward pass and optimization
model.zero_grad()
loss.backward()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
optimizer.step()


# NExt word prediction
recall

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the training data
input_sentence = "I love playing soccer."
target_word = "playing"

# Define vocabulary
vocab = set(input_sentence.split())
word_to_idx = {word: idx for idx, word in enumerate(vocab)}

# Convert input and target sentences to tensors
input_tensor = torch.tensor([word_to_idx[word] for word in input_sentence.split()], dtype=torch.long)
target_tensor = torch.tensor(word_to_idx[target_word], dtype=torch.long)

# Define the next word prediction model
class NextWordPredictionModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(NextWordPredictionModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, hidden_dim)
        self.out = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x):
        embedded = self.embedding(x)
        out = self.fc(embedded)
        out = self.out(out)
        return out

# Define model hyperparameters
vocab_size = len(vocab)
embedding_dim = 10
hidden_dim = 20

# Create an instance of the next word prediction model
model = NextWordPredictionModel(vocab_size, embedding_dim, hidden_dim)

# Define the loss function
loss_function = nn.CrossEntropyLoss()

# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop
for epoch in range(100):
    optimizer.zero_grad()

    # Forward pass
    output = model(input_tensor)
    
    # Calculate the loss
    loss = loss_function(output.unsqueeze(0), target_tensor.unsqueeze(0))
    
    # Backward pass
    loss.backward()
    optimizer.step()

    # Print the loss for every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# Example usage
test_input = torch.tensor([word_to_idx[word] for word in "I love".split()], dtype=torch.long)
test_output = model(test_input.unsqueeze(0))
predicted_index = torch.argmax(test_output).item()
predicted_word = [word for word, idx in word_to_idx.items() if idx == predicted_index][0]
print(f"Predicted Word: {predicted_word}")


# *Bert*

The BERT (Bidirectional Encoder Representations from Transformers) model is a transformer-based model introduced by Google. It is pre-trained on a large corpus of text data and has achieved state-of-the-art results on various NLP tasks. BERT has two main features that set it apart:

Bidirectional Context: BERT reads the entire input sentence in both directions (left-to-right and right-to-left), capturing contextual information from both past and future tokens. This bidirectional context allows BERT to have a deeper understanding of word meaning and sentence structure.

Pre-training and Fine-tuning: BERT is pre-trained on large amounts of unlabeled text data using two tasks: masked language modeling and next sentence prediction. After pre-training, BERT can be fine-tuned on specific downstream tasks with labeled data. Fine-tuning allows BERT to adapt its learned representations to the specific task, achieving high performance even with limited labeled data.

In [11]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

# Load the pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Example input sentence
sentence = "Hello, how are you?"

# Tokenize the input sentence
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([token_ids])

# Pass the input through the BERT model
outputs = model(input_ids)

# Get the contextualized representations of the tokens
token_embeddings = outputs.last_hidden_state

# Print the contextualized embeddings
for i, token in enumerate(tokens):
    embedding = token_embeddings[0][i]
    print(f"Token: {token}, Embedding: {embedding}")



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Token: hello, Embedding: tensor([-3.9924e-01, -3.4411e-02, -3.3965e-01, -2.8517e-01, -3.9406e-02,
        -1.9003e-01,  2.5025e-01,  1.4996e-02,  5.8878e-02,  2.9270e-01,
        -5.7878e-01, -4.9853e-01, -1.1511e+00, -5.2913e-01, -7.0896e-01,
         6.5086e-01,  5.5685e-01,  1.1373e-01, -5.8340e-02, -4.7555e-02,
         7.7435e-01,  2.7698e-01,  1.3954e-01, -4.0489e-01, -4.8637e-02,
        -1.2433e-01,  7.6382e-01,  5.3296e-01, -3.3520e-01, -7.7987e-01,
        -1.0209e-01,  4.1059e-02,  2.5635e-01, -2.1217e-01,  1.8521e-01,
         1.4251e-01,  6.3174e-01, -5.1141e-01,  5.1691e-01,  1.3027e-01,
         1.7753e-01, -5.0273e-01,  2.2023e-01,  2.8469e-01,  3.0471e-01,
        -1.1238e+00, -1.7980e-01,  5.7457e-01, -4.3325e-01, -1.1746e-01,
        -5.7595e-01,  2.7140e-01,  1.2023e-01,  7.2939e-01,  2.1635e-01,
         4.7397e-01,  2.4393e-01, -2.3373e-01,  3.1348e-01, -6.3376e-02,
        -6.8917e-01, -4.3387e-01, -3.5467e-01, -6.3114e-01,  1.1886e+00,
         1.1391e-01, -9.45