# The Masked Language Model (MLM)

***is a key component of BERT, where certain words in a sentence are masked, and the model learns to predict them based on context. This technique enables the model to understand relationships between words and improve its language comprehension capabilities.***

# The BERT-base-uncased model 


***The BERT-base-uncased model is a pre-trained transformer model developed by Google. It's widely used for natural language understanding tasks like text classification, sentiment analysis, and question answering. The "uncased" version means it does not differentiate between uppercase and lowercase letters. BERT-base has 12 layers, 768 hidden units, and 12 attention heads, making it powerful yet efficient for various NLP applications.***

## BERT-base has 12 layers, 768 hidden units, and 12 attention heads

![BERT Model](model.png)

# BERT base model (uncased)

In [3]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

#Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Define a function to predict masked words
def predict_masked_words(sentence):
    # Tokenize the input sentence
    tokenized_input = tokenizer.encode_plus(sentence, return_tensors="pt", add_special_tokens=True)

    # Get the position of the masked token
    masked_index = torch.where(tokenized_input["input_ids"] == tokenizer.mask_token_id)[1]

    # Predict the masked token
    with torch.no_grad():
        outputs = model(**tokenized_input)

    # Get the logits for the masked token
    predictions = outputs.logits[0, masked_index, :]

    # Get the top predictions
    top_indices = torch.topk(predictions, 1, dim=1).indices[0].tolist()

    # Convert token IDs to actual words
    predicted_tokens = [tokenizer.decode([index]) for index in top_indices]

    return predicted_tokens

# Example sentence with a masked word
input_sentence = "I want to go [MASK]."

# Predict masked words
predicted_words = predict_masked_words(input_sentence)

print("Predicted words:", predicted_words)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Predicted words: ['home']


# BERT large model (uncased)

***The BERT-large-uncased model has the following configuration: 24 layers, 1024 hidden dimensions, 16 attention heads, and 336 million parameters. This larger version of BERT provides enhanced performance for various NLP tasks. The "uncased" model does not differentiate between uppercase and lowercase letters. It is particularly suited for complex language understanding applications.***


## 24 layers, 1024 hidden dimensions, 16 attention heads, and 336 million parameters.

In [4]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-large-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Define a function to predict masked words
def predict_masked_words(sentence):
    # Tokenize the input sentence
    tokenized_input = tokenizer.encode_plus(sentence, return_tensors="pt", add_special_tokens=True)

    # Get the position of the masked token
    masked_index = torch.where(tokenized_input["input_ids"] == tokenizer.mask_token_id)[1]

    # Predict the masked token
    with torch.no_grad():
        outputs = model(**tokenized_input)

    # Get the logits for the masked token
    predictions = outputs.logits[0, masked_index, :]

    # Get the top 5 predictions
    top_5_indices = torch.topk(predictions, 5, dim=1).indices[0].tolist()

    # Convert token IDs to actual words
    predicted_tokens = [tokenizer.decode([index]) for index in top_5_indices]

    return predicted_tokens

# Example sentence with a masked word
input_sentence = "Are you going to [MASK]."
# Predict masked words
predicted_words = predict_masked_words(input_sentence)

print("Predicted words:", predicted_words)


Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Predicted words: ['?', 'die', 'sleep', '...', 'stay']


# BERT multilingual base model (cased)

***The BERT multilingual base model (cased) supports 104 languages, making it versatile for global NLP applications. It has 12 layers, 768 hidden dimensions, 12 attention heads, and 110 million parameters. The "cased" version maintains case sensitivity, distinguishing between uppercase and lowercase letters. This model is ideal for tasks requiring nuanced understanding across multiple languages.***

##  It has 12 layers, 768 hidden dimensions, 12 attention heads, and 110 million parameters

In [5]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-multilingual-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Define a function to predict masked words
def predict_masked_words(sentence):
    # Tokenize the input sentence
    tokenized_input = tokenizer.encode_plus(sentence, return_tensors="pt", add_special_tokens=True)

    # Get the position of the masked token
    masked_index = torch.where(tokenized_input["input_ids"] == tokenizer.mask_token_id)[1]

    # Predict the masked token
    with torch.no_grad():
        outputs = model(**tokenized_input)

    # Get the logits for the masked token
    predictions = outputs.logits[0, masked_index, :]

    # Get the top 5 predictions
    top_5_indices = torch.topk(predictions, 5, dim=1).indices[0].tolist()

    # Convert token IDs to actual words
    predicted_tokens = [tokenizer.decode([index]) for index in top_5_indices]

    return predicted_tokens

# Example sentence with a masked word
input_sentence = "What is your favorite [MASK]?"
# Predict masked words
predicted_words = predict_masked_words(input_sentence)

print("Predicted words:", predicted_words)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Predicted words: ['game', 'food', 'thing', 'song', 'movie']
