# Generating word embeddings with BERT

Embedding model: BERT

BERT is a popular choice to generate word, sentence, and document embeddings.

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing model developed by Google. It’s 
designed to capture the bidirectional context of words in a text corpus by pretraining on large amounts of unlabeled text data. Key components of the 
BERT embedding model include the following:

Tokenization: BERT tokenizes input text into subword tokens using WordPiece tokenization. This allows BERT to handle out-of-vocabulary words and
capture morphological variations.

Transformer architecture: BERT utilizes a transformer architecture consisting of multiple layers of self-attention mechanisms and feedforward neural 
networks. This architecture enables BERT to capture contextual information from the left and right contexts of each word in a sentence.

Pretraining: BERT is pretrained on large text corpora using two unsupervised learning tasks: masked language model (MLM) and next sentence prediction
(NSP). In MLM, BERT predicts masked words in a sentence based on the context of the surrounding words. In NSP, BERT predicts whether two sentences 
appear consecutively in the original text.

Embedding generation: During pretraining, BERT learns contextualized embeddings for each token in the input text. These embeddings capture the meaning
of individual words and their relationships with surrounding words in the context of a sentence.

Fine-tuning: After pretraining, BERT can be fine-tuned on downstream tasks such as text classification, named entity recognition, and sentiment 
analysis. Fine-tuning adapts BERT’s parameters to the specific task, improving its performance.

Pretrained BERT model (bert-base-uncased)
    
The bert-base-uncased model is a pretrained version of BERT developed by Google. It is a powerful tool for generating high-quality embeddings that
capture contextual information and semantic meaning in natural language text. We’ll use it to generate word, sentence, and document embeddings.

Model size: bert-base-uncased refers to the base version of BERT, which consists of 12 transformer layers, 768 hidden units (dimensions) in each
layer, and 110 million parameters in total. It’s a relatively smaller version compared to larger variants like bert-large.

Uncased variant: The “uncased” variant indicates that the model is trained on uncased text, where all text is converted to lowercase during 
tokenization. This variant is suitable for tasks where case sensitivity is not crucial.

Pretrained weights: The bert-base-uncased model has pretrained weights learned from large-scale text corpora, such as Wikipedia and BookCorpus. 
These weights capture rich semantic information and contextual relationships between words in natural language text.

Usage: After loading the pretrained bert-base-uncased model, it can be fine-tuned on downstream tasks using task-specific datasets and can also be 
used as a feature extractor to generate word, sentence, or document embeddings for various natural language processing tasks.



In [1]:
# Sample data
Text_sequences = ["We had a picnic on the river bank.",
"Tom deposited his paycheck into his savings bank account."
]

# Step 1: Data preprocessing 

The first step is to preprocess the text. Preprocessing involves tokenization, removing stop words and punctuation, and lemmatization. We import
word_tokenizer, stopwords, and WordNetLemmatizer from the NLTK library. We import string to remove punctuation.

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Convert text sequence into individual words called tokens
    print("Original text:\n", text, "\n")
    tokens = word_tokenize(text.lower())
    print ("Original text tokens:\n", tokens, "\n")
    # Extract stop_words and remove punctutaion from the tokens and lemmatize the reamining tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token not in string.punctuation]
    print("Tokens we have after removing stop words, punctuation and lemmatization:\n", tokens, "\n")
    # Join the new lemma token
    return ' '.join(tokens)

preprocessed_Text_sequences = [(preprocess(text_sequence)) for text_sequence in Text_sequences]
print("Preprocessed text: ", preprocessed_Text_sequences)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Original text:
 We had a picnic on the river bank. 

Original text tokens:
 ['we', 'had', 'a', 'picnic', 'on', 'the', 'river', 'bank', '.'] 

Tokens we have after removing stop words, punctuation and lemmatization:
 ['picnic', 'river', 'bank'] 

Original text:
 Tom deposited his paycheck into his savings bank account. 

Original text tokens:
 ['tom', 'deposited', 'his', 'paycheck', 'into', 'his', 'savings', 'bank', 'account', '.'] 

Tokens we have after removing stop words, punctuation and lemmatization:
 ['tom', 'deposited', 'paycheck', 'saving', 'bank', 'account'] 

Preprocessed text:  ['picnic river bank', 'tom deposited paycheck saving bank account']


# Step 2: BERT tokenization

The preprocessed text is further tokenized by a BERT tokenizer, converting the data into a format the BERT model takes as input to generate embeddings
for each BERT-generated token. To do this, we import BertTokenizer from the Hugging Face Transformers library. In the following code:

We load the BERT tokenizer from the pretrained bert-base-uncased model.

We tokenize our preprocessed text sequence using the BERT tokenizer. It returns PyTorch tensors, which we will later provide to the BERT model to
generate embeddings for each token in the inputs.



In [6]:
from transformers import BertTokenizer

# Load pre-trained BERT tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print("BERT Tokenizer:\n", tokenizer)

def BERT_tokenize(preprocessed_text_sequence):

  print("Preprocessed text sequence:\n", preprocessed_text_sequence)
  # Encoding the preprocessed text in the format expected by BERT. 'pt' stands for PyTorch tensor, for tensorflow tensor, this value would be 'tf'
  inputs = tokenizer(preprocessed_text_sequence, return_tensors='pt')
  print("BERT tokenized input tensor that will later be given as input to the BERT model:\n", inputs)
  # Visualizing the tokens generated by BERT from preprocessed text sequence
  bert_tokenized_text = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
  print("BERT tokens:\n", bert_tokenized_text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

BERT Tokenizer:
 BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}




# Analyzing output

Visualizating the tokens generated by BERT tokenizer. It adds a class token '[CLS]' at the beginning of each sequence, and a separation token 
'[SEP]' at the end of each sequence

In BERT tokenization, the ## symbol represents a subword that is not at the beginning of a word. In the example above, “paycheck" is divided into 
'pay', '##che', and '##ck'. The tokens '##che' and '##ck' indicate that these are subword tokens, and they should be combined with the preceding 
token ('pay') to form the original word “paycheck". This technique is part of the WordPiece tokenization used in BERT to efficiently 
handle out-of-vocabulary and rare word forms by breaking them into subword units.

"check" is further split into two tokens, where '##ck' indicates that "ck" is a continuation of the previous token. This allows BERT to handle 
similar words like "check" ('che', '##ck') or "checking" ('che', '##ck', '##ing') efficiently.

BERT Special Tokens : 

Classification token [CLS]
Purpose: This is added at the beginning of every input sequence.

Function: The output embedding corresponding to this token is used as the aggregate sequence representation for classification tasks. For example, 
in sentiment analysis, the [CLS] token’s representation is used to determine the sentiment of the entire input sentence.

Separator token [SEP]
Purpose: This is used to separate two sentences or segments in a single input sequence.

Function: This helps the model understand the boundaries between different parts of the input. This is particularly useful in tasks like question
answering, where the input consists of a question and a passage, and each is separated by a [SEP] token.

Padding token [PAD]
Purpose: This is used to pad shorter sequences in a batch to ensure they are all the same length.

Function: This ensures uniform input size for efficient batch processing. The model is designed to ignore these tokens during training and inference,
which is facilitated by the attention mask.

Masking token [MASK]
Purpose: This is used during the BERT pretraining phase for the masked language model task.

Function: Certain tokens in the input are replaced with [MASK] tokens, and the model is trained to predict the original token. This helps the model 
learn contextual relationships between words and predict missing words in a sentence, aiding in understanding word contexts and improving the 
model’s language understanding capabilities.



In [9]:
BERT_tokenize(preprocessed_Text_sequences[0])

Preprocessed text sequence:
 picnic river bank
BERT tokenized input tensor that will later be given as input to the BERT model:
 {'input_ids': tensor([[  101, 12695,  2314,  2924,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
BERT tokens:
 ['[CLS]', 'picnic', 'river', 'bank', '[SEP]']


In [10]:
BERT_tokenize(preprocessed_Text_sequences[1])

Preprocessed text sequence:
 tom deposited paycheck saving bank account
BERT tokenized input tensor that will later be given as input to the BERT model:
 {'input_ids': tensor([[  101,  3419, 14140,  3477,  5403,  3600,  7494,  2924,  4070,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
BERT tokens:
 ['[CLS]', 'tom', 'deposited', 'pay', '##che', '##ck', 'saving', 'bank', 'account', '[SEP]']


# Step 3: Generate embeddings

We generate word embeddings by feeding the inputs to the BERT model.
    
We import the BertModel from transformers.

We load the pretrained BERT model with uncased (lowercase) text bert-base-uncased.

We make a forward pass of the BERT model and save the results in the outputs. The last hidden state from the outputs gives us the word embeddings.

We get the last_hidden_state.

The shape of last_hidden_state is (batch_size, sequence_length, hidden_layer). For a single text sequence, batch_size is 1, so we squeeze the last hidden state to get (sequence_length, hidden_layer).

We return the word embedding as a NumPy array.


In [12]:
import torch
from transformers import BertModel
import numpy as np

# Load the pretrained BERT model
model = BertModel.from_pretrained('bert-base-uncased')

def generate_embedding(inputs):
  bert_tokenized_text = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
  print("BERT tokens:\n", bert_tokenized_text)
  print("Number of BERT tokens:\n", len(bert_tokenized_text))
  # Forward pass, get hidden states
  with torch.no_grad():
      outputs = model(**inputs)
      # The last hidden state of the BERT model gives us the embeddings for each word/token in the input
      word_embeddings = outputs.last_hidden_state.squeeze(0)
      print("Word embeddings shape: ", word_embeddings.shape)
      print("BERT word embeddings: ", word_embeddings)
      return word_embeddings.numpy()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

# Output analysis :

A 768-dimentional vector embedding is generated for each of the BERT token. The first embedding in both example is the [CLS] token embedding, which
can be used as the embedding of the complete text sequence


In [17]:
generate_embedding(tokenizer(preprocessed_Text_sequences[0], return_tensors='pt'))

BERT tokens:
 ['[CLS]', 'picnic', 'river', 'bank', '[SEP]']
Number of BERT tokens:
 5
Word embeddings shape:  torch.Size([5, 768])
BERT word embeddings:  tensor([[-0.2253, -0.0168,  0.0024,  ..., -0.3422,  0.0240,  0.2583],
        [ 0.2572,  0.3250, -0.6357,  ..., -0.1652, -0.1414, -0.5714],
        [ 0.3935,  0.2023, -0.8208,  ...,  0.1791, -0.5660, -0.4970],
        [ 0.1709, -0.6051, -0.9341,  ..., -0.4851, -0.9701,  0.2262],
        [ 0.7009,  0.0398, -0.4535,  ...,  0.1339, -0.5551, -0.2387]])


array([[-0.22530818, -0.01683594,  0.00243193, ..., -0.34223026,
         0.02404874,  0.25833204],
       [ 0.2572179 ,  0.325045  , -0.63566685, ..., -0.16516595,
        -0.14136916, -0.57144696],
       [ 0.39351892,  0.20234175, -0.8208232 , ...,  0.17907247,
        -0.5659587 , -0.49696052],
       [ 0.17093086, -0.6051165 , -0.9341149 , ..., -0.48505035,
        -0.9701358 ,  0.22616962],
       [ 0.7009372 ,  0.03977758, -0.4534922 , ...,  0.13394572,
        -0.55510193, -0.23867923]], dtype=float32)

In [18]:
generate_embedding(tokenizer(preprocessed_Text_sequences[1], return_tensors='pt'))

BERT tokens:
 ['[CLS]', 'tom', 'deposited', 'pay', '##che', '##ck', 'saving', 'bank', 'account', '[SEP]']
Number of BERT tokens:
 10
Word embeddings shape:  torch.Size([10, 768])
BERT word embeddings:  tensor([[-0.1539, -0.0143,  0.2241,  ..., -0.3247, -0.0813,  0.2300],
        [ 0.2860, -0.0930,  0.4270,  ..., -0.5211,  0.3296, -0.2689],
        [-0.2934, -0.1110,  0.3552,  ..., -0.7213, -0.1483, -0.4355],
        ...,
        [ 0.3683, -0.0957,  0.4680,  ..., -0.6037, -0.6312, -0.2820],
        [-0.2391, -0.4368, -0.0191,  ...,  0.0445, -0.5492,  0.1968],
        [ 0.6700,  0.3728,  0.0954,  ...,  0.1696, -0.5201, -0.2818]])


array([[-0.1539274 , -0.01426611,  0.22410263, ..., -0.32470435,
        -0.08126129,  0.23003566],
       [ 0.28599662, -0.09298298,  0.42699027, ..., -0.52109075,
         0.32956743, -0.26891032],
       [-0.2933597 , -0.11097818,  0.3552085 , ..., -0.7213431 ,
        -0.14829345, -0.43545598],
       ...,
       [ 0.36828735, -0.09568151,  0.46795404, ..., -0.6036686 ,
        -0.63117397, -0.28203726],
       [-0.23909116, -0.43680143, -0.01907284, ...,  0.04451707,
        -0.5491625 ,  0.19682671],
       [ 0.6699899 ,  0.3728311 ,  0.09542982, ...,  0.16961138,
        -0.5200647 , -0.28180757]], dtype=float32)