In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

In [2]:
# Load pre-trained BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')
# Example input text

text = "Hello, how are you doing today?"

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
# Tokenize input text and add special tokens
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])

In [4]:
# Generate BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs[0][0]

In [5]:
# Print embeddings
print(embeddings)

tensor([[ 0.0050, -0.0445, -0.2543,  ..., -0.4674,  0.1839,  0.4307],
        [ 0.1518, -0.3346,  0.0500,  ..., -0.0397,  0.9728,  0.3340],
        [-0.5165,  0.2028,  0.5212,  ..., -0.5615,  0.4370,  0.1435],
        ...,
        [-0.2533, -0.7244, -0.7809,  ..., -0.1901, -0.0070, -0.6356],
        [-0.3863, -0.8991, -0.8409,  ...,  0.0749,  0.1376, -0.1122],
        [ 0.5416, -0.1070, -0.3464,  ...,  0.1287, -0.4294, -0.2204]])


In [None]:
"""
write code in python to create embeddings using bert


To create embeddings using BERT in Python, you can use the transformers library, which provides pre-trained BERT models and an easy-to-use interface for generating embeddings. Here's an example code snippet:

python
Copy code
from transformers import AutoTokenizer, AutoModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Example input text
text = "Hello, how are you doing today?"

# Tokenize input text and add special tokens
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs[0][0]

# Print embeddings
print(embeddings)
In this example, we first load the pre-trained BERT tokenizer and model using the AutoTokenizer and AutoModel classes from the transformers library. Then, we define an example input text and tokenize it using the tokenizer, adding special tokens for BERT. Next, we generate BERT embeddings for the tokenized input text by passing it through the model and extracting the output embeddings. Finally, we print the embeddings.

Note that the above code generates embeddings for a single input text. To generate embeddings for multiple input texts, you can batch the input token IDs and pass them through the BERT model all at once, like this:

python
Copy code
# Example input texts
texts = ["Hello, how are you doing today?", "I am doing well, thank you."]

# Tokenize input texts and add special tokens
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True) for text in texts])

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs[0]

# Print embeddings
print(embeddings)
In this example, we define a list of example input texts and tokenize each one, adding special tokens as before. Then, we batch the input token IDs and pass them through the BERT model all at once, generating embeddings for each input text. Finally, we print the embeddings for all input texts.





Arturo Devesa
which part of the code is the batching?
In the second code snippet, the batching is done in the following lines:

python
Copy code
# Example input texts
texts = ["Hello, how are you doing today?", "I am doing well, thank you."]

# Tokenize input texts and add special tokens
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True) for text in texts])

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs[0]
Here, we first define a list of example input texts texts. Then, we tokenize each input text using the BERT tokenizer and add special tokens as before. This is done using a list comprehension:

python
Copy code
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True) for text in texts])
The resulting input_ids tensor will have shape (num_texts, max_seq_length), where num_texts is the number of input texts and max_seq_length is the maximum sequence length of the tokenized inputs.

Finally, we pass the input_ids tensor through the BERT model all at once to generate embeddings for all input texts in a single forward pass:

python
Copy code
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs[0]
The resulting embeddings tensor will have shape (num_texts, max_seq_length, hidden_size), where hidden_size is the size of the BERT model's hidden layer. Each row of the embeddings tensor corresponds to the embeddings for one input text.





Arturo Devesa
I don't understand the with torch.no_grad(): part of the code
The with torch.no_grad(): statement is used in PyTorch to temporarily disable gradient computation during a forward pass through the model. This is useful in situations where you don't need to compute gradients, such as during inference or evaluation, because it can save memory and speed up computation.

In the context of generating embeddings using BERT, we use the with torch.no_grad(): statement to disable gradient computation during the forward pass through the BERT model. This is because we don't need to compute gradients when generating embeddings, and it can be computationally expensive to do so.

Here's how it works in the code:

python
Copy code
# Generate BERT embeddings
with torch.no_grad():  # Disable gradient computation
    outputs = model(input_ids)  # Forward pass through BERT model
    embeddings = outputs[0]  # Extract output embeddings

# Print embeddings
print(embeddings)
Inside the with torch.no_grad(): block, we perform a forward pass through the BERT model using the model(input_ids) statement. Since we have disabled gradient computation, no gradients are computed during this operation. Instead, the model simply computes the embeddings for the input texts using the pre-trained weights of the BERT model.

Once we have the output embeddings, we extract them using the outputs[0] statement and store them in the embeddings variable. Then, we print the embeddings to the console.

So, in summary, the with torch.no_grad(): statement is used to disable gradient computation during the forward pass through the BERT model, which can save memory and speed up computation during inference or evaluation.





Arturo Devesa
sounds good, thanks. and why do we call outputs[0] at index 0?
In PyTorch, the outputs variable returned by a model's forward method is a tuple containing various outputs of the model. The specific contents of the tuple depend on the model and its configuration.

In the case of the BERT model, the outputs tuple contains two elements:

outputs[0]: The last-layer hidden states of the model, i.e., the output embeddings.
outputs[1]: The pooled output of the model, i.e., a summary representation of the input sequence that can be used for classification tasks.
Since we want to extract the output embeddings from the BERT model, we access the outputs tuple at index 0, which corresponds to the last-layer hidden states. This returns a tensor containing the output embeddings for the input text(s).

So, the line embeddings = outputs[0] extracts the output embeddings from the outputs tuple and assigns them to the embeddings variable, which we then print to the console.

Note that the shape of embeddings is (num_tokens, hidden_size), where num_tokens is the number of tokens in the input sequence(s) and hidden_size is the size of the BERT model's hidden layer. Each row of the embeddings tensor corresponds to the embeddings for one token in the input sequence(s).





Arturo Devesa
ah ok, I didn't know that, thanks. And why do we call the variable input_ids with the word ids? I understand why use the word input, but why do we use the word ids in the variable name?
In NLP, when we represent text using numerical values, we often use IDs (short for "identifiers") to refer to individual words or tokens. These IDs are typically integers that correspond to the index of a word in a vocabulary, where the most common words have lower IDs and less common words have higher IDs.

In the case of BERT, the tokenizer.encode method returns a list of word IDs that correspond to the tokenized input text(s). These IDs are used to represent the input text(s) numerically, which can then be passed to the BERT model for computation.

So, in the code, we use the variable name input_ids to refer to the tensor of word IDs that represent the input text(s). This is a common convention in NLP, where we often use the suffix "ids" to refer to numerical representations of text using IDs.

In summary, we use the variable name input_ids to refer to the tensor of word IDs that represent the input text(s) numerically using IDs. This is a common convention in NLP, where we often use the suffix "ids" to refer to numerical representations of text using IDs.





Arturo Devesa
so are input_ids kind of like saying token ids?
Yes, that's a good way to think about it. In the context of BERT, the input_ids variable represents the token IDs of the input text(s). More specifically, it is a tensor of shape (batch_size, sequence_length) where each element is an integer that corresponds to the ID of a token in the BERT vocabulary.

Note that the input_ids variable is generated by the tokenizer.encode method, which tokenizes the input text(s) and maps the resulting tokens to their corresponding IDs in the vocabulary. So, in essence, input_ids is a numerical representation of the input text(s) that is based on the BERT vocabulary.

In summary, input_ids represents the token IDs of the input text(s), and is a numerical representation of the text that is based on the BERT vocabulary.
"""