# Transformers - Embeddings

This notebook includes experimentation with the Embeddings through the usage of the Transformers.

# Setup Notebook

## Imports

In [2]:
# Import Standard Libraries
import numpy as np
import json
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

ModuleNotFoundError: No module named '_lzma'

# Experimentations

## AutoTokenizer with BERT

### Tokenization

In [2]:
# Instance the Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

- The Tokenizer is used to converts raw text into tokens
<br>

**The Process:**
1) The first step is WordPiece Tokenization (e.g., "playing" &rarr; ["play", "##ing"])
2) The second step maps tokens into numerical IDs (based on BERT's vocabulary)
3) Add special tokens (`[CLS]` and `[SEP]`)
4) Pads and truncate text sequence to fix model's input
5) Create attention mask

In [9]:
text = "I love Data Science"
tokens = tokenizer(text, return_tensors="pt")  # Convert to PyTorch tensors

print('Token Object Shape:', len(tokens), "- Token Keys:", tokens.keys())
print('Number of Tokens:', len(tokens.input_ids[0]), "- Remember Special Tokens [101] and [102]")
print('Tokens:' , tokens.input_ids)
print('Attention Mask:', tokens.attention_mask)

Token Object Shape: 3 - Token Keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
Number of Tokens: 6 - Remember Special Tokens [101] and [102]
Tokens: tensor([[ 101, 1045, 2293, 2951, 2671,  102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1]])


- [101] = [CLS] token (start of sentence)
- 1045 = "I", 2293 = "love", 2951 = "data", 2671 = "science"
- [102] = [SEP] token (end of sentence)

### Embeddings

The algorithm process an input tokenised sequence and represent it with another vector representation called "Embeddings", which has more contextual meaning among the different tokens of the input sequence.



In [10]:
# Instance model
model = AutoModel.from_pretrained("bert-base-uncased")

In [11]:
# Pass token IDs into the Transformer model
outputs = model(**tokens)

# Extract last hidden state (word embeddings for each token)
embeddings = outputs.last_hidden_state

print(embeddings.shape)  # (batch_size, sequence_length, hidden_size)

torch.Size([1, 6, 768])


Each of our 6 tokens gets a 768-dimensional vector that captures its meaning in context.

## Sentence Transformers - AutoTokenizer

### Basic Usage

In [2]:
# Define the sentences
sentences = [
    "I took my dog for a walk",
    "Today is going to rain",
    "I took my cat for a walk"
]

In [3]:
# Define tokenizer and the model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [4]:
# Create tokens
tokens = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

In [5]:
# Compute embeddings
embeddings = model(**tokens).last_hidden_state

In [6]:
print(embeddings.shape)

torch.Size([3, 9, 384])


- Shape is: number of sentences, number of tokens, embedding dimension
- Therefore, each token in the input sequence has a dimension of 384

### Mean Pooling

Technique used to have one single embedding vector per sentence, and not per token

In [7]:
def mean_pooling(model_output, attention_mask):
    """Perform the mean pooling over the model output in order to reduce the embedding dimension"""
    # Retrieve embeddings
    token_embeddings = model_output.last_hidden_state

    # Use the attention mask in order to not include the padding tokens into the mean pooling
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()

    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

**Code Analysis**:

```python
# Use the attention mask in order to not include the padding tokens into the mean pooling
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
```

- The `attention_mask` is produced by the model to understand what tokens are actual words and what are paddings
- Now the goal is to be able to apply the `attention_mask` to the `token_embeddings`, so that we can understand which are real tokens and which are padding tokens. In order to do this, we need to transform the `attention_mask` so that it has the same dimension as `token_embeddings`, in order to be able to do `token_embeddings * input_mask_expanded` later on
- The `attention_mask` has a shape `(batch_size, sequence_length)` and `1` indicates a real token, while `0` is a padding token
- The `unsqueeze(-1)` add an extra dimension to `attention_mask` &rarr; `(batch_size, sequence_length, 1)`
- The `expand()` changes the shape of `attention_mask` to match the one of `token_embeddings`

```python
torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
```

- Both `token_embeddings` and `input_mask_expanded` have the same shape &rarr; `(batch_size, sequence_length, hidden_size)`
- `token_embeddings * input_mask_expanded` &rarr; Mask out padding tokens by setting their embeddings to zero
- `torch.sum(..., 1)` &rarr; sums along the `sequence_length` dimension (axis=1) &rarr; Compute the sum of embeddings only for the real tokens, while ignoring the padding tokens
- `input_mask_expanded.sum(1)` &rarr; Sums the mask values along the `sequence_length` &rarr; It computes the number of real tokens
- `torch.clamp(..., min=1e-9)` &rarr; If the whole sequence is padding, we would get a divide by zero error &rarr; This code prevents this by replacing zero with a very small number (`1e-9`)
- The final division between the summed embeddings and the real number of tokens is the real **mean pooling of the embeddings** (Real average of only the real token embeddings, effectively ignoring the paddings)

In [8]:
# Define the sentences
sentences = [
    "I took my dog for a walk",
    "Today is going to rain",
    "I took my cat for a walk"
]

In [9]:
# Define tokenizer and the model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [10]:
# Create tokens
tokens = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

# Compute model output
output = model(**tokens)

In [16]:
# Apply mean pooling
sentence_embeddings = mean_pooling(output, tokens.attention_mask)

In [12]:
print(sentence_embeddings.shape)

torch.Size([3, 384])


### Similarity

In [17]:
# Remove gradiant tracking
sentence_embeddings = sentence_embeddings.detach().numpy()

# Initialise the score matrix 3 x 3
scores = np.zeros((sentence_embeddings.shape[0], sentence_embeddings.shape[0]))

# Compute the scores
for index in range(sentence_embeddings.shape[0]):
    scores[index, :] = cosine_similarity([sentence_embeddings[index]], sentence_embeddings)[0]

In [18]:
scores

array([[1.        , 0.17021172, 0.82909292],
       [0.17021172, 1.00000012, 0.17396861],
       [0.82909292, 0.17396861, 1.        ]])

- As expected, `I took my dog for a walk` and `I took my cat for a walk` are very much similar between each other

## Sentence Transformers - SDK