### Using Transformers

This notebook is adapted from Sebastian Raschka's excellent blog post [Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)

The Transformer architecture basically eliminated the need for RNNs, mainly because of the introduction of the self-attention mechanism. What is self-attention? Self-attention is a technical approach that can help to determine not only the information of an input sequence, but also, the context of that sequence. As Raschka states: "This is especially important for language processing tasks, where the meaning of a word can change based on its context within a sentence or document." 

Or, as linguist John Firth said in 1957: "You shall know a word by the company it keeps."

While there are many variants of self-attention, this tutorial focuses on the original scaled-dot product attention mechanim (referred to as self-attention). [Here is an overview of the scaled dot-product attention.](https://machinelearningmastery.com/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras/)

#### Embedding an Input Sentence

##### In

In [2]:
sentence = 'Life is short, eat dessert first.'

##### Out

In [3]:
# Create a dictionary comprehension of words and their indices in the sentence, sorted by word, and print it
dc = {word:index for index, word in enumerate(sorted(sentence.replace(',', '').split()))} 
print(dc)

{'Life': 0, 'dessert': 1, 'eat': 2, 'first.': 3, 'is': 4, 'short': 5}


Use the dictionary to assign an integer index to each word.

##### In

In [4]:
import torch

sentence_int =torch.tensor([dc[word] for word in sentence.replace(',', '').split()])

  from .autonotebook import tqdm as notebook_tqdm


##### Out

In [5]:
print(sentence_int)

tensor([0, 4, 5, 2, 1, 3])


Now we create the embedding layer using the integer-vector representation of the sentence. We use a 16-dimensional embedding, which means each input word is represented by a 16-dim vector. 

In [6]:
torch.manual_seed(123) # set the random seed
embed = torch.nn.Embedding(6, 16) # 6 words in vocab, 16 dimensional embeddings
embedded_sentence = embed(sentence_int).detach() # returns a new tensor, detached from the computation graph

In [7]:
print(embedded_sentence)
print(embedded_sentence.shape)

tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,
          0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,  0.4954,  0.2692],
        [ 0.5146,  0.9938, -0.2587, -1.0826, -0.0444,  1.6236, -2.3229,  1.0878,
          0.6716,  0.6933, -0.9487, -0.0765, -0.1526,  0.1167,  0.4403, -1.4465],
        [ 0.2553, -0.5496,  1.0042,  0.8272, -0.3948,  0.4892, -0.2168, -1.7472,
         -1.6025, -1.0764,  0.9031, -0.7218, -0.5951, -0.7112,  0.6230, -1.3729],
        [-1.3250,  0.1784, -2.1338,  1.0524, -0.3885, -0.9343, -0.4991, -1.0867,
          0.8805,  1.5542,  0.6266, -0.1755,  0.0983, -0.0935,  0.2662, -0.5850],
        [-0.0770, -1.0205, -0.1690,  0.9178,  1.5810,  1.3010,  1.2753, -0.2010,
          0.4965, -1.5723,  0.9666, -1.1481, -1.1589,  0.3255, -0.6315, -2.8400],
        [ 0.8768,  1.6221, -1.4779,  1.1331, -1.2203,  1.3139,  1.0533,  0.1388,
          2.2473, -0.8036, -0.2808,  0.7697, -0.6596, -0.7979,  0.1838,  0.2293]])
torch.Size([6, 16])


#### Defining the weight matrices

There are three weight matricies in self-attention (aka scaled dot product): W<sub>q</sub>, W<sub>k</sub>, and W<sub>v</sub>. As Raschka outlines, "these matrices serve to project the inputs into query, key, and value components of the sequence, respectively." 

Q, K, and V are "obtained via matrix multiplication between weight matrices W and the embedded inputs **x**:

* Query sequence: **q**<sup>(i)</sup>=**W**<sub>q</sub>**x**<sup>(i)</sup> for i ∈[1,*T*]
* Key sequence: **k**<sup>(i)</sup>=**W**<sub>k</sub>**x**<sup>(i)</sup> for i∈[1,*T*]
* Value sequence: **v**<sup>(i)</sup>=**W**<sub>v</sub>**x**<sup>(i)</sup> for  i∈[1,*T*]

The index *i* refers to the token index position in the input sentence, which has length *T*.

Initialize the projection matrices:

##### In

In [8]:
torch.manual_seed(123) # set the random seed

d = embedded_sentence.shape[1] # dimension of each word vector
d_q, d_k, d_v = 24, 24, 28 # dimensions of query, key, and value vectors

# Weights for the query, key, and value vectors. d = 16, so we have 16 weights for each of the 24, 24, and 28 vectors
W_query = torch.rand(d_q, d) # randomly initialize query weights
W_key = torch.rand(d_k, d) # randomly initialize key weights
W_value = torch.rand(d_v, d) # randomly initialize value weights

##### Computing the Unnormalized Attention Weights

Compute the attention-vector for the second input element.

##### In

In [9]:
x_2 = embedded_sentence[1] # get the second word vector from the embedded sentence

query_2 = W_query.matmul(x_2) # calculate the query for the second word vector. The matmul() function performs matrix multiplication.
key_2 = W_key.matmul(x_2) # calculate the key for the second word vector
value_2 = W_value.matmul(x_2) # calculate the value for the second word vector

In [10]:
print(query_2.shape)
print(key_2.shape)
print(value_2.shape)

torch.Size([24])
torch.Size([24])
torch.Size([28])


Now, generalize this to compute the remaining key, and value elements for all inputs. We will need them in the next steps for computing the unnormalized attention weights *w:*

##### In

In [11]:
keys = W_key.matmul(embedded_sentence.T).T # calculate the keys for all word vectors in the sentence
values = W_value.matmul(embedded_sentence.T).T # calculate the values for all word vectors in the sentence

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 24])
values.shape: torch.Size([6, 28])


Now we have all the required keys and values! We can compute *w*<sub>ij</sub> as the dot product between the query and key sequences, *w*<sub>ij</sub>=**q**<sup>(i)<sup>T</sup></sup>**k**<sup>(j)</sup>.

##### In

In [12]:
omega_2 = query_2.matmul(keys.T) # calculate the attention weights for the second word vector
print(omega_2)

tensor([ 8.5808, -7.6597,  3.2558,  1.0395, 11.1466, -0.4800])


#### Compute the attention scores

Here, were are going to use the softmax function in order to obtain the normalized attention weights. We do this by applying the softmax function to the previous unnormalized attenion weights. 

What is the softmax function? From [deepai.org's Softmax Function article](https://deepai.org/machine-learning-glossary-and-terms/softmax-layer):

*The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1.*

There is also a scaling step for **w** before the softmax function.

As Raschka writes: The scaling by **d<sub>k</sub>**
 ensures that the Euclidean length of the weight vectors will be approximately in the same magnitude. This helps prevent the attention weights from becoming too small or too large, which could lead to numerical instability or affect the model’s ability to converge during training.

 ##### In

In [13]:
import torch.nn.functional as F

attention_weights_2 = F.softmax(omega_2 /d_k**0.5, dim=0) # calculate the softmax of the attention weights
print(attention_weights_2) # print the attention weights

tensor([0.2912, 0.0106, 0.0982, 0.0625, 0.4917, 0.0458])


"The last step is to compute the context vector **z**<sup>(2)</sup>. This is an attention-weighted verrsion of the original query input **x**<sup>(2)</sup>, including all the other input elements as its context via the attention weights:

##### In

In [14]:
context_vector_2 = attention_weights_2.matmul(values) # calculate the context vector for the second word vector

print(context_vector_2.shape)
print(context_vector_2)

torch.Size([28])
tensor([-1.5993,  0.0156,  1.2670,  0.0032, -0.6460, -1.1407, -0.4908, -1.4632,
         0.4747,  1.1926,  0.4506, -0.7110,  0.0602,  0.7125, -0.1628, -2.0184,
         0.3838, -2.1188, -0.8136, -1.5694,  0.7934, -0.2911, -1.3640, -0.2366,
        -0.9564, -0.5265,  0.0624,  1.7084])


The output vector has more dimensions (d<sub>v</sub> = 28) than the original input vector (d=16).

#### Multi-Head Attention!

The scaled dot-product attention mechanism is used in the multi-head attention blocks. From the blog post:

![Multi-head attention modules](images/scaled-dot-product.png)


*"In the scaled dot-product attention, the input sequence was transformed using three matrices representing the query, key, and value. These three matrices can be considered as a single attention head in the context of multi-head attention."*

### Using Transformers - Part Two of the Huggingface NLP Course

##### Standard approach when using the Pipeline function

In [15]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", 
     "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

##### Build an approach using Tokenizer, Model, and Post Processing

Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

* Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
* Mapping each token to an integer
* Adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained() method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

In [16]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english' # the name of the checkpoint.
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load the tokenizer

Transformer models only accept tensors as inputs!

In [17]:
raw_inputs = ["I've been waiting for a HuggingFace course my whole life.",
              "I hate this so much!",]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # tokenize the inputs

print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


##### Now the model!

In [18]:
from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english' # the name of the checkpoint.
model = AutoModel.from_pretrained(checkpoint) # load the model

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
outputs = model(**inputs) # forward pass the inputs through the model
print(outputs.last_hidden_state.shape) # print the size of the outputs tensor

torch.Size([2, 16, 768])


For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:

In [20]:
from transformers import AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english' # the name of the checkpoint.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint) # load the model
outputs = model(**inputs) # forward pass the inputs through the model

If we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [21]:
print(outputs.logits.shape) # print the size of the logits tensor

torch.Size([2, 2])


In [22]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


These are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [23]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # pass the logits tensor through the softmax function
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [24]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

### Try It Out!

In [25]:
from transformers import pipeline

# Create a classifier pipeline
classifier = pipeline('sentiment-analysis')
text = ["The Philadelpha Eagles won the Superbowl.", 
             "The Philadelpha Eagles lost the Superbowl.",]

# Classify two sample sentences
classifier(text)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.999765932559967},
 {'label': 'NEGATIVE', 'score': 0.9984251260757446}]

#### Manual steps: Tokenizer, Model, Post Processing

In [26]:
# Set up the tokenizer

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english' # the name of the checkpoint.
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load the tokenizer

In [28]:
# Create the inputs and tokenization

raw_inputs = ["The Philadelpha Eagles won the Superbowl.",
                "The Philadelpha Eagles lost the Superbowl.",]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # tokenize the inputs
print(inputs)


{'input_ids': tensor([[  101,  1996,  6316,  9648, 14277,  3270,  8125,  2180,  1996, 21688,
          5004,  2140,  1012,   102],
        [  101,  1996,  6316,  9648, 14277,  3270,  8125,  2439,  1996, 21688,
          5004,  2140,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [30]:
# Set up the model

from transformers import AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english' # the name of the checkpoint.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint) # load the model
outputs = model(**inputs) # forward pass the inputs through the model

In [31]:
# print the logits
print(outputs.logits)

tensor([[-4.0651,  4.2943],
        [ 3.5058, -2.9462]], grad_fn=<AddmmBackward0>)


In [32]:
# print the softmax of the logits
print(torch.nn.functional.softmax(outputs.logits, dim=-1))

tensor([[2.3413e-04, 9.9977e-01],
        [9.9843e-01, 1.5749e-03]], grad_fn=<SoftmaxBackward0>)


In [33]:
# make the predictions
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # pass the logits tensor through the softmax function
print(predictions)

tensor([[2.3413e-04, 9.9977e-01],
        [9.9843e-01, 1.5749e-03]], grad_fn=<SoftmaxBackward0>)


In [34]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

### Diving into Models

The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

In [35]:
# This is a basic model, not pretrained

from transformers import BertConfig, BertModel

# Build the config
config = BertConfig() # initialize a configuration object. 

# Build the model from the config
model = BertModel(config) # initialize a model from the configuration

In [36]:
# Check the config
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [37]:
# Load a transformer that is already built
from transformers import BertModel

model = BertModel.from_pretrained('bert-base-uncased') # Use the pretrained transformer model

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
