# Explore Transformers

We wil use the `transformers` package from `HuggingFace`, that provides a unified interface to a variety of Transormer nodels.

In [1]:
from transformers import AutoModel, AutoTokenizer



Pretrained models can be downloaded directly from the HuggingFace repository.

We need also the Tokenizer for a given model, since each model may do it differently.

In [2]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Here we get the base BERT model `uncased`, i.i. where toknes are all lowercased.

In [3]:
model = AutoModel.from_pretrained('bert-base-uncased') 

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
wps_ids = tokenizer.encode("Hypatia was a mathematician")
wps_ids

[101, 1044, 22571, 10450, 2050, 2001, 1037, 13235, 102]

In [5]:
wordpieces = tokenizer.convert_ids_to_tokens(wps_ids)

See what they are:

In [6]:
wordpieces

['[CLS]', 'h', '##yp', '##ati', '##a', 'was', 'a', 'mathematician', '[SEP]']

Convert to tensor

In [7]:
import torch
wps_tensor = torch.tensor([wps_ids])

In [8]:
outputs = model(wps_tensor, output_hidden_states=True, output_attentions=True)

In [9]:
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output
hidden_states = outputs.hidden_states
attentions = outputs.attentions

`last_hidden_state` is the sequence of hidden-states at the output of the last layer of the model.

shape: (batch_size, sequence_length, hidden_size)

In [10]:
last_hidden_state[0]

tensor([[-0.1904, -0.2317, -0.4896,  ..., -0.4366,  0.5167,  0.5166],
        [ 0.7274,  0.4021, -0.4051,  ..., -0.9010,  1.4341,  0.2526],
        [-0.2553, -0.1450, -0.4257,  ..., -0.8199,  1.1043,  0.2408],
        ...,
        [ 0.1278,  0.1691, -0.8889,  ..., -0.6259,  0.1597,  0.8265],
        [-0.8941, -0.1419, -0.6008,  ..., -0.1861,  0.8240,  0.4556],
        [ 0.7600,  0.0256, -0.5482,  ...,  0.2768, -0.5651, -0.2150]],
       grad_fn=<SelectBackward>)

In [11]:
last_hidden_state[0].shape

torch.Size([9, 768])

`pooler_output` Last layer hidden-state of the first token of the sequence (**classification token**) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

shape: `(batch_size, hidden_size)`

In [12]:
len(pooler_output[0])

768

`hidden_states` is a Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer.

shape of each layer: `(batch_size, sequence_length, hidden_size)`

In [13]:
len(hidden_states)

13

Look at the last layer

In [14]:
hidden_states[-1][0]

tensor([[-0.1904, -0.2317, -0.4896,  ..., -0.4366,  0.5167,  0.5166],
        [ 0.7274,  0.4021, -0.4051,  ..., -0.9010,  1.4341,  0.2526],
        [-0.2553, -0.1450, -0.4257,  ..., -0.8199,  1.1043,  0.2408],
        ...,
        [ 0.1278,  0.1691, -0.8889,  ..., -0.6259,  0.1597,  0.8265],
        [-0.8941, -0.1419, -0.6008,  ..., -0.1861,  0.8240,  0.4556],
        [ 0.7600,  0.0256, -0.5482,  ...,  0.2768, -0.5651, -0.2150]],
       grad_fn=<SelectBackward>)

Look at the one for the first wordpiece (skipping (CLS]):

In [31]:
hidden_states[-1][0][1]

tensor([ 7.2743e-01,  4.0206e-01, -4.0513e-01,  3.0209e-01,  6.8818e-02,
         1.0866e-01, -1.2657e-01,  1.4593e+00, -4.8246e-01,  2.5372e-01,
         1.5767e-01, -6.2645e-01,  5.0674e-02,  2.6216e-01, -1.9618e-01,
        -3.6497e-01, -4.1162e-01, -3.6958e-01, -3.2668e-01, -6.8400e-02,
         6.4011e-01,  1.2157e-01, -5.9686e-02,  3.3221e-01,  1.6070e-01,
         1.2944e-01,  7.5560e-02,  6.3244e-01, -3.3727e-01,  1.7607e-01,
         1.6163e-02, -9.3616e-01,  2.2640e-02, -7.7795e-01,  1.1385e-01,
         6.0136e-01, -8.5157e-01,  2.3465e-01,  5.7237e-01,  1.1199e-01,
        -5.7691e-01, -4.6522e-01,  2.8443e-01, -2.1060e-01, -2.7226e-02,
         1.4974e-01, -5.0491e-01, -9.0769e-01,  3.3629e-02, -3.4374e-01,
        -7.1083e-01,  7.9934e-02, -1.1527e+00,  6.2862e-01,  5.6249e-01,
         1.5110e+00, -5.6712e-01, -6.9483e-01, -3.1478e-01,  7.7170e-02,
         1.3527e-01, -2.1106e-01, -3.6059e-01,  4.1817e-01,  8.0597e-01,
         5.8128e-01, -5.5423e-01,  6.9554e-01,  7.5

`attentions` Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

shape: `(batch_size, num_heads, sequence_length, sequence_length)`

In [16]:
attentions[0].shape

torch.Size([1, 12, 9, 9])

Look at the single head:

In [17]:
attentions[0][0].shape

torch.Size([12, 9, 9])

Attentions for last layer

In [18]:
attentions[0][0][-1].shape

torch.Size([9, 9])

In [20]:
attentions[0][0][-1]

tensor([[8.9986e-01, 8.9166e-03, 3.4554e-03, 8.5585e-04, 2.9731e-03, 4.7226e-03,
         3.5215e-02, 4.7056e-03, 3.9298e-02],
        [1.2391e-01, 8.4317e-02, 3.2262e-01, 1.5241e-01, 4.7131e-02, 6.6522e-02,
         5.5258e-02, 2.4437e-02, 1.2340e-01],
        [2.3238e-03, 9.2049e-01, 2.0366e-03, 4.4564e-02, 5.7512e-03, 9.3530e-04,
         1.4572e-02, 6.2075e-03, 3.1225e-03],
        [2.4987e-01, 2.4486e-02, 2.8669e-01, 1.5228e-02, 2.8864e-02, 4.3187e-02,
         5.5294e-02, 8.2267e-02, 2.1411e-01],
        [7.1093e-02, 8.7373e-02, 4.1576e-01, 2.2153e-01, 8.9381e-03, 1.1020e-02,
         3.6452e-02, 2.6599e-02, 1.2124e-01],
        [2.0167e-01, 1.2511e-01, 1.4100e-01, 4.7089e-02, 4.2112e-02, 1.3234e-01,
         9.5376e-02, 1.6339e-01, 5.1919e-02],
        [2.1234e-01, 1.6746e-02, 1.0519e-01, 1.7828e-02, 6.2450e-02, 2.4318e-01,
         5.8937e-02, 1.3822e-01, 1.4510e-01],
        [2.9630e-01, 3.4920e-02, 5.3029e-02, 1.3975e-01, 1.0845e-02, 1.8033e-02,
         9.1013e-02, 8.4441e-0

# Visualize model

In [22]:
from bertviz import model_view

In [23]:
def show_model_view(model, tokenizer, sentence_a, sentence_b=None, hide_delimiter_attn=False, display_mode="dark"):
    """Visualize the attention weights produced by model on sentence_a.
    If sentence_b is provided, the two sentences are concatenated with separator token in between.
    """
    # return_tensors='pt' means to return values as PyTorch tensors.
    inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
    input_ids = inputs['input_ids']
    if sentence_b:
        token_type_ids = inputs['token_type_ids'] # 0 for fiest sentence, 1 for second
        attention = model(input_ids, token_type_ids=token_type_ids, output_attentions=True).attentions
        sentence_b_start = token_type_ids[0].tolist().index(1)
    else:
        attention = model(input_ids).attentions
        sentence_b_start = None
    input_id_list = input_ids[0].tolist() # Batch index 0
    tokens = tokenizer.convert_ids_to_tokens(input_id_list)  
    if hide_delimiter_attn:
        for i, t in enumerate(tokens):
            if t in ("[SEP]", "[CLS]"):
                for layer_attn in attention:
                    layer_attn[0, :, i, :] = 0
                    layer_attn[0, :, :, i] = 0
    model_view(attention, tokens, sentence_b_start, display_mode=display_mode)

In [24]:
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
show_model_view(model, tokenizer, sentence_a, sentence_b, hide_delimiter_attn=False, display_mode="light")

<IPython.core.display.Javascript object>

# Explore similarity

Consider the pooler output for two sentences and compute their cosine distance:

In [32]:
def pooler_similarity(sentence_a, sentence_b):
    tokens_a = tokenizer.encode_plus(sentence_a, return_tensors='pt', add_special_tokens=True)
    tokens_b = tokenizer.encode_plus(sentence_b, return_tensors='pt', add_special_tokens=True)
    outputs_a = model(**tokens_a)
    outputs_b = model(**tokens_b)
    return torch.cosine_similarity(outputs_a.pooler_output[0], outputs_b.pooler_output[0], dim=0)

In [33]:
pooler_similarity(sentence_a, sentence_b)

tensor(0.9177, grad_fn=<DivBackward0>)

In [34]:
pooler_similarity(sentence_a, sentence_a)

tensor(1., grad_fn=<DivBackward0>)

In [35]:
pooler_similarity("Who is Boris Johnson?", "The British prime minister.")

tensor(0.9899, grad_fn=<DivBackward0>)

In [36]:
pooler_similarity("Who is Boris Johnson?", "I don't know")

tensor(0.5651, grad_fn=<DivBackward0>)

However adding a period:

In [37]:
pooler_similarity("Who is Boris Johnson?", "I don't know.")

tensor(0.9862, grad_fn=<DivBackward0>)