# Week 8: Using Transformer Models


## Getting started
If working on your own machine, make sure the huggingface transformers package is installed

`conda install -c huggingface transformers`

or

`pip install transformers`

Of course, if working on Google Colab, you won't need to do this.  Whatever environment you are using check whether the following code runs.  It should output a negative label with a high score!


In [1]:
from transformers import pipeline
print(pipeline('sentiment-analysis')('I hate you'))

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]


The following is adapted from the huggingface quickstart to transformers tutorial https://huggingface.co/transformers/quickstart.html
We will be looking at the BERT introduction (but feel free to have a look at GPT2 etc as well!)

First of all we need some key imports.  We are going to be using the pre-trained bert-base-uncased model so this cell instantiates a tokenizer for this model.  Logging is also switched on so we can see more of what's going on. The first time you run it, the model will be downloaded and cached.  The cached version will be used on subsequent runs, if it is available (not on Google CoLab).

In [2]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



Now we are going to tokenize some text.  This will demonstrate the 'wordpiece' vocabulary used by BERT as well as the fact that we need to introduce special `[CLS]` and `[SEP]` tokens in the input.

In [3]:
# Tokenize input
text = "[CLS] Who was elected as British prime minister in 1951? [SEP] Sir Winston Leonard Spencer Churchill was a British politician, statesman, army officer and writer, who was Prime Minister of the United Kingdom from 1940 to 1945 and again from 1951 to 1955. [SEP]"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'who', 'was', 'elected', 'as', 'british', 'prime', 'minister', 'in', '1951', '?', '[SEP]', 'sir', 'winston', 'leonard', 'spencer', 'churchill', 'was', 'a', 'british', 'politician', ',', 'statesman', ',', 'army', 'officer', 'and', 'writer', ',', 'who', 'was', 'prime', 'minister', 'of', 'the', 'united', 'kingdom', 'from', '1940', 'to', '1945', 'and', 'again', 'from', '1951', 'to', '1955', '.', '[SEP]']


In [4]:
# Tokenize input
text = "[CLS] What are igneous rocks? [SEP] Igneous rocks form when hot , molten rock crystallizes and solidifies. [SEP] "
tokenized_text= tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


Note that the tokenizer is not breaking down all words according to their morphology -- only rare words.  Reasonably frequent words such as `elected` are left as whole words.  Rarer words such as `solidifies` are broken down.

Now we are going to mask out one of the words in the text.  For the purposes of this demonstration, I have chosen token 11 but you could try different tokens.  Remember that during training the tokens to mask are chosen randomly.


In [5]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 11
tokenized_text[masked_index] = '[MASK]'
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', '[MASK]', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [6]:
print(len(tokenized_text))

25


We are now going to try to use the masked language model to predict this word.

First we need to convert the input into a list of word index ids.

In [7]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(indexed_tokens)

[101, 2054, 2024, 16270, 14769, 5749, 1029, 102, 16270, 14769, 5749, 103, 2043, 2980, 1010, 23548, 2600, 6121, 3669, 11254, 1998, 5024, 14144, 1012, 102]


We need segment ids to define whether a token is in the first or second sentence.

In [8]:
def make_segment_ids(list_of_tokens):
    #this function assumes that up to and including the first '[SEP]' is the first segment, anything afterwards is the second segment
    current_id=0
    segment_ids=[]
    for token in list_of_tokens:
        segment_ids.append(current_id)
        if token == '[SEP]':
            current_id=1
    return segment_ids

segment_ids=make_segment_ids(tokenized_text)
print(segment_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [9]:
# Convert inputs to PyTorch tensors
#this just wraps things up in multi-dimensional tensors rather than as flat lists.
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segment_ids])
print(tokens_tensor)
print(segments_tensors)

tensor([[  101,  2054,  2024, 16270, 14769,  5749,  1029,   102, 16270, 14769,
          5749,   103,  2043,  2980,  1010, 23548,  2600,  6121,  3669, 11254,
          1998,  5024, 14144,  1012,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]])


Now we need to encode the input using the bert-base-uncased model


In [10]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda - otherwise comment this out to run on CPU
#tokens_tensor = tokens_tensor.to('cuda')
#segments_tensors = segments_tensors.to('cuda')
#model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element of outputs is the output of the last layer of the Bert model (all tokens)
    # the second element of outputs, outputs[1] is actually just a "pooled_output" representation of the CLS token (rather than all tokens) - however this involves an extra layer which is why it is not the same as the first element in outputs[0]!
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)


In [11]:
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
print(encoded_layers.shape)

torch.Size([1, 25, 768])


In [12]:
encoded_layers

tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
         [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
         [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
         ...,
         [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
         [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
         [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]])

In [13]:
#outputs[1] is a representation of the CLS token of shape (batch size, model hidden dimension)
outputs[1].shape

torch.Size([1, 768])

In [14]:
outputs[1]

tensor([[-0.9920, -0.8534, -0.9981,  0.9898,  0.9694, -0.7778,  0.9918,  0.7435,
         -0.9951, -1.0000, -0.9537,  0.9990,  0.9948,  0.9125,  0.9905, -0.9797,
         -0.9501, -0.9055,  0.7429, -0.9396,  0.9539,  1.0000, -0.7118,  0.8053,
          0.8925,  1.0000, -0.9784,  0.9881,  0.9908,  0.8765, -0.9697,  0.7592,
         -0.9981, -0.6847, -0.9970, -0.9994,  0.9093, -0.9470, -0.6564, -0.6737,
         -0.9759,  0.8515,  1.0000,  0.4217,  0.8609, -0.7738, -1.0000,  0.7504,
         -0.9733,  0.9991,  0.9951,  0.9958,  0.8245,  0.9274,  0.9106, -0.8983,
          0.6171,  0.6578, -0.7522, -0.9329, -0.8695,  0.8373, -0.9920, -0.9770,
          0.9985,  0.9901, -0.7740, -0.8020, -0.7395,  0.4467,  0.9907,  0.7470,
         -0.6924, -0.9460,  0.9862,  0.7678, -0.8415,  1.0000, -0.9650, -0.9968,
          0.9861,  0.9893,  0.8187, -0.9405,  0.9565, -1.0000,  0.9414, -0.6060,
         -0.9974,  0.7915,  0.9118, -0.7666,  0.9441,  0.8653, -0.8785, -0.9049,
         -0.8647, -0.9939, -

We can also predict the masked token as follows.  We make the predictions as before (using the last layer of the BERT model) but then we find the token id which maximises the prediction for the masked token.

In [15]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
#tokens_tensor = tokens_tensor.to('cuda')
#segments_tensors = segments_tensors.to('cuda')
#model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

        
# find the token id which maximises the prediction for the masked token and then convert this back to a word
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


form


### Exercise 0
Mask each token in turn and see what BERT predicts.   How accurate are its predictions?  As an extension, you could look at masking multiple words in the sequence.

In [16]:
# Tokenize input
text = "[CLS] What are igneous rocks? [SEP] Igneous rocks form when hot , molten rock crystallizes and solidifies. [SEP] "
tokenized_text= tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [17]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
def get_masked_tokens(text, masked_index):
    tokenized_text = tokenizer.tokenize(text)
    tokenized_text[masked_index] = '[MASK]'
    return tokenized_text

In [18]:
masked_1 = get_masked_tokens(text, 1)
print(masked_1)

['[CLS]', '[MASK]', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [19]:
def get_tensors(tokens):
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
    segment_ids=make_segment_ids(tokens)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segment_ids])
    return tokens_tensor, segments_tensors

In [20]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

def predict_mask(model, tokens, index):
    tokens_tensor, segments_tensors = get_tensors(tokens)
    # Predict all tokens
    with torch.no_grad():
        outputs = model(tokens_tensor, token_type_ids=segments_tensors)
        predictions = outputs[0]
    # find the token id which maximises the prediction for the masked token and then convert this back to a word
    predicted_index = torch.argmax(predictions[0, index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    return predicted_token


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
masked_1 = get_masked_tokens(text, 1)
print(masked_1)

['[CLS]', '[MASK]', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [22]:
predicted_1 = predict_mask(model, masked_1, 1)
print(predicted_1)

what


In [23]:
masked_2 = get_masked_tokens(text, 2)
print(masked_2)

['[CLS]', 'what', '[MASK]', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [24]:
predicted_2 = predict_mask(model, masked_2, 2)
print(predicted_2)

about


In [25]:
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [26]:
for i in range(1, len(tokenized_text)):
    masked_tokens = get_masked_tokens(text, i)
    predicted = predict_mask(model, masked_tokens, i)
    print(tokenized_text[i], predicted)

what what
are about
ign ign
##eous ##eous
rocks rocks
? ?
[SEP] "
ign ign
##eous ##eous
rocks rocks
form form
when when
hot solid
, ,
molten hard
rock rock
crystal crystal
##li ##li
##zes ##zes
and and
solid solid
##ifies ##ifies
. .
[SEP] "


## Representing Sentential Meaning
We are going to be looking at different strategies for representing sentential meaning
* CLS token representation
* centroid/sum of output embeddings

The file `examples.txt` contains some example sentences.

### Exercise 1
Read in the sentences and store them as a list of sentences.  Add `[CLS]` and `[SEP]` tokens to the beginning and end of each and then pass them through the bert-base-uncased tokenizer

In [27]:
sentences = []
with open('examples.txt') as f:
    sentences = ['[CLS] ' +line.rstrip()+ ' [SEP]'  for line in f]
print(sentences)

['[CLS] The boy kicks the ball. [SEP]', '[CLS] The ball kicks the boy. [SEP]', '[CLS] The child kicks the ball. [SEP]', '[CLS] The ball is kicked by the boy. [SEP]', '[CLS] The ball is kicked. [SEP]', '[CLS] The boy kicks. [SEP]', '[CLS] The child kicks. [SEP]', '[CLS] The boy kicks a round object. [SEP]', '[CLS] The male child kicks the ball. [SEP]', '[CLS] The boy is playing football. [SEP]', '[CLS] The boy hits the ball. [SEP]', '[CLS] The ball hits the boy. [SEP]', '[CLS] The boy is hit by the ball. [SEP]', '[CLS] The ball is hit by the boy. [SEP]', '[CLS] The female child kicks the ball. [SEP]', '[CLS] The girl kicks the ball. [SEP]', '[CLS] The child plays with dolls. [SEP]', '[CLS] The female child plays with dolls. [SEP]', '[CLS] The male child plays with dolls. [SEP]', '[CLS] The girl plays with dolls. [SEP]', '[CLS] The boy plays with dolls. [SEP]', '[CLS] The boy is kicking the ball. [SEP]', '[CLS] The boy is not kicking the ball. [SEP]', '[CLS] All boys kick balls. [SEP]', 

In [28]:
sentences_tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
print(sentences_tokens)

[['[CLS]', 'the', 'boy', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'kicks', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'child', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'is', 'kicked', 'by', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'is', 'kicked', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'kicks', '.', '[SEP]'], ['[CLS]', 'the', 'child', 'kicks', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'kicks', 'a', 'round', 'object', '.', '[SEP]'], ['[CLS]', 'the', 'male', 'child', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'is', 'playing', 'football', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'hits', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'hits', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'is', 'hit', 'by', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'is', 'hit', 'by', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'female', 'child', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'gi

In [29]:
sentences_indexed_tokens = [tokenizer.convert_tokens_to_ids(tokens) for tokens in sentences_tokens]
sentences_segment_ids= [make_segment_ids(tokens) for tokens in sentences_tokens]
sentences_tokens_tensor = [torch.tensor([indexed_tokens]) for indexed_tokens in sentences_indexed_tokens]
sentences_segments_tensors = [torch.tensor([segment_ids]) for segment_ids in sentences_segment_ids]

When encoding sentences, it is actually more typical to pool the hidden states for each layer (at depth n) rather than the output layer.  We can access the hidden states of the model using `output_hidden_states=True` 

In [30]:
model = BertModel.from_pretrained('bert-base-uncased')


model.eval()

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors,output_hidden_states=True)
   
    
    

In [31]:
outputs.to_tuple()

(tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
          [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
          [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
          ...,
          [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
          [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
          [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]]),
 tensor([[-0.9920, -0.8534, -0.9981,  0.9898,  0.9694, -0.7778,  0.9918,  0.7435,
          -0.9951, -1.0000, -0.9537,  0.9990,  0.9948,  0.9125,  0.9905, -0.9797,
          -0.9501, -0.9055,  0.7429, -0.9396,  0.9539,  1.0000, -0.7118,  0.8053,
           0.8925,  1.0000, -0.9784,  0.9881,  0.9908,  0.8765, -0.9697,  0.7592,
          -0.9981, -0.6847, -0.9970, -0.9994,  0.9093, -0.9470, -0.6564, -0.6737,
          -0.9759,  0.8515,  1.0000,  0.4217,  0.8609, -0.7738, -1.0000,  0.7504,
          -0.9733,  0.9991,  0.9951,  0.9958,  0.8245,  0.

In [32]:
print(len(outputs))
for i in range(len(outputs)):
    try:
        print(outputs[i].shape)
    except:
        print(len(outputs[i]))

3
torch.Size([1, 25, 768])
torch.Size([1, 768])
13


Here:
* outputs[0] contains the output representation of each token
* outputs[1] is representation of the first token (after being put through an additional layer)
* outputs[2] is a a tuple.  Each element is the hidden layer at depth n.  If we want the last layer then we need outputs[2][-1]


In [33]:
#outputs[2][-1] is the last hidden layer also output as outputs[0]
outputs[2][-1]

tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
         [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
         [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
         ...,
         [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
         [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
         [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]])

In [34]:
#so if you want the penultimate hidden layer you need outputs[2][-2]
outputs[2][-2]

tensor([[[-3.4760e-01,  6.1722e-01, -6.2986e-01,  ..., -5.5267e-02,
          -4.7414e-01,  1.1962e+00],
         [ 2.4667e-01, -1.1385e-01, -3.2224e-01,  ...,  3.9890e-01,
          -4.0355e-01, -1.5206e+00],
         [-1.5780e-01,  2.0670e-01, -3.2949e-01,  ..., -7.0685e-01,
          -8.9301e-02,  1.3601e+00],
         ...,
         [ 3.1460e-01,  4.2031e-01,  1.2168e-01,  ..., -4.4048e-01,
          -5.6069e-01, -1.8098e-01],
         [ 4.6541e-02,  1.3745e-02, -3.7530e-02,  ...,  1.9264e-02,
          -1.4171e-02, -4.5845e-03],
         [ 4.5347e-02,  8.7523e-03, -3.7831e-02,  ...,  1.8386e-02,
          -1.5910e-02, -1.1673e-03]]])

In [51]:
outputs[0]

tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
         [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
         [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
         ...,
         [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
         [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
         [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]])

In [52]:
outputs[0][0]

tensor([[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
        [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
        [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
        ...,
        [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
        [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
        [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]])

In [39]:
outputs[0][0,0]

tensor([-6.6741e-01,  9.4580e-01, -5.0362e-01,  7.6330e-02, -1.2090e+00,
         1.2795e-01,  9.3474e-01,  1.1574e+00, -1.9790e-01,  9.3847e-02,
        -7.1620e-01, -4.7296e-01, -2.1994e-01,  1.0530e+00,  3.7979e-01,
        -1.6461e-01, -2.0667e-01,  1.0136e+00,  3.5675e-01, -2.8986e-02,
         1.4404e-01, -2.0457e-01, -1.0594e-01,  9.7713e-03,  9.4189e-03,
        -5.7983e-01, -1.4214e-01, -4.3860e-01,  6.5073e-01, -4.3919e-01,
        -2.3003e-01,  1.1294e+00, -5.0243e-01, -1.7202e-01,  3.5664e-01,
        -6.0078e-02,  1.5558e-01, -5.6304e-02,  3.3006e-01,  1.5681e-01,
        -1.7278e-01,  1.3906e-01,  5.0966e-01,  3.3143e-01, -1.8095e-01,
        -3.5913e-01, -1.8010e+00, -2.8397e-01, -3.3440e-01, -3.4394e-01,
         1.7959e-01, -6.9788e-02,  8.8394e-01,  4.0845e-01, -3.3890e-01,
         1.4026e+00, -9.4726e-01,  3.8425e-01,  1.7327e-01,  9.4517e-01,
        -6.2579e-02, -7.5702e-02, -6.6680e-01, -4.6924e-01,  4.4900e-01,
         9.2037e-01, -2.3056e-01,  8.5764e-01, -8.6

In [40]:
outputs[1]

tensor([[-0.9920, -0.8534, -0.9981,  0.9898,  0.9694, -0.7778,  0.9918,  0.7435,
         -0.9951, -1.0000, -0.9537,  0.9990,  0.9948,  0.9125,  0.9905, -0.9797,
         -0.9501, -0.9055,  0.7429, -0.9396,  0.9539,  1.0000, -0.7118,  0.8053,
          0.8925,  1.0000, -0.9784,  0.9881,  0.9908,  0.8765, -0.9697,  0.7592,
         -0.9981, -0.6847, -0.9970, -0.9994,  0.9093, -0.9470, -0.6564, -0.6737,
         -0.9759,  0.8515,  1.0000,  0.4217,  0.8609, -0.7738, -1.0000,  0.7504,
         -0.9733,  0.9991,  0.9951,  0.9958,  0.8245,  0.9274,  0.9106, -0.8983,
          0.6171,  0.6578, -0.7522, -0.9329, -0.8695,  0.8373, -0.9920, -0.9770,
          0.9985,  0.9901, -0.7740, -0.8020, -0.7395,  0.4467,  0.9907,  0.7470,
         -0.6924, -0.9460,  0.9862,  0.7678, -0.8415,  1.0000, -0.9650, -0.9968,
          0.9861,  0.9893,  0.8187, -0.9405,  0.9565, -1.0000,  0.9414, -0.6060,
         -0.9974,  0.7915,  0.9118, -0.7666,  0.9441,  0.8653, -0.8785, -0.9049,
         -0.8647, -0.9939, -

### Exercise 2
* Encode each sentence using the output representation for its CLS token - note that you do not need to mask the CLS token.  We are just interested in the output layer embedding for this token.  You can use outputs[0][0] or outputs[1] as a representation of the CLS token - but you will get different results as outputs[1] as gone through an additional layer (trained for next sentence prediction during fine-tuning and classification IF the model has been fine-tuned).
* Use cosine similarity to determine all pairs similarities for the sentences.
* Identify the 10 most similar pairs of sentences using this sentence encoding

In [41]:
encoded_layers.size()

torch.Size([1, 25, 768])

In [117]:
## this is a handy way of finding the cosine similarity between two tensors
# see https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

#you use this as:
print(encoded_layers[0,0],encoded_layers[0,1])
output=cos(encoded_layers[0,0],encoded_layers[0,1])
print(output.item())

tensor([-6.6741e-01,  9.4580e-01, -5.0362e-01,  7.6330e-02, -1.2090e+00,
         1.2795e-01,  9.3474e-01,  1.1574e+00, -1.9790e-01,  9.3847e-02,
        -7.1620e-01, -4.7296e-01, -2.1994e-01,  1.0530e+00,  3.7979e-01,
        -1.6461e-01, -2.0667e-01,  1.0136e+00,  3.5675e-01, -2.8986e-02,
         1.4404e-01, -2.0457e-01, -1.0594e-01,  9.7713e-03,  9.4189e-03,
        -5.7983e-01, -1.4214e-01, -4.3860e-01,  6.5073e-01, -4.3919e-01,
        -2.3003e-01,  1.1294e+00, -5.0243e-01, -1.7202e-01,  3.5664e-01,
        -6.0078e-02,  1.5558e-01, -5.6304e-02,  3.3006e-01,  1.5681e-01,
        -1.7278e-01,  1.3906e-01,  5.0966e-01,  3.3143e-01, -1.8095e-01,
        -3.5913e-01, -1.8010e+00, -2.8397e-01, -3.3440e-01, -3.4394e-01,
         1.7959e-01, -6.9788e-02,  8.8394e-01,  4.0845e-01, -3.3890e-01,
         1.4026e+00, -9.4726e-01,  3.8425e-01,  1.7327e-01,  9.4517e-01,
        -6.2579e-02, -7.5702e-02, -6.6680e-01, -4.6924e-01,  4.4900e-01,
         9.2037e-01, -2.3056e-01,  8.5764e-01, -8.6

In [118]:
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    sentences_outputs = [model(tokens_tensor, token_type_ids=segments_tensors)
                         for (tokens_tensor, segments_tensors) in zip(sentences_tokens_tensor, sentences_segments_tensors)]
    sentences_cls = [output[0][0,0] for output in sentences_outputs]

In [119]:
len(sentences_cls)
sentences_cls[0]

tensor([ 1.1590e-01,  6.5324e-01, -2.0935e-01, -2.5611e-02, -3.8921e-01,
        -6.2618e-01,  3.0078e-01,  5.4555e-01, -2.0464e-03, -6.0749e-01,
        -4.2318e-01, -1.9440e-01,  4.5724e-01,  7.6823e-02,  3.3098e-01,
        -5.0582e-01,  1.5183e-01,  1.0898e-01, -1.2321e-01, -5.4610e-01,
        -5.0400e-02, -5.3258e-01, -3.2013e-01,  2.9357e-01,  4.9870e-01,
         5.0016e-02,  1.4715e-01, -1.0674e-01, -2.2981e-01, -1.5368e-01,
         2.7195e-01,  1.0101e+00, -1.6937e-01, -1.3404e-01,  8.9786e-02,
        -5.1196e-01,  2.3494e-01, -9.3254e-02,  4.7672e-01,  1.4907e-01,
        -3.0858e-01,  1.5049e-01, -8.5428e-02, -4.1467e-01,  6.8358e-01,
        -9.3506e-01, -2.8060e+00, -3.7640e-01, -4.8633e-01, -4.4476e-01,
         1.5088e-01,  1.3590e-02,  1.5590e-02, -2.3847e-02, -6.3058e-01,
        -3.2194e-02, -1.7782e-01,  4.7192e-01,  5.1622e-03,  7.3014e-02,
        -1.0110e-01,  4.8777e-01, -2.3079e-01, -1.0471e-01,  1.7462e-01,
        -2.1536e-01, -5.5140e-01,  2.9230e-01,  4.2

In [120]:
cos_sims = [[cos(cls_a, cls_b).item() for cls_b in sentences_cls] for cls_a in sentences_cls]
print(cos_sims)

[[1.0, 0.9455853700637817, 0.9646619558334351, 0.8514178991317749, 0.8286489248275757, 0.9401548504829407, 0.9227976202964783, 0.9353488087654114, 0.9003757238388062, 0.8708136081695557, 0.9646222591400146, 0.9289807677268982, 0.8759903907775879, 0.870566725730896, 0.88913494348526, 0.9722212553024292, 0.8038939833641052, 0.7822104096412659, 0.7883208990097046, 0.8124408721923828, 0.8199416399002075, 0.8976421356201172, 0.875326931476593, 0.8014339208602905, 0.8331339955329895, 0.8581059575080872, 0.815019965171814, 0.8298882246017456, 0.8301887512207031, 0.8170695304870605, 0.7577216029167175, 0.7726884484291077, 0.7968733310699463, 0.8189309239387512, 0.8209186792373657], [0.9455853700637817, 1.0, 0.9564514756202698, 0.8817493915557861, 0.8290425539016724, 0.9268245697021484, 0.9358603358268738, 0.8994765281677246, 0.91685551404953, 0.8586716651916504, 0.958881139755249, 0.9730361700057983, 0.9118578433990479, 0.9189797639846802, 0.9177204966545105, 0.9352853298187256, 0.799350023269

In [45]:
for cos in cos_sims:
    print(cos)

[1.0, 0.9455853700637817, 0.9646619558334351, 0.8514178991317749, 0.8286489248275757, 0.9401548504829407, 0.9227976202964783, 0.9353488087654114, 0.9003757238388062, 0.8708136081695557, 0.9646222591400146, 0.9289807677268982, 0.8759903907775879, 0.870566725730896, 0.88913494348526, 0.9722212553024292, 0.8038939833641052, 0.7822104096412659, 0.7883208990097046, 0.8124408721923828, 0.8199416399002075, 0.8976421356201172, 0.875326931476593, 0.8014339208602905, 0.8331339955329895, 0.8581059575080872, 0.815019965171814, 0.8298882246017456, 0.8301887512207031, 0.8170695304870605, 0.7577216029167175, 0.7726884484291077, 0.7968733310699463, 0.8189309239387512, 0.8209186792373657]
[0.9455853700637817, 1.0, 0.9564514756202698, 0.8817493915557861, 0.8290425539016724, 0.9268245697021484, 0.9358603358268738, 0.8994765281677246, 0.91685551404953, 0.8586716651916504, 0.958881139755249, 0.9730361700057983, 0.9118578433990479, 0.9189797639846802, 0.9177204966545105, 0.9352853298187256, 0.79935002326965

In [46]:
import numpy as np
np_cos_sims = np.array(cos_sims)
print(np_cos_sims.shape)
np_cos_sims_flatten = np_cos_sims.flatten()
print(np_cos_sims_flatten.shape)
top_10 = np_cos_sims_flatten.argsort()[-55:][::-1][-20:]

(35, 35)
(1225,)


In [47]:
print(top_10)
unraveled_idx = np.unravel_index(top_10, (35,35))
print(unraveled_idx)

[ 979 1217  498  294  613  647  685  719  978 1182  467  433 1189 1223
 1143  837  611  577  580  716]
(array([27, 34, 14,  8, 17, 18, 19, 20, 27, 33, 13, 12, 33, 34, 32, 23, 17,
       16, 16, 20], dtype=int64), array([34, 27,  8, 14, 18, 17, 20, 19, 33, 27, 12, 13, 34, 33, 23, 32, 16,
       17, 20, 16], dtype=int64))


In [48]:
print(np_cos_sims_flatten[979])

0.9951815605163574


In [49]:
final_top_10 = []
i = 0
while i < len(unraveled_idx[0]):
    idx_1 = unraveled_idx[0][i]
    idx_2 = unraveled_idx[0][i+1]
    final_top_10.append(((idx_1, idx_2), np_cos_sims_flatten[idx_1*35+idx_2]))
    i = i + 2

In [50]:
final_top_10




[((27, 34), 0.9951815605163574),
 ((14, 8), 0.9927845597267151),
 ((17, 18), 0.9904729127883911),
 ((19, 20), 0.988221287727356),
 ((27, 33), 0.9856245517730713),
 ((13, 12), 0.9816993474960327),
 ((33, 34), 0.9796444773674011),
 ((32, 23), 0.9780699014663696),
 ((17, 16), 0.9780153632164001),
 ((16, 20), 0.9770835041999817)]

### Exercise 3
a) Repeat exercise 2 but use the centroid of all of the output embeddings as the representation of a sentence.

b) Experiment with using different pooling layers from the hidden state embeddings.  Typically, using the penultimate layer (-2) is felt to be optimal as it is far enough away from the original uncontextualised word embeddings but also not too close to the output predictions.  See here for a discussion: https://github.com/hanxiao/bert-as-service#q-what-are-the-available-pooling-strategies

In [124]:
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    sentences_outputs = [model(tokens_tensor, token_type_ids=segments_tensors)
                         for (tokens_tensor, segments_tensors) in zip(sentences_tokens_tensor, sentences_segments_tensors)]
    sentences_centroids = [torch.sum(output[0], 1)[0] for output in sentences_outputs]

In [125]:
print(len(sentences_centroids))
print(sentences_centroids[0].size())
sentences_centroids[0]

35
torch.Size([768])


tensor([ 1.9850e+00,  1.3161e+00,  6.1223e-01,  6.2846e-01, -1.0020e+00,
        -3.8386e+00,  1.1196e+00,  8.2367e+00, -1.8715e-01, -6.2047e+00,
        -2.9274e+00, -4.6410e+00,  1.6850e+00, -2.5634e-01,  8.2823e-01,
        -2.7640e+00,  4.9652e+00, -1.8406e+00, -3.3004e+00,  3.5776e-01,
        -1.6112e+00, -8.6485e-01, -2.7241e+00,  3.1237e+00,  7.0547e+00,
         1.2566e+00, -6.9189e-01,  2.0653e+00, -4.4272e+00, -2.0778e+00,
         1.7964e+00,  5.3366e+00,  5.7195e-01, -5.9054e-01, -2.6131e+00,
        -3.6601e+00,  2.9636e-01, -1.5845e+00, -5.5964e-01, -6.0556e-01,
        -5.7638e+00,  5.1215e-01, -3.3417e+00, -2.0679e+00,  7.7346e+00,
        -5.3294e+00,  1.2208e+00, -4.0233e+00, -2.7871e+00, -2.2031e+00,
        -3.5953e+00,  4.1838e+00, -1.0607e+00, -4.8845e+00, -5.8304e+00,
         6.3068e-01, -3.9480e-01, -2.5270e+00, -3.0434e+00, -3.9312e-01,
        -1.0354e+00,  2.8725e+00, -1.4134e+00, -3.2419e+00,  1.2876e+00,
        -1.4771e-01, -4.6510e+00,  1.8880e+00,  4.7

In [142]:
cos_sims_centroids = [[cos(centroid_a, centroid_b).item() for centroid_b in sentences_centroids] for centroid_a in sentences_centroids]
print(cos_sims_centroids)
len(cos_sims_centroids)

[[1.0, 0.8996008634567261, 0.9198457598686218, 0.7202247381210327, 0.6473896503448486, 0.8561151623725891, 0.7878742814064026, 0.8632704615592957, 0.779762864112854, 0.7260465621948242, 0.9300801157951355, 0.8763684034347534, 0.7349972724914551, 0.7374476194381714, 0.7547682523727417, 0.9308681488037109, 0.5955303311347961, 0.5602136850357056, 0.5564670562744141, 0.6455003023147583, 0.666740357875824, 0.806885838508606, 0.7635465264320374, 0.5910668969154358, 0.6878949999809265, 0.7097204327583313, 0.6448606252670288, 0.6957005262374878, 0.6904869675636292, 0.6500993967056274, 0.5376498103141785, 0.5208662748336792, 0.5528550148010254, 0.635760486125946, 0.6670063138008118], [0.8996008634567261, 1.0, 0.8855577111244202, 0.7331383228302002, 0.6599642038345337, 0.8448490500450134, 0.8175795078277588, 0.7907476425170898, 0.7839623093605042, 0.6682948470115662, 0.8859336376190186, 0.9299957156181335, 0.7837681770324707, 0.7793893814086914, 0.7832338809967041, 0.8372676372528076, 0.58686482

35

In [143]:
from itertools import product
cos_sims_centroids_2 = [cos(centroid_a, centroid_b).item() for centroid_a, centroid_b in product(sentences_centroids, sentences_centroids)]
print(cos_sims_centroids_2)
len(cos_sims_centroids_2)

[1.0, 0.8996008634567261, 0.9198457598686218, 0.7202247381210327, 0.6473896503448486, 0.8561151623725891, 0.7878742814064026, 0.8632704615592957, 0.779762864112854, 0.7260465621948242, 0.9300801157951355, 0.8763684034347534, 0.7349972724914551, 0.7374476194381714, 0.7547682523727417, 0.9308681488037109, 0.5955303311347961, 0.5602136850357056, 0.5564670562744141, 0.6455003023147583, 0.666740357875824, 0.806885838508606, 0.7635465264320374, 0.5910668969154358, 0.6878949999809265, 0.7097204327583313, 0.6448606252670288, 0.6957005262374878, 0.6904869675636292, 0.6500993967056274, 0.5376498103141785, 0.5208662748336792, 0.5528550148010254, 0.635760486125946, 0.6670063138008118, 0.8996008634567261, 1.0, 0.8855577111244202, 0.7331383228302002, 0.6599642038345337, 0.8448490500450134, 0.8175795078277588, 0.7907476425170898, 0.7839623093605042, 0.6682948470115662, 0.8859336376190186, 0.9299957156181335, 0.7837681770324707, 0.7793893814086914, 0.7832338809967041, 0.8372676372528076, 0.58686482906

1225

In [135]:
np_cos_sims_centroids = np.array(cos_sims_centroids)
print(np_cos_sims_centroids.shape)
np_cos_sims_centroids_flatten = np_cos_sims_centroids.flatten()
print(np_cos_sims_centroids_flatten.shape)
top_10_centroids = np_cos_sims_centroids_flatten.argsort()[-55:][::-1][-20:]

(35, 35)
(1225,)


In [136]:
print(top_10_centroids)
unraveled_centroids_idx = np.unravel_index(top_10_centroids, (35,35))
print(unraveled_centroids_idx)



[ 613  647  719  685  979 1217  498  294  611  577  580  716  978 1182
  646  578  681  579  467  433]
(array([17, 18, 20, 19, 27, 34, 14,  8, 17, 16, 16, 20, 27, 33, 18, 16, 19,
       16, 13, 12], dtype=int64), array([18, 17, 19, 20, 34, 27,  8, 14, 16, 17, 20, 16, 33, 27, 16, 18, 16,
       19, 12, 13], dtype=int64))


In [137]:
final_top_10_centroids = []
i = 0
while i < len(unraveled_centroids_idx[0]):
    idx_1 = unraveled_centroids_idx[0][i]
    idx_2 = unraveled_centroids_idx[0][i+1]
    final_top_10_centroids.append(((idx_1, idx_2), np_cos_sims_centroids_flatten[idx_1*35+idx_2]))
    i = i + 2

In [138]:
final_top_10_centroids

[((17, 18), 0.9890540242195129),
 ((20, 19), 0.9832317233085632),
 ((27, 34), 0.9824605584144592),
 ((14, 8), 0.9780564904212952),
 ((17, 16), 0.9682672023773193),
 ((16, 20), 0.9671179056167603),
 ((27, 33), 0.9647222757339478),
 ((18, 16), 0.96048903465271),
 ((19, 16), 0.9594876170158386),
 ((13, 12), 0.954054057598114)]

In [132]:
final_top_10

[((27, 34), 0.9951815605163574),
 ((14, 8), 0.9927845597267151),
 ((17, 18), 0.9904729127883911),
 ((19, 20), 0.988221287727356),
 ((27, 33), 0.9856245517730713),
 ((13, 12), 0.9816993474960327),
 ((33, 34), 0.9796444773674011),
 ((32, 23), 0.9780699014663696),
 ((17, 16), 0.9780153632164001),
 ((16, 20), 0.9770835041999817)]

In [None]:
sims=run(sentences,poolinglayer=-2)

In [None]:
#sims=run(sentences,method="cls")

In [None]:
print(sims[0])

In [None]:
interested=[2,1,21,7,8,22,3,25]
for i in interested:
    print(sentences[i],sims[0][i])

### Extension 1
The MRPC.zip file contains a training, dev and test split for the Microsoft Research paraphrase corpus.  In this corpus the quality '1' indicates that the 2 sentences are considered to be paraphrases and '0' indicates that they are not.

Can you build a classifier on top of the BERT pre-trained model, trained on the training split of MRPC, which predicts whether 2 sentences are paraphrases or not?

Note this does not require you to fine-tune the BERT model.  You can use outputs from BERT as input to your separate classifier.  I would suggest a single neural layer which uses the representation from exercise 2 or 3 as input, built using scikit-learn or torch.   