# Week 8: Using Transformer Models


## Getting started
If working on your own machine, make sure the huggingface transformers package is installed

`conda install -c huggingface transformers`

or

`pip install transformers`

Of course, if working on Google Colab, you won't need to do this.  Whatever environment you are using check whether the following code runs.  It should output a negative label with a high score!


In [None]:
from transformers import pipeline
print(pipeline('sentiment-analysis')('I hate you'))

The following is adapted from the huggingface quickstart to transformers tutorial https://huggingface.co/transformers/quickstart.html
We will be looking at the BERT introduction (but feel free to have a look at GPT2 etc as well!)

First of all we need some key imports.  We are going to be using the pre-trained bert-base-uncased model so this cell instantiates a tokenizer for this model.  Logging is also switched on so we can see more of what's going on. The first time you run it, the model will be downloaded and cached.  The cached version will be used on subsequent runs, if it is available (not on Google CoLab).

In [1]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



Now we are going to tokenize some text.  This will demonstrate the 'wordpiece' vocabulary used by BERT as well as the fact that we need to introduce special `[CLS]` and `[SEP]` tokens in the input.

In [2]:
# Tokenize input
text = "[CLS] Who was elected as British prime minister in 1951? [SEP] Sir Winston Leonard Spencer Churchill was a British politician, statesman, army officer and writer, who was Prime Minister of the United Kingdom from 1940 to 1945 and again from 1951 to 1955. [SEP]"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'who', 'was', 'elected', 'as', 'british', 'prime', 'minister', 'in', '1951', '?', '[SEP]', 'sir', 'winston', 'leonard', 'spencer', 'churchill', 'was', 'a', 'british', 'politician', ',', 'statesman', ',', 'army', 'officer', 'and', 'writer', ',', 'who', 'was', 'prime', 'minister', 'of', 'the', 'united', 'kingdom', 'from', '1940', 'to', '1945', 'and', 'again', 'from', '1951', 'to', '1955', '.', '[SEP]']


In [3]:
# Tokenize input
text = "[CLS] What are igneous rocks? [SEP] Igneous rocks form when hot , molten rock crystallizes and solidifies. [SEP] "
tokenized_text= tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


Note that the tokenizer is not breaking down all words according to their morphology -- only rare words.  Reasonably frequent words such as `elected` are left as whole words.  Rarer words such as `solidifies` are broken down.

Now we are going to mask out one of the words in the text.  For the purposes of this demonstration, I have chosen token 11 but you could try different tokens.  Remember that during training the tokens to mask are chosen randomly.


In [4]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 11
tokenized_text[masked_index] = '[MASK]'
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', '[MASK]', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [5]:
print(len(tokenized_text))

25


We are now going to try to use the masked language model to predict this word.

First we need to convert the input into a list of word index ids.

In [6]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(indexed_tokens)

[101, 2054, 2024, 16270, 14769, 5749, 1029, 102, 16270, 14769, 5749, 103, 2043, 2980, 1010, 23548, 2600, 6121, 3669, 11254, 1998, 5024, 14144, 1012, 102]


We need segment ids to define whether a token is in the first or second sentence.

In [7]:
def make_segment_ids(list_of_tokens):
    #this function assumes that up to and including the first '[SEP]' is the first segment, anything afterwards is the second segment
    current_id=0
    segment_ids=[]
    for token in list_of_tokens:
        segment_ids.append(current_id)
        if token == '[SEP]':
            current_id=1
    return segment_ids

segment_ids=make_segment_ids(tokenized_text)
print(segment_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [8]:
# Convert inputs to PyTorch tensors
#this just wraps things up in multi-dimensional tensors rather than as flat lists.
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segment_ids])
print(tokens_tensor)
print(segments_tensors)

tensor([[  101,  2054,  2024, 16270, 14769,  5749,  1029,   102, 16270, 14769,
          5749,   103,  2043,  2980,  1010, 23548,  2600,  6121,  3669, 11254,
          1998,  5024, 14144,  1012,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]])


Now we need to encode the input using the bert-base-uncased model


In [9]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda - otherwise comment this out to run on CPU
#tokens_tensor = tokens_tensor.to('cuda')
#segments_tensors = segments_tensors.to('cuda')
#model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element of outputs is the output of the last layer of the Bert model (all tokens)
    # the second element of outputs, outputs[1] is actually just a "pooled_output" representation of the CLS token (rather than all tokens) - however this involves an extra layer which is why it is not the same as the first element in outputs[0]!
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)


In [10]:
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
print(encoded_layers.shape)

torch.Size([1, 25, 768])


In [11]:
encoded_layers

tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
         [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
         [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
         ...,
         [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
         [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
         [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]])

In [12]:
#outputs[1] is a representation of the CLS token of shape (batch size, model hidden dimension)
outputs[1].shape

torch.Size([1, 768])

In [13]:
outputs[1]

tensor([[-0.9920, -0.8534, -0.9981,  0.9898,  0.9694, -0.7778,  0.9918,  0.7435,
         -0.9951, -1.0000, -0.9537,  0.9990,  0.9948,  0.9125,  0.9905, -0.9797,
         -0.9501, -0.9055,  0.7429, -0.9396,  0.9539,  1.0000, -0.7118,  0.8053,
          0.8925,  1.0000, -0.9784,  0.9881,  0.9908,  0.8765, -0.9697,  0.7592,
         -0.9981, -0.6847, -0.9970, -0.9994,  0.9093, -0.9470, -0.6564, -0.6737,
         -0.9759,  0.8515,  1.0000,  0.4217,  0.8609, -0.7738, -1.0000,  0.7504,
         -0.9733,  0.9991,  0.9951,  0.9958,  0.8245,  0.9274,  0.9106, -0.8983,
          0.6171,  0.6578, -0.7522, -0.9329, -0.8695,  0.8373, -0.9920, -0.9770,
          0.9985,  0.9901, -0.7740, -0.8020, -0.7395,  0.4467,  0.9907,  0.7470,
         -0.6924, -0.9460,  0.9862,  0.7678, -0.8415,  1.0000, -0.9650, -0.9968,
          0.9861,  0.9893,  0.8187, -0.9405,  0.9565, -1.0000,  0.9414, -0.6060,
         -0.9974,  0.7915,  0.9118, -0.7666,  0.9441,  0.8653, -0.8785, -0.9049,
         -0.8647, -0.9939, -

We can also predict the masked token as follows.  We make the predictions as before (using the last layer of the BERT model) but then we find the token id which maximises the prediction for the masked token.

In [14]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
#tokens_tensor = tokens_tensor.to('cuda')
#segments_tensors = segments_tensors.to('cuda')
#model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

        
# find the token id which maximises the prediction for the masked token and then convert this back to a word
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


form


### Exercise 0
Mask each token in turn and see what BERT predicts.   How accurate are its predictions?  As an extension, you could look at masking multiple words in the sequence.

In [15]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

text = "[CLS] What are igneous rocks? [SEP] Igneous rocks form when hot , molten rock crystallizes and solidifies. [SEP] "


correct=0
n=0
for i in range(len(tokenized_text)):
    tokenized_text= tokenizer.tokenize(text)
    masked_index=i
    gold=tokenized_text[masked_index]
    tokenized_text[masked_index]='[MASK]'
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segment_ids=make_segment_ids(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segment_ids])

    # Predict all tokens
    with torch.no_grad():
        outputs = model(tokens_tensor, token_type_ids=segments_tensors)
        predictions = outputs[0]

        
    # find the token id which maximises the prediction for the masked token and then convert this back to a word
    predicted_index = torch.argmax(predictions[0, masked_index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    print(gold,predicted_token)
    n+=1
    if predicted_token==gold:
        correct+=1
print("Accuracy: {}".format(str(correct/n)))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[CLS] .
what what
are about
ign ign
##eous ##eous
rocks rocks
? ?
[SEP] "
ign ign
##eous ##eous
rocks rocks
form form
when when
hot solid
, ,
molten hard
rock rock
crystal crystal
##li ##li
##zes ##zes
and and
solid solid
##ifies ##ifies
. .
[SEP] "
Accuracy: 0.76


## Representing Sentential Meaning
We are going to be looking at different strategies for representing sentential meaning
* CLS token representation
* centroid/sum of output embeddings

The file `examples.txt` contains some example sentences.

### Exercise 1
Read in the sentences and store them as a list of sentences.  Add `[CLS]` and `[SEP]` tokens to the beginning and end of each and then pass them through the bert-base-uncased tokenizer

In [16]:
filename='lab8resources/examples.txt'
sentences=[]
with open(filename, 'r') as instream:
    for line in instream:
        sentences.append(line.rstrip())
#print(sentences)

inputsents=['[CLS] '+sent+' [SEP]' for sent in sentences]

tokenized_sents=[tokenizer.tokenize(sent) for sent in inputsents]
print(tokenized_sents)

[['[CLS]', 'the', 'boy', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'kicks', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'child', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'is', 'kicked', 'by', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'is', 'kicked', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'kicks', '.', '[SEP]'], ['[CLS]', 'the', 'child', 'kicks', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'kicks', 'a', 'round', 'object', '.', '[SEP]'], ['[CLS]', 'the', 'male', 'child', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'is', 'playing', 'football', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'hits', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'hits', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'boy', 'is', 'hit', 'by', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'ball', 'is', 'hit', 'by', 'the', 'boy', '.', '[SEP]'], ['[CLS]', 'the', 'female', 'child', 'kicks', 'the', 'ball', '.', '[SEP]'], ['[CLS]', 'the', 'gi

In [17]:
for i,sent in enumerate(inputsents):
    print(i,sent)

0 [CLS] The boy kicks the ball. [SEP]
1 [CLS] The ball kicks the boy. [SEP]
2 [CLS] The child kicks the ball. [SEP]
3 [CLS] The ball is kicked by the boy. [SEP]
4 [CLS] The ball is kicked. [SEP]
5 [CLS] The boy kicks. [SEP]
6 [CLS] The child kicks. [SEP]
7 [CLS] The boy kicks a round object. [SEP]
8 [CLS] The male child kicks the ball. [SEP]
9 [CLS] The boy is playing football. [SEP]
10 [CLS] The boy hits the ball. [SEP]
11 [CLS] The ball hits the boy. [SEP]
12 [CLS] The boy is hit by the ball. [SEP]
13 [CLS] The ball is hit by the boy. [SEP]
14 [CLS] The female child kicks the ball. [SEP]
15 [CLS] The girl kicks the ball. [SEP]
16 [CLS] The child plays with dolls. [SEP]
17 [CLS] The female child plays with dolls. [SEP]
18 [CLS] The male child plays with dolls. [SEP]
19 [CLS] The girl plays with dolls. [SEP]
20 [CLS] The boy plays with dolls. [SEP]
21 [CLS] The boy is kicking the ball. [SEP]
22 [CLS] The boy is not kicking the ball. [SEP]
23 [CLS] All boys kick balls. [SEP]
24 [CLS] Ev

When encoding sentences, it is actually more typical to pool the hidden states for each layer (at depth n) rather than the output layer.  We can access the hidden states of the model using `output_hidden_states=True` 

In [18]:
model = BertModel.from_pretrained('bert-base-uncased')


model.eval()

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors,output_hidden_states=True)
   
    
    

In [19]:
outputs.to_tuple()

(tensor([[[-0.6770,  0.9460, -0.5043,  ..., -0.5411,  0.2880,  0.6554],
          [ 0.3894, -0.1123,  0.0627,  ...,  0.2348, -0.2624, -0.7901],
          [ 0.3382,  0.4643,  0.5102,  ..., -0.3201,  0.0523,  0.6370],
          ...,
          [ 0.3637,  0.2680,  0.4165,  ..., -0.1186, -0.3448,  0.1381],
          [ 0.6523, -0.0097, -0.2530,  ...,  0.0553, -0.6688, -0.4759],
          [ 0.1958,  0.3457,  0.2455,  ..., -0.0109, -0.0633,  0.1034]]]),
 tensor([[-0.9894, -0.8646, -0.9975,  0.9868,  0.9707, -0.8055,  0.9864,  0.7454,
          -0.9937, -1.0000, -0.9569,  0.9987,  0.9958,  0.8907,  0.9904, -0.9714,
          -0.9318, -0.9199,  0.7476, -0.8990,  0.9496,  1.0000, -0.6833,  0.8039,
           0.8966,  0.9999, -0.9735,  0.9875,  0.9899,  0.8809, -0.9644,  0.7739,
          -0.9986, -0.6912, -0.9968, -0.9994,  0.9149, -0.9360, -0.6885, -0.6479,
          -0.9761,  0.8580,  1.0000,  0.3837,  0.8634, -0.7817, -1.0000,  0.7828,
          -0.9685,  0.9987,  0.9940,  0.9951,  0.8279,  0.

In [20]:
print(len(outputs))
for i in range(len(outputs)):
    try:
        print(outputs[i].shape)
    except:
        print(len(outputs[i]))

3
torch.Size([1, 25, 768])
torch.Size([1, 768])
13


Here:
* outputs[0] contains the output representation of each token
* outputs[1] is representation of the first token (after being put through an additional layer)
* outputs[2] is a a tuple.  Each element is the hidden layer at depth n.  If we want the last layer then we need outputs[2][-1]


In [21]:
#outputs[2][-1] is the last hidden layer also output as outputs[0]
outputs[2][-1]

tensor([[[-0.6770,  0.9460, -0.5043,  ..., -0.5411,  0.2880,  0.6554],
         [ 0.3894, -0.1123,  0.0627,  ...,  0.2348, -0.2624, -0.7901],
         [ 0.3382,  0.4643,  0.5102,  ..., -0.3201,  0.0523,  0.6370],
         ...,
         [ 0.3637,  0.2680,  0.4165,  ..., -0.1186, -0.3448,  0.1381],
         [ 0.6523, -0.0097, -0.2530,  ...,  0.0553, -0.6688, -0.4759],
         [ 0.1958,  0.3457,  0.2455,  ..., -0.0109, -0.0633,  0.1034]]])

In [22]:
#so if you want the penultimate hidden layer you need outputs[2][-2]
outputs[2][-2]

tensor([[[-0.3338,  0.5738, -0.5777,  ..., -0.1215, -0.3504,  1.1074],
         [ 0.4044, -0.2317, -0.2335,  ...,  0.4218, -0.3578, -1.5020],
         [-0.2562,  0.3409, -0.2989,  ..., -0.5693,  0.0474,  1.3045],
         ...,
         [ 0.8100, -0.0503,  0.3418,  ..., -0.1121, -0.2673,  0.5875],
         [ 0.0457,  0.0066, -0.0341,  ...,  0.0187, -0.0104, -0.0071],
         [ 0.4002,  0.4515,  0.5819,  ...,  0.0999, -0.6354,  0.0041]]])

### Exercise 2
* Encode each sentence using the output representation for its CLS token - note that you do not need to mask the CLS token.  We are just interested in the output layer embedding for this token.  You can use outputs[0][0] or outputs[1] as a representation of the CLS token - but you will get different results as outputs[1] as gone through an additional layer (trained for next sentence prediction during fine-tuning and classification IF the model has been fine-tuned).
* Use cosine similarity to determine all pairs similarities for the sentences.
* Identify the 10 most similar pairs of sentences using this sentence encoding

In [23]:
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

encoded=[]
for sent in tokenized_sents:
    indexed_tokens = tokenizer.convert_tokens_to_ids(sent)
    segment_ids=make_segment_ids(sent)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segment_ids])

    # Predict all tokens
    with torch.no_grad():
        outputs = model(tokens_tensor, token_type_ids=segments_tensors)
        predictions = outputs[0]
        
    cls=predictions[0,0]
    encoded.append(cls)

print(encoded)

[tensor([ 1.1590e-01,  6.5325e-01, -2.0935e-01, -2.5612e-02, -3.8921e-01,
        -6.2618e-01,  3.0078e-01,  5.4555e-01, -2.0468e-03, -6.0749e-01,
        -4.2319e-01, -1.9440e-01,  4.5724e-01,  7.6824e-02,  3.3098e-01,
        -5.0582e-01,  1.5184e-01,  1.0898e-01, -1.2320e-01, -5.4610e-01,
        -5.0400e-02, -5.3258e-01, -3.2013e-01,  2.9357e-01,  4.9870e-01,
         5.0015e-02,  1.4715e-01, -1.0674e-01, -2.2981e-01, -1.5368e-01,
         2.7195e-01,  1.0101e+00, -1.6937e-01, -1.3404e-01,  8.9785e-02,
        -5.1196e-01,  2.3494e-01, -9.3253e-02,  4.7672e-01,  1.4907e-01,
        -3.0858e-01,  1.5049e-01, -8.5429e-02, -4.1467e-01,  6.8358e-01,
        -9.3506e-01, -2.8060e+00, -3.7640e-01, -4.8633e-01, -4.4476e-01,
         1.5088e-01,  1.3589e-02,  1.5590e-02, -2.3848e-02, -6.3058e-01,
        -3.2195e-02, -1.7782e-01,  4.7192e-01,  5.1612e-03,  7.3015e-02,
        -1.0110e-01,  4.8777e-01, -2.3079e-01, -1.0471e-01,  1.7462e-01,
        -2.1536e-01, -5.5140e-01,  2.9230e-01,  4.

In [24]:
## this is a handy way of finding the cosine similarity between two tensors
# see https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

#you use this as:
print(encoded_layers[0,0],encoded_layers[0,1])
output=cos(encoded_layers[0,0],encoded_layers[0,1])
print(output.item())

tensor([-6.6741e-01,  9.4580e-01, -5.0362e-01,  7.6330e-02, -1.2090e+00,
         1.2795e-01,  9.3474e-01,  1.1574e+00, -1.9790e-01,  9.3847e-02,
        -7.1620e-01, -4.7296e-01, -2.1994e-01,  1.0530e+00,  3.7979e-01,
        -1.6461e-01, -2.0667e-01,  1.0136e+00,  3.5675e-01, -2.8986e-02,
         1.4404e-01, -2.0457e-01, -1.0594e-01,  9.7714e-03,  9.4192e-03,
        -5.7983e-01, -1.4214e-01, -4.3860e-01,  6.5073e-01, -4.3919e-01,
        -2.3003e-01,  1.1294e+00, -5.0243e-01, -1.7202e-01,  3.5664e-01,
        -6.0077e-02,  1.5558e-01, -5.6304e-02,  3.3006e-01,  1.5681e-01,
        -1.7278e-01,  1.3906e-01,  5.0966e-01,  3.3143e-01, -1.8095e-01,
        -3.5913e-01, -1.8010e+00, -2.8397e-01, -3.3440e-01, -3.4394e-01,
         1.7959e-01, -6.9789e-02,  8.8394e-01,  4.0845e-01, -3.3890e-01,
         1.4026e+00, -9.4726e-01,  3.8425e-01,  1.7327e-01,  9.4517e-01,
        -6.2579e-02, -7.5702e-02, -6.6680e-01, -4.6924e-01,  4.4900e-01,
         9.2037e-01, -2.3056e-01,  8.5764e-01, -8.6

In [25]:
sims=[]
for posA in encoded:
    this_sims=[]
    for posB in encoded:
        
        output=cos(posA,posB)
        this_sims.append(output.item())
    sims.append(this_sims)
print(sims)

[[1.0, 0.9455852508544922, 0.9646618962287903, 0.8514177203178406, 0.8286486864089966, 0.9401546716690063, 0.9227973222732544, 0.9353486895561218, 0.900375485420227, 0.8708134293556213, 0.9646220803260803, 0.9289805889129639, 0.8759901523590088, 0.8705665469169617, 0.8891347646713257, 0.9722211956977844, 0.8038936257362366, 0.7822102308273315, 0.7883206605911255, 0.8124406933784485, 0.8199414610862732, 0.8976418972015381, 0.8753265738487244, 0.8014339208602905, 0.8331336379051208, 0.8581057786941528, 0.8150197267532349, 0.8298880457878113, 0.830188512802124, 0.8170691728591919, 0.7577212452888489, 0.772688090801239, 0.7968730330467224, 0.8189307451248169, 0.8209183812141418], [0.9455852508544922, 1.0, 0.9564514756202698, 0.8817494511604309, 0.8290426135063171, 0.9268247485160828, 0.9358603358268738, 0.8994766473770142, 0.9168555736541748, 0.8586715459823608, 0.9588810801506042, 0.9730362296104431, 0.9118579030036926, 0.918979823589325, 0.9177206754684448, 0.9352852702140808, 0.79934996

In [26]:
from operator import itemgetter
bestmatches=[]
for i,this_sims in enumerate(sims):
    withindex=[(i,sim) for i,sim in enumerate(this_sims)]
    sortedsims=sorted(withindex,key=itemgetter(1),reverse=True)
    mostsim=sortedsims[1][0]
    bestmatches.append((i,mostsim))
    
print(bestmatches)
    

[(0, 15), (1, 11), (2, 10), (3, 13), (4, 3), (5, 6), (6, 5), (7, 0), (8, 14), (9, 21), (10, 11), (11, 10), (12, 13), (13, 12), (14, 8), (15, 0), (16, 17), (17, 18), (18, 17), (19, 20), (20, 19), (21, 22), (22, 21), (23, 32), (24, 23), (25, 27), (26, 30), (27, 34), (28, 27), (29, 21), (30, 32), (31, 32), (32, 23), (33, 27), (34, 27)]


In [27]:
for a,b in bestmatches:
    print(inputsents[a]+" : "+inputsents[b])

[CLS] The boy kicks the ball. [SEP] : [CLS] The girl kicks the ball. [SEP]
[CLS] The ball kicks the boy. [SEP] : [CLS] The ball hits the boy. [SEP]
[CLS] The child kicks the ball. [SEP] : [CLS] The boy hits the ball. [SEP]
[CLS] The ball is kicked by the boy. [SEP] : [CLS] The ball is hit by the boy. [SEP]
[CLS] The ball is kicked. [SEP] : [CLS] The ball is kicked by the boy. [SEP]
[CLS] The boy kicks. [SEP] : [CLS] The child kicks. [SEP]
[CLS] The child kicks. [SEP] : [CLS] The boy kicks. [SEP]
[CLS] The boy kicks a round object. [SEP] : [CLS] The boy kicks the ball. [SEP]
[CLS] The male child kicks the ball. [SEP] : [CLS] The female child kicks the ball. [SEP]
[CLS] The boy is playing football. [SEP] : [CLS] The boy is kicking the ball. [SEP]
[CLS] The boy hits the ball. [SEP] : [CLS] The ball hits the boy. [SEP]
[CLS] The ball hits the boy. [SEP] : [CLS] The boy hits the ball. [SEP]
[CLS] The boy is hit by the ball. [SEP] : [CLS] The ball is hit by the boy. [SEP]
[CLS] The ball is h

### Exercise 3
a) Repeat exercise 2 but use the centroid of all of the output embeddings as the representation of a sentence.

b) Experiment with using different pooling layers from the hidden state embeddings.  Typically, using the penultimate layer (-2) is felt to be optimal as it is far enough away from the original uncontextualised word embeddings but also not too close to the output predictions.  See here for a discussion: https://github.com/hanxiao/bert-as-service#q-what-are-the-available-pooling-strategies

In [34]:


def encode(tokenized_sents, method="sum",poolinglayer=-1):
    model = BertModel.from_pretrained('bert-base-uncased')
    model.eval()
    encoded=[]
    blacklist=['[CLS]','[SEP]']
    for sent in tokenized_sents:
        indexed_tokens = tokenizer.convert_tokens_to_ids(sent)
        segment_ids=make_segment_ids(sent)
        tokens_tensor = torch.tensor([indexed_tokens])
        segments_tensors = torch.tensor([segment_ids])

        # Predict all tokens
        with torch.no_grad():
            outputs = model(tokens_tensor, token_type_ids=segments_tensors,output_hidden_states=True)
            if poolinglayer==0:
                predictions = outputs[0]
            else:
                predictions = outputs[2][poolinglayer]
                
        if method=="sum":
            rep=sum(predictions[0])
        elif method=="cls":
            rep=predictions[0][0]
        elif method=="centroid":
            rep=sum(predictions[0])
            rep=rep/len(predictions[0])
        elif method=="centroid-":
            rep=predictions[0][1]
            for tok,pred in zip(sent[2:],predictions[0][2:]):
                if tok not in blacklist:
                    rep+=pred
            rep=rep/len(predictions[0][1:])
                    
        
        else:
            rep=predictions[0][0]
        encoded.append(rep)
        
    return encoded

def allpairssims(encoded):
    sims=[]
    for posA in encoded:
        this_sims=[]
        for posB in encoded:
        
            output=cos(posA,posB)
            this_sims.append(output.item())
        sims.append(this_sims)
    return sims

def match(inputsents,sims):
    bestmatches=[]
    for i,this_sims in enumerate(sims):
        withindex=[(i,sim) for i,sim in enumerate(this_sims)]
        sortedsims=sorted(withindex,key=itemgetter(1),reverse=True)
        mostsim=sortedsims[1][0]
        bestmatches.append((i,mostsim))
    
    for a,b in bestmatches:
        print(inputsents[a]+" : "+inputsents[b])
        
def run(sentences,method="sum",poolinglayer=-1):
    inputsents=['[CLS] '+sent+' [SEP]' for sent in sentences]

    tokenized_sents=[tokenizer.tokenize(sent) for sent in inputsents]
    encoded=encode(tokenized_sents,method=method,poolinglayer=poolinglayer)
    sims=allpairssims(encoded)
    match(sentences,sims)
    return sims
    

In [29]:
sims=run(sentences,poolinglayer=-2)

The boy kicks the ball. : The boy hits the ball.
The ball kicks the boy. : The ball hits the boy.
The child kicks the ball. : The male child kicks the ball.
The ball is kicked by the boy. : The ball is hit by the boy.
The ball is kicked. : The ball is kicked by the boy.
The boy kicks. : The child kicks.
The child kicks. : The boy kicks.
The boy kicks a round object. : The boy kicks the ball.
The male child kicks the ball. : The female child kicks the ball.
The boy is playing football. : The boy is kicking the ball.
The boy hits the ball. : The boy kicks the ball.
The ball hits the boy. : The ball kicks the boy.
The boy is hit by the ball. : The ball is hit by the boy.
The ball is hit by the boy. : The boy is hit by the ball.
The female child kicks the ball. : The male child kicks the ball.
The girl kicks the ball. : The boy kicks the ball.
The child plays with dolls. : The boy plays with dolls.
The female child plays with dolls. : The male child plays with dolls.
The male child plays w

In [30]:
#sims=run(sentences,method="cls")

In [31]:
print(sims[0])

[1.0, 0.9320710301399231, 0.9389090538024902, 0.7984751462936401, 0.7116036415100098, 0.8858177065849304, 0.8407012224197388, 0.8538014888763428, 0.8429251909255981, 0.7730696201324463, 0.9567531943321228, 0.9111838936805725, 0.817293107509613, 0.8164535760879517, 0.8308355808258057, 0.9517408609390259, 0.6803529262542725, 0.6679407954216003, 0.6689137816429138, 0.7107057571411133, 0.7194034457206726, 0.8586256504058838, 0.8101323843002319, 0.6889369487762451, 0.735599160194397, 0.7480183243751526, 0.6996849179267883, 0.7375574707984924, 0.7541757822036743, 0.7557987570762634, 0.6476711630821228, 0.6235708594322205, 0.6821606755256653, 0.705179750919342, 0.7103111743927002]


In [32]:
interested=[2,1,21,7,8,22,3,25]
for i in interested:
    print(sentences[i],sims[0][i])

The child kicks the ball. 0.9389090538024902
The ball kicks the boy. 0.9320710301399231
The boy is kicking the ball. 0.8586256504058838
The boy kicks a round object. 0.8538014888763428
The male child kicks the ball. 0.8429251909255981
The boy is not kicking the ball. 0.8101323843002319
The ball is kicked by the boy. 0.7984751462936401
There is a boy kicking a ball. 0.7480183243751526


In [33]:
sims=run(sentences,poolinglayer=-1)
for i in interested:
    print(sentences[i],sims[0][i])

The boy kicks the ball. : The girl kicks the ball.
The ball kicks the boy. : The ball hits the boy.
The child kicks the ball. : The boy kicks the ball.
The ball is kicked by the boy. : The ball is hit by the boy.
The ball is kicked. : The ball is kicked by the boy.
The boy kicks. : The child kicks.
The child kicks. : The boy kicks.
The boy kicks a round object. : The boy kicks the ball.
The male child kicks the ball. : The female child kicks the ball.
The boy is playing football. : The boy is kicking the ball.
The boy hits the ball. : The boy kicks the ball.
The ball hits the boy. : The ball kicks the boy.
The boy is hit by the ball. : The ball is hit by the boy.
The ball is hit by the boy. : The boy is hit by the ball.
The female child kicks the ball. : The male child kicks the ball.
The girl kicks the ball. : The boy kicks the ball.
The child plays with dolls. : The female child plays with dolls.
The female child plays with dolls. : The male child plays with dolls.
The male child pla

In [35]:
sims=run(sentences,poolinglayer=0)
for i in interested:
    print(sentences[i],sims[0][i])

The boy kicks the ball. : The girl kicks the ball.
The ball kicks the boy. : The ball hits the boy.
The child kicks the ball. : The boy kicks the ball.
The ball is kicked by the boy. : The ball is hit by the boy.
The ball is kicked. : The ball is kicked by the boy.
The boy kicks. : The child kicks.
The child kicks. : The boy kicks.
The boy kicks a round object. : The boy kicks the ball.
The male child kicks the ball. : The female child kicks the ball.
The boy is playing football. : The boy is kicking the ball.
The boy hits the ball. : The boy kicks the ball.
The ball hits the boy. : The ball kicks the boy.
The boy is hit by the ball. : The ball is hit by the boy.
The ball is hit by the boy. : The boy is hit by the ball.
The female child kicks the ball. : The male child kicks the ball.
The girl kicks the ball. : The boy kicks the ball.
The child plays with dolls. : The female child plays with dolls.
The female child plays with dolls. : The male child plays with dolls.
The male child pla

In [36]:
sims=run(sentences,method="cls",poolinglayer=0)
for i in interested:
    print(sentences[i],sims[0][i])

The boy kicks the ball. : The girl kicks the ball.
The ball kicks the boy. : The ball hits the boy.
The child kicks the ball. : The boy hits the ball.
The ball is kicked by the boy. : The ball is hit by the boy.
The ball is kicked. : The ball is kicked by the boy.
The boy kicks. : The child kicks.
The child kicks. : The boy kicks.
The boy kicks a round object. : The boy kicks the ball.
The male child kicks the ball. : The female child kicks the ball.
The boy is playing football. : The boy is kicking the ball.
The boy hits the ball. : The ball hits the boy.
The ball hits the boy. : The boy hits the ball.
The boy is hit by the ball. : The ball is hit by the boy.
The ball is hit by the boy. : The boy is hit by the ball.
The female child kicks the ball. : The male child kicks the ball.
The girl kicks the ball. : The boy kicks the ball.
The child plays with dolls. : The female child plays with dolls.
The female child plays with dolls. : The male child plays with dolls.
The male child plays 

### Extension 1
The MRPC.zip file contains a training, dev and test split for the Microsoft Research paraphrase corpus.  In this corpus the quality '1' indicates that the 2 sentences are considered to be paraphrases and '0' indicates that they are not.

Can you build a classifier on top of the BERT pre-trained model, trained on the training split of MRPC, which predicts whether 2 sentences are paraphrases or not?

Note this does not require you to fine-tune the BERT model.  You can use outputs from BERT as input to your separate classifier.  I would suggest a single neural layer which uses the representation from exercise 2 or 3 as input, built using scikit-learn or torch.   