# Calculating Similarity

First, let's create some embeddings.

In [1]:
sentences = [
    "Today's consumers want options that help them take control of their financial lives on the go. CardValet from Fiserv is a mobile app that allows financial institutions to deliver the experiences consumers expect by enabling them to define when, where and how their cards are used.",
    "Consumer’s want options that let them take control of their financial lives on the go. CardValet from Fiserv is a mobile app that helps banks deliver consumer experiences by enabling them to define when, where and how their cards are used.",
    "CardValet from Fiserv is a mobile app that consumers expect them to enable their cards and define when their cards are used. CardValet can be used as A branded app that can be integrated via single sign-on with your mobile banking platform giving cardholders a seamless banking experience",
    "Standing on one's head at job interviews forms a lasting impression."   
]


In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # tokenize sentence and append to dictionary lists
    new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True,
                                       padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

Some weights of the model checkpoint at sentence-transformers/bert-base-nli-mean-tokens were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
tokens['input_ids'].shape

torch.Size([4, 128])

We process these tokens through our model:

In [4]:
outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [5]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-2.9276e-01, -2.6861e-02,  1.2145e+00,  ..., -8.5463e-01,
          -1.3567e+00,  5.0336e-01],
         [-9.2307e-01,  2.2581e-01,  1.4719e+00,  ..., -1.3455e+00,
          -1.0756e+00,  4.2485e-02],
         [-3.9946e-01, -2.4946e-01,  1.9602e+00,  ..., -8.4858e-01,
          -1.1761e+00,  2.4284e-01],
         ...,
         [-2.5001e-01, -2.5791e-01,  8.8925e-01,  ..., -4.5543e-01,
          -1.1812e+00,  1.9731e-01],
         [-1.3332e-01, -1.6209e-01,  8.7017e-01,  ..., -4.4412e-01,
          -1.1705e+00,  2.5546e-01],
         [-1.1183e-01, -2.8377e-01,  8.6390e-01,  ..., -4.0031e-01,
          -1.2038e+00,  2.4586e-01]],

        [[-1.5972e-01,  1.4743e-01,  7.8997e-01,  ..., -9.1521e-01,
          -1.6520e+00,  5.4084e-01],
         [-3.3597e-01,  2.4327e-01,  1.1927e+00,  ..., -1.0822e+00,
          -1.4735e+00,  1.3521e-01],
         [ 3.1547e-01,  5.4656e-02,  9.1885e-01,  ..., -8.4436e-01,
          -1.1599e+00,  3.6451e-01],
         ...,
         [-1.3240e-01, -1

In [6]:
embeddings.shape

torch.Size([4, 128, 768])

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [7]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([4, 128])

In [8]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([4, 128, 768])

In [9]:
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [10]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([4, 128, 768])

In [11]:
masked_embeddings

tensor([[[-0.2928, -0.0269,  1.2145,  ..., -0.8546, -1.3567,  0.5034],
         [-0.9231,  0.2258,  1.4719,  ..., -1.3455, -1.0756,  0.0425],
         [-0.3995, -0.2495,  1.9602,  ..., -0.8486, -1.1761,  0.2428],
         ...,
         [-0.0000, -0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [-0.0000, -0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [-0.0000, -0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000]],

        [[-0.1597,  0.1474,  0.7900,  ..., -0.9152, -1.6520,  0.5408],
         [-0.3360,  0.2433,  1.1927,  ..., -1.0822, -1.4735,  0.1352],
         [ 0.3155,  0.0547,  0.9188,  ..., -0.8444, -1.1599,  0.3645],
         ...,
         [-0.0000, -0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [-0.0000, -0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000]],

        [[-0.1937,  0.3952,  0.5603,  ..., -0.7847, -1.2638,  0.4324],
         [-0.2207, -0.4856,  1.4179,  ..., -0

Then we sum the remained of the embeddings along axis `1`:

In [12]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([4, 768])

Then sum the number of values that must be given attention in each position of the tensor:

In [13]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([4, 768])

In [14]:
summed_mask

tensor([[57., 57., 57.,  ..., 57., 57., 57.],
        [52., 52., 52.,  ..., 52., 52., 52.],
        [61., 61., 61.,  ..., 61., 61., 61.],
        [16., 16., 16.,  ..., 16., 16., 16.]])

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [15]:
mean_pooled = summed / summed_mask

In [16]:
mean_pooled

tensor([[-0.3085,  0.1475,  1.3810,  ..., -0.8949, -1.3626,  0.4667],
        [-0.1856,  0.2541,  1.1287,  ..., -0.9815, -1.5834,  0.5147],
        [-0.0906,  0.4047,  0.8533,  ..., -0.5820, -1.4343,  0.3572],
        [-0.0132,  0.9773,  1.4516,  ..., -0.8462, -1.4004, -0.4118]],
       grad_fn=<DivBackward0>)

And that is how we calculate our dense similarity vector.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

Let's calculate cosine similarity for sentence `0`:

In [18]:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

# calculate
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

array([[0.9675461 , 0.90401113, 0.4043299 ]], dtype=float32)