## Embeddings

In this notebook, we will create sentence embeddings through **sentence-transformers**. 

Throughout this notebook, we will learn
- What are embeddings
- What are the applications of 

The code in the notebook is loosely based on [hackerllama's blog post](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/).

## Setup

In [None]:
%%capture
!pip install sentence-transformers==3.4.1

## Sentence transformers

Let's look into one kind of embeddings through the Python package `sentence-transformer`.

### Downloading a pretrained model

We will download the `all-MiniLM-L6-v2` model. You can find models by [selecting the `sentence-similarity` task in Huggingface Hub.](https://huggingface.co/models?pipeline_tag=sentence-similarity)

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


## What problems are we trying to solve with embeddings?

Text can not be consumed by computers as is -- we must find a numeric representation of our text in order to process them.

We also have the following questions

* How similar are two pieces of texts?
* How can we find neighbors? For example, what are the texts most relevant to the one in question?
    - semantic search (vector search) as opposed to keyword search
    - top k neighbors = ranking

By turning texts into vectors, we also get the following benefits:

* We can do vector algebra on texts!
* We can turn unstructured text data into structured feature for other models e.g. for predictive modeling

## Two ways of producing sentence embeddings

1. Through the `sentence-transformer` library
2. Through the `transformer` library and pool the output ourselves.

In [6]:
# take last hidden state
from sentence_transformers import util

In [49]:
sentences = ["The weather today is beautiful", "It's raining!", "Dogs are awesome"]
embeddings = model.encode(sentences[0], hidden_states=True)
embeddings.shape

(384,)

In [50]:
embeddings

array([ 5.05578984e-03,  1.10878065e-01,  1.68528676e-01,  6.66609406e-02,
        3.34140696e-02, -3.65244597e-02,  5.89507632e-02, -1.01468720e-01,
       -4.48783115e-02,  9.43513028e-03, -4.34322357e-02, -7.85938837e-03,
        2.71036495e-02,  1.10919038e-02,  2.93265730e-02,  5.26361838e-02,
       -1.59959644e-02, -1.64193697e-02, -2.30436828e-02,  4.32102941e-02,
       -1.31010517e-01,  7.62632489e-02, -8.19776952e-02,  7.98825026e-02,
       -1.11006061e-02,  6.51216283e-02,  6.73978310e-03,  4.09533828e-02,
        2.07324177e-02, -1.83034968e-02, -3.09522133e-02,  1.08262142e-02,
        2.86188927e-02, -3.21798585e-02, -3.74149270e-02, -2.83376630e-02,
       -1.94936022e-02, -1.68013453e-01, -6.04415908e-02,  5.96985407e-02,
       -1.75287295e-02, -4.80048321e-02, -2.60541197e-02,  5.32719865e-03,
        5.67740994e-03, -3.82829159e-02, -4.57524434e-02,  4.04658094e-02,
        1.25757933e-01, -8.11213907e-03, -6.56668842e-02,  2.46861242e-02,
       -5.86222373e-02, -

In [63]:
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

tokenizer2 = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model2 = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

In [64]:
encoded = tokenizer2(sentences[0], return_tensors="pt")
embeddings2 = model2(**encoded)

In [88]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    # return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return torch.sum(token_embeddings * input_mask_expanded, 1)

In [90]:
result2 = F.normalize(mean_pooling(embeddings2, encoded['attention_mask']))

In [96]:
result2 = F.normalize(torch.mean(embeddings2.last_hidden_state,1))

In [97]:
result1 = torch.from_numpy(embeddings)
torch.equal(result2, result1[None,:])

False

In [101]:
result1[:10]

tensor([ 0.0051,  0.1109,  0.1685,  0.0667,  0.0334, -0.0365,  0.0590, -0.1015,
        -0.0449,  0.0094])

In [104]:
result2[0,:10]

tensor([ 0.0051,  0.1109,  0.1685,  0.0667,  0.0334, -0.0365,  0.0590, -0.1015,
        -0.0449,  0.0094], grad_fn=<SliceBackward0>)

In [110]:
torch.allclose(result2, result1[None,:],rtol=1e-4)

True

In [111]:
# for all practical purposes they are the same
util.pytorch_cos_sim(result1, result2)

tensor([[1.0000]], grad_fn=<MmBackward0>)

In [44]:
encoded

{'input_ids': tensor([[ 101, 1996, 4633, 2651, 2003, 3376,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [58]:
embeddings2 = model2(**encoded)

In [59]:
embeddings2.last_hidden_state.shape

torch.Size([1, 7, 384])

In [60]:
embeddings2[0]

tensor([[[-0.1274,  0.5167,  0.6653,  ...,  0.1483, -0.3038, -0.0985],
         [-0.0433,  0.4315,  0.5239,  ..., -0.0653, -0.2210,  0.2986],
         [-0.4177,  0.9717,  1.6749,  ...,  0.1541, -1.2018,  0.7160],
         ...,
         [ 0.0998,  0.4436,  0.8043,  ..., -0.0972, -0.1734,  0.0320],
         [ 0.2365,  0.6064,  0.9382,  ...,  0.3798,  0.6692, -0.5391],
         [ 0.0573,  0.5393,  0.4884,  ..., -0.2348, -0.4291,  0.1329]]],
       grad_fn=<NativeLayerNormBackward0>)

In [61]:
embeddings2.last_hidden_state

tensor([[[-0.1274,  0.5167,  0.6653,  ...,  0.1483, -0.3038, -0.0985],
         [-0.0433,  0.4315,  0.5239,  ..., -0.0653, -0.2210,  0.2986],
         [-0.4177,  0.9717,  1.6749,  ...,  0.1541, -1.2018,  0.7160],
         ...,
         [ 0.0998,  0.4436,  0.8043,  ..., -0.0972, -0.1734,  0.0320],
         [ 0.2365,  0.6064,  0.9382,  ...,  0.3798,  0.6692, -0.5391],
         [ 0.0573,  0.5393,  0.4884,  ..., -0.2348, -0.4291,  0.1329]]],
       grad_fn=<NativeLayerNormBackward0>)

In [56]:
import torch

In [57]:
torch.mean(embeddings2.last_hidden_state,1)

tensor([[ 2.6720e-02,  5.8597e-01,  8.9065e-01,  3.5229e-01,  1.7659e-01,
         -1.9303e-01,  3.1154e-01, -5.3625e-01, -2.3717e-01,  4.9863e-02,
         -2.2953e-01, -4.1536e-02,  1.4324e-01,  5.8619e-02,  1.5499e-01,
          2.7817e-01, -8.4536e-02, -8.6773e-02, -1.2178e-01,  2.2836e-01,
         -6.9237e-01,  4.0304e-01, -4.3324e-01,  4.2217e-01, -5.8665e-02,
          3.4416e-01,  3.5619e-02,  2.1643e-01,  1.0957e-01, -9.6731e-02,
         -1.6358e-01,  5.7214e-02,  1.5125e-01, -1.7007e-01, -1.9773e-01,
         -1.4976e-01, -1.0302e-01, -8.8792e-01, -3.1942e-01,  3.1550e-01,
         -9.2637e-02, -2.5370e-01, -1.3769e-01,  2.8153e-02,  3.0004e-02,
         -2.0232e-01, -2.4179e-01,  2.1386e-01,  6.6461e-01, -4.2871e-02,
         -3.4704e-01,  1.3046e-01, -3.0981e-01, -4.9183e-01, -1.5027e-01,
          3.9937e-01,  1.3859e-02, -4.2963e-01,  4.3013e-01, -2.8581e-03,
          1.8555e-01,  5.1034e-01, -1.5708e-01,  2.0041e-01,  4.8996e-01,
         -1.8586e-01, -4.0722e-01,  1.

Question: why am I getting different results? https://github.com/UKPLab/sentence-transformers/issues/2261

In [76]:
# Note that the tokenizer output and attention_mask includes the [CLS] token
tokenizer2.convert_ids_to_tokens(tokenizer2(sentences[0])['input_ids'])

['[CLS]', 'the', 'weather', 'today', 'is', 'beautiful', '[SEP]']

In [70]:
[_ for _ in dir(tokenizer2) if 'convert' in _]

['_convert_encoding',
 '_convert_id_to_token',
 '_convert_token_to_id_with_added_voc',
 'convert_added_tokens',
 'convert_ids_to_tokens',
 'convert_tokens_to_ids',
 'convert_tokens_to_string']

In [5]:
embeddings[0]

array([ 5.05583361e-03,  1.10878065e-01,  1.68528691e-01,  6.66609332e-02,
        3.34140547e-02, -3.65244001e-02,  5.89507520e-02, -1.01468772e-01,
       -4.48782966e-02,  9.43512842e-03, -4.34322618e-02, -7.85935670e-03,
        2.71035992e-02,  1.10919140e-02,  2.93265451e-02,  5.26362360e-02,
       -1.59960240e-02, -1.64193064e-02, -2.30436642e-02,  4.32102829e-02,
       -1.31010488e-01,  7.62632415e-02, -8.19776654e-02,  7.98824579e-02,
       -1.11006787e-02,  6.51216283e-02,  6.73982082e-03,  4.09533717e-02,
        2.07324307e-02, -1.83034819e-02, -3.09522282e-02,  1.08261872e-02,
        2.86188740e-02, -3.21798660e-02, -3.74149270e-02, -2.83376593e-02,
       -1.94936134e-02, -1.68013498e-01, -6.04416355e-02,  5.96984737e-02,
       -1.75287910e-02, -4.80048172e-02, -2.60541737e-02,  5.32722427e-03,
        5.67738712e-03, -3.82828526e-02, -4.57523502e-02,  4.04657796e-02,
        1.25757813e-01, -8.11220519e-03, -6.56668395e-02,  2.46861074e-02,
       -5.86223155e-02, -