# Contextual Embeddings and Semantic Search Engines with Transformers

# Contextual Embeddings in Transformers

Contextual embeddings are different from plain word embeddings.

Due to aspects like self-attention, each word's corresponding emebdding can be influenced by surrounding words context.

e.g

- They are going to **fire** him
- He is sitting by the **fire**

The word **fire** would have a different embedding vector because of other words when using transformer models

Also you can typically average word embeddings in a document to get an overall document embedding

Let's start by taking a few different documents.

In [1]:
import numpy as np

In [2]:
docs = ['The cat is running on the stairs',
        'A woman is eating a piece of bread.',
        'A man is eating a pasta.']

docs

['The cat is running on the stairs',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

## Load a pre-trained BERT Transformer model

Here we will use a pre-trained transformer model to get contextual word embeddings and average them to get document embeddings.

![](https://i.imgur.com/4uYtfkQ.png)

Then we can compute simple cosine similarity between document embedding for each pair of the above documents

In [3]:
from transformers import AutoTokenizer, AutoModel

# model details: https://huggingface.co/google-bert/bert-base-uncased
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## Tokenizing Documents

Transformers have been trained by first tokenizing the documents, hence each model comes with its own trained tokenizer which can break down a document into individual tokens (words) and each token would have a corresponding embedding.

REMEMBER: We are not training any models here, just using an already pre-trained model on a lot of data and getting embeddings for our documents by passing them through the pre-trained model

In [4]:
docs[0]

'The cat is running on the stairs'

In [5]:
token_ids = tokenizer(docs[0], return_tensors='pt')
token_ids

{'input_ids': tensor([[ 101, 1996, 4937, 2003, 2770, 2006, 1996, 5108,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Key token ID elements above include

#### Input IDs
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model. __You should mostly care about this in most cases__

<br/>

#### Attention mask
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not (masked language modeling)

<br/>

#### Token Type IDs
Some models’ purpose is to do sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:

```
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
```

https://huggingface.co/transformers/glossary.html#token-type-ids

In [6]:
token_ids

{'input_ids': tensor([[ 101, 1996, 4937, 2003, 2770, 2006, 1996, 5108,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [7]:
[tokenizer.decode(id) for id in token_ids['input_ids'][0]]

['[CLS]', 'the', 'cat', 'is', 'running', 'on', 'the', 'stairs', '[SEP]']

In [8]:
contextual_embeddings = model(**token_ids)[0]
contextual_embeddings

tensor([[[-0.0452,  0.3938,  0.0324,  ..., -0.2268,  0.2357,  0.3701],
         [-0.1922, -0.1978, -0.0488,  ..., -0.3804,  0.9541, -0.4337],
         [-0.2737,  0.0130,  0.5933,  ..., -0.4942,  0.6265,  0.8316],
         ...,
         [-0.1404, -0.4727, -0.1985,  ..., -0.2306,  0.0506, -0.9043],
         [ 0.1981,  0.2794, -0.0747,  ...,  0.2561,  0.0174, -0.9420],
         [ 0.7610,  0.5504, -0.0924,  ..., -0.0937, -0.3253, -0.3800]]],
       grad_fn=<NativeLayerNormBackward0>)

In [9]:
contextual_embeddings.shape

torch.Size([1, 9, 768])

In [10]:
import torch

doc_embedding = torch.mean(contextual_embeddings, dim=1)[0]
doc_embedding

tensor([ 4.1116e-02, -7.1035e-03,  1.0905e-01, -2.1075e-01,  3.9004e-01,
        -2.8752e-01,  1.3906e-01,  4.6047e-01, -6.5605e-01, -2.0786e-01,
         5.2294e-02, -3.6576e-01, -1.3718e-01,  1.6120e-01, -9.5343e-02,
        -1.4518e-01,  2.6929e-01,  7.3588e-02, -2.1862e-01,  1.2493e-01,
        -1.5276e-01, -4.5335e-02, -1.4030e-01,  3.0250e-01,  3.0361e-01,
        -2.3097e-01,  8.7902e-02,  1.5944e-01, -2.9857e-03, -3.7663e-01,
         4.1330e-01,  1.6108e-01, -2.8776e-02, -1.8680e-01,  1.4490e-01,
         3.0399e-01,  2.6641e-01, -2.9455e-01, -4.8678e-01,  1.2412e-01,
        -3.9100e-01, -2.1390e-02, -1.7335e-01,  2.9154e-01,  1.9289e-01,
        -3.2460e-01,  4.0478e-01, -9.4743e-01,  4.5585e-01, -4.8945e-01,
        -2.3037e-01,  4.3064e-01, -2.6569e-02,  7.2616e-02, -1.1596e-01,
         4.7561e-01,  5.9488e-03, -2.4535e-01, -1.6376e-01, -2.1947e-01,
        -1.0907e-01, -1.6344e-01,  1.9193e-02, -3.6330e-01, -3.1556e-02,
         9.6432e-02,  3.4471e-02,  3.5154e-01, -4.9

In [11]:
doc_embedding.shape

torch.Size([768])

In [12]:
docs

['The cat is running on the stairs',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

In [13]:
doc_token_ids = [tokenizer(doc, return_tensors='pt') for doc in docs]
doc_token_ids

[{'input_ids': tensor([[ 101, 1996, 4937, 2003, 2770, 2006, 1996, 5108,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 {'input_ids': tensor([[ 101, 1037, 2450, 2003, 5983, 1037, 3538, 1997, 7852, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 {'input_ids': tensor([[  101,  1037,  2158,  2003,  5983,  1037, 24857,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}]

In [14]:
doc_contextual_embeddings = [model(**token_ids)[0] for token_ids in doc_token_ids]
doc_contextual_embeddings

[tensor([[[-0.0452,  0.3938,  0.0324,  ..., -0.2268,  0.2357,  0.3701],
          [-0.1922, -0.1978, -0.0488,  ..., -0.3804,  0.9541, -0.4337],
          [-0.2737,  0.0130,  0.5933,  ..., -0.4942,  0.6265,  0.8316],
          ...,
          [-0.1404, -0.4727, -0.1985,  ..., -0.2306,  0.0506, -0.9043],
          [ 0.1981,  0.2794, -0.0747,  ...,  0.2561,  0.0174, -0.9420],
          [ 0.7610,  0.5504, -0.0924,  ..., -0.0937, -0.3253, -0.3800]]],
        grad_fn=<NativeLayerNormBackward0>),
 tensor([[[ 0.1430,  0.5433, -0.7770,  ...,  0.0536,  0.1999,  0.0249],
          [ 0.2386,  0.3652, -0.9208,  ..., -0.3724, -0.3153,  0.0562],
          [ 0.1720, -0.5397, -0.2436,  ..., -0.7923,  0.0735,  0.1566],
          ...,
          [ 0.4039,  0.0170, -0.3467,  ...,  0.0282, -0.3911, -0.5297],
          [-0.3470, -0.4788, -0.5005,  ...,  0.4639,  0.0966, -0.5437],
          [ 0.6441,  0.3791, -0.5053,  ..., -0.2455, -0.5596, -0.2091]]],
        grad_fn=<NativeLayerNormBackward0>),
 tensor([[[-

In [15]:
doc_embeddings = [torch.mean(contextual_embeddings, dim=1)[0] for contextual_embeddings in doc_contextual_embeddings]
doc_embeddings

[tensor([ 4.1116e-02, -7.1035e-03,  1.0905e-01, -2.1075e-01,  3.9004e-01,
         -2.8752e-01,  1.3906e-01,  4.6047e-01, -6.5605e-01, -2.0786e-01,
          5.2294e-02, -3.6576e-01, -1.3718e-01,  1.6120e-01, -9.5343e-02,
         -1.4518e-01,  2.6929e-01,  7.3588e-02, -2.1862e-01,  1.2493e-01,
         -1.5276e-01, -4.5335e-02, -1.4030e-01,  3.0250e-01,  3.0361e-01,
         -2.3097e-01,  8.7902e-02,  1.5944e-01, -2.9857e-03, -3.7663e-01,
          4.1330e-01,  1.6108e-01, -2.8776e-02, -1.8680e-01,  1.4490e-01,
          3.0399e-01,  2.6641e-01, -2.9455e-01, -4.8678e-01,  1.2412e-01,
         -3.9100e-01, -2.1390e-02, -1.7335e-01,  2.9154e-01,  1.9289e-01,
         -3.2460e-01,  4.0478e-01, -9.4743e-01,  4.5585e-01, -4.8945e-01,
         -2.3037e-01,  4.3064e-01, -2.6569e-02,  7.2616e-02, -1.1596e-01,
          4.7561e-01,  5.9488e-03, -2.4535e-01, -1.6376e-01, -2.1947e-01,
         -1.0907e-01, -1.6344e-01,  1.9193e-02, -3.6330e-01, -3.1556e-02,
          9.6432e-02,  3.4471e-02,  3.

In [16]:
doc_embeddings = [np.round(doc_embedding.detach().numpy(), 3) for doc_embedding in doc_embeddings]
doc_embeddings

[array([ 4.100e-02, -7.000e-03,  1.090e-01, -2.110e-01,  3.900e-01,
        -2.880e-01,  1.390e-01,  4.600e-01, -6.560e-01, -2.080e-01,
         5.200e-02, -3.660e-01, -1.370e-01,  1.610e-01, -9.500e-02,
        -1.450e-01,  2.690e-01,  7.400e-02, -2.190e-01,  1.250e-01,
        -1.530e-01, -4.500e-02, -1.400e-01,  3.020e-01,  3.040e-01,
        -2.310e-01,  8.800e-02,  1.590e-01, -3.000e-03, -3.770e-01,
         4.130e-01,  1.610e-01, -2.900e-02, -1.870e-01,  1.450e-01,
         3.040e-01,  2.660e-01, -2.950e-01, -4.870e-01,  1.240e-01,
        -3.910e-01, -2.100e-02, -1.730e-01,  2.920e-01,  1.930e-01,
        -3.250e-01,  4.050e-01, -9.470e-01,  4.560e-01, -4.890e-01,
        -2.300e-01,  4.310e-01, -2.700e-02,  7.300e-02, -1.160e-01,
         4.760e-01,  6.000e-03, -2.450e-01, -1.640e-01, -2.190e-01,
        -1.090e-01, -1.630e-01,  1.900e-02, -3.630e-01, -3.200e-02,
         9.600e-02,  3.400e-02,  3.520e-01, -4.970e-01,  2.800e-01,
        -9.900e-02, -3.170e-01,  1.890e-01, -1.7

In [20]:
doc_embeds = [tensor.detach().numpy() for tensor in doc_embeddings]
doc_embeds[0]

array([ 4.11159582e-02, -7.10348971e-03,  1.09050065e-01, -2.10750714e-01,
        3.90038848e-01, -2.87515521e-01,  1.39057666e-01,  4.60469842e-01,
       -6.56053305e-01, -2.07862318e-01,  5.22938184e-02, -3.65757495e-01,
       -1.37175635e-01,  1.61198154e-01, -9.53429043e-02, -1.45177171e-01,
        2.69286305e-01,  7.35878125e-02, -2.18623519e-01,  1.24928027e-01,
       -1.52758420e-01, -4.53345850e-02, -1.40297145e-01,  3.02497655e-01,
        3.03614914e-01, -2.30969965e-01,  8.79023820e-02,  1.59441903e-01,
       -2.98574241e-03, -3.76628131e-01,  4.13295984e-01,  1.61082000e-01,
       -2.87761893e-02, -1.86801612e-01,  1.44900113e-01,  3.03987294e-01,
        2.66413897e-01, -2.94546932e-01, -4.86776710e-01,  1.24124609e-01,
       -3.91000450e-01, -2.13896967e-02, -1.73348904e-01,  2.91543603e-01,
        1.92887455e-01, -3.24603766e-01,  4.04782146e-01, -9.47434127e-01,
        4.55847740e-01, -4.89449233e-01, -2.30367526e-01,  4.30635810e-01,
       -2.65689660e-02,  

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(doc_embeds)

array([[1.        , 0.6412004 , 0.6671486 ],
       [0.6412004 , 0.9999995 , 0.9237186 ],
       [0.6671486 , 0.9237186 , 0.99999994]], dtype=float32)

In [22]:
docs

['The cat is running on the stairs',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

# Finding semantic similarity with Pre-trained Embeddings

Here we will leverage already pre-trained embedding models \ deep learning models to extract embeddings from sentences and find out their semantic similarity.

Models we will look at:

- Transformers

# Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.


The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

![](https://i.imgur.com/FyUCkG5.png)

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

# Fun with Embeddings: Simple Search Engine!

Let's create a corpus of documents which will be our source on which we will run text searches

In [23]:
documents = [
  "Quantum mechanics describes the behavior of very small particles.",
  "Photosynthesis is the process by which green plants make food using sunlight.",
  "Shakespeare's plays are a testament to English literature.",
  "Artificial Intelligence aims to create machines that can think and learn.",
  "The pyramids of Egypt are historical monuments that have stood for thousands of years.",
  "Biology is the study of living organisms and their interactions with the environment.",
  "Music therapy can aid in the mental well-being of individuals.",
  "The Milky Way is just one of billions of galaxies in the universe.",
  "Economic theories help understand the distribution of resources in society.",
  "Yoga is an ancient practice that involves physical postures and meditation."
]

## Building Robust Semantic Search Engines with Transformers

## Load Dependencies

In [24]:
from sentence_transformers import SentenceTransformer, util
import torch

## Load Pre-trained Transformer Model

In [25]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

model = SentenceTransformer('all-MiniLM-L12-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [26]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

## Create a corpus of documents

In [27]:
documents = ['Quantum mechanics describes the behavior of very small particles.',
 'Photosynthesis is the process by which green plants make food using sunlight.',
 "Shakespeare's plays are a testament to English literature.",
 'Artificial Intelligence aims to create machines that can think and learn.',
 'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
 'Biology is the study of living organisms and their interactions with the environment.',
 'Music therapy can aid in the mental well-being of individuals.',
 'The Milky Way is just one of billions of galaxies in the universe.',
 'Economic theories help understand the distribution of resources in society.',
 'Yoga is an ancient practice that involves physical postures and meditation.']

In [28]:
len(documents)

10

## Get document embeddings

In [29]:
document_embeddings = model.encode(documents)

In [30]:
document_embeddings.shape

(10, 384)

In [31]:
document_embeddings, document_embeddings.shape

(array([[ 0.01569201,  0.02793595,  0.01044164, ...,  0.05586387,
         -0.06863528,  0.04651126],
        [-0.03118527,  0.0311649 ,  0.01921584, ..., -0.03559605,
          0.04873832,  0.0257752 ],
        [ 0.01772625, -0.03875764,  0.1274249 , ...,  0.06566846,
          0.05715034, -0.05221867],
        ...,
        [ 0.09988892, -0.0422441 ,  0.0405912 , ..., -0.05507796,
          0.03537546, -0.07602308],
        [-0.01405636, -0.00320507, -0.01274007, ...,  0.01621081,
          0.00849567,  0.00209842],
        [-0.04375401, -0.00201986, -0.00192908, ...,  0.08587869,
         -0.00045797, -0.02675952]], dtype=float32),
 (10, 384))

In [32]:
documents[0]

'Quantum mechanics describes the behavior of very small particles.'

In [33]:
document_embeddings[0].shape

(384,)

In [34]:
document_embeddings[0]

array([ 1.56920142e-02,  2.79359464e-02,  1.04416404e-02,  5.38145676e-02,
       -6.64291605e-02, -1.10787347e-01,  3.04780640e-02,  2.54000183e-02,
       -5.17279468e-03,  3.58387381e-02,  3.20614353e-02,  5.82041638e-03,
       -8.15961044e-03,  6.18610978e-02,  5.40908948e-02,  7.69704534e-03,
        6.66749179e-02, -4.58203107e-02, -4.35077511e-02, -6.01347489e-03,
       -3.49232852e-02,  3.26351300e-02, -5.32207489e-02,  1.72806419e-02,
       -1.03645690e-01, -4.15866934e-02,  3.19175906e-02, -2.92218048e-02,
        2.84545645e-02, -1.93595197e-02, -6.80248486e-03,  3.82997915e-02,
        2.14443356e-02,  5.28521873e-02, -9.52145457e-03, -3.56447101e-02,
        2.97932196e-02, -7.39673898e-02, -5.50745428e-02,  1.35463271e-02,
        3.38974223e-02,  4.48159985e-02,  2.14401409e-02,  6.01723790e-02,
        1.87074672e-02, -1.51426755e-02,  1.95753239e-02, -1.35566257e-02,
       -5.95715865e-02, -3.25450189e-02, -1.14724822e-01,  1.16387075e-02,
       -3.54244448e-02,  

## Let's try to find the most similar document for one query

### New Query

In [35]:
new_text = 'What is AI?'
new_text

'What is AI?'

### Get Embedding for New Query

In [36]:
new_text_embedding = model.encode(new_text)
new_text_embedding.shape

(384,)

### Get Cosine Similarity Score of Document Emebddings compared to New Query Embedding

In [37]:
cos_scores = util.pytorch_cos_sim(new_text_embedding, document_embeddings)[0]
cos_scores

tensor([ 5.4059e-02, -4.1397e-02,  6.0208e-02,  6.1713e-01,  4.6174e-04,
         1.1132e-01,  6.8011e-02, -4.1013e-02,  1.3222e-01,  6.8518e-02])

### Get Most Similar Document ID

In [38]:
top_results = torch.topk(cos_scores, k=1)
top_results

torch.return_types.topk(
values=tensor([0.6171]),
indices=tensor([3]))

In [39]:
idx = top_results.indices[0]
idx

tensor(3)

### Get Most Similar Document

In [40]:
documents[idx]

'Artificial Intelligence aims to create machines that can think and learn.'

## Create a function to return the top similar document based on any query

In [41]:
def semantic_search_engine(query, embedder_model):
  query_embedding = embedder_model.encode(query)
  cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
  top_results = torch.topk(cos_scores, k=1)
  idx = top_results.indices[0]
  return documents[idx]

## Try out the function

In [42]:
new_sentence = 'Tell me about AI'
semantic_search_engine(new_sentence, model)

'Artificial Intelligence aims to create machines that can think and learn.'

In [43]:
new_sentence = 'Do you know about the pyramids?'
semantic_search_engine(new_sentence, model)

'The pyramids of Egypt are historical monuments that have stood for thousands of years.'

In [44]:
new_sentence = 'How do plants survive?'
semantic_search_engine(new_sentence, model)

'Photosynthesis is the process by which green plants make food using sunlight.'

In [45]:
new_sentence = 'How do animals survive?'
semantic_search_engine(new_sentence, model)

'Biology is the study of living organisms and their interactions with the environment.'