# Contextual Embeddings and Semantic Search Engines with Transfer Learning (Pre-trained Embeddings)

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Finding semantic similarity with Pre-trained Embeddings

Here we will leverage already pre-trained embedding models \ deep learning models to extract embeddings from sentences and find out their semantic similarity.

Models we will look at:

1. Pre-trained Word2Vec Embeddings from Google
2. Universal Sentence Encoders
3. Transformers

# Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.


The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

![](https://i.imgur.com/jjkX9h8.png)

# Create a sample corpus

In [10]:
sentences = ['A woman is playing violin.',
             'A monkey is playing drums.',
             'A woman is eating a piece of bread.',             
             'A man is eating a pasta.']

```
[0, 1] = quite similar 
[0, 2] = not very similar
[0, 3] = not very similar
[1, 2] = not very similar
[1, 3] = not very similar
[2, 3] = very similar
```

## Semantic Similarity with pre-trained Word2Vec Embeddings

The word2vec model takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. 

The archive is available in the link below: 

Source: https://code.google.com/archive/p/word2vec/

![](https://i.imgur.com/l26L0pP.png)

In [2]:
!pip install gdown --ignore-install

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.4.0.tar.gz (14 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 17.3 MB/s 
[?25hCollecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting requests[socks]
  Downloading requests-2.28.0-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.8 MB/s 
[?25hCollecting filelock
  Downloading filelock-3.7.1-py3-none-any.whl (10 kB)
Collecting tqdm
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 8.8 MB/s 
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Collecting cha

In [1]:
!gdown '1VPwziUXxRukY8qvbLaVfITav9JBjFXKH'

Downloading...
From: https://drive.google.com/uc?id=1VPwziUXxRukY8qvbLaVfITav9JBjFXKH
To: /content/GoogleNews-vectors-negative300.bin.gz
100% 1.65G/1.65G [00:30<00:00, 54.1MB/s]


In [2]:
!gunzip GoogleNews-vectors-negative300.bin

In [3]:
!ls -l --block-size=MB

total 3645MB
-rw-r--r-- 1 root root 3645MB Jun 15 13:38 GoogleNews-vectors-negative300.bin
drwxr-xr-x 1 root root    1MB Jun  1 13:50 sample_data


### Load Word2Vec Embeddings in a Word2Vec Model

In [4]:
import gensim

gensim.__version__

'3.6.0'

In [5]:
from gensim.models import KeyedVectors

w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [6]:
w2v_model

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fe76542a0d0>

### Sample embedding from already trained Word2Vec Model

In [7]:
w2v_model['man']

array([ 0.32617188,  0.13085938,  0.03466797, -0.08300781,  0.08984375,
       -0.04125977, -0.19824219,  0.00689697,  0.14355469,  0.0019455 ,
        0.02880859, -0.25      , -0.08398438, -0.15136719, -0.10205078,
        0.04077148, -0.09765625,  0.05932617,  0.02978516, -0.10058594,
       -0.13085938,  0.001297  ,  0.02612305, -0.27148438,  0.06396484,
       -0.19140625, -0.078125  ,  0.25976562,  0.375     , -0.04541016,
        0.16210938,  0.13671875, -0.06396484, -0.02062988, -0.09667969,
        0.25390625,  0.24804688, -0.12695312,  0.07177734,  0.3203125 ,
        0.03149414, -0.03857422,  0.21191406, -0.00811768,  0.22265625,
       -0.13476562, -0.07617188,  0.01049805, -0.05175781,  0.03808594,
       -0.13378906,  0.125     ,  0.0559082 , -0.18261719,  0.08154297,
       -0.08447266, -0.07763672, -0.04345703,  0.08105469, -0.01092529,
        0.17480469,  0.30664062, -0.04321289, -0.01416016,  0.09082031,
       -0.00927734, -0.03442383, -0.11523438,  0.12451172, -0.02

### Document Embeddings from Averaging Word Embeddings

In [8]:
import numpy as np

def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector


def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

### Observe Similarity based on Word2Vec Embeddings

In [11]:
w2v_vectors = averaged_word_vectorizer(sentences, model=w2v_model, num_features=300)

In [12]:
sentences 

['A woman is playing violin.',
 'A monkey is playing drums.',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

In [13]:
w2v_vectors.shape

(4, 300)

In [14]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

similarity_matrix = cosine_similarity(w2v_vectors)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3
0,1.0,0.962555,0.929097,0.9199
1,0.962555,1.0,0.95769,0.951351
2,0.929097,0.95769,1.0,0.943735
3,0.9199,0.951351,0.943735,1.0


```
[0, 1] = quite similar 
[0, 2] = not very similar
[0, 3] = not very similar
[1, 2] = not very similar
[1, 3] = not very similar
[2, 3] = very similar
```

## Semantic Similarity with Universal Sentence Embeddings

![](https://i.imgur.com/k8vqvcp.png)

### Load USE Pre-trained Model

In [16]:
import tensorflow_hub as hub

use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

### Get Embeddings

In [17]:
embeddings = use_model(sentences)
embeddings.shape

TensorShape([4, 512])

### Compute Similarity

In [18]:
similarity_matrix = cosine_similarity(embeddings)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3
0,1.0,0.404262,0.488585,0.261118
1,0.404262,1.0,0.248893,0.213846
2,0.488585,0.248893,1.0,0.580541
3,0.261118,0.213846,0.580541,1.0


```
[0, 1] = quite similar 
[0, 2] = not very similar
[0, 3] = not very similar
[1, 2] = not very similar
[1, 3] = not very similar
[2, 3] = very similar
```

In [19]:
sentences

['A woman is playing violin.',
 'A monkey is playing drums.',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

## Pre-trained Transformer Embeddings (BERT)

![](https://i.imgur.com/dDd5ZbP.png)

In [20]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 15.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 54.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 1.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.7 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

### Load Pre-trained BERT Model

In [21]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


## Sample Encoding and Feature Extraction

In [23]:
token_ids = tokenizer.encode("Hello, how are you?")
token_ids

[101, 7592, 1010, 2129, 2024, 2017, 1029, 102]

In [24]:
["<"+tokenizer.decode([item])+">" for item in token_ids]

['<[CLS]>', '<hello>', '<,>', '<how>', '<are>', '<you>', '<?>', '<[SEP]>']

In [25]:
token_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]
token_ids

<tf.Tensor: shape=(1, 8), dtype=int32, numpy=
array([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]],
      dtype=int32)>

In [26]:
token_embeddings = model(np.array(token_ids))[0]

In [27]:
token_embeddings

<tf.Tensor: shape=(1, 8, 768), dtype=float32, numpy=
array([[[-0.1143714 ,  0.19371377,  0.12495909, ..., -0.38269046,
          0.21065864,  0.54070836],
        [ 0.53082436,  0.3207484 ,  0.36645943, ..., -0.00360714,
          0.7578602 ,  0.03884368],
        [-0.48765177,  0.88492495,  0.42556435, ..., -0.6976217 ,
          0.44583377,  0.12309451],
        ...,
        [-0.7002785 , -0.1815068 ,  0.32969713, ..., -0.4837932 ,
          0.06802306,  0.89008516],
        [-1.035463  , -0.25667778, -0.03165283, ...,  0.31974316,
          0.39990166,  0.17954731],
        [ 0.6079923 ,  0.26097032, -0.31307247, ...,  0.03109752,
         -0.62827134, -0.19942416]]], dtype=float32)>

# BERT Tokenization of Sequences of Text

#### Input IDs
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

<br/>

#### Attention mask
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not.

<br/>

#### Token Type IDs
Some models’ purpose is to do sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:

```
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
```

https://huggingface.co/transformers/glossary.html#token-type-ids

In [28]:
tokenizer(sentences)

{'input_ids': [[101, 1037, 2450, 2003, 2652, 6710, 1012, 102], [101, 1037, 10608, 2003, 2652, 3846, 1012, 102], [101, 1037, 2450, 2003, 5983, 1037, 3538, 1997, 7852, 1012, 102], [101, 1037, 2158, 2003, 5983, 1037, 24857, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [29]:
sentences

['A woman is playing violin.',
 'A monkey is playing drums.',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

In [30]:
bert_token_ids = tokenizer(sentences)['input_ids']
bert_token_ids

[[101, 1037, 2450, 2003, 2652, 6710, 1012, 102],
 [101, 1037, 10608, 2003, 2652, 3846, 1012, 102],
 [101, 1037, 2450, 2003, 5983, 1037, 3538, 1997, 7852, 1012, 102],
 [101, 1037, 2158, 2003, 5983, 1037, 24857, 1012, 102]]

# Types of BERT Embeddings

There are mainly two types of embeddings we can get from BERT

1. Embeddings of each token (which we can combine together using a strategy e.g mean later on)
2. Overall pooled embedding of a fixed 1-D Vector (output of the first i.e [CLS] token from the top layer)

```
[CLS] => [.....] # 768 sized flat vector (embedding) => pooled repr of the whole sentence 

h12_1 => [.....] 768 sized embedding representation for the word w1 after passing through the 12 encoder layers

12: [CLS] [h12_1, h12_2, h12_3....] [SEP]
            ...
            ...
2:  [CLS] [h2_1, h2_2, h2_3....] [SEP]

1:  [CLS] [h1_1, h1_2, h1_3....] [SEP]

S:  [CLS] [w1, w2, w3....] [SEP]
```

## Pooled Embedding

BERT encoder produces a sequence of hidden states. For classification tasks, this sequence ultimately needs to be reduced to a single vector. There are multiple ways of converting this sequence to a single vector representation of a sentence. One is max/mean pooling. Another is applying attention. The authors, however, opt to go with a much simpler method: simply taking the hidden state corresponding to the first token.

To make this pooling scheme work, BERT prepends a [CLS] token (short for "classification") to the start of each sentence (this is essentially like a start-of-sentence token).

![](https://i.imgur.com/LODFglb.png)

[Source](https://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)


__If you really want to dive into the details, you can check the source code of BERT in [these lines here](https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/modeling.py#L224-L232) which show how this happens__

## Token Embeddings

We can get the 1-D Embeddings for each and every token in our sentences. Typically this can be visualized as follows.

![](https://i.imgur.com/ckzQGKC.png)

[Source](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a)

We can average out these embeddings to get a document embedding. There are other better strategies too.

## Semantic Similarity based on Averaged BERT Token Embeddings

In [31]:
sentences

['A woman is playing violin.',
 'A monkey is playing drums.',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

In [32]:
token_emb = [model(np.array([tokens]))[0] for tokens in bert_token_ids]

In [33]:
token_emb[1]

<tf.Tensor: shape=(1, 8, 768), dtype=float32, numpy=
array([[[-0.11781539,  0.44128385, -0.58142006, ...,  0.0184275 ,
          0.3107761 ,  0.06316176],
        [ 0.10769126,  0.44274443, -1.2093672 , ..., -0.10495304,
          0.09872227, -0.08223685],
        [ 0.4246171 ,  0.2918639 , -1.0349988 , ..., -0.4293397 ,
          0.16894542, -0.00826068],
        ...,
        [ 0.9256045 ,  0.20301542,  0.04375648, ..., -0.1834942 ,
         -0.14454445, -1.5779029 ],
        [-0.47451797, -0.51808405, -0.31593698, ...,  0.45416337,
          0.40323186, -0.618767  ],
        [ 0.85682994,  0.27821812, -0.22241437, ..., -0.01036618,
         -0.46961913, -0.1822662 ]]], dtype=float32)>

In [34]:
token_emb[1][:,1:-1]

<tf.Tensor: shape=(1, 6, 768), dtype=float32, numpy=
array([[[ 0.10769126,  0.44274443, -1.2093672 , ..., -0.10495304,
          0.09872227, -0.08223685],
        [ 0.4246171 ,  0.2918639 , -1.0349988 , ..., -0.4293397 ,
          0.16894542, -0.00826068],
        [ 0.14515851,  0.23817   , -0.8344765 , ...,  0.12461176,
         -0.03070291, -0.06715855],
        [ 0.26153895, -0.04653043, -0.93738604, ...,  0.05589517,
         -0.28230473, -0.33846936],
        [ 0.9256045 ,  0.20301542,  0.04375648, ..., -0.1834942 ,
         -0.14454445, -1.5779029 ],
        [-0.47451797, -0.51808405, -0.31593698, ...,  0.45416337,
          0.40323186, -0.618767  ]]], dtype=float32)>

In [35]:
token_emb[1][:,1:-1][0]

<tf.Tensor: shape=(6, 768), dtype=float32, numpy=
array([[ 0.10769126,  0.44274443, -1.2093672 , ..., -0.10495304,
         0.09872227, -0.08223685],
       [ 0.4246171 ,  0.2918639 , -1.0349988 , ..., -0.4293397 ,
         0.16894542, -0.00826068],
       [ 0.14515851,  0.23817   , -0.8344765 , ...,  0.12461176,
        -0.03070291, -0.06715855],
       [ 0.26153895, -0.04653043, -0.93738604, ...,  0.05589517,
        -0.28230473, -0.33846936],
       [ 0.9256045 ,  0.20301542,  0.04375648, ..., -0.1834942 ,
        -0.14454445, -1.5779029 ],
       [-0.47451797, -0.51808405, -0.31593698, ...,  0.45416337,
         0.40323186, -0.618767  ]], dtype=float32)>

In [36]:
token_emb_flat = np.array([np.mean(item[:,1:-1][0], axis=0) for item in token_emb])
token_emb_flat.shape

(4, 768)

In [37]:
similarity_matrix = cosine_similarity(token_emb_flat)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3
0,1.0,0.890194,0.85782,0.85535
1,0.890194,1.0,0.784939,0.822796
2,0.85782,0.784939,1.0,0.913372
3,0.85535,0.822796,0.913372,1.0


In [38]:
sentences

['A woman is playing violin.',
 'A monkey is playing drums.',
 'A woman is eating a piece of bread.',
 'A man is eating a pasta.']

```
[0, 1] = quite similar 
[0, 2] = not very similar
[0, 3] = not very similar
[1, 2] = not very similar
[1, 3] = not very similar
[2, 3] = very similar
```

# Fun with Embeddings: Simple Search Engine!

Let's create a corpus of documents which will be our source on which we will run text searches

In [39]:
database = np.array(['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ])

## Similar document search using Word2Vec Embeddings

In [40]:
w2v_vectors = averaged_word_vectorizer(database, model=w2v_model, num_features=300)

In [41]:
def get_w2v_similar_docs(new_sentence, w2v_model, num_features, corpus_vectors):
  ns_w2v = averaged_word_vectorizer([new_sentence], model=w2v_model, num_features=num_features)
  cs = cosine_similarity(ns_w2v, corpus_vectors)
  top2_idx = np.argsort(-cs)[0][:2]
  print('Top 2 most similar to:', new_sentence)
  print(database[top2_idx])

In [42]:
new_sentence = 'A man is eating a pasta'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A man is eating a pasta
['A man is eating food.' 'A cheetah is running behind its prey.']


In [43]:
new_sentence = 'Someone in a gorilla costume is playing a set of drums.'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: Someone in a gorilla costume is playing a set of drums.
['A man is riding a white horse on an enclosed ground.'
 'A man is eating food.']


In [44]:
new_sentence = 'A cheetah chases prey on across a field.'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is eating a piece of bread.']


## Similar document search using Universal Sentence Embeddings

In [45]:
database_embeddings = use_model(database)

In [46]:
def get_usemodel_similar_docs(new_sentence, use_model, corpus_vectors):
  ns_se = use_model([new_sentence])
  cs = cosine_similarity(ns_se, corpus_vectors)
  top2_idx = np.argsort(-cs)[0][:2]
  print('Top 2 most similar to (USE Model):', new_sentence)
  print(database[top2_idx])

In [47]:
new_sentence = 'A man is eating a pasta'
get_usemodel_similar_docs(new_sentence, use_model, corpus_vectors=database_embeddings)

Top 2 most similar to (USE Model): A man is eating a pasta
['A man is eating a piece of bread.' 'A man is eating food.']


In [48]:
new_sentence = 'Someone in a gorilla costume is playing a set of drums.'
get_usemodel_similar_docs(new_sentence, use_model, corpus_vectors=database_embeddings)

Top 2 most similar to (USE Model): Someone in a gorilla costume is playing a set of drums.
['A monkey is playing drums.' 'A woman is playing violin.']


In [49]:
new_sentence = 'A cheetah chases prey on across a field.'
get_usemodel_similar_docs(new_sentence, use_model, corpus_vectors=database_embeddings)

Top 2 most similar to (USE Model): A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is riding a white horse on an enclosed ground.']


## Similar document search using BERT Embeddings

In [50]:
bert_token_ids = tokenizer(list(database))['input_ids']
token_emb = [model(np.array([tokens]))[0] for tokens in bert_token_ids]
token_emb_flat = np.array([np.mean(item[:,1:-1][0], axis=0) for item in token_emb])

In [51]:
def get_bert_similar_docs(new_sentence, bert_tokenizer, bert_model, token_corpus_vectors_flat):
  tokens = bert_tokenizer([new_sentence])['input_ids']
  token_ns_emb = np.array([bert_model(np.array([token]))[0] for token in tokens])
  token_ns_emb_flat = np.array([np.mean(item[:,1:-1][0], axis=0) for item in token_ns_emb])

  cs = cosine_similarity(token_ns_emb_flat, token_corpus_vectors_flat)
  top2_idx = np.argsort(-cs)[0][:2]
  print('[Avg Token Embeddings] Top 2 most similar to:', new_sentence)
  print(database[top2_idx])

In [52]:
new_sentence = 'A man is eating a pasta'
get_bert_similar_docs(new_sentence, 
                      bert_tokenizer=tokenizer, 
                      bert_model=model, 
                      token_corpus_vectors_flat=token_emb_flat)

[Avg Token Embeddings] Top 2 most similar to: A man is eating a pasta
['A man is eating food.' 'A man is eating a piece of bread.']


In [53]:
new_sentence = 'Someone in a gorilla costume is playing a set of drums.'
get_bert_similar_docs(new_sentence, 
                      bert_tokenizer=tokenizer, 
                      bert_model=model, 
                      token_corpus_vectors_flat=token_emb_flat)

[Avg Token Embeddings] Top 2 most similar to: Someone in a gorilla costume is playing a set of drums.
['A monkey is playing drums.' 'A woman is playing violin.']


In [54]:
new_sentence = 'A cheetah chases prey on across a field.'
get_bert_similar_docs(new_sentence, 
                      bert_tokenizer=tokenizer, 
                      bert_model=model, 
                      token_corpus_vectors_flat=token_emb_flat)

[Avg Token Embeddings] Top 2 most similar to: A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is riding a white horse on an enclosed ground.']


## Building Robust Semantic Search Engines with Transformers

In [55]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 6.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 38.2 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.0-py3-none-any.whl size=120747 sha256=ee47ff43f934fe6a2f351a12beb975485da9aad7970a7d88d9ddd1b6b820f922
  Stored in directory: /root/.cache/pip/wheels/83/c0/df/b6873ab7aac3f2465aa9144b6b4c41c4391cfecc027c8b07e7
Successfully built sentence-transformers
Installing collected packages: sentencepiece, sentence-transformers
Successfully installed sentenc

In [56]:
from sentence_transformers import SentenceTransformer, util
import torch

In [57]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

embedder = SentenceTransformer('all-MiniLM-L12-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [58]:
database_embeddings = embedder.encode(database, convert_to_tensor=True)

In [59]:
def semantic_search_engine(query, embedder_model):

  query_embedding = embedder_model.encode(query, convert_to_tensor=True)
  # We use cosine-similarity and torch.topk to find the highest 2 scores
  cos_scores = util.cos_sim(query_embedding, database_embeddings)[0]
  top_results = torch.topk(cos_scores, k=2)
  return database[top_results.indices.cpu()]

In [60]:
new_sentence = 'A man is eating pasta.'
semantic_search_engine(new_sentence, embedder)

array(['A man is eating food.', 'A man is eating a piece of bread.'],
      dtype='<U52')

In [61]:
new_sentence = 'Someone in a gorilla costume is playing a set of drums.'
semantic_search_engine(new_sentence, embedder)

array(['A monkey is playing drums.', 'A woman is playing violin.'],
      dtype='<U52')

In [62]:
new_sentence = 'A cheetah chases prey on across a field.'
semantic_search_engine(new_sentence, embedder)

array(['A cheetah is running behind its prey.',
       'A man is riding a white horse on an enclosed ground.'],
      dtype='<U52')