## Various methods to generate sentence embeddings

Suppose we have a query as 'Best Italian restaurant in Delhi' and 'Top Italian food in Delhi' - our simple method i.e. word based embedding generation would fail to detect the similarity between 'Best' and 'Top' or between 'food' and 'restaurant'.

Sentence embedding techniques represent entire sentences and their semantic information as vectors. This helps the machine in understanding the context, intention, and other nuances in the entire text.

We will be learning following in this notebook



1. Doc2Vec
2. SentenceBERT
3. InferSent (Powered by GLOVE)
4. Universal Sentence Encoder




## We will first import basic libraries and define a function for cosine similarity

In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
sentences = ["I ate dinner.", 
       "We had a three-course meal.", 
       "Brad came to dinner with us.",
       "He loves fish tacos.",
       "In the end, we all felt like we ate too much.",
       "We all agreed; it was a magnificent evening."]

In [3]:
# Tokenization of each document
tokenized_sent = []

In [4]:
for s in sentences:
  tokenized_sent.append(word_tokenize(s.lower()))

In [5]:
tokenized_sent

[['i', 'ate', 'dinner', '.'],
 ['we', 'had', 'a', 'three-course', 'meal', '.'],
 ['brad', 'came', 'to', 'dinner', 'with', 'us', '.'],
 ['he', 'loves', 'fish', 'tacos', '.'],
 ['in',
  'the',
  'end',
  ',',
  'we',
  'all',
  'felt',
  'like',
  'we',
  'ate',
  'too',
  'much',
  '.'],
 ['we', 'all', 'agreed', ';', 'it', 'was', 'a', 'magnificent', 'evening', '.']]

In [6]:
def cosine(u, v):
  return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

## 1. Doc2Vec
Doc2Vec embedding is one of the most popular techniques out there
We will use Gensim to show an example of how to use Doc2Vec

In [7]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [8]:
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]

In [9]:
tagged_data

[TaggedDocument(words=['i', 'ate', 'dinner', '.'], tags=[0]),
 TaggedDocument(words=['we', 'had', 'a', 'three-course', 'meal', '.'], tags=[1]),
 TaggedDocument(words=['brad', 'came', 'to', 'dinner', 'with', 'us', '.'], tags=[2]),
 TaggedDocument(words=['he', 'loves', 'fish', 'tacos', '.'], tags=[3]),
 TaggedDocument(words=['in', 'the', 'end', ',', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much', '.'], tags=[4]),
 TaggedDocument(words=['we', 'all', 'agreed', ';', 'it', 'was', 'a', 'magnificent', 'evening', '.'], tags=[5])]

In [10]:
'''
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate. 
'''

model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, epochs = 100)

In [11]:
model.wv.vocab

{',': <gensim.models.keyedvectors.Vocab at 0x7f7273907bd0>,
 '.': <gensim.models.keyedvectors.Vocab at 0x7f7273907750>,
 ';': <gensim.models.keyedvectors.Vocab at 0x7f7273907d90>,
 'a': <gensim.models.keyedvectors.Vocab at 0x7f7273907890>,
 'agreed': <gensim.models.keyedvectors.Vocab at 0x7f7273907d50>,
 'all': <gensim.models.keyedvectors.Vocab at 0x7f7273907c10>,
 'ate': <gensim.models.keyedvectors.Vocab at 0x7f72739076d0>,
 'brad': <gensim.models.keyedvectors.Vocab at 0x7f7273907810>,
 'came': <gensim.models.keyedvectors.Vocab at 0x7f7273907910>,
 'dinner': <gensim.models.keyedvectors.Vocab at 0x7f7273907710>,
 'end': <gensim.models.keyedvectors.Vocab at 0x7f7273907b90>,
 'evening': <gensim.models.keyedvectors.Vocab at 0x7f7273907e90>,
 'felt': <gensim.models.keyedvectors.Vocab at 0x7f7273907c50>,
 'fish': <gensim.models.keyedvectors.Vocab at 0x7f7273907a90>,
 'had': <gensim.models.keyedvectors.Vocab at 0x7f7273907850>,
 'he': <gensim.models.keyedvectors.Vocab at 0x7f7273907a10>,
 'i

In [12]:
test_doc = word_tokenize("I had pizza and pasta".lower())
test_doc_vector = model.infer_vector(test_doc)
#The infer_vector method returns the vectorized form of the test sentence(including the paragraph vector).
model.docvecs.most_similar(positive = [test_doc_vector])

[(3, 0.5244523286819458),
 (4, 0.41902583837509155),
 (0, 0.1829233467578888),
 (5, 0.17090129852294922),
 (1, 0.10256706178188324),
 (2, -0.018988102674484253)]

## 2. Sentence Bert
Sentence-BERT uses a Siamese network like architecture to provide 2 sentences as an input. These 2 sentences are then passed to BERT models and a pooling layer to generate their embeddings. Then use the embeddings for the pair of sentences as inputs to calculate the cosine similarity.

In [14]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 9.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 51.0 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.15-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 1.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 41.8 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |██████████████████████

We will then load the pre-trained BERT model. There are many other pre-trained models available. You can find the full list of models [here](https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md).

In [15]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=391.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3931.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=122.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=229.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438007537.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=53.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466081.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=399.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




In [17]:
sentence_embeddings = sbert_model.encode(sentences)

In [18]:
query = "I had pizza and pasta"
query_vec = sbert_model.encode([query])[0]

In [21]:
for sent in sentences:
  sim = cosine(query_vec, sbert_model.encode([sent])[0])
  print("Sentence = ", sent, "; similarity = ", sim)

Sentence =  I ate dinner. ; similarity =  0.71734625
Sentence =  We had a three-course meal. ; similarity =  0.6371339
Sentence =  Brad came to dinner with us. ; similarity =  0.5897908
Sentence =  He loves fish tacos. ; similarity =  0.62239355
Sentence =  In the end, we all felt like we ate too much. ; similarity =  0.419805
Sentence =  We all agreed; it was a magnificent evening. ; similarity =  0.180816


## 3. Infer Sent by Facebook AI
There are 2 versions of InferSent. Version 1 uses GLovE while version 2 uses fastText vectors. You can choose to work with any model (I have used version 2)

Thus, we download the InferSent Model and the pre-trained Word Vectors. For this, please first save the models.py file from [here](https://github.com/facebookresearch/InferSent) and store it in your working directory.

In [22]:
!mkdir encoder

In [23]:
!curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  22.5M      0  0:00:06  0:00:06 --:--:-- 26.3M


In [24]:
!mkdir GloVe

In [25]:
!curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   315    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0   352    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2075M  100 2075M    0     0  5173k      0  0:06:50  0:06:50 --:--:-- 5728k


In [26]:
!unzip GloVe/glove.840B.300d.zip -d GloVe/

Archive:  GloVe/glove.840B.300d.zip
  inflating: GloVe/glove.840B.300d.txt  


In [29]:
from models import InferSent
import torch

In [30]:
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}

In [31]:
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = '/content/GloVe/glove.840B.300d.txt'
model.set_w2v_path(W2V_PATH)

In [32]:
model.build_vocab(sentences, tokenize=True)

Found 36(/36) words with w2v vectors
Vocab size : 36


In [33]:
query = "I had pizza and pasta"
query_vec = model.encode(query)[0]
query_vec

  sentences = np.array(sentences)[idx_sort]


array([ 0.02459561,  0.04943122, -0.15705208, ...,  0.07534433,
       -0.03941801,  0.05388858], dtype=float32)

In [34]:
similarity = []
for sent in sentences:
  sim = cosine(query_vec, model.encode([sent])[0])
  print("Sentence = ", sent, "; similarity = ", sim)

Sentence =  I ate dinner. ; similarity =  0.6868881
Sentence =  We had a three-course meal. ; similarity =  0.504327
Sentence =  Brad came to dinner with us. ; similarity =  0.55740434
Sentence =  He loves fish tacos. ; similarity =  0.590714
Sentence =  In the end, we all felt like we ate too much. ; similarity =  0.57681197
Sentence =  We all agreed; it was a magnificent evening. ; similarity =  0.5049965


## 4. Universal Sentence Encoder by Google AI
One of the most well-performing sentence embedding techniques right now is the Universal Sentence Encoder. The key feature here is that we can use it for Multi-task learning.

In [35]:
!pip3 install --upgrade tensorflow-gpu
# Install TF-Hub.
!pip3 install tensorflow-hub

Collecting tensorflow-gpu
  Downloading tensorflow_gpu-2.5.0-cp37-cp37m-manylinux2010_x86_64.whl (454.3 MB)
[K     |████████████████████████████████| 454.3 MB 16 kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-2.5.0




In [36]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

In [37]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

INFO:absl:Using /tmp/tfhub_modules to cache modules.
INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.
INFO:absl:Downloaded https://tfhub.dev/google/universal-sentence-encoder/4, Total size: 987.47MB
INFO:absl:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.


module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [38]:
sentence_embeddings = model(sentences)
query = "I had pizza and pasta"
query_vec = model([query])[0]

In [39]:
for sent in sentences:
  sim = cosine(query_vec, model([sent])[0])
  print("Sentence = ", sent, "; similarity = ", sim)

Sentence =  I ate dinner. ; similarity =  0.4686642
Sentence =  We had a three-course meal. ; similarity =  0.35643068
Sentence =  Brad came to dinner with us. ; similarity =  0.2033895
Sentence =  He loves fish tacos. ; similarity =  0.16515438
Sentence =  In the end, we all felt like we ate too much. ; similarity =  0.14987423
Sentence =  We all agreed; it was a magnificent evening. ; similarity =  0.058435913
