# Text Representation and Semantic Meaning with Transfer Learning (Pre-trained Embeddings)

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Finding semantic similarity with Pre-trained Embeddings

Here we will leverage already pre-trained embedding models \ deep learning models to extract embeddings from sentences and find out their semantic similarity.

Models we will look at:

1. Pre-trained Word2Vec Embeddings from Google
2. BERT

# Create a sample corpus

In [77]:
sentences = ['He is sitting near the river bank',
             'He is sitting in the bank to get some cash',
             'The elephant is sitting near the bank of the river',
             'The bank is closed so he cannot get any money today']

# Get Pre-trained Google Word2Vec Embeddings

The word2vec model takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. 

The archive is available in the link below: 

Source: https://code.google.com/archive/p/word2vec/

In [78]:
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

--2020-09-17 23:04:05--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.78.78
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.78.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-09-17 23:05:51 (14.9 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [6]:
!gunzip GoogleNews-vectors-negative300.bin

In [8]:
!ls -l --block-size=MB

total 3645MB
-rw-r--r-- 1 root root 3645MB Mar  5  2015 GoogleNews-vectors-negative300.bin
drwxr-xr-x 1 root root    1MB Sep 16 16:29 sample_data


# Load Word2Vec Embeddings in a Word2Vec Model

In [81]:
from gensim.models import KeyedVectors

w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [82]:
w2v_model

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7faf132ae438>

# Sample embedding from already trained Word2Vec Model

In [83]:
w2v_model['bank']

array([ 2.19726562e-02,  1.34765625e-01, -5.78613281e-02,  5.56640625e-02,
        9.91210938e-02, -1.40625000e-01, -3.03649902e-03,  1.87988281e-02,
        2.53906250e-01, -4.88281250e-02, -1.63574219e-02, -1.33666992e-02,
        6.25000000e-02,  6.07910156e-02, -9.22851562e-02,  3.12500000e-01,
        1.38282776e-04, -1.34765625e-01, -4.32128906e-02,  1.16699219e-01,
        2.22656250e-01, -9.81445312e-02,  4.51660156e-02, -2.23388672e-02,
        5.17578125e-02, -2.41210938e-01, -1.11328125e-01,  9.71679688e-02,
        2.28515625e-01, -1.08642578e-02, -4.02832031e-02, -1.83105469e-02,
        3.10546875e-01,  3.88183594e-02, -2.85156250e-01, -2.06054688e-01,
        3.69140625e-01, -5.24902344e-02,  1.30859375e-01,  1.51367188e-01,
        1.59179688e-01, -2.36328125e-01,  7.47070312e-02, -5.54199219e-02,
       -8.64257812e-02, -2.28515625e-01,  2.44140625e-03,  8.11767578e-03,
       -1.62109375e-01,  1.46484375e-01,  1.40625000e-01, -3.82995605e-03,
        1.09375000e-01,  

# Document Embeddings from Averaging Word Embeddings

In [18]:
import numpy as np

def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector


def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

# Observe Similarity based on Word2Vec Embeddings

In [84]:
w2v_vectors = averaged_word_vectorizer(sentences, model=w2v_model, num_features=300)

In [86]:
sentences

['He is sitting near the river bank',
 'He is sitting in the bank to get some cash',
 'The elephant is sitting near the bank of the river',
 'The bank is closed so he cannot get any money today']

In [85]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

similarity_matrix = cosine_similarity(w2v_vectors)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3
0,1.0,0.980478,0.989551,0.95627
1,0.980478,1.0,0.979031,0.967277
2,0.989551,0.979031,1.0,0.96975
3,0.95627,0.967277,0.96975,1.0


# Pre-trained Transformer Embeddings (BERT)

![](https://i.imgur.com/dDd5ZbP.png)

In [23]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 2.8MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 15.7MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 21.1MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz 

# BERT for Feature Extraction

![](https://i.imgur.com/4uYtfkQ.png)

# Load Pre-trained BERT Model

In [24]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


# Sample Encoding and Feature Extraction

In [28]:
token_ids = tokenizer.encode("Hello, how are you?")
token_ids

[101, 7592, 1010, 2129, 2024, 2017, 1029, 102]

In [30]:
["<"+tokenizer.decode([item])+">" for item in token_ids]

['<[CLS]>', '<hello>', '<,>', '<how>', '<are>', '<you>', '<?>', '<[SEP]>']

In [31]:
tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]

<tf.Tensor: shape=(1, 8), dtype=int32, numpy=
array([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]],
      dtype=int32)>

In [48]:
token_emb, pooled_emb = model(np.array([token_ids]))

In [49]:
token_emb[0].shape, pooled_emb[0].shape

(TensorShape([8, 768]), TensorShape([768]))

# BERT Tokenization of Sequences of Text

#### Input IDs
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

<br/>

#### Attention mask
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not.

<br/>

#### Token Type IDs
Some models’ purpose is to do sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:

```
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
```

https://huggingface.co/transformers/glossary.html#token-type-ids

In [50]:
tokenizer(sentences)

{'input_ids': [[101, 2002, 2003, 3564, 2379, 1996, 2314, 2924, 102], [101, 2002, 3791, 2000, 2131, 2070, 5356, 2013, 1996, 2924, 102], [101, 2002, 2003, 3564, 2379, 1996, 2924, 1997, 2637, 2311, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [87]:
bert_token_ids = tokenizer(sentences)['input_ids']
bert_token_ids

[[101, 2002, 2003, 3564, 2379, 1996, 2314, 2924, 102],
 [101, 2002, 2003, 3564, 1999, 1996, 2924, 2000, 2131, 2070, 5356, 102],
 [101, 1996, 10777, 2003, 3564, 2379, 1996, 2924, 1997, 1996, 2314, 102],
 [101, 1996, 2924, 2003, 2701, 2061, 2002, 3685, 2131, 2151, 2769, 2651, 102]]

# Types of BERT Embeddings

There are mainly two types of embeddings we can get from BERT

1. Embeddings of each token (which we can combine together using a strategy e.g mean later on)
2. Overall pooled embedding of a fixed 1-D Vector (output of the first i.e [CLS] token from the top layer)

## Pooled Embedding

BERT encoder produces a sequence of hidden states. For classification tasks, this sequence ultimately needs to be reduced to a single vector. There are multiple ways of converting this sequence to a single vector representation of a sentence. One is max/mean pooling. Another is applying attention. The authors, however, opt to go with a much simpler method: simply taking the hidden state corresponding to the first token.

To make this pooling scheme work, BERT prepends a [CLS] token (short for "classification") to the start of each sentence (this is essentially like a start-of-sentence token).

![](https://i.imgur.com/LODFglb.png)

[Source](https://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)


__If you really want to dive into the details, you can check the source code of BERT in [these lines here](https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/modeling.py#L224-L232) which show how this happens__

## Token Embeddings

We can get the 1-D Embeddings for each and every token in our sentences. Typically this can be visualized as follows.

![](https://i.imgur.com/ckzQGKC.png)

[Source](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a)

We can average out these embeddings to get a document embedding. There are other better strategies too.

In [88]:
token_emb = np.array([model(np.array([tokens]))[0] for tokens in bert_token_ids])
pooled_emb = np.array([model(np.array([tokens]))[1] for tokens in bert_token_ids])

In [89]:
pooled_emb = np.array([item[0] for item in pooled_emb])
pooled_emb.shape

(4, 768)

# Semantic Similarity based on Pooled BERT Embeddings

In [91]:
sentences

['He is sitting near the river bank',
 'He is sitting in the bank to get some cash',
 'The elephant is sitting near the bank of the river',
 'The bank is closed so he cannot get any money today']

In [90]:
similarity_matrix = cosine_similarity(pooled_emb)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3
0,1.0,0.756986,0.932169,0.947001
1,0.756986,1.0,0.895952,0.893017
2,0.932169,0.895952,1.0,0.9728
3,0.947001,0.893017,0.9728,1.0


# Semantic Similarity based on Averaged BERT Token Embeddings

In [157]:
sentences

['He is sitting near the river bank',
 'He is sitting in the bank to get some cash',
 'The elephant is sitting near the bank of the river',
 'The bank is closed so he cannot get any money today']

In [92]:
token_emb_flat = np.array([np.mean(item[0], axis=0) for item in token_emb])
token_emb_flat.shape

(4, 768)

In [93]:
similarity_matrix = cosine_similarity(token_emb_flat)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3
0,1.0,0.753337,0.829573,0.716291
1,0.753337,1.0,0.674486,0.857112
2,0.829573,0.674486,1.0,0.686655
3,0.716291,0.857112,0.686655,1.0


# Fun with Embeddings: Simple Search Engine!

Let's create a corpus of documents which will be our source on which we will run text searches

In [96]:
corpus = np.array(['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ])

## Similar document search using Word2Vec Embeddings

In [97]:
w2v_vectors = averaged_word_vectorizer(corpus, model=w2v_model, num_features=300)

In [145]:
def get_w2v_similar_docs(new_sentence, w2v_model, num_features, corpus_vectors):
  ns_w2v = averaged_word_vectorizer([new_sentence], model=w2v_model, num_features=num_features)
  cs = cosine_similarity(ns_w2v, corpus_vectors)
  top2_idx = np.argsort(-cs)[0][:2]
  print('Top 2 most similar to:', new_sentence)
  print(corpus[top2_idx])

In [146]:
new_sentence = 'A man is eating a pasta'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A man is eating a pasta
['A man is eating food.' 'A cheetah is running behind its prey.']


In [147]:
new_sentence = 'Someone in a gorilla costume is playing a set of drums.'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: Someone in a gorilla costume is playing a set of drums.
['A man is riding a white horse on an enclosed ground.'
 'A man is eating food.']


In [148]:
new_sentence = 'A cheetah chases prey on across a field.'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is eating a piece of bread.']


## Similar document search using BERT Embeddings

In [149]:
bert_token_ids = tokenizer(list(corpus))['input_ids']
token_emb = np.array([model(np.array([tokens]))[0] for tokens in bert_token_ids])
pooled_emb = np.array([model(np.array([tokens]))[1][0] for tokens in bert_token_ids])
token_emb_flat = np.array([np.mean(item[0], axis=0) for item in token_emb])

In [151]:
def get_bert_similar_docs(new_sentence, bert_tokenizer, bert_model, pooled_corpus_vectors, token_corpus_vectors_flat):
  tokens = bert_tokenizer([new_sentence])['input_ids']
  token_ns_emb = np.array([bert_model(np.array([token]))[0] for token in tokens])
  token_ns_emb_flat = np.array([np.mean(item[0], axis=0) for item in token_ns_emb])
  pooled_ns_emb = np.array([bert_model(np.array([token]))[1][0] for token in tokens])

  cs = cosine_similarity(pooled_ns_emb, pooled_corpus_vectors)
  top2_idx = np.argsort(-cs)[0][:2]
  print('[Pooled Embedding] Top 2 most similar to:', new_sentence)
  print(corpus[top2_idx])
  print()

  cs = cosine_similarity(token_ns_emb_flat, token_corpus_vectors_flat)
  top2_idx = np.argsort(-cs)[0][:2]
  print('[Avg Token Embeddings] Top 2 most similar to:', new_sentence)
  print(corpus[top2_idx])

In [152]:
new_sentence = 'A man is eating a pasta'
get_bert_similar_docs(new_sentence, 
                      bert_tokenizer=tokenizer, 
                      bert_model=model, 
                      pooled_corpus_vectors=pooled_emb, 
                      token_corpus_vectors_flat=token_emb_flat)

[Pooled Embedding] Top 2 most similar to: A man is eating a pasta
['A cheetah is running behind its prey.' 'A man is eating food.']

[Avg Token Embeddings] Top 2 most similar to: A man is eating a pasta
['A man is eating food.' 'A man is eating a piece of bread.']


In [154]:
new_sentence = 'Someone in a gorilla costume is playing a set of drums.'
get_bert_similar_docs(new_sentence, 
                      bert_tokenizer=tokenizer, 
                      bert_model=model, 
                      pooled_corpus_vectors=pooled_emb, 
                      token_corpus_vectors_flat=token_emb_flat)

[Pooled Embedding] Top 2 most similar to: Someone in a gorilla costume is playing a set of drums.
['A monkey is playing drums.' 'A man is riding a horse.']

[Avg Token Embeddings] Top 2 most similar to: Someone in a gorilla costume is playing a set of drums.
['A monkey is playing drums.' 'A woman is playing violin.']


In [155]:
new_sentence = 'A cheetah chases prey on across a field.'
get_bert_similar_docs(new_sentence, 
                      bert_tokenizer=tokenizer, 
                      bert_model=model, 
                      pooled_corpus_vectors=pooled_emb, 
                      token_corpus_vectors_flat=token_emb_flat)

[Pooled Embedding] Top 2 most similar to: A cheetah chases prey on across a field.
['A man is riding a white horse on an enclosed ground.'
 'A man is riding a horse.']

[Avg Token Embeddings] Top 2 most similar to: A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is riding a white horse on an enclosed ground.']
