<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D3_4_L3_Text_Representation_Pre_Trained_Embeddings_Transfer_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://academy.constructor.org/"><img src="https://jobtracker.ai/static/media/constructor_academy_colour.b86fa87f.png" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center>Constructor Academy, 2024</center>

# Text Representation and Semantic Meaning with Transfer Learning (Pre-trained Embeddings)



![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

## Finding semantic similarity with Pre-trained Embeddings

Here we will leverage already pre-trained embedding models \ deep learning models to extract embeddings from sentences and find out their semantic similarity.

Models we will look at:

1. Pre-trained Word2Vec Embeddings from Google
2. BERT

In [None]:
!pip install gdown --ignore-install

Collecting gdown
  Downloading gdown-4.7.3-py3-none-any.whl (16 kB)
Collecting filelock (from gdown)
  Downloading filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting requests[socks] (from gdown)
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting six (from gdown)
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting tqdm (from gdown)
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting beautifulsoup4 (from gdown)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->gdown)
  Downloading soupsieve-2.5-py3-none-any.whl (36 k

## Create a sample corpus

In [None]:
sentences = ['He is sitting by the fire',
             'He was sitting in the office till they decided to fire him']
sentences

['He is sitting by the fire',
 'He was sitting in the office till they decided to fire him']

## Get Pre-trained Google Word2Vec Embeddings

The word2vec model takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.

The archive is available in the link below:

Source: https://code.google.com/archive/p/word2vec/

## Get pretrained Google word2vec embeddings

In [None]:
!gdown '1VPwziUXxRukY8qvbLaVfITav9JBjFXKH'

Downloading...
From (original): https://drive.google.com/uc?id=1VPwziUXxRukY8qvbLaVfITav9JBjFXKH
From (redirected): https://drive.google.com/uc?id=1VPwziUXxRukY8qvbLaVfITav9JBjFXKH&confirm=t&uuid=a95a21d1-2277-4c2a-b832-3ef36aa52134
To: /content/GoogleNews-vectors-negative300.bin.gz
100% 1.65G/1.65G [00:29<00:00, 56.6MB/s]


In [None]:
!ls -l

total 1608452
-rw-r--r-- 1 root root 1647046227 Jan 17 14:21 GoogleNews-vectors-negative300.bin.gz
drwxr-xr-x 1 root root       4096 Jan 12 19:20 sample_data


In [None]:
!gunzip GoogleNews-vectors-negative300.bin

In [None]:
!ls -l --block-size=MB

total 3645MB
-rw-r--r-- 1 root root 3645MB Jan 17 14:21 GoogleNews-vectors-negative300.bin
drwxr-xr-x 1 root root    1MB Jan 12 19:20 sample_data


## Load Word2Vec Embeddings in a Word2Vec Model

In [None]:
from gensim.models import KeyedVectors

w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
w2v_model

<gensim.models.keyedvectors.KeyedVectors at 0x7ed91fd80a00>

## Sample embedding from already trained Word2Vec Model

In [None]:
w2v_model['fire'].shape

(300,)

In [None]:
w2v_model['fire']

array([ 3.55468750e-01,  1.83593750e-01,  1.49414062e-01, -9.37500000e-02,
        1.78710938e-01, -8.39843750e-02, -6.39648438e-02, -2.57812500e-01,
       -1.17187500e-01,  1.36718750e-01,  2.27539062e-01, -2.55859375e-01,
       -1.85546875e-01, -2.08007812e-01, -1.70898438e-01,  2.56347656e-02,
       -1.20117188e-01, -1.08398438e-01, -6.34765625e-02,  1.51977539e-02,
        1.50390625e-01, -2.05078125e-01,  2.02148438e-01, -5.51757812e-02,
        1.75781250e-02,  1.55273438e-01,  2.42919922e-02,  1.37695312e-01,
        3.22265625e-01, -1.06445312e-01, -6.49414062e-02, -3.84765625e-01,
       -3.04687500e-01, -2.38281250e-01, -1.69921875e-01, -1.78710938e-01,
       -1.34765625e-01,  1.03027344e-01, -1.31835938e-01,  2.35595703e-02,
        4.51660156e-02, -1.52343750e-01,  1.43554688e-01, -2.57812500e-01,
       -8.59375000e-02, -2.57812500e-01, -1.60156250e-01, -1.34765625e-01,
       -1.41601562e-01,  1.92382812e-01,  2.81250000e-01,  1.43554688e-01,
        2.08007812e-01, -

## Document Embeddings from Averaging Word Embeddings

In [None]:
import numpy as np

def average_word_vectors(words, model, vocabulary, num_features):

    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.

    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])

    if nwords:
        feature_vector = np.divide(feature_vector, nwords)

    return feature_vector


def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.index_to_key)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

## Observe Similarity based on Word2Vec Embeddings

In [None]:
w2v_vectors = averaged_word_vectorizer(sentences, model=w2v_model, num_features=300)

In [None]:
sentences

['He is sitting by the fire',
 'He was sitting in the office till they decided to fire him']

In [None]:
w2v_vectors.shape

(2, 300)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

similarity_matrix = cosine_similarity(w2v_vectors)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1
0,1.0,0.986952
1,0.986952,1.0


## Pre-trained Transformer Embeddings (BERT)

![](https://i.imgur.com/dDd5ZbP.png)

In [None]:
!pip install transformers



## BERT for Feature Extraction

## Load Pre-trained BERT Model

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

## Sample Encoding and Feature Extraction

In [None]:
sentences[0]

'He is sitting by the fire'

In [None]:
token_ids = tokenizer.encode(sentences[0])
token_ids

[101, 2002, 2003, 3564, 2011, 1996, 2543, 102]

In [None]:
["<"+tokenizer.decode([item])+">" for item in token_ids]

['<[CLS]>', '<he>', '<is>', '<sitting>', '<by>', '<the>', '<fire>', '<[SEP]>']

In [None]:
np.array([token_ids]), np.array([token_ids]).shape

(array([[ 101, 2002, 2003, 3564, 2011, 1996, 2543,  102]]), (1, 8))

In [None]:
model(np.array([token_ids]))[0]

<tf.Tensor: shape=(1, 8, 768), dtype=float32, numpy=
array([[[-0.00656033,  0.21889031,  0.04123305, ..., -0.2512958 ,
          0.5053678 ,  0.30240256],
        [ 0.16671468, -0.14512278,  0.21866603, ...,  0.00584966,
          0.90117884, -0.50590026],
        [ 0.10758889, -0.1345629 ,  0.08733959, ...,  0.03014014,
          0.3650271 ,  0.2472037 ],
        ...,
        [ 0.20182168, -0.28290182, -0.6742688 , ...,  0.21596226,
          0.26962116, -0.73235154],
        [ 0.18315123,  0.27861282, -0.37248605, ...,  0.44760212,
          0.6359492 , -0.75997436],
        [ 0.46276826,  0.1875516 , -0.14534219, ..., -0.11311157,
         -0.39454335, -0.5757332 ]]], dtype=float32)>

## BERT Tokenization of Sequences of Text

#### Input IDs
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

<br/>

#### Attention mask
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not.

<br/>

#### Token Type IDs
Some models’ purpose is to do sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:

```
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
```

https://huggingface.co/transformers/glossary.html#token-type-ids

In [None]:
sentences

['He is sitting by the fire',
 'He was sitting in the office till they decided to fire him']

In [None]:
tokenizer(sentences)

{'input_ids': [[101, 2002, 2003, 3564, 2011, 1996, 2543, 102], [101, 2002, 2001, 3564, 1999, 1996, 2436, 6229, 2027, 2787, 2000, 2543, 2032, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
bert_token_ids = tokenizer(sentences)['input_ids']
bert_token_ids

[[101, 2002, 2003, 3564, 2011, 1996, 2543, 102],
 [101,
  2002,
  2001,
  3564,
  1999,
  1996,
  2436,
  6229,
  2027,
  2787,
  2000,
  2543,
  2032,
  102]]

## Semantic Similarity based on Averaged BERT Token Embeddings

We can get the 1-D Embeddings for each and every token in our sentences.

We can average out these embeddings to get a document embedding. There are other better strategies too.

In [None]:
sentences

['He is sitting by the fire',
 'He was sitting in the office till they decided to fire him']

In [None]:
bert_token_ids

[[101, 2002, 2003, 3564, 2011, 1996, 2543, 102],
 [101,
  2002,
  2001,
  3564,
  1999,
  1996,
  2436,
  6229,
  2027,
  2787,
  2000,
  2543,
  2032,
  102]]

In [None]:
token_emb = [model(np.array([tokens]))[0] for tokens in bert_token_ids]

In [None]:
token_emb[0].shape # sentence 1

TensorShape([1, 8, 768])

In [None]:
token_emb[1].shape # sentence 2

TensorShape([1, 14, 768])

In [None]:
token_emb_flat = np.array([np.mean(item[0], axis=0) for item in token_emb])
token_emb_flat.shape

(2, 768)

In [None]:
similarity_matrix = cosine_similarity(token_emb_flat)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1
0,1.0,0.616723
1,0.616723,1.0


## Fun with Embeddings: Simple Search Engine!

Let's create a corpus of documents which will be our source on which we will run text searches

In [None]:
corpus = np.array(['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ])

## Similar document search using Word2Vec Embeddings

In [None]:
w2v_vectors = averaged_word_vectorizer(corpus, model=w2v_model, num_features=300)

In [None]:
w2v_vectors.shape

(9, 300)

In [None]:
def get_w2v_similar_docs(new_sentence, w2v_model, num_features, corpus_vectors):
  ns_w2v = averaged_word_vectorizer([new_sentence], model=w2v_model, num_features=num_features)
  cs = cosine_similarity(ns_w2v, corpus_vectors)
  top2_idx = np.argsort(-cs)[0][:2]
  print('Top 2 most similar to:', new_sentence)
  print(corpus[top2_idx])

In [None]:
new_sentence = 'A man is eating a pasta'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A man is eating a pasta
['A man is eating food.' 'A cheetah is running behind its prey.']


In [None]:
new_sentence = 'A gorilla is playing a set of drums.'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A gorilla is playing a set of drums.
['A monkey is playing drums.'
 'A man is riding a white horse on an enclosed ground.']


In [None]:
new_sentence = 'A cheetah chases prey on across a field.'
get_w2v_similar_docs(new_sentence, w2v_model, num_features=300, corpus_vectors=w2v_vectors)

Top 2 most similar to: A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is eating a piece of bread.']


## Similar document search using BERT Embeddings

In [None]:
bert_token_ids = tokenizer(list(corpus))['input_ids']
token_emb = [model(np.array([tokens]))[0] for tokens in bert_token_ids]
token_emb_flat = np.array([np.mean(item[0], axis=0) for item in token_emb])

In [None]:
token_emb_flat.shape

(9, 768)

In [None]:
def get_bert_similar_docs(new_sentence, bert_tokenizer, bert_model, token_corpus_vectors_flat):
  tokens = bert_tokenizer([new_sentence])['input_ids']
  token_ns_emb = np.array([bert_model(np.array([token]))[0] for token in tokens])
  token_ns_emb_flat = np.array([np.mean(item[0], axis=0) for item in token_ns_emb])

  cs = cosine_similarity(token_ns_emb_flat, token_corpus_vectors_flat)
  top2_idx = np.argsort(-cs)[0][:2]
  print('[Avg Token Embeddings] Top 2 most similar to:', new_sentence)
  print(corpus[top2_idx])

In [None]:
new_sentence = 'A man is eating a pasta'
get_bert_similar_docs(new_sentence,
                      bert_tokenizer=tokenizer,
                      bert_model=model,
                      token_corpus_vectors_flat=token_emb_flat)

[Avg Token Embeddings] Top 2 most similar to: A man is eating a pasta
['A man is eating food.' 'A man is eating a piece of bread.']


In [None]:
new_sentence = 'A gorilla is playing a set of drums.'
get_bert_similar_docs(new_sentence,
                      bert_tokenizer=tokenizer,
                      bert_model=model,
                      token_corpus_vectors_flat=token_emb_flat)

[Avg Token Embeddings] Top 2 most similar to: A gorilla is playing a set of drums.
['A monkey is playing drums.' 'A woman is playing violin.']


In [None]:
new_sentence = 'A cheetah chases prey on across a field.'
get_bert_similar_docs(new_sentence,
                      bert_tokenizer=tokenizer,
                      bert_model=model,
                      token_corpus_vectors_flat=token_emb_flat)

[Avg Token Embeddings] Top 2 most similar to: A cheetah chases prey on across a field.
['A cheetah is running behind its prey.'
 'A man is riding a white horse on an enclosed ground.']


### use sentence transformers for transformer embeddings
### This notebook should be useful for exercise 2:

Check the section 'Building Robust Semantic Search Engines with Transformers'

https://nbviewer.org/github/dipanjanS/adv_nlp_workshop_odsc_europe22/blob/main/04_NLP_Applications_Contextual_Embeddings_and_Search_Engines.ipynb