## <center>Word Vector (a.k.a Word Embedding) </center> 

### 1.1 Word2Vector
 - Vector representation of words (i.e. word vectors) learned using neural network
   - e.g. "apple" : [0.35, -0.2, 0.4, ...], 'mango':  [0.32, -0.18, 0.5, ...]
   - Interesting properties of word vectors:
    * **Words with similar semantics have close word vectors**
    <img src="https://www.kdnuggets.com/images/cartoon-espresso-word2vec.jpg" width="50%">
    https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html
    * **Composition**: e.g. vector("woman")+vector("king")-vector('man') $\approx$ vector("queen")
 - Models:
   - **CBOW** (Continuous Bag of Words): Predict a target word based on context
     - e.g. the fox jumped over the lazy dog
     - Assuming symmetric context with window size 3, this sentence can create training samples: 
       - ([-, fox], the) 
       - ([the, jumped], fox) 
       - ([fox, over], jumped)
       - ([jumped, the], over) 
       - ...
       
       <img src="cbow.png" width="50%">
       source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
   - **Skip Gram**: predict context based on target words
   
        <img src="skip_gram.png" width="50%">
        source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

In [1]:
# set up interactive shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Exercise 1.1 Train your word vector

import pandas as pd
import nltk,string

# Load data
data=pd.read_csv('amazon_review_large.csv')
data.columns=['label','text']
data

# tokenize each document into a list of unigrams
# strip punctuations and leading/trailing spaces from unigrams
# only unigrams with 2 or more characters are taken
sentences=[ [token.strip(string.punctuation).strip() \
             for token in nltk.word_tokenize(doc.lower()) \
                 if token not in string.punctuation and \
                 len(token.strip(string.punctuation).strip())>=2]\
             for doc in data["text"]]
print(sentences[0:2])

Unnamed: 0,label,text
0,2,This is a little longer and more detailed than...
1,1,Only Michelle Branch save this album!!!!All gu...
2,2,"A surprisingly good book, given its inherently..."
3,2,"This is a wonderful, quiet and relaxing CD tha..."
4,1,The lights that I received are absolutely not ...
...,...,...
19995,1,This was suppose to be such a great movie and ...
19996,2,I gave this poncho as a birthday gift to my hu...
19997,1,These diapers absorb MUCH less than Swaddlers....
19998,2,"""Moonstruck"" is one of the best movies of the ..."


[['this', 'is', 'little', 'longer', 'and', 'more', 'detailed', 'than', 'the', 'first', 'two', 'books', 'in', 'the', 'series', 'however', 'have', 'enjoyed', 'each', 'new', 'aspect', 'of', 'the', 'exciting', 'fantasy', 'universe'], ['only', 'michelle', 'branch', 'save', 'this', 'album', 'all', 'guys', 'play', 'along', 'with', 'unenthusiastic', 'beat', 'even', 'karl']]


In [3]:
# Train your own word vectors using gensim

# gensim.models is the package for word2vec
# check https://radimrehurek.com/gensim/models/word2vec.html
# for detailed description

from gensim.models import word2vec
import logging
import pandas as pd

# print out tracking information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', \
                    level=logging.INFO)

# min_count: words with total frequency lower than this are ignored
# size: the dimension of word vector
# window: context window, i.e. the maximum distance 
#         between the current and predicted word 
#         within a sentence (i.e. the length of ngrams)
# workers: # of parallel threads in training
# for other parameters, check https://radimrehurek.com/gensim/models/word2vec.html
wv_model = word2vec.Word2Vec(sentences, \
            min_count=5,vector_size=200, \
            window=5, workers=4 )

2023-04-12 20:13:53,054 : INFO : collecting all words and their counts
2023-04-12 20:13:53,055 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-04-12 20:13:53,218 : INFO : PROGRESS: at sentence #10000, processed 711991 words, keeping 36968 word types
2023-04-12 20:13:53,351 : INFO : collected 55241 word types from a corpus of 1424289 raw words and 20000 sentences
2023-04-12 20:13:53,352 : INFO : Creating a fresh vocabulary
2023-04-12 20:13:53,447 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 12133 unique words (21.96375880233884%% of original 55241, drops 43108)', 'datetime': '2023-04-12T20:13:53.420065', 'gensim': '4.1.2', 'python': '3.7.6 (default, Jan  8 2020, 13:42:34) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.6.0-x86_64-i386-64bit', 'event': 'prepare_vocab'}
2023-04-12 20:13:53,450 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1361999 word corpus (95.62658982832838%% of orig

In [4]:
# test word2vec model

print("Top 5 words similar to word 'sound'")
wv_model.wv.most_similar('sound', topn=5)

print("Top 5 words similar to word 'sound' but not relevant to 'film'")
wv_model.wv.most_similar(positive=['sound','music'], \
                         negative=['film'], topn=5)

print("Similarity between 'movie' and 'film':")
wv_model.wv.similarity('movie','film') 

print("Similarity between 'movie' and 'city':")
wv_model.wv.similarity('movie','city') 

print("Word does not match with others in the list of \
['sound', 'music', 'graphics', 'actor', 'book']:")
wv_model.wv.doesnt_match(["sound", "music", \
                          "graphics", "actor", "book"])

print("Word vector for 'movie':")
wv_model.wv['movie']

Top 5 words similar to word 'sound'


[('metal', 0.7418868541717529),
 ('beats', 0.7396437525749207),
 ('production', 0.7243618965148926),
 ('vocals', 0.7202281951904297),
 ('sounds', 0.7157590389251709)]

Top 5 words similar to word 'sound' but not relevant to 'film'


[('rock', 0.7720798254013062),
 ('pop', 0.7414670586585999),
 ('beats', 0.727301836013794),
 ('songs', 0.7210579514503479),
 ('lyrics', 0.7147101759910583)]

Similarity between 'movie' and 'film':


0.918801

Similarity between 'movie' and 'city':


-0.008146924

Word does not match with others in the list of ['sound', 'music', 'graphics', 'actor', 'book']:


'book'

Word vector for 'movie':


array([-1.27586794e+00, -1.55183077e-01, -8.59031141e-01, -2.40213901e-01,
       -1.13054955e+00,  4.17942643e-01,  6.60929918e-01,  4.77608591e-02,
       -6.73678994e-01, -5.54073811e-01,  7.63609484e-02,  1.98808778e-02,
        2.29297429e-01,  1.05217195e+00, -1.33877146e+00,  4.67828572e-01,
        1.33434558e+00, -1.54373336e+00, -1.68750668e+00, -2.82854867e+00,
        1.92898560e+00,  1.67880177e-01, -3.53898197e-01,  5.58144569e-01,
       -5.66379607e-01,  7.45268464e-01, -1.72002172e+00, -6.55121148e-01,
        1.52349007e+00,  4.76395994e-01,  8.02174568e-01,  5.40708244e-01,
       -3.88299286e-01, -5.14961779e-02, -1.37928200e+00,  1.69840622e+00,
       -2.23905230e+00, -8.48303214e-02, -1.46513450e+00,  4.11255807e-01,
        1.32939541e+00, -1.60527551e+00,  4.35094446e-01, -1.34233892e+00,
       -1.94738835e-01,  1.57682508e-01, -3.29971492e-01,  2.64493167e-01,
        2.07258016e-01,  1.80324465e-01, -7.95697212e-01, -1.29068136e+00,
        6.08371496e-01,  

### 1.2. Pretrained Word Vectors
- Google published pre-trained 300-dimensional vectors for 3 million words and phrases that were trained on Google News dataset (about 100 billion words)(https://code.google.com/archive/p/word2vec/)
- GloVe (Global Vectors for Word Representation): Pretained word vectors from different data sources provided by Standford https://nlp.stanford.edu/projects/glove/
- FastText by Facebook https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

In [6]:
# Exercise 1.2: Use pretrained word vectors

# download the bin file for pretrained word vectors
# from above links, e.g. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
# Warning: the bin file is very big (over 2G)
# You need a powerful machine to load it

import gensim

model = gensim.models.KeyedVectors.\
load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) 




2023-04-12 20:14:46,630 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin
2023-04-12 20:15:14,863 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2023-04-12T20:15:14.863445', 'gensim': '4.1.2', 'python': '3.7.6 (default, Jan  8 2020, 13:42:34) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.6.0-x86_64-i386-64bit', 'event': 'load_word2vec_format'}


In [7]:
model.most_similar(positive=['women','king'], \
                      negative='man')

[('queen', 0.4827326238155365),
 ('queens', 0.466781347990036),
 ('kumaris', 0.4653734564781189),
 ('kings', 0.4558638632297516),
 ('womens', 0.422832190990448),
 ('princes', 0.4176960587501526),
 ('Al_Anqari', 0.41725507378578186),
 ('concubines', 0.4011078476905823),
 ('monarch', 0.3962482810020447),
 ('monarchy', 0.39430150389671326)]

### BERT

In [15]:
#!pip install transformers
from transformers import pipeline
import torch
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Artificial Intelligence [MASK] take over the world.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'sequence': 'artificial intelligence can take over the world.',
  'score': 0.3182401657104492,
  'token': 2064,
  'token_str': 'can'},
 {'sequence': 'artificial intelligence will take over the world.',
  'score': 0.18299691379070282,
  'token': 2097,
  'token_str': 'will'},
 {'sequence': 'artificial intelligence to take over the world.',
  'score': 0.05600118637084961,
  'token': 2000,
  'token_str': 'to'},
 {'sequence': 'artificial intelligences take over the world.',
  'score': 0.04519489035010338,
  'token': 2015,
  'token_str': '##s'},
 {'sequence': 'artificial intelligence would take over the world.',
  'score': 0.04515336453914642,
  'token': 2052,
  'token_str': 'would'}]

In [16]:
# Load the BERT tokenizer.
from transformers import BertTokenizer, BertModel
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)


Loading BERT tokenizer...


In [17]:
text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

# Print out the tokens.
print (tokenized_text)
#The original word has been split into smaller subwords and characters. 
#The two hash signs preceding some of these subwords are just our tokenizer’s way to denote that this subword 
# or character is part of a larger word and preceded by another subword.
# this way some contextual meaning of the original word will be retained. 

['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']


In [18]:
# check out contents of BERT’s vocabulary
list(tokenizer.vocab.keys())[5000:5020]

['knight',
 'lap',
 'survey',
 'ma',
 '##ow',
 'noise',
 'billy',
 '##ium',
 'shooting',
 'guide',
 'bedroom',
 'priest',
 'resistance',
 'motor',
 'homes',
 'sounded',
 'giant',
 '##mer',
 '150',
 'scenes']

In [19]:
# use our data
data=data.iloc[:100]
sentences=data["text"].values
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  This is a little longer and more detailed than the first two books in the series. However, I have enjoyed each new aspect of the exciting fantasy universe.
Tokenized:  ['this', 'is', 'a', 'little', 'longer', 'and', 'more', 'detailed', 'than', 'the', 'first', 'two', 'books', 'in', 'the', 'series', '.', 'however', ',', 'i', 'have', 'enjoyed', 'each', 'new', 'aspect', 'of', 'the', 'exciting', 'fantasy', 'universe', '.']
Token IDs:  [2023, 2003, 1037, 2210, 2936, 1998, 2062, 6851, 2084, 1996, 2034, 2048, 2808, 1999, 1996, 2186, 1012, 2174, 1010, 1045, 2031, 5632, 2169, 2047, 7814, 1997, 1996, 10990, 5913, 5304, 1012]


In [20]:
input_ids = []
#attention_masks = []
max_len =50

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        #return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )

    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    #attention_masks.append(encoded_dict['attention_mask'])


# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
#attention_masks = torch.cat(attention_masks, dim=0)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [21]:
# Load pre-trained model (weights)
bert_model = BertModel.from_pretrained('bert-base-uncased',
                                    #output_attentions = False, # Whether the model returns attentions weights.
                                    output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

## Put the model in "evaluation" mode, meaning feed-forward operation.
bert_model.eval()
with torch.no_grad():

    outputs = bert_model(input_ids)  
    #the third item will be the hidden states from all layers.
    hidden_states = outputs[2]
    
    

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [22]:
print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
#The second dimension, the batch size, is used when submitting multiple sentences to the model at once
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))



Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 100
Number of tokens: 50
Number of hidden units: 768


### Which vector works best as a contextualized embedding?
- check out the authors: http://jalammar.github.io/illustrated-bert/
<img src="bert layer.webp" width ="100%">

In [23]:


# get the last four layers
token_embeddings = torch.stack(hidden_states[-4:], dim=0) 
print(token_embeddings.size())

# permute axis
token_embeddings = token_embeddings.permute(1,2,0,3)
print(token_embeddings.size())

# take the mean of the last 4 layers
token_embeddings = token_embeddings.mean(axis=2)
print(token_embeddings.size())

torch.Size([4, 100, 50, 768])
torch.Size([100, 50, 4, 768])
torch.Size([100, 50, 768])


In [24]:
token_embeddings[0,0,:]

tensor([-1.1674e-01, -5.8241e-01,  4.6175e-02, -2.3729e-01, -5.1516e-01,
        -1.8817e-01,  3.6290e-01,  2.7166e-01,  1.7761e-01, -8.4111e-01,
        -1.9935e-01,  3.3611e-01,  2.5841e-01,  5.8463e-01,  7.7076e-01,
         4.6129e-01, -1.8080e-01,  5.4306e-01,  3.8711e-01, -2.4097e-01,
         3.7831e-01,  4.6802e-01,  8.4769e-01,  1.0741e-01,  1.6989e-01,
        -2.8467e-01, -5.7680e-01, -1.5120e-01, -4.5195e-01,  7.4452e-02,
        -4.1504e-01,  6.1159e-02,  3.2210e-02,  2.4624e-01,  6.4040e-01,
        -2.9038e-01, -1.0278e-02, -2.5516e-01,  1.7088e-01,  3.1873e-02,
        -4.9831e-01, -2.2552e-01,  4.3180e-01, -2.0207e-01,  3.9851e-01,
        -3.9309e-01, -4.3959e+00,  1.9462e-01, -2.2815e-01,  2.0750e-01,
         2.5107e-01, -3.9300e-01,  6.9135e-02,  2.4547e-01,  5.4014e-01,
         8.6356e-01, -4.0756e-01, -4.7488e-01,  5.7281e-01,  1.6537e-04,
         3.8241e-01,  2.6872e-01, -3.5878e-01, -1.2718e-01, -2.0059e-01,
         1.6424e-01,  5.5433e-01, -1.9884e-01, -3.5

### 1.3. How to use word vectors in classification?

`Convolutional Neural Network`
<img src="CNN.png" width ="100%">

`Recurrent Neural Network`

<img src="https://raw.githubusercontent.com/graviraja/100-Days-of-NLP/master/assets/images/applications/sentiment/simple.gif" width = "90%">

<img src="https://www.kdnuggets.com/images/cartoon-machine-learning-vacation.jpg" width='60%'>
