## <center>Word Vector (a.k.a Word Embedding) </center> 

### 1.1 Word2Vector
 - Vector representation of words (i.e. word vectors) learned using neural network
   - e.g. "apple" : [0.35, -0.2, 0.4, ...], 'mongo':  [0.32, -0.18, 0.5, ...]
   - Interesting properties of word vectors:
    * **Words with similar semantics have close word vectors**
    <img src="https://www.kdnuggets.com/images/cartoon-espresso-word2vec.jpg" width="50%">
    https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html
    * **Composition**: e.g. vector("woman")+vector("king")-vector('man') $\approx$ vector("queen")
 - Models:
   - **CBOW** (Continuous Bag of Words): Predict a target word based on context
     - e.g. the fox jumped over the lazy dog
     - Assuming symmetric context with window size 3, this sentence can create training samples: 
       - ([-, fox], the) 
       - ([the, jumped], fox) 
       - ([fox, over], jumped)
       - ([jumped, the], over) 
       - ...
       
       <img src="cbow.png" width="50%">
       source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
   - **Skip Gram**: predict context based on target words
   
        <img src="skip_gram.png" width="50%">
        source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

In [1]:
# set up interactive shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [11]:
# Exercise 1.1 Train your word vector

import pandas as pd
import nltk,string

# Load data
data=pd.read_csv('amazon_review_large.csv')
data.columns=['label','text']
data.head()

# tokenize each document into a list of unigrams
# strip punctuations and leading/trailing spaces from unigrams
# only unigrams with 2 or more characters are taken
sentences=[ [token.strip(string.punctuation).strip() \
             for token in nltk.word_tokenize(doc.lower()) \
                 if token not in string.punctuation and \
                 len(token.strip(string.punctuation).strip())>=2]\
             for doc in data["text"]]
print(sentences[0:2])

Unnamed: 0,label,text
0,2,This is a little longer and more detailed than...
1,1,Only Michelle Branch save this album!!!!All gu...
2,2,"A surprisingly good book, given its inherently..."
3,2,"This is a wonderful, quiet and relaxing CD tha..."
4,1,The lights that I received are absolutely not ...


[['this', 'is', 'little', 'longer', 'and', 'more', 'detailed', 'than', 'the', 'first', 'two', 'books', 'in', 'the', 'series', 'however', 'have', 'enjoyed', 'each', 'new', 'aspect', 'of', 'the', 'exciting', 'fantasy', 'universe'], ['only', 'michelle', 'branch', 'save', 'this', 'album', 'all', 'guys', 'play', 'along', 'with', 'unenthusiastic', 'beat', 'even', 'karl']]


In [12]:
# Train your own word vectors using gensim

# gensim.models is the package for word2vec
# check https://radimrehurek.com/gensim/models/word2vec.html
# for detailed description

from gensim.models import word2vec
import logging
import pandas as pd

# print out tracking information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', \
                    level=logging.INFO)

# min_count: words with total frequency lower than this are ignored
# size: the dimension of word vector
# window: context window, i.e. the maximum distance 
#         between the current and predicted word 
#         within a sentence (i.e. the length of ngrams)
# workers: # of parallel threads in training
# for other parameters, check https://radimrehurek.com/gensim/models/word2vec.html
wv_model = word2vec.Word2Vec(sentences, \
            min_count=5, vector_size=200, \
            window=5, workers=4 )

2021-11-26 21:12:51,359 : INFO : collecting all words and their counts
2021-11-26 21:12:51,361 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-11-26 21:12:51,492 : INFO : PROGRESS: at sentence #10000, processed 711991 words, keeping 36968 word types
2021-11-26 21:12:51,657 : INFO : collected 55241 word types from a corpus of 1424289 raw words and 20000 sentences
2021-11-26 21:12:51,658 : INFO : Creating a fresh vocabulary
2021-11-26 21:12:51,729 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 12133 unique words (21.96375880233884%% of original 55241, drops 43108)', 'datetime': '2021-11-26T21:12:51.729906', 'gensim': '4.1.2', 'python': '3.7.6 (default, Jan  8 2020, 13:42:34) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.6.0-x86_64-i386-64bit', 'event': 'prepare_vocab'}
2021-11-26 21:12:51,730 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1361999 word corpus (95.62658982832838%% of orig

In [13]:
# test word2vec model

print("Top 5 words similar to word 'sound'")
wv_model.wv.most_similar('sound', topn=5)

print("Top 5 words similar to word 'sound' but not relevant to 'film'")
wv_model.wv.most_similar(positive=['sound','music'], \
                         negative=['film'], topn=5)

print("Similarity between 'movie' and 'film':")
wv_model.wv.similarity('movie','film') 

print("Similarity between 'movie' and 'city':")
wv_model.wv.similarity('movie','city') 

print("Word does not match with others in the list of \
['sound', 'music', 'graphics', 'actor', 'book']:")
wv_model.wv.doesnt_match(["sound", "music", \
                          "graphics", "actor", "book"])

print("Word vector for 'movie':")
wv_model.wv['movie']

Top 5 words similar to word 'sound'


[('band', 0.7466944456100464),
 ('metal', 0.7309845089912415),
 ('production', 0.7275912165641785),
 ('vocals', 0.7178143858909607),
 ('beats', 0.7157869338989258)]

Top 5 words similar to word 'sound' but not relevant to 'film'


[('pop', 0.7688503265380859),
 ('rock', 0.7571867108345032),
 ('guitar', 0.7289978265762329),
 ('songs', 0.7192723751068115),
 ('invisible', 0.7177398800849915)]

Similarity between 'movie' and 'film':


0.92379814

Similarity between 'movie' and 'city':


0.036855552

Word does not match with others in the list of ['sound', 'music', 'graphics', 'actor', 'book']:


'book'

Word vector for 'movie':


array([-1.802832  ,  0.24956171, -0.11756705,  0.04693254, -1.5093597 ,
        0.5086073 ,  0.3910569 , -0.89214   ,  0.23388013, -1.2573065 ,
        0.6647229 ,  1.7413628 ,  0.81334066, -0.9406785 , -1.178873  ,
        0.68965447,  1.0267875 , -1.4296671 , -1.1222117 , -1.606071  ,
        1.7382059 ,  0.56719065,  0.04122837,  0.22017027, -0.3461662 ,
       -0.13749823, -0.9189834 , -1.0497375 ,  0.7365495 ,  0.09684153,
        0.30657044,  0.7274297 ,  0.5823868 ,  0.03434692, -1.6505245 ,
        2.875775  , -1.5738666 , -0.4178427 , -1.1819825 , -0.4639121 ,
        0.55049515, -1.1630489 ,  0.54714614, -0.92406774,  0.7725745 ,
        0.2697898 , -0.9988751 , -0.14581019, -0.04652194,  0.31032822,
        0.29798335, -0.36748606,  0.33740562, -0.04714237, -1.3902302 ,
       -1.067852  , -0.7876221 , -2.2594395 , -1.5850363 ,  1.1617929 ,
        1.4588189 , -0.04304981,  0.71149564, -0.55700463, -2.5268452 ,
       -1.3418169 ,  0.26387188,  0.10013806, -0.73470455,  0.39

### 1.2. Pretrained Word Vectors
- Google published pre-trained 300-dimensional vectors for 3 million words and phrases that were trained on Google News dataset (about 100 billion words)(https://code.google.com/archive/p/word2vec/)
- GloVe (Global Vectors for Word Representation): Pretained word vectors from different data sources provided by Standford https://nlp.stanford.edu/projects/glove/
- FastText by Facebook https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

In [15]:
# Exercise 1.2: Use pretrained word vectors

# download the bin file for pretrained word vectors
# from above links, e.g. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
# Warning: the bin file is very big (over 2G)
# You need a powerful machine to load it

import gensim

model = gensim.models.KeyedVectors.\
load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) 




2021-11-26 21:16:27,886 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin
2021-11-26 21:17:01,517 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2021-11-26T21:17:01.517006', 'gensim': '4.1.2', 'python': '3.7.6 (default, Jan  8 2020, 13:42:34) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.6.0-x86_64-i386-64bit', 'event': 'load_word2vec_format'}


In [18]:
model.most_similar(positive=['women','king'], \
                      negative='man')

[('queen', 0.4827326238155365),
 ('queens', 0.466781347990036),
 ('kumaris', 0.4653734564781189),
 ('kings', 0.4558638632297516),
 ('womens', 0.422832190990448),
 ('princes', 0.4176960587501526),
 ('Al_Anqari', 0.41725507378578186),
 ('concubines', 0.4011078476905823),
 ('monarch', 0.3962482810020447),
 ('monarchy', 0.39430150389671326)]

### 1.3. How to use word vectors in classification?

`Convolutional Neural Network`
<img src="CNN.png" width ="100%">

`Recurrent Neural Network`

<img src="https://raw.githubusercontent.com/graviraja/100-Days-of-NLP/master/assets/images/applications/sentiment/simple.gif" width = "90%">

<img src="https://www.kdnuggets.com/images/cartoon-machine-learning-vacation.jpg" width='60%'>
