<a href="https://colab.research.google.com/github/fabnancyuhp/DEEP-LEARNING/blob/main/NOTEBOOKS/word_embedding_word_vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.<br>
Some of NLP tasks include the following:
* Speech recognition : converting voice data into text data
* Part of speech tagging, also called grammatical tagging
* Named entity recognition, or NEM : NEM identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
* Sentiment analysis attempts to extract Sentiments from text data.
* ....

When we deal with NLP in this cours, we're proned to handle text data. To apply algorithms such as random forests, neural networks on text data we need to convert text into vector representation. The different types of word vector representation can be broadly classified into two categories:
* Frequency based vector 
* Prediction based vector called word Embedding

We show four types of Frequency based vector:
* one-hote vector and Count Vector
* TF-IDF Vector
* Co-Occurrence Vector

**Frequency-based methods yield  sparse matrices.**<br>

In this chapter, we show severals kind of word embedding:
* Embedding Layer
* Word2Vec : Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:
    * Continuous Bag-of-Words, or CBOW model.
    * Continuous Skip-Gram Model.
* GloVe : Global Vectors for Word Representation

**Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space.**

# Corpus, dictionary and one-hote encoding
A corpus is a set of documents. A document could be a review, a tweet, newspaper article,... . The dictionary is the set of the unique words (tokens) that appear in the corpus.<br>
For example, 
* here we have a corpus with 3 documents.
   * D1 'le beagle est il un bon chien de compagnie'
   * D2 'le tour de france 2021 est maintenue'
   * D3 "l'euro 2020 se joue dans plusieurs pays"
* The dictionary is ['le', 'est', 'de', 'beagle', 'il', 'un', 'bon', 'chien', 'compagnie', 'tour', 'france', '2021', 'maintenue', "l'euro", '2020', 'se', 'joue', 'dans', 'plusieurs', 'pays']. 
* The vocabulary size is 20

For one-hot encoding, each word, or token, in a text corresponds to a vector element. O stands for one-hot. We give some one-hot encoding vectors. We have 20 words in our dictionary, then each one-hot vector has 20 components.
* $O_{le}=O_{1}=[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]^{T}$
* $O_{ beagle}=O_{4}=[0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]^{T}$
* $O_{compagnie} = O_{9}=[0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]^{T}$
* $O_{pays} = O_{20} = [0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]^{T}$


Here we list some drawbacks of the one-hot encoding:
* High dimensionality : The number of dimension is equal to the number of unique words in the corpus
* Sparse : Only 1 non-zero value
* One-hot encoding does not catch the words meaning 

In [None]:
Corpus = ['le beagle est il un bon chien de compagnie',\
          'le tour de france 2021 est maintenue',\
         "l'euro 2020 se joue dans plusieurs pays"]

In [None]:
#import tensorflow as tf, tokenizer.word_index.keys()
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(Corpus)
sequence = tokenizer.texts_to_sequences(['le tour de france 2021 est maintenue','le beagle est il un bon chien de compagnie'])
#tokenizer.word_index
print('dictionary : ',tokenizer.word_index ,'\n','\n', 'The vocabulary size is :', len(tokenizer.word_index.keys()),'\n')
print('sequence:', sequence) 

# Count-Vector Matrix

Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens will form our dictionary and the size of the Count Vector matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document di. 

Let the following Corpus of 3 documents

* D1 'Boxer is a German dog', 
* D2 'Bulldog is an English dog',
* D3 'Stellantis is a merger between PSA and FCA' 

This corpus has 13 unique tokens that form the following dictionary : ['Boxer', 'Bulldog','English', 'FCA', 'German', 'PSA', 'Stellantis', 'an', 'and', 'between','dog','is' 'merger'].<br><br>
The count matrix M of size 3 X 13 will be represented as:
$$
\begin{array}{|c|c|c|c|c|c|c|c|c|}
  \hline
   & boxer & Bulldog & English &  FCA & German &  PSA& Stellantis& an& and& between& dog& is& merger\\
  \hline
  doc1 & 1& 0& 0& 0& 1& 0& 0& 0& 0& 0& 1& 1& 0 \\
  \hline
  doc2 & 0& 1& 1& 0& 0& 0& 0& 1& 0& 0& 1& 1& 0 \\
  \hline
  doc3 & 0& 0& 0& 1& 0& 1& 1& 0& 1& 1& 0& 1& 1\\
  \hline
\end{array}
$$

The result of a count_vector process on a corpus is a sparse matrix. Consider if you had a corpus with 20,000 unique words: a single short document in that corpus of, perhaps, 40 words would be represented by a matrix with 20,000 rows (one for each unique word) with a maximum of 40 non-zero matrix elements (and potentially far-fewer if there are a high number of non-unique words in this collection of 40 words). This leaves a lot of zeroes, and can end up taking a large amount of memory to house these spare representations.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

Corpus = ['Boxer is a German dog',\
         'Bulldog is an English dog',\
         'Stellantis is a merger between PSA and FCA']

vectorizer = CountVectorizer(min_df=0, lowercase=False)
vectorizer.fit(Corpus)
print('Dictionary:',vectorizer.vocabulary_,'\n')

Dictionary: {'Boxer': 0, 'is': 11, 'German': 4, 'dog': 10, 'Bulldog': 1, 'an': 7, 'English': 2, 'Stellantis': 6, 'merger': 12, 'between': 9, 'PSA': 5, 'and': 8, 'FCA': 3} 



In [2]:
#Count-Vector Matrix
vectorizer.transform(Corpus).toarray()

array([[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1]])

# TF-IDF vectorization
TF-IDF stands for Term Frequency Inverse Document Frequency of records.
TF-IDF is another method which is based on the frequency method but it is different to the count vectorization in the sense that it takes into account not just the occurrence of a word in a single document but in the entire corpus.<br><br>
Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Lionel Messi is going to contain more occurences of the word “Messi” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.<br><br>
Ideally, what we would want is to down weight the common words occurring in almost all documents and give more importance to words that appear in a subset of documents.<br>
TF-IDF works by penalising these common words by assigning them lower weights while giving importance to words like Messi in a particular document.

Consider the below sample table which gives the count of terms(tokens/words) in two documents.
$$
\begin{array}{cccc|c|c|c|cc}
  \hline
   &    & Document1 &   &   &  &  &  & Document2 & & & &  & \\
  \hline
   &term  & & count& &  &  & term&  &count  & & & &  \\
  \hline
  & this  & & 1& & & & this& & 1& & & &  \\
  &is &  & 1& & & & is&  & 2& & & & \\
  &about & & 2& & & &about & &1 & & & & \\
  &Messi & & 4& & & &TF-IDF & &1 & & & & 
\end{array}
$$

Now, let us define a few terms related to TF-IDF.
* TF = (Number of times term t appears in a document)/(Number of terms in the document)
* TF(This, Document2)=1/5
* IDF = log(D/n), where, D is the number of documents and n is the number of documents a term t has appeared in.
* So, IDF(This) = log(2/2) = 0.
* Let us compute IDF for the word ‘Messi’. IDF(Messi) = log(2/1) = 0.301.
* TF-IDF(This,Document1) = TF(This, Document1)*IDF(This) =(1/8) * (0) = 0
* TF-IDF(This, Document2) = TF(This, Document2)*IDF(This) = (1/5) * (0) = 0
* TF-IDF(Messi, Document1) = TF(Messi, Document1)*IDF(Messi) = (4/8) * 0.301 = 0.15



As, you can see for Document1 , TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important word for Document1 from the context of the entire corpus.<br>

Below we have an example of computation of a TF_IDF matrix using python

Consider a Corpus C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens will form our dictionary=[word1,word2,....,wordN]. The size of the TF-IDF matrix M will be given by D X N. Then we have
$$M(i,j)=TF-IDF(wordj,Document_i)=TF(wordj,Document_i)*IDF(wordj)=(Number of times wordj appears in a Document_i/Number of word in Document_i)*log(D/n)$$<br>
where n is the number of documents wordj has appeared in<br><br>

**The result of a TF-IDF process on a large corpus is a sparse matrix.**

Below we have an example of computation of a TF_IDF matrix using python

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

Corpus = ['Boxer is a German dog',\
         'Bulldog is an English dog',\
         'Stellantis is a merger between PSA and FCA']

vectorizer = TfidfVectorizer(min_df=0, lowercase=False)
vectorizer.fit(Corpus)
print('Dictionary:',vectorizer.vocabulary_,'\n')
print('TF-IDF matrix: ')
vectorizer.transform(Corpus).toarray()

Dictionary: {'Boxer': 0, 'is': 11, 'German': 4, 'dog': 10, 'Bulldog': 1, 'an': 7, 'English': 2, 'Stellantis': 6, 'merger': 12, 'between': 9, 'PSA': 5, 'and': 8, 'FCA': 3} 

TF-IDF matrix: 


array([[0.5844829 , 0.        , 0.        , 0.        , 0.5844829 ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.44451431, 0.34520502, 0.        ],
       [0.        , 0.50461134, 0.50461134, 0.        , 0.        ,
        0.        , 0.        , 0.50461134, 0.        , 0.        ,
        0.38376993, 0.29803159, 0.        ],
       [0.        , 0.        , 0.        , 0.39687454, 0.        ,
        0.39687454, 0.39687454, 0.        , 0.39687454, 0.39687454,
        0.        , 0.2344005 , 0.39687454]])

# Vector Representation/ Featurized representation


it is obvious that one-hot encoding has drawbacks:
* the inner product of two differents one-hot vectors is 0
* the cosine similarity between the one-hot vectors of any two different words is 0
* The euclidianne distance between 2 differents one-hot vectors is always the same

As As consequences:
* Since the cosine similarity between the one-hot vectors of any two different words is 0, it is difficult to use the one-hot vector to accurately represent the similarity between multiple different words.
* one-hot encoding and bag-of-words models (i.e. using dummy variables to represent the presence or absence of a word in an observation, i.e. a sentence) do not capture information about a word's meaning or context. 
* One-hot encodings do not capture syntactic (structure) and semantic (meaning) relationships across collections of words and, therefore, represent language in a very naive way.<br>

In contrast, word vectors represent words as multidimensional continuous floating point numbers where semantically similar words are mapped to proximate points in geometric space.This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant. The beauty of representing words as vectors is that they lend themselves to mathematical operators. **Word vectors are simply vectors of numbers that represent the meaning of a word**. For example, we can add and subtract vectors — the canonical example here is showing that by using word vectors we can determine that:

* **king - man + woman = queen**

<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/WORDEMBEDING/kingandkeen.png" title="Title Tag Goes Here" height="200" width="250" border="1px">
</center>

The numbers in the word vector represent the word's distributed weight across dimensions. In a simplified sense, each dimension represents a meaning and the word's numerical weight on that dimension captures the closeness of its association with and to that meaning. Thus, the semantics of the word are embedded across the dimensions of the vector.

A word vector is a featurized representation of a word. Here, we represent the words Man, Woman, King, Queen, Apple, Orange with the features Gender, Royale, Age, Food. 

$$\begin{array}{c|cccccc|}
\hline
&Man & Woman & King & Qween & Apple & Orange\\
\hline
Gender & -1 & 1  &-0.95  & 0.97  &0.00  &0.01    \\
\hline
Royal &0.01 & 0.02 & 0.93 & 0.95 & -0.01 & 0.00\\
\hline
Age &  0.03 &0.02  &0.7  &0.69  &0.03 &-0.02      \\
\hline 
Food & 0.04 &0.01 &0.02 & 0.01 &0.95 & 0.97 \\
\hline
\end{array}$$
In the above matrix each column is the vector representation of a word. This particular matrix is called embedding matrix. Each column vector is a word embedding.<br><br>
**Given the vectors of two words, we can determine their similarity**.
Apple and Orange are fruits then the cosine similarity between their word vectors should be close to 1. The word vectors of Apple and Orange are respectively $e_{Apple}=[0.00,-0.01,0.03,0.95]^{T}$ and $e_{Orange}=[0.01,0.00,-0.02,0.97]^{T}$. $e$ stands for embedding. We have
$$sim(Apple,Orange)=\frac{e_{Apple}^{T}e_{Orange}}{\left\lVert e_{Apple} \right\rVert \left\lVert e_{Orange} \right\rVert }=0.99853041412809$$
Apple should be quit distant to King. We compute the cosine similarity between these 2 words:
$$sim(Apple,Orange)=\frac{e_{Apple}^{T}e_{King}}{\left\lVert e_{Apple} \right\rVert \left\lVert e_{King} \right\rVert }=0.021494708641488013$$
These two word vectors are close to the orthogonality. The angle between these two vectors is about 88 degree.

In [8]:
import numpy as np
ea = np.array([0.0, -0.01,0.03,0.95])
eo = np.array([0.01,0.00, -0.02,0.97])
print('sim(Apple,Orange)=',np.sum(ea*eo)/np.sqrt((np.sum(ea*ea)*np.sum(eo*eo))))
ek = np.array([-0.95,0.93, 0.7,0.02])
print('sim(Apple,King)=',np.sum(ea*ek)/np.sqrt((np.sum(ea*ea)*np.sum(ek*ek))))
print('angle between Apple and King=',np.arccos(np.sum(ea*ek)/np.sqrt((np.sum(ea*ea)*np.sum(ek*ek))))*180/np.pi,' degree')

sim(Apple,Orange)= 0.99853041412809
sim(Apple,King)= 0.021494708641488013
angle between Apple and King= 88.76834905881994  degree


# What Are Word Embeddings?
We give severals definitions of word embedding:
* Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.
* A word embedding is a learned representation for text where words that have the same meaning have a similar representation.
* Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.

Key to the approach is the idea of using a dense distributed representation for each word. Conceptually it involves the mathematical embedding from space with many dimensions per word to a continuous vector space with a much lower dimension.

# Embedding Layer : From one-hote to word embedding
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/<br>

In this section we use supervised learning to get vectors representations of words from a corpus. We Consider a corpus C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. Each document is a review. The N tokens will form our dictionary = {word1,....,wordN}.The one-hot encoding vector of the wordi $O_{i}$ is the N-dimensional vector where each component is equal 0 except the ith one which  is equal to 1. Each document is labeled as 0 (negative) or 1 (positive). We want to represent each dictionary word with K components vector:
* word1 vector is $e_{1}=e_{word1}=[w_{11},\dots,w_{k1}]^{T}$
* word2 vector is $e_{2}=e_{word1}=[w_{12},\dots,w_{k2}]^{T}$
* wordN vector is $e_{N}=e_{word1}=[w_{1N},\dots,w_{kN}]^{T}$

Generaly $K<N$ and the K-dimentional vector space is dense. For exemple, we could have N=10000 and K=300.
Each word is represented with K features corresponding to the K vector components. We introduce the kxv embedding matrice E:
$$\begin{array}{c|ccc|}
& word1&\dots \dots & wordN\\
\hline
feature1 & w_{11}&\dots \dots & w_{1N}\\
\vdots & \vdots & \vdots & \vdots \\
\vdots & \vdots & \vdots & \vdots \\
featurek & w_{k1}&\dots \dots & w_{kN}\\
\hline
\end{array}$$
The $w_{ij}s$ are trainables parameters. 

Let $\{O_{1},\dots,O_{N} \}$ be the set of the one-hot encodings of the dictionary. The one-hot encoding vector of the wordi $O_{i}$ is the N-dimensional vector where each component is equal 0 except the ith one which  is equal to 1 . The matrix multiplication between E and $O_{i}$ is equal to the word vector embedding of wordi:
$$EO_{i}=e_{i} $$
The matrix E is randomly initialized. We train a neural network to determin wether or not a review is positive. The matrix embedding coeficiants are fitted during the training stage of this neural network. After the neural network is trained we will get word embeddings as a side effect. So the problem for review classification is almost like a fake problem. In fact we care about word embeddings. In other words, we care about the matrix E.<br>
In this neural network, we vertically stack all word vectors embedding into a single vector. This step is called flattering. The flattering step produces a document vector.<br> 
The documents in the corpus don't have the same length. In other words, the number of tokens differs from one document to another. To cop this problem, we use a padding step to make all vector documents same-sized.<br>
$\hat{Y}$ is an $\mathbb{R}^{2}$ vector since we deal with a binary classification problem. In the case of multiclass classification problem, $\hat{Y}$ is an $\mathbb{R}^{n}$ vector where n is the number of classe label.

**Conclusion**<br>
An embedding layer, for lack of a better name, is a word embedding that is learned jointly with a neural network model on a specific natural language processing task, such as language modeling or document classification.

Below an example of an embedding layer with Keras:

In [9]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding

In [10]:
reviews = ['nice food',
          'amazing restaurant',
          'too good',
          'just loved it!',
          'will go again',
          'horrible food',
          'never go there',
          'poor service',
          'poor quality',
          'needs improvement']

sentiment = np.array([1,1,1,1,1,0,0,0,0,0])  

In [11]:
vocab_size = 30
print("on_hot example",one_hot("amazing restaurant",vocab_size))
encoded_review = [one_hot(d,vocab_size) for d in reviews]
print(encoded_review)

#padding
max_length = 3
padding_reviews = pad_sequences(encoded_review ,maxlen=max_length,padding='post')
#print(padding_reviews)

#Embedding
embeded_vector_size = 4
model = Sequential()
model.add(Embedding(vocab_size,embeded_vector_size,input_length=max_length,name="embedding"))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))

on_hot example [19, 8]
[[12, 17], [19, 8], [18, 24], [13, 13, 8], [18, 27, 19], [2, 17], [5, 27, 12], [5, 10], [5, 26], [14, 8]]


In [12]:
X = padding_reviews
Y = sentiment

#Compile model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 4)              120       
                                                                 
 flatten (Flatten)           (None, 12)                0         
                                                                 
 dense (Dense)               (None, 1)                 13        
                                                                 
Total params: 133
Trainable params: 133
Non-trainable params: 0
_________________________________________________________________


In [13]:
model.fit(X,Y,epochs=50,verbose=0)

<keras.callbacks.History at 0x7f165b1108d0>

In [14]:
loss, accuracy = model.evaluate(X,Y)
accuracy



1.0

We have an interest in the embedding matrix. So we want to get the embedding layer weights.

In [16]:
#Embedding_matrix model.get_layer('embedding').get_weights()[0]
model.get_layer('embedding').get_weights()[0]

array([[-0.0385818 ,  0.01196041,  0.08330416, -0.04809228],
       [ 0.04928363,  0.03091284, -0.00218043, -0.03546967],
       [ 0.02897159, -0.08566714,  0.06147001, -0.06250183],
       [-0.0457347 ,  0.02563734, -0.00191243,  0.00735519],
       [-0.03447334,  0.04667535,  0.0179052 ,  0.04737124],
       [ 0.06568251, -0.05336633,  0.05199532, -0.00136059],
       [-0.02487051,  0.03086315,  0.04834208,  0.00569927],
       [-0.01147396,  0.0064138 ,  0.01900108, -0.03045193],
       [ 0.0212338 , -0.08549266, -0.09383223,  0.01544772],
       [ 0.0222734 ,  0.03541424, -0.04154212,  0.02096004],
       [ 0.03723327,  0.09083919, -0.07665687,  0.03710005],
       [ 0.01954024,  0.01567096, -0.01021005,  0.01254454],
       [-0.01008093,  0.02703698, -0.04052323,  0.01923491],
       [-0.01212789, -0.06453195, -0.06806654, -0.07130087],
       [ 0.02827326, -0.08486904,  0.09327632, -0.06985669],
       [ 0.04131815, -0.03515949, -0.0072787 ,  0.00013392],
       [ 0.01054634, -0.

In [17]:
len(model.get_layer('embedding').get_weights()[0])

30

**Word embedding of each word**: we make a python dictionnary with the unique word 

In [18]:
# Unique word of the reviews
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(reviews)
tokenizer.word_index.keys()
one_hot_distinct_word = [one_hot(d,vocab_size) for d in tokenizer.word_index.keys()]
print("unique tokens:",tokenizer.word_index.keys())

unique tokens: dict_keys(['food', 'go', 'poor', 'nice', 'amazing', 'restaurant', 'too', 'good', 'just', 'loved', 'it', 'will', 'again', 'horrible', 'never', 'there', 'service', 'quality', 'needs', 'improvement'])


In [19]:
#One encoding of unique words
one_hot_distinct_word = [one_hot(d,vocab_size) for d in tokenizer.word_index.keys()]
print(one_hot_distinct_word )
one_hot_distinct_word_bis = [o[0] for o in one_hot_distinct_word]
print(one_hot_distinct_word_bis) 

[[17], [27], [5], [12], [19], [8], [18], [24], [13], [13], [8], [18], [19], [2], [5], [12], [10], [26], [14], [8]]
[17, 27, 5, 12, 19, 8, 18, 24, 13, 13, 8, 18, 19, 2, 5, 12, 10, 26, 14, 8]


In [20]:
unique_token = list(tokenizer.word_index.keys())
Embedding_matrix = model.get_layer('embedding').get_weights()[0]
[(unique_token[i],Embedding_matrix[one_hot_distinct_word_bis[i]]) for i in range(0,len(one_hot_distinct_word_bis))]

[('food',
  array([-0.02545296,  0.0028329 ,  0.05190564, -0.07803503], dtype=float32)),
 ('go',
  array([ 0.07698561,  0.04366833, -0.06087789,  0.01205637], dtype=float32)),
 ('poor',
  array([ 0.06568251, -0.05336633,  0.05199532, -0.00136059], dtype=float32)),
 ('nice',
  array([-0.01008093,  0.02703698, -0.04052323,  0.01923491], dtype=float32)),
 ('amazing',
  array([-0.00072147,  0.00133491, -0.00295607,  0.01325878], dtype=float32)),
 ('restaurant',
  array([ 0.0212338 , -0.08549266, -0.09383223,  0.01544772], dtype=float32)),
 ('too',
  array([-0.06896301,  0.0536689 , -0.05900615,  0.06012291], dtype=float32)),
 ('good',
  array([-0.05770757, -0.00297162,  0.05293114, -0.0314495 ], dtype=float32)),
 ('just',
  array([-0.01212789, -0.06453195, -0.06806654, -0.07130087], dtype=float32)),
 ('loved',
  array([-0.01212789, -0.06453195, -0.06806654, -0.07130087], dtype=float32)),
 ('it',
  array([ 0.0212338 , -0.08549266, -0.09383223,  0.01544772], dtype=float32)),
 ('will',
  arra

# Word2Vec Embedding : CBOW model and Skip-Gram Model

Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus. Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:
* **Continuous Bag-of-Words, or CBOW model.**
* **Continuous Skip-Gram Model.**

In the following image, the word in the blue box is called the target word and the words in the white boxes are called context words in a size 5 window.<br>

<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/WORDEMBEDING/Capture-de%CC%81cran-2020-09-18-a%CC%80-09.27.50.png.webp" title="Title Tag Goes Here" height="300" width="450" border="1px">
</center>

* The **CBOW model** learns the embedding by predicting the current word based on its context. The continuous skip-gram model learns by predicting the surrounding words given a current word.

* The continuous **skip-gram** model learns by predicting the surrounding words given a current word.

<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/WORDEMBEDING/Word2Vec-Training-Models.webp" title="Title Tag Goes Here" height="300" width="450" border="1px">
</center>
<center>
<title> Word2Vec Training Models<br>
Taken from “Efficient Estimation of Word Representations in Vector Space”, 2013 </title>          
</center>

**Skip-Gram model mathematical description**<br>
We have a corpus that leads to a N-size dictionary. In other words, We have N words in our vocabulary.<br>
We go deeper to explain the skip-gram algorithm where the context is just one randomly picked nearby word.<br>

Let the following sentence : 
* Machiavelli was an Italian Renaissance politician

First we randomly pick up a word to be the context. Let's say we choose the word Politician. We also randomly pick up an other word to be the target. After, We randomly choose an other word within a window (+- n words of the context word) to be the target word. Let's say we choose the word Italian. We set c=Politician and t=Italian. We do it again in order to have new couples of context-target:
$$\begin{array}{c|c}
Context=c & Target=t\\
\hline
Italian  & Politician\\
Italian  & Renaissance\\
Renaissance & was\\
politician & Machiavelli
\end{array}$$

We set up supervised learning problem where given the context word, we are asked to predict the target word. The goal of setting up this supervised learning problem is not to do well on the supervised learning problem. **In fact, we want to use this learning problem to learn good word embeddings.**.

We want to learn a mapping from some context to some target: $Context=c\longrightarrow  Target=t$. Here the details of the model:<br>

* $o_{c}\longrightarrow E\longrightarrow Eo_{c}=e_{c}\longrightarrow \underbrace{\theta}_{Softmax}\longrightarrow \hat{y}$
    * $o_{c}$ is the one_hote encoding for the input context vector
    * $E$ is an embedding matrix
    * $e_{c}$ is the embedding vector for the input context word
    * $\hat{y}$ is a N_sized probability vector where N is the vocabulary size. ($\hat{y}\in \mathbb{R}^{N}$)
    
   
The softmax model estimates the probability of  different target words t given the input context word c as
$$P(t|c)=\frac{\exp(\theta_{t}^{T}e_{c})}{\sum_{j=1}^{N}\exp(\theta_{j}^{T}e_{c})}$$
where $\theta_{t}$ is the parameter associated with the output target word t. $P(t|c)$ is the chance of a particular word t being the label given the context word c.<br>
The loss function for the softmax is the negative log-likelihood  
$$\mathcal{L}(\hat{y},y)=-\sum_{i=1}^{N}y_{i}\log(\hat{y}_{i})$$
where $\hat{y}_{i}=\frac{\exp(\theta_{i}^{T}e_{c})}{\sum_{j=1}^{N}\exp(\theta_{j}^{T}e_{c})}$ and $y$ is a one_hot vector.

# GloVe word vectors
GloVe stands for Global Vectors for Word Representation. GloVe is based on the word-word co-occurrence matrix denoted by $X$.
* $X_{i,j}$ tabulate the number of times word j occurs in the context of word i.
* $X_{i}=\sum_{k}X_{i,k}$ is the number of times any word appears in the context of word i.

The probability that word j appears in the context of word i is $P_{i,j}=P(j|i)=\frac{X_{i,j}}{X_{i}}$

The idea behind the word-word co-occurrence matrix is that Similar words tend to occur together and will have similar context. For example apple and mango are fruits. Apple and mango tend to have a similar context i.e fruit.We have to clarify two concepts : co-occurrence and Context.<br><br>
**Co-occurrence:** For a given corpus, the co-occurrence of a pair of words is the number of times they have appeared together in a Context.<br>
<br> **Context:**  
In the following sentence, the green words are in a size 5 context window for the word ‘Fox’ and for calculating the co-occurrence only these words will be counted.
<h1><font color="green">Quick Brown</font><font color="red"> Fox </font><font color="green">Jump Over </font>The Lazy Dog</h1>


Let us see context window for the word ‘Over’.

<h1>Quick Brown<font color="green"> Fox Jump</font><font color="red"> Over </font><font color="green">The Lazy </font>Dog</h1>

Now, let us take an example corpus to calculate a word-word co-occurrence matrix. Corpus = He is not lazy. He is intelligent. He is smart.
$$\begin{array}{c|c|c|c|c|c|c|}
\hline
 &He & is & not & lazy & intelligent & smart\\
 \hline
 He & 0 & 4 & 2 & 1 & 2 & 1 \\
 \hline
 is & 4& 0 & 1 & 2 & 2 & 1\\
 \hline
 not & 2& 1 & 0 & 1 &0 &0\\
 \hline
 lazy &1 &2 &1 & 0 &0 &0\\
 \hline
 intelligent &2 &2 & 0 & 0 & 0 & 0 \\
 \hline
 smart & 1& 1 &0 &0 & 0& 0 \\
 \hline
\end{array}
$$

**Note that for word-word co-occurrence matrices, the distinction between a word and a context word is arbitrary and
that we are free to exchange the two roles.**

The ratio $\frac{P_{i,k}}{P_{j,k}}$ depends on three words i, j, and k, the most general model takes the form, 
$$F(w_{i},w_{j},\tilde{w}_{k})=\frac{P_{i,k}}{P_{j,k}}$$
where $w \in \mathbb{R}^{d}$ are word vectors and $\tilde{w} \in \mathbb{R}^{d}$ are separate context word vectors. In this equation, the right-hand side is extracted from the corpus, and F may depend on some as-of-yet unspecified parameters. We would like F to encode the information present the ratio $\frac{P_{i,k}}{P_{j,k}}$ in the word vector space. Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences. This leads to the following equation:
$$F(w_{i},w_{j},\tilde{w}_{k})=F(w_{i}-w_{j},\tilde{w}_{k})=\frac{P_{i,k}}{P_{j,k}}$$
Next, we note that the arguments of F in are vectors while the right-hand side is a scalar. Then, we can first
take the dot product of the arguments:
$$F(w_{i}-w_{j},\tilde{w}_{k})=F\left((w_{i}-w_{j})^{T}\tilde{w}_{k}\right)=\frac{P_{i,k}}{P_{j,k}}$$

<svg width="20" height="20">
<rect width="20" height="20" style="fill:#E9E612;stroke-width:3;stroke:rgb(0,0,0)" />
</svg>

Next, note that for word-word co-occurrence matrices, the distinction between a word and a context word is arbitrary and
that we are free to exchange the two roles. To do so consistently, we must not only exchange $w\leftrightarrow \tilde{w}$ but also $X\leftrightarrow  X^{T}$ . We require that F be a homomorphism between the groups $(\mathbb{R},+)$ and $(\mathbb{R}_{>0},\times)$,
$$F\left((w_{i}-w_{j})^{T}\tilde{w}_{k}\right)=\frac{F(w_{i}^{T}\tilde{w}_{k})}{F(w_{j}^{T}\tilde{w}_{k})}=\frac{P_{i,k}}{P_{j,k}}$$
then 
$$F(w_{i}^{T}\tilde{w}_{k})=P_{i,k}=\frac{X_{i,k}}{X_{i}}$$
Since $\exp((w_{i}-w_{j})^{T}\tilde{w}_{k})=\frac{\exp(w_{i}^{T}\tilde{w}_{k})}{\exp(w_{j}^{T}\tilde{w}_{k})}$, we conclude that $F=\exp$ and
$$\exp(w_{i}^{T}\tilde{w}_{k})=P_{i,k}=\frac{X_{i,k}}{X_{i}} \Leftrightarrow w_{i}^{T}\tilde{w}_{k}=\ln(P_{i,k})=\ln(X_{i,k})-\ln(X_{i})$$

We note that the above equation would exhibit the exchange symmetry if not for the $\ln(X_{i})$ on the right-hand side. However, this term is independent of k so it can be absorbed into a bias $b_{i}$ for $w_{i}$. Finally, adding an additional bias $\tilde{b_{k}}$ for $\tilde{w_{k}}$ restores the symmetry,
$$w_{i}^{T}\tilde{w}_{k}+b_{i}+\tilde{b_{k}}=\ln(X_{i,k}) $$

We can cast the above equation as least squared problem and we introduce a weighting function $f(X_{i,j})$ into the cost function:
$$J=\sum_{i,j=1}^{V}\left(w_{i}^{T}\tilde{w}_{j}+b_{i}+\tilde{b_{j}}-\ln(X_{i,j})\right)^{2} $$
* The goal of the weighting function f is to punish high frequency
* V is the size of the vocabulary

we defined $f$ as follow:
$$f(x)=\left\{\begin{array}{cc}
(x/x_{max})^{\alpha} & \text{if $x<x_{max}$}\\
1 & \text{otherwise}
\end{array}
\right.$$

In conclusion, GloVe algorithm finds a set of word vectors $\{w_{1},\dots,w_{n}\}$ that minimize the cost function $\sum_{i,j=1}^{V}\left(w_{i}^{T}\tilde{w}_{j}+b_{i}+\tilde{b_{j}}-\ln(X_{i,j})\right)^{2}$.

# Exercice

In [25]:
import pandas as pd
url = "https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/text_for_embedding.parquet.brotli"
text_for_embedding = pd.read_parquet(url)

text_for_embedding.head(7)

Unnamed: 0,class,text
12775,1,Common sense is prevailing in Brexit negotiati...
930,1,"Paul Manafort, the indicted former campaign ma..."
4467,1,U.S. Representative Mark Walker said after a m...
8653,1,Hungarian Prime Minister Viktor Orban on Satur...
21544,0,Way to go Granny! Perfect timing for your anno...
13398,0,The DNC Action Committee announced on Facebook...
14975,0,The old reliable stone most commonly used by a...


We lowercase, remove the digit and remove the puntuation.

In [37]:
def preprocess_text(x):
    punct_tag=re.compile(r'[^\w\s]')
    new_text=punct_tag.sub(r'',x)
    new_text = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , new_text)
    new_text = re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", new_text)
    new_text = re.sub(r'[0-9]', '', new_text)
    return(new_text.lower())

text_for_embedding['text'] = text_for_embedding['text'].apply(lambda x:preprocess_text(x))
