***Reference:***

***Ganegedara, Thushan. Natural Language Processing with TensorFlow: The definitive NLP book to implement the most sought-after machine learning models and tasks, 2nd Edition. Packt Publishing.***

# **Chapter 3: Word2Vec - Word Embeddings**

Word2Vec is a technique for numerical representation(vectors)  of words/tokens in a corpus of text. It captures the semantic and contextual information that the word carries.

For e.g., the word *forest* and *oven* have very diff. vector representation as they are rarely used in similar contexts, while the words *forest* and *jungle* should be very similar.

This chapter covers this information through the
following main topics:
- What is a word representation or meaning?
- Classical approaches to learning word representations
- Word2vec — a neural network-based approach to learning word representation
- The skip-gram algorithm
- The Continuous Bag-of-Words algorithm

## 1. What is a word representation or meaning?

What is *word meanining*? : *meaning* is the idea conveyed by or some representation associated with the word.

To achieve this, we will use algorithms that can analyze a given text corpus and come up with good numerical representations of words (that is, word embeddings) such that words that fall within similar contexts (for example, one and two, I and we) will have similar numerical representations compared to words that are unrelated (for example, cat and volcano).

## 2. Classical approaches to learning word representations

- One-Hot Encoding 
- Term frequency-Inverse Document Frequency(TF-IDF)

### 2.1 One-Hot Encoding
One-hot encoding is also known as a localist representation (the opposite to the distributed representation), as the feature representation is decided by the activation of a single element in the vector.

### 2.3 TF-IDF Method

- **TF-IDF is a frequency-based method that takes into account the frequency with which a word appears in a corpus. This is a word representation in the sense that it represents the importance of a specific word in a given document. Intuitively, the higher the frequency of the word, the more important that word is in the document.**

    - For example, in a document about cats, the word cats will appear more often than in a document that isn't about cats. 
    
    - However, just calculating the equency woul not work because words such as this and is are very frequent in documents but do not contribute much information. TF- IDF takes this into consideration and gives values of near- zero for such common words.
    

- Again, **TF** stands for **term frequency** and **IDF** stands for **inverse document frequency:**

    - $TF(w_i) = \large{\frac{\text{No. of times } w_i \text{ apear}}{\text{Total No. of words}}}$
    
    - $IDF(w_i) = \large{\frac{\text{Total No. of documents}}{\text{No. of docs.  with }w_i \text{ in it}}}$
    
    - $TF-IDF(w_i) = TF(w_i) \times IDF(w_i)$
    
- E.g.: Therefore, the word **"cats"** is informative, while **"this"** is not. This is the desired behavior we needed in terms of measuring the importance of words.
<div align="center">
    <img src="images/tfidf.png"/>
</div>

### 2.4 Co-occurrence matrix

Co-occurance matrix, unlike one-hot encoded representation, encode the context info. of words, but require a maintaining a $V \times V$ matrix, where $V = \text{vocaubalry size}$. 

To understand the co-occurance matrix, let's take two sentences:
- *Jerry and Mary are friends.*
- *Jerry buys flowers for Mary.*

The co-occ. matrix will look like the foll. It's symmetrical:
<div align="center">
    <img src="images/co_occ_matrix.png"/>
</div>

## 3. Word2Vec - Intution

**Syntax is the grammatical structure of the text, whereas Semantics is the meaning being conveyed.**

To understand:
- [Semantic & Syntactic Analysis - Blog-1](https://www.gnani.ai/resources/blogs/semantic-analysis-v-s-syntactic-analysis-in-nlp/)
- [Syntactic & Semantic Analysis - Blog-2](https://builtin.com/data-science/introduction-nlp)

- **Word2vec is a groundbreaking approach that allows computers to learn the meaning of words without any human intervention. Also, Word2vec learns numerical representations of words by looking at the words surrounding a given word.**

    - Above quote can be understood by the foll. e.g.: "Mary is a very stubborn child. Her *previcacious* nature always gets her in trouble."

    - We might not know what *previcacious* means, but by looking at the words that surround it like *stubborn, nature, trouble*, we can understand *previcacious* in fact means the state of being stubborn.

### 3.1 Basics of Word2vec

- As already mentioned, **Word2vec learns the meaning of a given word by looking at its context and representing it numerically.**

    - **context** means fixed number of words in fornt of and behind the word of interest.
    

- Now, if we want to find a good algorithm that is capable of learning word meanings, **given a word, our algorithm should be able to predict the context words correctly.** 

    - This means that given a word $w_i$ the probability of *surrounding/context* words should be **high**: $$\large{P(w_{i-m}, \cdots, w_{i-1}, w_{i+1}, \cdots, w_{i+m}|w_i) = \prod_{j \neq i \wedge j=i-m}^{i+m} P(w_j|w_i)}$$
    
    - To arrive at the right-hand side of the equation, we need to assume that given the target word $(w_i)$, the context words are independent of each other (for example, $w_{i-2}$ and $w_{i-1}$ are independent). Though not entirely true, this approximation makes the learning problem practical and works well in practice. 
    
Let's go through an example to understand the computations.

**Exercise: does "queen = king - he + she"?** : See the book for explanation


**In short maximizing the about probability leads to finding good meaning(or representation) of words, i.e. the Semantic structure.**

## 4. the Skip-gram Algorithm

**The skip-gram algorithm, is an algorithm that exploit the context of the words in a written text to learn good word embeddings.**

### 4.1 Data Prep.: From raw text to semi-structured text

First, we need to design a mechanism to extract a dataset that can be fed to our learning model. **Such a dataset should be a set of tuples of the format (target, context)**. Moreover, this needs to be created in an unsupervised manner. 

In summary, the data prep. process should do the foll:
- Capture the surrounding words(context) of given word
- Run in an unsupervised manner

The skip-gram model uses the foll. approach to design a dataset:
1. For a given word $w_i$, a context window of $m$ is assumed.

    - By **context window size**, we mean # of words considered as context on either side of the target word.
    
    - So, for a word $w_i$, the context window(including the target word $w_i$) will be of size $2m+1$; $[w_{i-m}, \cdots, w_{i-1}, w_i, w_{i+1}, \cdots, w_{i+m}]$.<br></br>
    
2. Next, **(traget, context)** tuples are formed as: $[\cdots, (w_i, w_{i-m}), \cdots, (w_i, w_{i-1}), (w_i, w_{i+1}), \cdots, (w_i, w_{i+m}), \cdots]$; here, $m+1 \leq i \leq N-m$, and $N = \text{# words in text corpus}$.

E.g. : context window size(m) = 1
> The dog barked at the mailman.

For this example, the dataset would be as follows:
> [(dog, The), (dog, barked), (barked, dog), (barked, at), ..., (the, at), (the, mailman)]

Once the data is in the (target, context) format, we can use a
neural network to learn the word embeddings.

### 4.2 Understanding Skip-Gram Algorithm

#### Variables and Notations to learn the word embeddings


- To store the embeddings, we need two $V \times D$ matrices, $V = \text{vocabulary size, } D = \text{dimentionality of the word embeddings}$(i.e., the No. of elements in the vector that represents a single word).

- **D** is a hyperparameter. The higher **D** is, the more expressive the word embeddings learned will be. 

- **We need two matrices, one to represent the context words and one to represent the target words.** 
    - These matrices will be referred to as the **context embedding space (or context embedding layer)** and,
    - the **target embedding space (or target embedding layer)**, or in general as the embedding space (or the embedding layer).


Each word will be represented with a unique ID in the range [1, V+ 1]. These IDs are passed to the embedding layer to look up corresponding vectors. To generate these IDs, we will use a special object called a Tokenizer that's available in TensorFlow.

- Let's refer to an example target-context tuple $(w_i, w_j)$, where the target word ID is $w_i$, and one of the context words is $w_j$.

- The corresponding target embedding of $w_i$ is $t_i$, and the corresponding context embedding of $w_i$ is $c_j$. 

- Each target-context tuple is accompanied by a label (O or 1), denoted by $y_i$, 
    - where true target-context pairs will get a label of 1, and
    
    - negative (or false) target-context candidates will get a label of O. 
    - It is easy to generate negative target-context candidates by sampling a word that does not appear in the context of a given target as the context word. We will talk about this in more detail later.
    
* **

At this point, we have defined the necessary variables. 

- Next, for each input $w_i$, we will look up the embedding vectors from the context-embedding layer corresponding to the input. This operation provides us with $c_i$, which is a D-sized vector(i.e., a D-long embedding vector). 

- We do the same for the input $w_j$, using the context embedding space to retrieve $c_j.$

- Afterward, we calculate the prediction output for $(w_i, w_j)$ using the following transformation:
$$\large{logit(w_i, w_j) = c_i \cdot t_j}$$

$$\large{\hat{y}_{ij} = sigmoid(logit(w_i, w_j))}$$

- Here, $logit(w_i, w_j)$ represent the unnormalized scores(i.e., logits),

- $\hat{y}_i$ is a singled valued predicted output(representing the probability of context word belonging in the context of the target word).


<div align="center">
    <img src="images/skipgram_1.png"/>
</div>

* **

- Using both the existing and derived entities, we can now use the cross-entropy loss function to calculate the loss for a given data point $[(w_i, w_j), y_i]$.

- **The Comceptual Skip-gram Model**

<div align="center">
    <img src="images/skipgram_2.png"/>
</div>

- **The implementation of the skip-gram model**

<div align="center">
    <img src="images/skipgram_3.png"/>
</div>

## 5. Implementing Skip-gram with Tensorflow