# [KDNUGGETS](https://www.kdnuggets.com/2021/11/guide-word-embedding-techniques-nlp.html)<br>
### TF-IDF — Term Frequency-Inverse Document Frequency: 
The text can be in the form of a document or various documents (corpus). It is a combination of two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).<br><br>
TF counts number of the words in the document. Where i is occurance of a word, j is the words in the document TF can be shown as following:
$$TF(i) =   \frac{\log(Frequency(i,j))}{\log(TotalNumber(j))}$$
<br><br>
IDF is rarity of the words. Words that are rarely used in the corpus may hold significant information. Where d is documents and i is occurance of a word, IDF can be showed as following:

$$IDF (i) = \log (\frac{TotalNumber(d)}{Frequency(d,i)})$$

### Word2Vec:
Word2Vec uses cosine similarity metric. If the cosine angle is 0 and cosine value is 1, that means words are overlapping. If the cosine angle is 90 and cosine value is 0, that means words are independent or hold no contextual similarity. Word2Vec offers two neural network-based variants: Continuous Bag of Words (CBOW) and Skip-gram.

In CBOW, the neural network model takes various words as input and predicts the target word that is closely related to the context of the input words. On the other hand, the Skip-gram architecture takes one word as input and predicts its closely related context words.

Skip-gram <br>

$$ P (w_o | w_c) = \frac{e^{u_o^\intercal v_c}}{\sum e^{u_i^\intercal v_c}}$$

<br>
Word2Vec only captures the local context of words 

![CBOW vs Skip-gram](https://www.kdnuggets.com/wp-content/uploads/nlp-cbow-skip-gram.jpg)

Although one-hot word vectors are easy to construct, they are usually not a good choice. A main reason is that one-hot word vectors cannot accurately express the similarity between different words, such as the cosine similarity that we often use. 


### GloVe — Global Vectors for Word Representation
GloVe considers the entire corpus and creates a large matrix that can capture the co-occurrence of words within the corpus. GloVe combines two-word vector learning methods: matrix factorization and local context window method. GloVe technique has a simpler least square cost or error function that reduces the computational cost of training the model. The resulting word embeddings are different and improved.

![GloVe](https://www.kdnuggets.com/wp-content/uploads/nlp-glove-embedding-example.jpg)


### BERT — Bidirectional Encoder Representations from Transformers
BERT-Base has 110 million parameters, and BERT-Large has 340 million parameters. During the training process, embeddings are refined by passing through each BERT encoder layer. For each word, the attention mechanism captures word associations based on the words on the left and the words on the right. Word embeddings are also positionally encoded to keep track of the pattern or position of each word in a sentence. Google search engine uses BERT.

BERT further improved the state of the art on eleven natural language processing tasks under broad categories of (i) single text classification (e.g., sentiment analysis), (ii) text pair classification (e.g., natural language inference), (iii) question answering, (iv) text tagging (e.g., named entity recognition). All proposed in 2018, from context-sensitive ELMo to task-agnostic GPT and BERT, conceptually simple yet empirically powerful pretraining of deep representations for natural languages have revolutionized solutions to various natural language processing tasks.


Masked Language Modeling
a language model predicts a token using the context on its left. To encode context bidirectionally for representing each token, BERT randomly masks tokens and uses tokens from the bidirectional context to predict the masked tokens in a self-supervised fashion. This task is referred to as a masked language model.

- a special “mask” token for 80% of the time (e.g., “this movie is great” becomes “this movie is mask”);
- a random token for 10% of the time (e.g., “this movie is great” becomes “this movie is drink”);
- the unchanged label token for 10% of the time (e.g., “this movie is great” becomes “this movie is great”).

Note that for 10% of 15% time a random token is inserted. This occasional noise encourages BERT to be less biased towards the masked token (especially when the label token remains unchanged) in its bidirectional context encoding.

Next Sentence Prediction
Although masked language modeling is able to encode bidirectional context for representing words, it does not explicitly model the logical relationship between text pairs. To help understand the relationship between two text sequences, BERT considers a binary classification task, next sentence prediction, in its pretraining. When generating sentence pairs for pretraining, for half of the time they are indeed consecutive sentences with the label “True”; while for the other half of the time the second sentence is randomly sampled from the corpus with the label “False”.

# Hierarchial Softmax

![image.png](https://d2l.ai/_images/hi-softmax.svg)