# Natural language processing & Word embeddings

## Introduction to word embeddings

* word representation
    * dictionary & one-hot vector
        * downstream algorithm cannot learn from word-similarity (apple/orange juice)
            * product of two one-hot vectors from a dictionary is zero
    * feature representation
        * embeddings (embedding a word into high-dimensional space)
        * visualization in 2D space through t-SNE
* application
    * leveraging unlabeled text (download from internet) in downstream tasks through embeddings
    * transfer learning
        * learn word embeddings from large corpus (or get pre-trained embeddings)
        * transfer embedding to new task with smaller training set
        * optional -> continue to fine-tune the embeddings with new data
* word embeddings captures analogies (man->women, king->queen)
    * $e_{man} - e_{woman} \sim e_{king}-e_?$, argmax similarity $sim(e_w,e_{king} - e_{man} + e_{woman} )$
    * helps intuition about how the embeddings internally works
    * similarities
        * cosine $sim(u,v) = \frac{u^Tv}{\|u\|_2 \|v\|_2}$
            * if angle zero then large, otherwise small
            * works good for analogy reasoning tasks
        * squared distance $sim(u,v) = - \|u-v\|^2$
            * works ok but not that often used
    * matrix representation E (rows embedding dim, columns word dim)
        * $E$ (100, 10k) $\cdot O_{6257}$ (10k, 1) = $\begin{bmatrix} \end{bmatrix}$ (300,1), that is embedding for the particular word 6257 from the dict, vertical slice from the $E$ matrix

**Learning embedding representation**
* neural language model
    * [paper](https://papers.nips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf)
    * $o$ -> $E$ -> $e$ -> FC (W,b) -> softmax (W,b) -> predicting the next word in the sentence
    * fixed window for how many words needed for the prediction
* other context/target pairs
    * prediction of a missing word using left and right window
    * prediction of a next word by a word
    * nearby 1 word (skip-grams)
* word2vec
    * [paper](https://arxiv.org/pdf/1301.3781)
    * context-target pairs
        * randomly pick a context word, randomly pick a target within a window (+-5)
    * learning mapping from context C ("context") -> target T ("juice")
    * $O_c$ -> $E$ -> $e_c = E\cdot o_c$ -> softmax layer -> $\hat{y}$
        * softmax $p(t|c) = \frac{e^{\Theta_t^T e_c}}{\sum_{j=1}^{10000} e^{\Theta_j^T e_c}}$
        * vocab size 10000, $\Theta_t$ -> parameters associated with output t
    * $L(\hat{y},y) = - \sum_{i=1}^{10000} y_i log \hat{y_i}$
    * softmax computationally intensive
        * hierarchical softmax, tree of classifiers, unbalanced tree, computational cost $log |v|$
    * sampling smartly (ie reducing probability of sampling a, the, or, and, ...)
* negative sampling
    * [paper](https://arxiv.org/pdf/1310.4546)
    * new learning problem -> presenting context-target pairs and predicting if they are present or not (positive or negative example)
    * we pick a context-target word and create positive example, for k examples we pick context-word and generate random words from the dictionary
    * supervised problem -> do those terms appear together?
    * for small datasets k = 5-20, for large ones k = 2-5
    * binary classification $P(y=1| c,t) = \sigma (\Theta^T_t e_c)$
    * 10000 binary classification problems, k+1 classifiers (random sampled + target)
    * sampling between the observed frequency and uniform distribution $P(w_i)=\frac{f(w_i)^{3/4}}{\sum_{j=1}^{10000}f(w_j)^{3/4}}$ 
