# Global Vectors for Word Representations
## Skip-Gram Recap
Let's recap the summary of Skip-Gram model. 

1. Go through each word of the whole corpus, suppose that `W` is the total number of words in the corpus, aka the vocabulary size. 

2. Predict surrounding words of each (window's center) word. The bottom sum can be computationally expensive cause it must go through the whole entire vocabulary.

$$
P(o \mid c) = \frac{exp(u_{o}^{T} v_{c})}{\sum^{W}_{w=1} exp(u_{w}^{T}v_{c})}
$$

3. Take gradients at each such window for SGD

There are two matrices, `V` and `U`. We will use upper cased letters to represent matrices and lower cased letters to represent a vector. Use `D` as the word vector feature dimension.

`V` is the center word matrix. Notice that each column represents the word vector for one single word in the corpus.

$$
V = \begin{bmatrix} 
V[0]_{0} & V[0]_{1} & ... & V[0]_{W} \\
V[1]_{0} & V[1]_{1} & ... & V[1]_{W} \\
... & ... & ... & ... \\
V[D]_{0} & V[D]_{1} & ... & V[D]_{W}
\end{bmatrix}
$$

`U` is the context word or outside word matrix. Similarly, each column represents the word vector for a single word in the corpus.

$$
U = \begin{bmatrix} 
U[0]_{0} & U[0]_{1} & ... & U[0]_{W} \\
U[1]_{0} & U[1]_{1} & ... & U[1]_{W} \\
... & ... & ... & ... \\
U[D]_{0} & U[D]_{1} & ... & U[D]_{W}
\end{bmatrix}
$$

**Objective**: We want to maximize the probability for each outside word, given a center word.

## Distributed Representations of Words and Phrases and their Compositionality
The sum under probabiltity expression is very expensive and inefficient because for a given center word, most context words in the corpus are completely irrelevant to it. The dot product of two irrelevant words leads to zero contribution to the sum. 

*The trick here is to train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word.)*

### Objective Function
`T` is the total number of windows that we can possibly fit in a corpus given a window size. 

$$
J(\theta) = \frac{1}{T} \sum_{t=1}^{T} J_{t}(\theta)
$$

And the simplified cost function (which is a bit different from previous lecture) for a given window is

$$
J_{t}(\theta) = log \; \sigma(u_{o}^{T}v_{c}) + \sum_{j \tilde{} P(w)} \left [ log \; \sigma(-u_{j}^{T}v_{c})\right ]
$$

1. First term is using a sigmoid function instead of a typical probability notation because it is computationally easier to compute a sigmoid. The result will come out to be the same after maximization. 
2. The second term represents sub-sampling. We take `k` negative examples, i.e. random words that do not appear with the center word. 
3. Maximize probability that real outside word appears and minimize probability that random words appear around center word.

**Math Trick**
$$
\sigma(-x) = 1 - \sigma(x)
$$

The way we sample the random words are using unigram distribution `U(w)` raised to the $\frac{3}{4}$ power. 

$$
P(w) = U(w)^{3/4}
$$

The power makes less frequent words to be sampled more often.