# Word Vector Representation
## Word Meaning
> "You shall know a word by the company it keeps"

Words possess multiple meanings depending on the context they appear. For example, the word `baby` in the following sentence can either be a noun or an adjective. 
> The pope's baby steps on gays.

We can either interpret baby as an adjective to `steps` or baby as an actual baby. The same applies for the word `steps` in the sentence. Although this sentence is inherently difficult to interpret, it illustrates the idea that the meaning of a word is largely coming from its context. 

### Distributional Meaning
We extract meaning through representing a word by means of its neighbors. Meaning is defined in terms of vectors. We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context.
```python
linguistics = [0.286, 0.792, -0.177, -0.107, 0.109, -0.542]
```

We define a model that aims to predict between a center word $w_{t}$ and context words in terms of word vectors.
$$
P(\text{Context}\mid w_{t}) = ... 
$$

which has a loss function, e.g.
$$
J = 1 - P(w_{-t}\mid w_{t})
$$

We look at many positions `t` in a big language corpus. We keep adjusting the vector representation of words to minimize this loss.

## Skip-Gram Model: Word2Vec
For each window `t = 1 ... T`, predict surrounding words in a window of radius `m` of every word. The objective is to maximize the probability of any context word given the current center word.
$$
J^{\prime}(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \lt m} P(w_{t+j} \mid w_{t}; \theta)
$$

### Cost Function
It's easier to work with summation so we turn the probabilities into log probabilities. People tend to prefer minimization over maximization, so we put a negative sign in front. Now this is the formal cost function that we will use.
$$
J(\theta) = \frac{-1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \lt m} log \; P(w_{t+j} \mid w_{t}; \theta)
$$

### Prediction
We predict surrounding words in a window of radius `m` of every word.

For $P(w_{t+j} \mid w_{t})$, the simplest first formulation is,
$$
P(o\mid c) = \frac{exp(u_{o}^{T}v_{c})}{\sum_{w=1}^{W} exp(u_{w}^{T}v_{c})}
$$

where $o$ is the outside word index, $c$ is the center word index, $v_{c}$ and $u_{o}$ are center and outside vectors of indices $c$ and $o$. We use center word to get the softmax probabilities for outside word. **IMPORTANT**: There are actually two vector representations for each word, hence the notation $u$ and $v$. One vector for being the center word and one vector for being the context word.

**Softmax**
Note that softmax is a standard way to map from a set of real numbers to a probability distributions. It works for all real numbers, i.e. including negatives. This is because when we take exponential of any number, it makes the number positive.


![skip_gram](./assets/skip_gram.png) 

Terminology
* $\theta$ is a long vector that defines the set of all parameters in the model.
* `D` is the dimension of our word vector.
* `V` is the total number of words in our vocabulary.

### Gradient Derivation
We are trying to make updates to $v_{c}$, the word vector of our center word. We will take the gradient of loss with respect to $v_{c}$ and then do an update step on it.

Let's ignore the negative one over `T` constant for a moment and expand out the probaility expression into exponentials. We focus on one center word at the moment and denote it as $c$. 
$$
\frac{\partial J}{\partial v_{c}} = \frac{\partial}{\partial v_{c}} \log\; exp(u_{o}^{T}v_{c}) - log \sum_{w=1}^{W} exp(u_{w}^{T}v_{c})
$$

**NOTE**: The usage of capital `W` here is different than above. Originally `W` stands for the vocabulary size for the whole corpus. When we do stochastic gradient descent, the W here represents the window size. If we have a 40 billion words corpus, the iteration is quite insanely inefficient for naive gradient descent.

Now we have two pieces to take derivative of. The first piece is easy
$$
\frac{\partial}{\partial v_{c}} \log\; exp(u_{o}^{T}v_{c})  = \frac{\partial}{\partial v_{c}} u_{o}^{T}v_{c} = u_{o}
$$

The second piece requires chain rule.
$$
\frac{\partial}{\partial v_{c}} log \sum_{w=1}^{W} exp(u_{w}^{T}v_{c}) = \frac{1}{\sum_{w=1}^{W}exp(u_{w}^{T}v_{c})}\left[ \sum_{x=1}^{W} exp(u_{x}^{T}v_{c}) u_{x} \right]
$$

**DO NOT** think that the two summation can be cancelled out with each other. Think of the fraction term as a constant instead. So Let's re-organize it.
$$
\sum_{x=1}^{W} \frac{exp(u_{x}^{T}v_{c})}{\sum_{w=1}^{W} exp(u_{w}^{T}v_{c})} u_{x}
$$

Now, what does that look like? It's an expected value for $u_{x}$. Let's combine everything together.
$$
\frac{\partial J}{\partial v_{c}} = u_{o} - \sum_{x=1}^{W} P(x \mid c) u_{x}
$$

The whole form will be.
$$
\frac{\partial}{\partial v_{c}} J(\theta) = \frac{-1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \lt m} \left[ u_{w_{i+j}} - \sum_{x=1}^{W} P(x \mid w_{i}) u_{x} \right]
$$