# Word2Vec: A prediction based method

Let us try to remember our main idea for this week: We have to put information about contexts into numerical vectors

While count based methods took this idea quite rigorously, Word2Vec uses it in a different manner, they try to learn word vectors by teaching them to predict contexts.

Word2Vec is a model whose parameters are word vectors. These parameters are optimized iteratively for a certain objective. The objective forces word vectors to "know" contexts a word can appear in: the vectors are trained to predict possible contexts of the corresponding words. As you remember from the distributional hypothesis, if vectors "know" about contexts, they "know" word meaning.

Word2Vec is an iterative method. Its main idea is as follows:

1. take a huge text corpus;
2. go over the text with a sliding window, moving one word at a time. At each step, there is a central word and context words (other words in this window);
3. for the central word, compute probabilities of context words;
4. adjust the vectors to increase these probabilities.

### Objective Function: Negative Log Likelihood

For each position $t=1,...,T$ in a text corpus, Word2Vec predicts context words within a m-sized window given the central word $w_{t}$

$$Likelihood = L(\theta) = \prod_{t=1}^{T} \prod_{-m\le j \le m,j\neq 0} P(w_{t+j}| w_{t}, \theta)$$

where $\theta$ are all variables to be optimized. The objective function (aka loss function or cost function) $J(\theta)$ is the average negative log-likelihood:

$$Loss = J(\theta) = -\frac{1}{T}log{L(\theta)} = -\frac{1}{T}\sum_{t=1}^{T} \sum_{-m\le j \le m,j\neq 0} log{P(w_{t+j}| w_{t}, \theta)}$$

In simpler terms our objective function goes over the text with a sliding window and computes the probabilities of the context words given the central word

### How do we calculate $P(w_{t+j}|w_{j}, \theta)$?

For $w$ each word we will have two vectors:
* when it is a central word;
* when it is a context word.

(Once the vectors are trained, usually we throw away context vectors and use only word vectors.)

Then for the central word $c$ (c - central) and the context word $o$ (o - outside word) probability of the context word is

$$P(o, c) = \frac{exp(u_{o}^{T}v_{c})}{\sum_{w \in V} exp(u_{w}^{T}v_{c})}$$

The dot product measures the similarity of $o$ and $c$, the larger the dot product the greater the similarity and then we are just normalizing it to get a probability distribution

This is very much similar to the softmax function
$$softmax(x_{i}) = \frac{exp(x_{i})}{\sum_{j=1}^{n} exp(x_{j})}$$
It is called softmax beacause
* "soft" all probabilities are non zero
* "max" max $x_{i}$ has the maximum probability $p_{i}$

### Training by Gradient Descent, one word at a time

Let us recall that our parameters $\theta$ are vectors $v_{w}$ and $u_{w}$ for all words in the vocabulary. These vectors are learned by optimizing the training objective via gradient descent (with some learning rate $\alpha$):

$$\theta^{new} = \theta^{old} - \alpha \nabla_{\theta}J(\theta)$$

We make these updates one at a time: each update is for a single pair of a center word and one of its context words. Look again at the loss function:

$$Loss = J(\theta) = -\frac{1}{T}log{L(\theta)} = -\frac{1}{T}\sum_{t=1}^{T} \sum_{-m\le j \le m,j\neq 0} log{P(w_{t+j}| w_{t}, \theta)}$$

For the center word $w_{t}$, the loss contains a distinct term $J_{t,j}(\theta) = -log{P(w_{t+j}|w_{t}, \theta)}$ for each of its context words $w_{t+j}$. Let us look in more detail at just this one term and try to understand how to make an update for this step.

$$J_{t,j}(\theta) = -log{P(w_{t+j}|w_{t}, \theta)} = -log{\frac{exp(u_{j}^{T}v_{t})}{\sum_{w \in V_{oc}} exp(u_{j}^{T}v_{t})}} = -u_{j}^{T}v_{t} + log{\sum_{w \in V_{oc}} exp(u_{j}^{T}v_{t})}$$

Note which parameters are present at this step:
* from vectors for central words, only 
* from vectors for context words, all 

Only these parameters will be updated at the current step.
Let us now calculate their gradients

$$v_{t} := v_{t} - \alpha \frac{\partial J_{t,j}(\theta)}{\partial v_{t}}$$
$$u_{j} := u_{j} - \alpha \frac{\partial J_{t,j}(\theta)}{\partial u_{j}} \forall w \in V_{oc}$$

Next Week we will be implementing the word2vec in pytorch

If you find anything hard to understand please look at the paper: https://arxiv.org/pdf/1411.2738.pdf

Resources:

* https://web.stanford.edu/class/cs224n/readings/cs224n_winter2023_lecture1_notes_draft.pdf