# Likelihood:
For each position $t=1,2..T$ we predict the context words in a window of fixed size $m$, given the center word $w_{t}$.
$$
L( \theta ) =\prod ^{T}_{t=1}\prod _{-m\leqslant j\leqslant m;\ j\neq 0} P( w_{t+j} |w_{t} ;\theta )
$$

The symbols and their meaning:
1. $\theta$: The parameter that is learned. Here refers to the vector embeddings of each word.
2. $t$: The variable that represents each word index in the corpus of text. There are $T$ number of words in the corpus.
3. $m$: The size of the window of context
4. $j$: The variable that represents the context words index.

We can say that likelihood is how good our model has done. How likely is our probabilistic model going to predict the right context words given the center word.

# Objective Function
The objective function is taken to be the average negative log likelihood.

Average: The average gives us the mean error. Scale of things are independent of the size of the corpus.

Negative: The optimizer should decrease the error

Log: Products are transformed into summation

$$
J( \theta ) =-\frac{1}{T}\log L( \theta )\\
\Longrightarrow J( \theta ) =-\frac{1}{T}\sum ^{T}_{t=1}\sum _{-m\leqslant j\leqslant m;\ j\neq 0}\log P( w_{t+j} |w_{t} ;\theta )
$$

# Probability
Everything is well and fine. The probability model and the objective function is intuitive enough. The word vectors would indeed make a lot of sense if those vectors could be used to predict the context vectors. The only problem that we see right now is with the probability equation. What could we probably use to depict the probability of a context word given a center word?

What we consider having is two vector representation for each word in the vocabulary. Let us consider a word $w$ then the two vectors would be:
1. $v_w$ - When the word is the **center** word
2. $u_w$ - When the words is the **context** word

$$
\boxed{P( w_{o} |w_{c}) =\frac{\exp u_{o}^{T} v_{c}}{\sum ^{W}_{i=1}\exp u_{o}^{T} v_{c}}}
$$

This looks like the **softmax function** doesn't it? Let us break this equation down.

In the numerator we have the $\exp u_{o}^{T} v_{c}$ term. This is the dot product between the center and the context word. This signifies how close the two words are. Exponentiating the dot product has a nice effect to it. It not only increases the big numbers but also diminishes the small number. Hence the numerator talks about how close (similar) the center and context words are.

In the denominator we have a normalization term $\sum ^{V}_{i=1}\exp u_{i}^{T} v_{c}$. This normalizes the numerator and provides us with a percentage similarity.

The formula diectly translates to the probability of the context words with the center word in question.

# Optimization
We need to notice that $\theta$ here is the parameter of the model. The parameters are the vector representation of the words in the vocabulary. We consider a vocabulary of size $V$. The vectors are $d$ dimensional. We also need to understand that in our case we are considering two vector representations for each word ($u$ and $v$). This means that the parameter vector is going to be of the size of $2dV$.
$$
\theta =\begin{bmatrix}
u_{the}\\
u_{quick}\\
.\\
.\\
u_{end}\\
v_{the}\\
v_{quick}\\
.\\
.\\
v_{end}
\end{bmatrix} \in \Re ^{2dV}
$$

With this vector in place we would like to optimize each and every one of the vectors to descend down the **loss landscape** and model the meaning of each word. In this process, we would have to compute $\nabla\theta$. This directly means that we need to compute the gradient for each and every term in the vector $\theta$.

# Loss landscape

Here I will try to derive the gradients of the loss with respect to the parameters of the model. The parameters of the model includes $u_{w}$ and $v_{w}$ for each $w$ in the vocabulary. We initialise the model with random $u_{w}$ and $v_{w}$ and then change their values in accordance to the gradient of the objective function $J(\theta)$.

# $\frac{\partial J(\theta)}{\partial v_{t}}$

In this section we will look into the derivation of the gradient of the objective function with respect to the vector representation of the center word.
$$
\frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =\frac{\partial }{\partial v_{c}}\log\frac{\exp u_{o} v_{c}}{\sum ^{V}_{i=1}\exp u_{i} v_{c}}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =\frac{\partial }{\partial v_{c}}\log\exp u_{o} v_{c} -\frac{\partial }{\partial v_{c}}\log\sum ^{V}_{i=1}\exp u_{i} v_{c}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =\frac{\partial }{\partial v_{c}} u_{o} v_{c} -\frac{\partial }{\partial \sum ^{V}_{i=1}\exp u_{i} v_{c}}\log\sum ^{V}_{i=1}\exp u_{i} v_{c} .\frac{\partial }{\partial v_{c}}\sum ^{V}_{x=1}\exp u_{x} v_{c}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\frac{1}{\sum ^{V}_{i=1}\exp u_{i} v_{c}} .\sum ^{V}_{x=1}(\exp u_{x} v_{c}) u_{x}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\frac{\sum ^{V}_{x=1}(\exp u_{x} v_{c}) u_{x}}{\sum ^{V}_{i=1}\exp u_{i} v_{c}}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\sum ^{V}_{x=1}\frac{(\exp u_{x} v_{c})}{\sum ^{V}_{i=1}\exp u_{i} v_{c}} u_{x}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\sum ^{V}_{x=1} P( u_{x} |v_{c}) u_{x}\\
\boxed{\frac{\partial J( \theta )}{\partial v_{c}} =-\frac{1}{T}\sum ^{T}_{t-1}\sum _{-m\leqslant j\leqslant m;\ j\neq 0}\left[ u_{t+j} -\sum ^{V}_{x=1} P( u_{x} |v_{t}) u_{x}\right]}
$$

The last equation kind of gives us an intuitive model of the gradient (slope). We are subtracting the expected context word vector ($\sum ^{V}_{x=1} P( u_{x} |v_{c}) u_{x}$) from the observed context vector ($u_o$).
# $\frac{\partial P( J(\theta)}{\partial u_{o}}$

In this section we will look into the derivation of the gradient of the objective function with respect to the vector representation of the context word.
$$
\frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =\frac{\partial }{\partial u_{o}}\log\frac{\exp u_{o} v_{c}}{\sum ^{V}_{i=1}\exp u_{i} v_{c}}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =\frac{\partial }{\partial u_{o}}\log\exp u_{o} v_{c} -\frac{\partial }{\partial u_{o}}\log\sum ^{V}_{i=1}\exp u_{i} v_{c}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =\frac{\partial }{\partial u_{o}} u_{o} v_{c} -\frac{\partial }{\partial \sum ^{V}_{i=1}\exp u_{i} v_{c}}\log\sum ^{V}_{i=1}\exp u_{i} v_{c} .\frac{\partial }{\partial u_{o}}\sum ^{V}_{x=1}\exp u_{x} v_{c}\\
\Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =v_{c} -\frac{1}{\sum ^{V}_{i=1}\exp u_{i} v_{c}} .v_{c}\\
\boxed{\frac{\partial J( \theta )}{\partial u_{t+j}} =-\frac{1}{T}\sum ^{T}_{t-1}\sum _{-m\leqslant j\leqslant m;\ j\neq 0}\left[ u_{t+j} -\sum ^{V}_{x=1} P( u_{x} |v_{t}) u_{x}\right]}
$$