<img src="https://drive.google.com/uc?export=view&id=1I0h5HOcIt-Ga-4Y1ZfSlM09s2e6fQagn">
Trick: 
1. copy the web address: < img src="https://drive.google.com/uc?export=view&id=XXX">
2. go to the google drive and find the file you want to insert ([details](https://stackoverflow.com/questions/15557392/how-do-i-display-images-from-google-drive-on-a-website))
3. replace XXX in step 1 with the ID of the file

## Chapter 2 Word Vectors 2 and Word Senses

###  Review
- How do we reliaze the word2vec model? **Gradient Descent**

Gradient Descent is an algorithm to minimize $J(\theta)$. The idea is for current value of $\theta$, calculate gradient of $J(\theta)$ , then take small step in direction of negative gradient. Repeat.

The update quation in matrix notation is: 
\begin{equation}
\theta^{new}=\theta^{old}-\alpha \nabla_{\theta}J(\theta)
\end{equation}

where $\alpha$ is the step size or the learning rate. 

The update equation for single parameter is:

\begin{equation}
\theta_{j}^{new}=\theta_{j}^{old}-\alpha \frac{\partial}{\partial \theta_{j}^{old}}J(\theta)
\end{equation}

The algorithm is:

In [0]:
while True:
  theta_grad = evaluate_gradient(J, corpus, theta)
  theta = theta col- alpha * theta_grad

###1. Stochastic Gradient Descent

The obvious <u> problem</u> accociated with the regular gradient decent is that $J(\theta)$ is a function of all windows in the corpus and the number of the windows could be billions depedinng on the size of the window. If that's the case then $\nabla_{\theta}J(\theta)$ is very computationally expensive to calculate. Them before we update the first gradient, we have to update a billion windows at a time. 

The basic  <u> solution</u> to that is: **<font size=3, color='color'>Stochastic Gradient Descent</font>**

Instead updating the entire set of windows, we only select a or a sample of windows We repeately sample windows and update the gradient after each one. (Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples. It first reduces the variance in the parameter update and can lead to more stable convergence, second this allows the computation to take advantage of highly optimized matrix operations that should be used in a well vectorized computation of the cost and gradient.[More on this](http://ruder.io/optimizing-gradient-descent/) )

The code for this is :



In [0]:
for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad

That is we first shuffle the data set, then we randomly extract a sample from it. Then we perform the calculation on the gradient $\theta$. Then we do this again, shuffle the data and get sample and calculate the gradient. This is really similar to the idea of bootstrap and finally we will have a $\theta$ converging to a specific value ([more on this](http://deeplearning.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/)).

Now suppose we take the gradients at each of the sample window for the Stochastic Gradient Descent. For example, one window could be [I like <font color='color'>deep</font> learning and NLP] with deep as the *center* word. Then at each window we only have a small number of words, like 6 in this case. If we claculate the $\nabla_{\theta}J(\theta)$, it could be very sparse! That is the relationship between "deep" and "and" are quite small and $\nabla_{\theta}J(\theta)$ could be zero, becasue we have a really small sample that gives large variations:

\begin{equation}
\nabla_{\theta}J(\theta)=\begin{bmatrix}
0\\ \vdots \\ \nabla_{v_{like}} \\ \vdots \\0\\\nabla_{v_{I}}  \\ \vdots \\\nabla_{v_{learning}}  \\ \vdots
\end{bmatrix} \in \mathbb{R}^{2dV}
\end{equation}

For example, in the vetor  $\nabla_{\theta}J(\theta)$ we have a lot of "0"s and every time we update on these, there will be huge updates and we will run out of memory very quickly. 

There are two solutions to the problem.
- Use sparse matrix update operations to only update certain rows of full embedding matrices $U$ and $V$.
- Keep around a hash for word vectors, where the values are the vectors and the keys are the word strings. 

More pratice can be found in Assignment 1. 


### 2.Word2vec: Models and Training Methods 

Word2vec has 2 different models that serves for prediction of different word types: <font color='color'>Skip-grams (SG)</font>and <font color='green'>Continuous Bag of Words (CBOW)</font>.   

####2.1 Skip-grams (SG)

SG is used to predict context (“outside”) words (position independent) given center word, while Continuous Bag of Words (CBOW) does the opposite. It predicts the center word from (bag of) context words. 

Let’s discuss the Skip-Gram model above. We use $x$ to represent the input center words. The input one hot vector (center word) we will represent with an $x$ (since there is only one). And the output vectors of context words as $y^{(j)}$. 

We create two matrices, $V \in \mathbb{R}^{n \times |V|}$ and $U \in \mathbb{R}^{|V| \times n}$. Where $n$ is an arbitrary size which defines the size of our embedding space. $V$ is the input word matrix such that the $i$-th column of $V$ is the n-dimensional embedded vector for word $w_{i}$ when it is an input to this model. We denote this $n \times 1$ vector as $v_{i}$. Similarly, $U$ is the output word matrix. The $j$-th row of $U$ is an $n$-dimensional embedded vector for word $w_{j}$ when it is an output of the model. We denote this row of $U$ as $u_{j}$. Note that we do in fact learn two vectors for every word $w_{i}$ (i.e. input word vector $v_{i}$ and output word vector $u_{i}$). $m$ indicates the context size, or the window size, each window should have $2m + 1$ words. 

We breakdown the way this model works in these 6 steps:

1. We generate our one hot input vector $x \in \mathbb{R}^{|V|}$ of the center word.
2. We get our embedded word vector for the center word $v_{c} = Vx \in \mathbb{R}^{|n|}$
3. Generate a score vector $z = Uv_{c}$.
4. Turn the score vector into probabilities, $\hat{y} = softmax(z)$. Note
that $\hat{y}_{c−m}, . . . , \hat{y}_{c−1}, \hat{y}_{c+1}, . . . , \hat{y}_{c+m}$ are the probabilities of observing each context word.
5. We desire our probability vector generated to match the true probabilities which is $y_{c−m}, . . . , y_{c−1}, y_{c+1}, . . . , y_{c+m}$, the one hot vectors of the actual output.

We need to generate an objective function for us to evaluate the model. A key assumption here is that we invoke a Naive Bayes assumption to break out the probabilities. it is a strong (naive) conditional independence assumption. In other words, given the center word, all output words are completely independent.


\begin{equation}
\begin{aligned}
minimize\, J & = -\, logP(w_{c−m}, ..., w_{c−1}, w_{c+1}, . . . , w_{c+m}|w_{c})\\
& = -\,log \prod^{2m} _{j=0,\, j\neq m}P(w_{c−m+j}|w_{c})\\
& =-\,log \prod^{2m} _{j=0,\, j\neq m}P(u_{c−m+j}|v_{c})\\
& =-\,log \prod^{2m} _{j=0,\, j\neq m}\frac{exp(u^{T}_{c−m+j}v_{c})}{\sum_{k=1}^{|V|}exp(u^{T}_{k}v_{c})}\\
& =- \sum^{2m} _{j=0,\, j\neq m}u^{T}_{c−m+j}v_{c} + 2m\, log \sum_{k=1}^{|V|}exp(u^{T}_{k}v_{c})
\end{aligned}
\end{equation}

With this objective function, we can compute the gradients with respect to the unknown parameters and at each iteration update them via Stochastic Gradient Descent.

Note that 

\begin{equation}
\begin{aligned}
 J& =-\,log \sum^{2m} _{j=0,\, j\neq m}P(u_{c−m+j}|v_{c})\\
 & = -\,log \sum^{2m} _{j=0,\, j\neq m}H(\hat{y}, y_{c−m+j})
\end{aligned}
\end{equation}

where $H(\hat{y}, y_{c−m+j})$ is the cross-entropy between the probability
vector $\hat{y}$ and the one-hot vector $y_{c−m+j}$.

___
**Example** 

Suppose now we have one sentence "I  really like you Daniel".  The input vector will be a $5 \times 5$ matrix, where 5 will be :
\begin{equation}
V=
\begin{bmatrix}
1 & 0&  0& 0&  0\\
0 & 1&  0& 0&  0\\
0& 0&  1& 0&  0\\
0 & 0&  0& 1&  0\\
0 & 0&  0& 0&  1
\end{bmatrix}
\end{equation}

However, this is just one hot vec representation of $V$ but pactically, it doesn't have to be represented in this way.

where  $I=\begin{bmatrix} 1 & 0&  0& 0&  0\\ \end{bmatrix}$,  $really=\begin{bmatrix} 0 & 1&  0& 0&  0\\ \end{bmatrix}$,  $like=\begin{bmatrix} 0 & 0&  1& 0&  0\\ \end{bmatrix}$...

Then We generate our one hot input vector $x=\begin{bmatrix} 0 \\ 0\\  1\\ 0\\  0\\ \end{bmatrix}$ of the  center word we want, say "<font color='color'>like</font>", then we just extract the vector is this center word by $Vx=v_{c}=\begin{bmatrix} 0 \\ 0\\  1\\ 0\\  0\\ \end{bmatrix}$ ($x$ as an input vector must be one hot vec)

Consider a window of 2, then we will have the the context word matrix
\begin{equation}
U=
\begin{bmatrix}
0 & 1&  1& 0&  0\\
1 & 0&  1& 1&  0\\
1& 1&  0& 1&  1\\
0 & 1&  1& 0&  1\\
0 & 0&  1& 1&  0
\end{bmatrix}
\end{equation}
again this is just one hot vec representation of $V$ but pactically, it doesn't have to be represented in this way.

Next we generate the score vector 

\begin{equation}
U\cdot v_{c}=
\begin{bmatrix}
0 & 1&  1& 0&  0\\
1 & 0&  1& 1&  0\\
1& 1&  0& 1&  1\\
0 & 1&  1& 0&  1\\
0 & 0&  1& 1&  0
\end{bmatrix}
\cdot \begin{bmatrix} 0 \\ 0\\  1\\ 0\\  0\\ \end{bmatrix}
\end{equation}

A more general representation of the steps:

<img src="https://drive.google.com/uc?export=view&id=1L6YNcMz5yAJ92PZmCJQCj9IwygIkTNc6">

___

#### 2.2 Continuous Bag of Words Model (CBOW)

The way this model works is very similar to that of Skip-Gram. We breakdown the way this model works in these steps:

1. We generate our one hot word vectors for the input context of size  $m \, :\, (x^{(c-m)}, x^{(c-1)}, x^{(c+1)}, ..., x^{(c+m)}\in \mathbb{R}^{|V|})$

2. We get our embedded word vectors for the context ($v_{c-m} = Vx^{(c-m)}, v_{c-m+1} = Vx^{(c-m+1)}, ..., v_{c+m} = Vx^{(c+m)}\in \mathbb{R}^{|n|}$)

3. Average these vectors to get $\hat{v}=\frac{v_{c-m+}+v_{c-m+1}+...+v_{c+m}}{2m} \in \mathbb{R}^{|n|}$

4. Generate a score vector $z = U\hat{v}\in \mathbb{R}^{|V|}$. As the dot product of
similar vectors is higher, it will push similar words close to each other in order to achieve a high score.

5. Turn the score vector into probabilities, $\hat{y} = softmax(z) \in \mathbb{R}^{|V|}$. 

5. We desire our probability vector generated, $\hat{y} \in \mathbb{R}^{|V|}$, to match the true probabilities , $y\in \mathbb{R}^{|V|}$,  which also happens to be the one hot vector
of the actual word.

Very often when we are trying to learn a probability from some true probability, we look to information theory to give us a measure of the distance between two distributions. Here, we use a popular choice of distance/loss
measure, cross entropy $H(\hat{y}, y)$:

\begin{equation}
\begin{aligned}
 H(\hat{y}, y)& = \sum^{|V|} _{j=1}-\,y_{j} log(\hat{y}_{j})
\end{aligned}
\end{equation}

___
**Aside:** This equation actually uses the theory of Entrophy.  We can now consider the case where our prediction was perfect and thus $\hat{y}_{c} = 1$. We can then calculate $H(\hat{y}, y) =
−1\, log(1) = 0$. Thus, for a perfect prediction, we face no penalty or loss. Now let us consider the opposite case where our prediction was very bad and thus $\hat{y}_{c} = 0.01$. As before, we can calculate our loss to be $H(\hat{y}, y) = −1\, log(0.01) = 4.605$. We can thus see that for probability distributions, cross entropy provides us with a good measure of distance.
___

####2.3 Negative Sampling
Let's go back to the probbility funtion a bit,  for a center word *c* and a context word *o*:

\begin{equation}
P(o\,|\,c)=\frac{exp(u_{o}^{T}v_{c})}{\sum_{w \in V}exp(u_{w}^{T}v_{c})}
\end{equation}

The computation on the numerator here is not a big problem. The issue lies in the denominator. Imagine that there are 20,000 words in your corpus, then for each window, you have 20,000 multiplications and then you have to sum them up.  It turns out we also don't teach the model that much, because most of the words in the corpus don't cooccur with the center word. For example, "I really like you Daniel" has "like" as center word,  and usually it won't cooccur with "heartattack", so it is repetitive. 

The main idea of *Skip Gram* is we break the $P(o\,|\,c)$ into 2 parts:
-  <font color='color'>We keep the numerator as the true pair (center word and words in its context window) and we try to maximize the inner product between these words</font>.
- <font color='green'>At the same time, instead of going through the entire corpus, we keep some noise pair by randomly select sone words in the rest of the corpus that don't cooccur.  We will try to minimize the inner products of these noise pairs</font>.

Mikolov et al. present Negative Sampling in Distributed Representations of Words and Phrases and their Compo- sitionality. While negative sampling is based on the Skip-Gram model, it is in fact optimizing a different objective. Consider a pair $(w, c)$ of word and context. Did this pair come from the training data? Let’s denote by $P(D = 1|w, c)$ the probability that $(w, c)$ came from the corpus data. Correspondingly, $P(D = 0|w, c)$ will be the probability that $(w, c)$ did not come from the corpus data. First, let’s model $P(D = 1|w, c)$ with the sigmoid function:
\begin{equation}
P(D = 1|w, c, \theta)=\sigma(v^{T}_{c}v_{w})=\frac{1}{1+e^{(-v^{T}_{c}v_{w})}}
\end{equation}

Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it in- deed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maxi- mum likelihood approach of these two probabilities.
\begin{equation}
\begin{aligned}
 \theta&=argmax_{\theta}\prod _{(w,c)\in D}P(D = 1|w, c, \theta)\prod _{(w,c)\in \tilde{D}}P(D = 0|w, c, \theta)\\ 
&=argmax_{\theta}\prod _{(w,c)\in D}P(D = 1|w, c, \theta)\prod _{(w,c)\in \tilde{D}}P(D = 0|w, c, \theta) \\
&=argmax_{\theta}\sum _{(w,c)\in D}logP(D = 1|w, c, \theta)\sum _{(w,c)\in \tilde{D}}logP(D = 0|w, c, \theta)\\
&=argmax_{\theta}\sum _{(w,c)\in D}log\frac{1}{1+e^{(-v^{T}_{c}v_{w})}}\sum _{(w,c)\in \tilde{D}}log(1-\frac{1}{1+e^{(-v^{T}_{c}v_{w})}})\\
&=argmax_{\theta}\underbrace{\sum _{(w,c)\in D}log\frac{1}{1+e^{(-v^{T}_{c}v_{w})}}\sum _{(w,c)\in \tilde{D}}log\frac{1}{1+e^{(v^{T}_{c}v_{w})}}}_{J(\theta)}\\
&J(\theta)=log\,\sigma(u^{T}_{c}v_{c})+\sum _{j \sim P(w)}log\, \sigma(-u^{T}_{j}v_{c})
\end{aligned}
\end{equation}

Now look at $J(\theta)$, if the word in is more likely to cooccur with the center word, then $e^{(-v^{T}_{c}v_{w})}$ will be smaller, and $log\,\sigma(u^{T}_{c}v_{w})$ will be greater, but if the word in is less likely to cooccur with the center word, $e^{(v^{T}_{c}v_{w})}$ will be smaller and $\sum _{j \sim P(w)}log\, \sigma(-v^{T}_{c}v_{w})$ will be greater. 

Note that D ̃ is a "false" or "negative" corpus. Note that maximizing the likelihood is the same as minimizing the negative log likelihood: 

$$J(\theta)=-\sum _{(w,c)\in D}log\frac{1}{1+e^{(-v^{T}_{c}v_{w})}}-\sum _{(w,c)\in \tilde{D}}log\frac{1}{1+e^{(v^{T}_{c}v_{w})}}$$

For skip-gram, our new objective function for observing the con- text word $c - m + j$ given the center word $c$ would be
$-log\,\sigma(u^{T}_{c-m+j}v_{c})-\sum _{k=1}^{K}log\, \sigma(-u^{T}_{k}v_{c})$

For CBOW, our new objective function for observing the center word uc given the context vector $\hat{v}=\frac{v_{c-m+}+v_{c-m+1}+...+v_{c+m}}{2m}$ would be $-log\,\sigma(u^{T}_{c}\hat{v})-\sum _{k=1}^{K}log\, \sigma(-u^{T}_{k}\hat{v})$

The way we sample the "irrevalent" words is actually from a simple uniform/unigram distribution $P(w)=U(w)^{\frac{3}{4}}/Z$. It should be more often to sample the rare words otherwise, we probably will sample the words like "the", "a" as stop words. And we will never sample the irrevelant words like forever. So **the power makes the less frequent words be sampled** more often. 