# Word2Vec Example

**(C) 2018 by [Damir Cavar](http://damir.cavar.me/)**

**Version:** 1.0, January 2018

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Fall 2018 at Indiana University.

## Introduction

Here we will discuss briefly the necessary methods to understand the Word2Vec algorithm. We will use Numpy for the basic computations.

In [13]:
import numpy as np

### Using One-Hot Vectors

We can create a one-hot vector that selects the 3rd row:

In [14]:
x = np.array([0, 0, 1, 0])
x

array([0, 0, 1, 0])

Let us create a matrix $A$ of four rows:

In [15]:
A = np.array([[1,   2,  3,  4],
              [5,   6,  7,  8],
              [9,  10, 11, 12],
              [13, 14, 15, 16]])
A

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

We can use the column vector $x$ to select a row in matrix $A$:

In [16]:
x.dot(A)

array([ 9, 10, 11, 12])

### Computing Softmax

Assume that we have some data or results of mutually exclusive variables that represent scores for an observation being of class $C = [ c_1, c_2, c_3 ]$, as represented by the columns in the vector $y$ below:

In [21]:
y = np.array([4.0, 2.5, 1.1])

If we want to convert the scores into a probability distribution that represents the likelihood that on the basis of these scores the observation belongs to one of the classes, we can compute the Softmax of the vector:

$$
p(C_n) = \frac{\exp(\theta \cdot X_n)}{\sum_{i=1}^N{\exp(\theta \cdot X_i)}}
$$

The parameter $\theta$ allows us to scale the results to increase the probabilities of lower scalars in the vector. The exponentiation of $X$ makes larger values much larger. If we include a parameter like $\theta$, we can scale the effect and increase the probabilities assigned to lower values. See for more details the implementation of *softmax* below.

In Python we can write this using Numpy's *exp* and *sum* functions. The *axis* parameter determines that the some is performed row-wise:

In [43]:
def softmax1(y):
    return np.exp(y) / np.sum(np.exp(y), axis=0)

In [44]:
softmax1([4.0, 4.0, 2.0])

array([0.46831053, 0.46831053, 0.06337894])

We can provide a parameter $\theta$ to the function, to be able to scale the probability for low values up. The larger $\theta$, the higher the probability assigned to lower values. We set the default for $\theta$ to $1.0$ in the *softmax* definition:

In [46]:
def softmax(y, t=1.0):
    return np.exp(y / t) / np.sum(np.exp(y / t), axis=0)

For a vector of values $[ 4.0, 4.0, 2.0 ]$, we get the following probability distribution given a default $\theta$ of $1.0$:

In [47]:
softmax(np.array([4.0, 4.0, 2.0]))

array([0.46831053, 0.46831053, 0.06337894])

If we double $\theta$, the probability assigned to the third scalar increases significantly:

In [48]:
softmax(np.array([4.0, 4.0, 2.0]), 2.0)

array([0.4223188, 0.4223188, 0.1553624])

## Computing Word2Vec

Assume that we have trained vectors for center words and their context.