# Word2Vec - The Skip-Gram Model

The idea of word2vec is actually quite simple. We want to train a 2-layer neural network to perform a fake task. The weights in the hidden layer will become our embedding vectors for our vocabulary in the corpus.

## Fake Task

Given a specific word in the middle of a sentence, find the probability for every word in our vocabulary of being the nearby *word*.

For example, I have a sentence.

> Live as if you were to die tomorrow and learn as if you were to live forever.

I pick *tomorrow* to be my center word. Now the neural network is supposed to tell me what is the probability of *die* being the nearby word of *tomorrow*. If it is properly trained, it should be close to one. Similarly if I ask the neural network, what is the probability of *live* being the nearby word of *tomorrow*, it should be close to zero.

## Architecture

![skip-gram-model](./assets/skip-gram-model.png)

### Input

The input is a word that gets translated into one-hot encoding. The encoding vector should have length of `V` where `V` represents the total vocabulary length. For example, if I have 10 unique words in my vocabulary, one of the words will be encoded as the following.

    [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
    
### Output

Each output is a vector of same length `V`; it contains the probability for all the words that they are the nearby word of the input center word. The output vector will contain float ranging from 0 to 1. 

    [0.1, 0.23, 0.64, 0.45, 0, 0.523, 0.4, 0.9, 0.34, 0.85]

### Lookup Table

Our neural network has two layers and two sets of weights. Suppose we have 1000 words in our vocabulary and 300 is our feature dimension. The first set of weights from the hidden layer will be our word vector lookup table after we finish training.

![word2vec-lookup-table](./assets/word2vec-lookup-table.png)

### Implementation

Now we are ready to actually create the model for performing such task. Let's define our word vector feature length to be `D=100`. 

In [22]:
import numpy as np


corpus = """
The Porsche Boxster is a mid-engined two-seater roadster. It was Porsche's first road vehicle to be
originally designed as a roadster since the 550 Spyder. The first-generation Boxster was introduced in
late 1996; it was powered by a 2.5-litre flat six-cylinder engine. The design was heavily influenced by the 1992
Boxster Concept. In 2000, the base model was upgraded to a 2.7-litre engine and the new Boxster S variant was
introduced with a 3.2-litre engine. In 2003, styling and engine output was upgraded on both variants.
"""


def softmax(x):
    if len(x.shape) > 1:
        x = x - np.max(x, axis=1, keepdims=True)
        denom = np.sum(np.exp(x), axis=1, keepdims=True)
        x = np.exp(x) / denom
    else:
        x = x - np.max(x)
        denom = np.sum(np.exp(x))
        x = np.exp(x) / denom
    
    return x
    
    
class Model(object):
    def __init__(self, corpus, feature_dim):
        self.add_corpus(corpus)
        self.V = len(self.lookup_table)
        self.D = feature_dim
        self.w1 = np.random.randn(self.V, self.D)
        self.w2 = np.random.randn(self.D, self.V)
    
    def add_corpus(self, corpus):
        self.lookup_table = dict()
        idx = 0
        for word in set(corpus.split()):
            self.lookup_table[word] = idx
            idx+=1
    
    def forward(self, word):
        if self.lookup_table.get(word) is None:
            return
        
        idx = self.lookup_table.get(word)
        wordvec = np.array([0]*self.V).reshape(1, self.V)
        wordvec[0][idx] = 1
        
        out1 = np.dot(wordvec, self.w1)
        out2 = np.dot(out1, self.w2)
        out3 = softmax(out2)
        
        return out3
        
        
model = Model(corpus, 100)
model.forward('Porsche')

array([[  7.10131056e-13,   2.47594653e-06,   2.53290408e-12,
          2.20996587e-06,   2.30141869e-01,   9.49824842e-02,
          3.22236345e-05,   1.60437264e-02,   1.27058846e-05,
          1.88410823e-01,   8.26042585e-13,   4.57722072e-08,
          5.02238406e-08,   5.09131038e-05,   3.05286306e-08,
          6.70482585e-18,   7.26135416e-10,   4.76799976e-03,
          1.43687960e-12,   4.89209703e-03,   4.98483157e-06,
          9.29606849e-03,   1.55444970e-17,   9.37455733e-10,
          2.82616201e-12,   1.36136316e-04,   5.78575661e-14,
          2.58469383e-02,   1.24313812e-12,   8.67508145e-14,
          1.77377765e-22,   7.86738822e-15,   9.52655197e-14,
          7.63781212e-13,   1.46723086e-02,   3.13115174e-13,
          2.83290437e-04,   1.83244196e-05,   1.28662427e-16,
          2.75454297e-15,   1.47283303e-15,   1.07521354e-12,
          3.80998775e-03,   7.74221786e-11,   1.26412695e-04,
          3.57877099e-06,   3.96060206e-01,   7.15494708e-11,
        

## Training

If we have a giant corpus with one million unique vocabulary words, and our feature dimension is defined to be 100. We will have a matrix that has 100 million entries. This is not going to fit in memory. We will also have serious trouble with computing the matrix multiplication.