# Word2Vec Embedding

Word2Vec deals with two different types of words to generate embedding vectors. The word we are looking into is the *Focus Word* and the words surrounding it are the *Context Words*. Word2Vec can be achieved using two methods: Skip-Gram and Common-Bag-of-Words (CBOW).

`CBOW Model:` The core idea of CBOW is: Given a context word, can we predict the focus word? Let's understand the working of CBOW:
1. Construct a vocabulary of size _v_.
2. Represent each word using one-hot-encoding. So, each word corresponds to a *v*-dimensional binary vector.
3. Pass v-dimensional context vectors as input to a neural network with N-dimensional hidden layer and generate an output vector again of v-dimensions.

`Skip-Gram Model:` In skip-gram the behaviour gets flipped i.e. it predicts context words given the focus word. In both CBOW and skip-gram we have *k+1* numbers of *NxV* sized vectors. But CBOW has only 1 softmax to train while skip-gram has k softmax to train. So, it is obvious that skip-gram takes more time to train.

In both cases the neurons of hidden layer consists of _linear_ activation function while the output layer is associated with *softmax* activation function. All the neurons are fully connected. It has been experienced that skip-gram works well with small amount of data and is found to represent rare words well while CBOW is faster and has better representations for more frequent words.

![](./../assets/embedding/w2v.jpg)

Word2Vec Embedding can be implemented using `gensim` library. gensim library has `Word2Vec` module that let's us work with the word2vec model easily. Let's begin by installing the required package:

`pip3 install gensim`

In [1]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

## Preprocessing

In [2]:
def read_file(path):
    with open(path, 'r', encoding='utf-8') as f:
        data = f.read()
        sent = list()
        for line in data.split('\n\n'):
            sent.append(simple_preprocess(line))
        return sent

In [3]:
sentences = read_file('./../data/shakespeare.txt')

## Building the Vocab

`Word2Vec():` We initialize word2vec model with the argument *sentences* which is a list of sentences. The two parameters: *min_count* and *sample*, have a great influence over the performance of a model.

In [4]:
model = Word2Vec(sentences=sentences, min_count=20, sample=6e-5)
model.save('./../models/w2v.model')

## Exploring the model

In [5]:
model = Word2Vec.load('./../models/w2v.model')

Let see how our model lists the top n most similar words to the given input word.

In [6]:
model.wv.most_similar(positive=["smile"], topn=3)

[('lest', 0.9987840056419373),
 ('whom', 0.9987515211105347),
 ('clothes', 0.9987297654151917)]

Let's see the analogy difference.

`For example:` Which word is to woman as king is to queen?

In [7]:
model.wv.most_similar(positive=['man', 'king'], negative=['queen'], topn=3)

[('humour', 0.9938604831695557),
 ('earth', 0.9937192797660828),
 ('cade', 0.9928563833236694)]

Check the similarity between two words i.e. how similar the words are?

In [8]:
model.wv.similarity('king', 'queen')

0.9858703

Let's ask our model to separate a word that does not belong to the list!

In [9]:
model.wv.doesnt_match(['king', 'queen', 'prince', 'walk'])

'walk'