## Implementation

We need the following components to implement the core word2vec model in tensorflow and as we will see, these components can seen as layers:

1. **The input word embeddings (weights)**: These weights, represented as a matrix $\boldsymbol{W_i}$, transform the onehot encoded word representation $\boldsymbol{i}$ into the distributed wordvector representation $\boldsymbol{v}$.
 
2. **The output word embeddings (set of weights)**: These set of matrices $\boldsymbol{W_{o_1}}, \boldsymbol{W_{o_2}}, ...$, hold the embeddings of words when they are in the context i.e., output. The number depends on the size of the context that we choose for the model.

3. **Softmax**: To transform the dot product scores into probabilities.

4. **Negative log-likelihood/ Cross-entropy loss**: The loss functions for the optimizer.

![Model](word2vec_model_structure.png)

In [3]:
import tensorflow as tf
vocab_size = 1000
embd_dim = 10
# create placeholder to feed the input
I = tf.placeholder(tf.float32, shape=(None, vocab_size))
O1 = tf.placeholder(tf.float32, shape=(None, vocab_size))
O2 = tf.placeholder(tf.float32, shape=(None, vocab_size))
# Input and output embeddings
Wi = tf.get_variable("Wi", shape=(vocab_size, embd_dim))
Wo1 = tf.get_variable("Wo1", shape=(embd_dim, vocab_size))
Wo2 = tf.get_variable("Wo2", shape=(embd_dim, vocab_size))

# create the model
Ei = tf.matmul(I, Wi)
So1 = tf.matmul(Ei, Wo1)
So2 = tf.matmul(Ei,Wo2)
#Po1 = tf.nn.softmax(So1)
#Po2 = tf.nn.softmax(So2)
loss1 = tf.nn.softmax_cross_entropy_with_logits(logits=So1, labels=O1, name="loss1")
loss2 = tf.nn.softmax_cross_entropy_with_logits(logits=So2, labels=O2, name="loss2")
loss = tf.add(loss1, loss2, name="total_loss")

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



## Preprocessing input and output

In [11]:
from pathlib import Path
from math import floor
data_dir = Path('./data')
files = ['wiki_' + country + '_climate' for country in ['india', 'Australia', 'US']]
strings = []
for file in files:
    strings.append(open(data_dir.joinpath(file).with_suffix('.txt')).read())

corpus = (' '.join(strings)).split()

# create input and output samples
# window iterator
def windows(corpus, window_len=3):
    corpus_len = len(corpus)
    if l < window_len:
        raise ValueError("Corpus length cannot be smaller than window length")
    pad = int(floor(corpus_len/2))
    for i in range(pad, corpus_len - pad):
        yield corpus[]

['the',
 'climate',
 'of',
 'india',
 'comprises',
 'a',
 'wide',
 'range',
 'of',
 'weather',
 'conditions',
 'across',
 'a',
 'vast',
 'geographic',
 'scale',
 'and',
 'varied',
 'topography',
 'making',
 'generalisations',
 'difficult',
 'based',
 'on',
 'the',
 'system',
 'india',
 'hosts',
 'six',
 'major',
 'climatic',
 'subtypes',
 'ranging',
 'from',
 'arid',
 'desert',
 'in',
 'the',
 'west',
 'alpine',
 'tundra',
 'and',
 'glaciers',
 'in',
 'the',
 'north',
 'and',
 'humid',
 'tropical',
 'regions',
 'supporting',
 'rainforests',
 'in',
 'the',
 'southwest',
 'and',
 'the',
 'island',
 'territories',
 'many',
 'regions',
 'have',
 'starkly',
 'different',
 'microclimates',
 'the',
 'country',
 'meteorological',
 'department',
 'follows',
 'the',
 'international',
 'standard',
 'of',
 'four',
 'climatological',
 'seasons',
 'with',
 'some',
 'local',
 'adjustments',
 'winter',
 'december',
 'january',
 'and',
 'february',
 'summer',
 'march',
 'april',
 'and',
 'may',
 'a',
 