## Contents

1.) Challenges with text representation <br>
2.) What is a Context? <br>
2.) Transforming Discrete Space to Continuous Space <br>
3.) Word2Vec model <br>


### Challenges with text representation

1.) Given a collection of words in a sentences, how do we represent a word in our computer? <br> 
2.) Images have pixel values, a grey scale pixel value 0.67 is close to 0.68, but in images how do we know that "appple" is just a mis-spelt "apple", so should be close to each other. <br>

### Exploiting the context

"You shall know a word by the company it keeps" - Firth 1957 <br>

1.) It can be words in same sentence or paragraph or document.  <br>
2.) Word in front and behind a particular word.  <br>
3.) 'K' word in front and behind a particular word.  <br>

Below is a representation that considers 2 words before and after as a context <br>
<br>
<font color=green>A bottle of __tezguino__ is on the table. <br>
__Tezguino__ makes you drunk.<br>
... <br>
I had a fancy bottle of __wine__ and got drunk last night! <br>
The terrible __wine__ is on the table. <br>

</font>
           
|          |  bottle |  table  |   you   | terrible |
|------- --|---------|---------|---------|----------|
| tezguino |    1    |    0    |    1    |     0    | 
|   wine   |    1    |    0    |    0    |     1    | 



### Transforming Discrete Space to Continuous Space

As we saw above, the idea is to exploit the neighbourhood co-ocuurences and learn <br>
$$p(x_i|x_{i-k}, x_{i-(k-1)},..x_{i-1},x_{i+1},...x_{i+k} )$$

<img src="cbow.png">

In [3]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".lower().split()

In [13]:
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i ], test_sentence[i+1]], test_sentence[i+2])
            for i in range(len(test_sentence) - 2)]

# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

[(['when', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [31]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F

class CBOWLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOWLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        out = embeds.sum(dim=0)
        out = self.linear1(out).view(1,-1)
        return out

In [32]:
losses = []
loss_function = nn.CrossEntropyLoss() # LogSoftMax + NLLLoss(classification loss)
model = CBOWLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(100):
    total_loss = 0
    for context, target in trigrams:
    
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = list(map(lambda w: word_to_ix[w], context))
        
        context_var = Variable(torch.LongTensor(context_idxs))
    
        # Step 2. Recall that torch *accumulates* gradients.  Before passing in a new instance,
        # you need to zero out the gradients from the old instance
        model.zero_grad()
    
        # Step 3. Run the forward pass, getting probabilities over next words
        probs = model(context_var)
    
        # Step 4. Compute your loss function. (Again, Torch wants the target word wrapped in a variable)
        loss = loss_function(log_probs, Variable(torch.LongTensor([word_to_ix[target]])))
    
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        total_loss += loss.data[0]
        
    if epoch % 10 == 0:
        print("Loss at epoch:{0} is {1} ".format(epoch, loss.data[0]))
    losses.append(total_loss)


Loss at epoch:0 is 4.895758152008057 
Loss at epoch:10 is 4.779870986938477 
Loss at epoch:20 is 4.668769359588623 
Loss at epoch:30 is 4.561984539031982 
Loss at epoch:40 is 4.45906400680542 
Loss at epoch:50 is 4.3595757484436035 
Loss at epoch:60 is 4.263116359710693 
Loss at epoch:70 is 4.169312953948975 
Loss at epoch:80 is 4.077834129333496 
Loss at epoch:90 is 3.988384485244751 


### Semantic space learned by Embeddings

In [1]:
# We will load a pre-trained model trained on Gigaword corpus as the power of these representations come from
# a large amount of data.

#To run this first downlaod the embeddings from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [7]:
word_vectors.most_similar(positive=['apple'])

[('apples', 0.720359742641449),
 ('pear', 0.6450697183609009),
 ('fruit', 0.6410146951675415),
 ('berry', 0.6302294731140137),
 ('pears', 0.6133959889411926),
 ('strawberry', 0.605826199054718),
 ('peach', 0.6025872230529785),
 ('potato', 0.5960936546325684),
 ('grape', 0.5935863256454468),
 ('blueberry', 0.5866668224334717)]

In [6]:
word_vectors.most_similar(positive=['dinner'])

[('dinners', 0.7902063727378845),
 ('brunch', 0.7900512218475342),
 ('Dinner', 0.7639395594596863),
 ('supper', 0.7596098184585571),
 ('luncheon', 0.7099569439888),
 ('banquet', 0.7032414674758911),
 ('breakfast', 0.7007029056549072),
 ('buffet_dinner', 0.6914125084877014),
 ('meal', 0.6843624114990234),
 ('lunch', 0.6815704703330994)]

In [8]:
# woman + king - man = queen

word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192315101624),
 ('monarch', 0.6189672946929932),
 ('princess', 0.5902429819107056),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377322435379028),
 ('kings', 0.5236843824386597),
 ('Queen_Consort', 0.5235944986343384),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098592638969421),
 ('monarchy', 0.5087411999702454)]

### Alternative Approach : Skip-Gram

<img src="skip.png">

The above model performs better but takes a long time to train due to high number of parameters. There are few tricks like negative sampling which are used to make this faster.

### References

[Mikolov et.al] Distributed Representations of Words and Phrases and their Compositionality <br>
[Mikolov et.al]  Efficient Estimation of Word Representations in Vector Space

