# Deep Learning for NLP WS17/18
## Exercise Sheet 4 - Pytorch WordEmbeddings
This exercise sheet is due on 28.11.17 11:59 pm. There is a total of 5
points for this exercise sheet. Please send your solution in a
suitable format to [beroth@cis.uni-muenchen.de](mailto:beroth@cis.uni-muenchen.de). Please submit a
completed version of this file in Python 3. You may submit in teams of
2 or 3 students.


You will have to complete the code where marked with ***TODO***

Please rename the file to pytorch_wordEmbeddings_last_names.ipynb

### Setup

Please refer to the last exercise sheet if you have trouble installing Pytorch. This time you will have to install nltk. (e.g. by using pip)

In [None]:
import random
import numpy as np
import nltk
from collections import Counter
from nltk.corpus import brown
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.optim as optim
import torch.utils.data as data_utils


The nltk library provides datasets which can be used for dry running an approach or to verify a hypothesis. We will use the brown corpus. In case you never used the brown corpus before on your machine you will have to uncomment the download statement.

For developing/debugging your code you can only take the first tokens of the corpus. Training with the full corpus may take up to one hour (standard laptop). Use only the first 1000 words until your code works, later you can use the whole corpus. Note that the words used for human evaluation at the very end could not be part of the short list!

***TODO***

In [None]:
#nltk.download("brown")
#brown_word_list = [w.lower() for w in brown.words()] #TODO
brown_word_list = [w.lower() for w in brown.words()[:1000]]

Next we create the vocabulary. The brown corpus contains 56057 unique words/tokens (lower cased), for the sake of computation time we will only use the 10 000 most common words/tokens.

In [None]:
# creating vocabulary
word_counts = Counter(brown_word_list)
vocab_size = 10000  # 
vocab = {w[0]: idx for idx, w in enumerate(word_counts.most_common(vocab_size))}

We define some Hyper-parameters. In case it is not clear what they do read it up or ask your Tutor.

In [None]:
# Hyper-parameters of the algorithm
window_size = 1 # window size
# TODO: You can change the number of negative samples to a higher value, e.g. 7. This will give you better results, but will take longer to train.
neg_samples_factor = 3 # negative samples multiple
dims = 64 # embedding dimension
learning_rate = 0.01
batch_size = 4096

The following function takes a list of strings (words) and returns a generator of word tuples of the following form:
    
(center_word, context_word, True)

where center_word is the word at a certain position, and context_word is at most max_distance tokens away.

"True" is the label of that pair (tuples in positive_cooccurrences always have the label "True").

Words are represented by integers (rather than by string) denoting their id.

Only pairs where both words are in the vocabulary are considered.

Note: cooccurrence only holds between words in different positions, not for a position with itself.)

    :param tokens: list of strings (words)
    :param max_distance: max distance of context word to target word
    :param neg_samples_factor: number of sampled negative tuples for each positive tuple
    :param vocab_to_id: dictionary (string to int) mapping each word to its id (=row in embedding matrizes).
    :return: generator over tuples of the form (context_word:string, center_word:string, label:boolean)

In [None]:
def positive_and_negative_cooccurrences(tokens, max_distance, neg_samples_factor, vocab_to_id):
    for center_position in range(len(tokens)):
        center_word = tokens[center_position]
        if center_word not in vocab_to_id:
            continue
        context_start = max(0, center_position - max_distance)
        context_end = min(len(tokens), center_position + max_distance + 1)
        for context_position in range(context_start, context_end):
            if context_position != center_position:
                context_word = tokens[context_position]
                if context_word not in vocab_to_id:
                    continue
                yield (vocab_to_id[center_word], vocab_to_id[context_word], True)
                for i in range(neg_samples_factor):
                    yield (vocab_to_id[center_word], random.randint(0, len(vocab_to_id) - 1), False)


Next we build our model. As you have seen in the last exercise you define a model by making a class inherit from torch.nn.module as well as defining the init and forward method. 

##### init method

***TODO*** You will have to create the embedddings for the vocabulary as context and target/center words. Hint: There is a Embedding layer in Pytorch which should be used here. (1p)

#### forward method:

***TODO*** What is the shape of center_context_idxs? (1p)

***TODO*** What are the shapes of cntr_idxs and ctxt_idxs? (1p)

***TODO***  ctxt_vecs must have the correct shape for batch-wise matrix multiplication (use variable.view(...) ) Check the documentation for torch.bmm(...) to see what is the required shape. (1p)

***TODO*** What is the shape of the returned Variable? (1p)

In [None]:
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(Word2Vec, self).__init__()
        self.embedding_size = embedding_size
        # TODO: create embeddings for vocabulary as context and target words (1p)
        self.embeddings_center = None # TODO
        self.embeddings_context = None #TODO

    def forward(self, center_context_idxs):
        # TODO: What is the shape of center_context_idxs? (1p)
        cntr_idxs = center_context_idxs[:, 0]
        ctxt_idxs = center_context_idxs[:, 1]
        # TODO: What are the shapes of cntr_idxs and ctxt_idxs? (1p)

        cntr_vecs = self.embeddings_center(ctxt_idxs).view(-1, 1, self.embedding_size) # resulting shape: batch_size x 1 x embedding_size
       
        ctxt_vecs = self.embeddings_context(cntr_idxs) #TODO
        # TODO: ctxt_vecs must have the correct shape for batch-wise matrix multiplication (use variable.view(...) ) ...
        # TODO: ...Check the documentation for torch.bmm(...) to see what is the required shape. (1p)
       
        scores = torch.bmm(cntr_vecs, ctxt_vecs) # Batch-wise matrix multiplication. Resulting shape: batch_size x 1 x 1
        return scores.view(-1,1) # TODO: What is the shape of the returned Variable? (1p)

    def center_sims(self, word_idx):
        m = self.embeddings_center.weight
        v = m[word_idx]
        return F.cosine_similarity(m, v.expand(m.size())).data.numpy()

    def context_sims(self, word_idx):
        m = self.embeddings_context.weight
        v = m[word_idx]
        return F.cosine_similarity(m, v.expand(m.size())).data.numpy()


From this point on there is nothing left to do for you. Please read through the following and have a look into the documentation to see what each line does. 

Understanding how the model learns and predicts will be essential for your upcoming projects.

In [None]:
w2v_model = Word2Vec(vocab_size, dims)

criterion = nn.BCEWithLogitsLoss()

optimizer = optim.Adam(w2v_model.parameters(), lr=learning_rate)

pos_neg_list = list(positive_and_negative_cooccurrences(brown_word_list, window_size, neg_samples_factor, vocab))
data_size=len(pos_neg_list)
train_data = np.asarray(pos_neg_list)

num_epochs = 5

data_tensor = torch.LongTensor(train_data[:, 0:2].tolist())
target_tensor = torch.FloatTensor(train_data[:, 2].tolist()).view(-1,1)

train = data_utils.TensorDataset(data_tensor, target_tensor)
train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True)

for epoch_nr in range(num_epochs):
    loss_accum = 0.0
    print("epoch", epoch_nr)
    for ctxt_tgt_idxs, labels in train_loader:
        optimizer.zero_grad()
        output = w2v_model.forward(Variable(ctxt_tgt_idxs))
        loss = criterion(output, Variable(labels))
        loss_accum += loss.data[0]
        loss.backward()
        optimizer.step()
    print("current loss:", loss_accum)


At last we get the most similar word embeddings to a few example words. The first list holds the results from the word being the center word, the second from the word being a context word. 

In [None]:

sorted_words = sorted(vocab.keys(), key=vocab.get)
def top_words_for(word, n=10, for_center=True):
    print(word)
    query_index = vocab[word]
    if for_center:
        sims = list(zip(w2v_model.center_sims(query_index), sorted_words))
    else:
        sims = list(zip(w2v_model.context_sims(query_index), sorted_words))
    sims.sort(key=lambda x: -x[0])
    return sims[:n]

print("word".ljust(20), "similarity")
for ww in ["before", "president", "city", "man", "government","1","two"]:
    print("="*35)
    for dist, nw in top_words_for(ww, for_center=True):
        print(nw.ljust(20), dist)
    for dist, nw in top_words_for(ww, for_center=False):
        print(nw.ljust(20), dist)