<a href="https://colab.research.google.com/github/dgromann/SemComp_WS2018/blob/master/Tutorial6/Tutorial6_model_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lesson 0.0.0: Store this notebook! 

Go to "File" and make sure you store this file as a local copy to either GitHub or your Google Drive. If you do not have a Google account and also do not want to create one, please check Option C below. 

Option A) Google Drive WITH collaboration

If you want to work in a collaborative manner where each of you in the group can see each other's contributions, one of you needs to store the notebook in Google Drive and share it with the others. You share it by clicking on the SHARE button on the top right of this page and share the link with the "everyone who receives this link can edit" option with the other team members per e-mail, skype, or any other way you prefer.

If you work with others, keep in mind to always copy the code before you edit it and always indicate your name as a comment (e.g. #Dagmar ) in the cell that it is clear who wrote which part. I also recommend creating a new code cell for your contributions.

Option B) Github without collaboration

Collaborative functions are not available when storing the notebook in GitHub; you will see your own work but not that of others.


Option C) Download this notebook as ipynb (Jupyter notebook) or py (Python file)

To run either of these on your local machine requires the installation of the required programs, which for the first tutorial are Python and NLTK. This will become more as we continue on to machine learning (requiring sklearn) and deep learning (requiring tensorflow and/or pytorch). In Google Codelab all of these are provided and do not need to be installed locally.


# Lesson 0.1: Pytorch tutorial - basics

In order to get started with deep learning and practically code up neural networks, we need to familiarize ourselves with the packages that can be used to this end. There are two basic open source machine learning frameworks that can be used to this end: 

*   TensorFlow (Google)
*   Torch (Facebook, Google DeepMind, Twitter)

Since these are high level core libraries, it is easier to use a framework that builds on top of it and adds some usability and documentation. We are going to for once not use the Google solution, but will go with the Facebook solution of Pytorch. This first part of today's tutorial will introduce you to some core concepts of Pytorch before we start working with embeddings.





In [0]:
# Let's first install pytorch 
!pip3 install torch torchvision

The most basic and important concept in Pytorch is that of a **Tensor**, To speed up computation and offer more flexibility, pytorch replaces numpy arrays with tensors.

In [0]:
import torch
import numpy 

# Seed for random number generator to ensure reproducibility of
# random initializations
torch.manual_seed(1)

#Difference between tensor and numpy array
a = torch.ones(5)
print("Tensor: ")
print(a)
print("Numpy array: ")
print(a.numpy(), "\n")


# This creates a randomly initialized 5 x 3 matrix
rand = torch.rand(5,3)
print("Randomly initialized tensor: ", rand, "\n")

# This creates a 5 x 3 matriy filled with zeros and of dtype long
# There are eight datatypes in tensor, this one is a datatype of 64-bit integer (signed)
# Here are the others: https://pytorch.org/docs/stable/tensors.html
zeros = torch.zeros(5,3, dtype=torch.long)
print("Tensor initialized with zeros: ", zeros, "\n")

# Directly initialize a tensor with data 
data = torch.tensor([5.5, 3])
print("Tensor initilialized with data: ", data, "\n")

# You can redefine an existing tensor 
redefined = rand.new_ones(5, 3, dtype=torch.double)
print("Redefined randomly initialized tensor: ", redefined, "\n")

redefined_too = torch.rand_like(redefined, dtype=torch.float)
print("Initializing randomly based on the size of redefined: ", redefined_too, "\n")


# Get the size - this is actually a tuple that supports tuple operations 
print(redefined_too.size(), "\n")

Tensor: 
tensor([1., 1., 1., 1., 1.])
Numpy array: 
[1. 1. 1. 1. 1.] 

Randomly initialized tensor:  tensor([[0.7576, 0.2793, 0.4031],
        [0.7347, 0.0293, 0.7999],
        [0.3971, 0.7544, 0.5695],
        [0.4388, 0.6387, 0.5247],
        [0.6826, 0.3051, 0.4635]]) 

Tensor initialized with zeros:  tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]) 

Tensor initilialized with data:  tensor([5.5000, 3.0000]) 

Redefined randomly initialized tensor:  tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64) 

Initializing randomly based on the size of redefined:  tensor([[0.4550, 0.5725, 0.4980],
        [0.9371, 0.6556, 0.3138],
        [0.1980, 0.4162, 0.2843],
        [0.3398, 0.5239, 0.7981],
        [0.7718, 0.0112, 0.8100]]) 

torch.Size([5, 3]) 



Tensors in torch also support basic operations: 

In [0]:
# Addition of tensors matching in size
x = torch.rand(5, 3)
y= torch.rand(5, 3)
print("Addition: ", x + y, "\n")
print("Addition alternative syntax: ", torch.add(x, y), "\n")

# Addition with providing a tensor as argument
result = torch.empty(5, 3)
print("Addition with tensor as argument: ", torch.add(x, y, out=result), "\n")


All opreations that mutate a tensor are indicated with an underscore _ such as the example below:

In [0]:
# add x to y
y.add_(x)
print("Adding x to y: ", y)

# you can also add a number 
print("Adding 1: ", y.add_(1))


Indexing operations of tensors follow the numpy standard:

In [0]:
# Indexing
print("X: ", x, "\n")
print("Element at index one of each row of the matrix ", x[:, 1], "\n")

Resizing: if you wish to change the shape of the tensor you can use torch.view:

In [0]:
# Exercise: resizing: what effect do the following
# resize operations have on the tensors
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)
print("Resizing: ")
print("Original")
print(x)
print("Resized view(16)", y, "\n")
print("Resized view(-1, 8)", z, "\n")

**Gradients and Backpropagation**

If you set the flag  ```.requires_grad``` on a ```torch.Tensor``` to ```True``` the program will track all operations on it in order to enable later operations, such as backpropagation, which is very important to neural networks. 

When you finish all computations on your tensor, you can then simply call the function ```.backward()``` and have all the gradients computed automatically.. The gradient will then automatically be accumulated in the attribute ```.grad```. 

If you wish to disconnect a specific tensor from this process of tracking all operations, you can call the function ```.detach()```. This prevents future computations from being tracked. You can alternatively wrap the code block in a function ```with torch.no_grad()``` which does not track the operations on any variables included in the block. This is particularly helpful if you wish to evaluate a model that has trainable parameters with *required_grad=True* flags but for which we don't need the gradients in evaluation. 

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a Function that has created the `Tensor` (except for Tensors created by the user - their grad_fn is None).

If you want to compute the derivatives, you can call `.backward() `on a `Tensor`. If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to `backward()`, however if it has more elements, you need to specify a `gradient` argument that is a tensor of matching shape.



In [0]:
# Tensor that requires gradien = operations are being tracked 
x = torch.ones(2, 2, requires_grad=True)
print(x)

# Let's do some operation 
y = x + 2 
print(y)

# y was created that has a grad_fn 
print(y.grad_fn)

# Some more operations 
z = y * y * 3
out = z.mean()

print(z, out)


# Gradients
# Let's calculate and print the gradietn (d(out)/dx) and print it
out.backward()
print(x.grad)

# Stop autograd from tracking 
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)
    

# Lesson 1:  Word Embeddings - first steps

We will start looking at Pytorch and then play with existing embeddings. The training in pytorch we will do next week. 

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Again random number generator to ensure reproducibility
torch.manual_seed(1)

<torch._C.Generator at 0x7fb9fc9caa70>

Here is a mini-example of how to initialize the layer randomly and with only two words.

In [0]:
# Map words to index to produce one-hot encodings 
word_to_ix = {"hello": 0, "world": 1}

# Initialize the embedding layer (nn = neural network) with the number of the 
# vocabulary and the dimensionality of the vectors 
# here: two words, vectors of 5 dimensions as ouput
embeds = nn.Embedding(2, 5) 
lookup_tensor = torch.tensor(list(word_to_ix.values()), dtype=torch.long)

# Create a look up tensor for the random embeddings
#for key, index in word_to_ix.items(): 
embeddings = embeds(lookup_tensor)
print(embeddings)

tensor([[ 0.5088, -0.3628, -0.2213,  1.1521,  0.2863],
        [-0.2330, -0.0370,  0.3126,  1.3573,  0.4418]],
       grad_fn=<EmbeddingBackward>)


Let's train our first embeddings. What do SGD and lr mean in the code below? What happens if you increase lr and 
increase the number of iterations?

The below implementation is just a toy implementation. For a better version, see [this word2vec implementation in Pytorch](https://adoni.github.io/2017/11/08/word2vec-pytorch/)

In [0]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

# deduplicate 
vocab = set(test_sentence)

# generate the word index
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)

# Exercise: What do SGD and lr mean? What happenes if you change them?
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses) # The loss decreased every iteration over the training data!

# Lesson 2: Use existing embeddings

The code below exemplifies how to load a trained embedding model in the gensim library. 

In [2]:
# Let's first load a small subset of word2vec embeddings that have been trained on a 
# large corpus of news documents  
!wget https://github.com/dgromann/SemanticComputing/raw/master/tutorial6/word2vec_embeddings.bin
!wget https://raw.githubusercontent.com/dgromann/SemComp_WS2018/master/Tutorial6/analogy.txt
!pip3 install gensim

--2018-12-21 10:49:43--  https://github.com/dgromann/SemanticComputing/raw/master/tutorial6/word2vec_embeddings.bin
Resolving github.com (github.com)... 140.82.118.3, 140.82.118.4
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dgromann/SemanticComputing/master/tutorial6/word2vec_embeddings.bin [following]
--2018-12-21 10:49:43--  https://raw.githubusercontent.com/dgromann/SemanticComputing/master/tutorial6/word2vec_embeddings.bin
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96769269 (92M) [application/octet-stream]
Saving to: ‘word2vec_embeddings.bin’


2018-12-21 10:49:45 (161 MB/s) - ‘word2vec_embeddings.bin’ saved [96769269/96769269]

--2018-

In [0]:
import gensim
from sklearn.decomposition import PCA
from matplotlib import pyplot

import warnings
warnings.filterwarnings('ignore')

In [0]:
# Let's load the model
model = gensim.models.KeyedVectors.load_word2vec_format("word2vec_embeddings.bin", binary=True)

# Print the length fo the whole vocabulary 
print(len(model.wv.vocab))

# Print the embedding of a specific word 
print(model["good"])

# Get the 10 most similar words of "good"
most_similar = model.most_similar("good", topn=10)

# Check whether our embeddings are good at the analogy task
analogy = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

In [5]:
# Exercise: get the length and datatype of the vector - 
# what does the length tell us and how would it influence our parameters when training? 
print("Number of dimensions of the embeddings when training", len(model["good"]))
print("Data type ", type(model["good"]))

# Exercise: implement your own version of the analogy check without using a 
# built in function 
analogy_2 = model.similar_by_vector((model["woman"]+model["king"]) - model["man"], topn=10)


Number of dimensions of the embeddings when training 300
Data type  <class 'numpy.ndarray'>


Evalute the loaded subset of the embeddings on the analogy.txt file from the Github repository. Each type of analogy is 
marked by a ": " at the beginning of the subset. 

For instance, the analogy of capitals and countries is marked by ": capital-common-countries" before the data start. Provide an accuracy measure for each type of analogy and compare on which it performs best.

In [11]:
# Exercise: Evaluate the embeddings on the analogy text file (already loaded for you)
# To do so you need to load the file, and test the analogy task on each line apart from 
# the header lines marked with ": "
# If you want, it would also be interesting to get accuracies for the different types of 
# analogies that are seprated by a header line marked with ": "

analogy = open("analogy.txt", "r")

corr = 0
line_counter = 0
final_results = {}
line_count = {}
overall_count = 0
overall_line_counter = 0
for line in analogy.readlines():
    if not ": " in line:
        line_counter += 1
        overall_line_counter += 1
        first = line.split()[0]
        second = line.split()[1]
        third = line.split()[2]
        answer = line.split()[3]
        # Solution with only checking the first most closest word (Accurracy approx.: 0.756)
        # results = model.most_similar(positive=[second, third], negative=[first], topn=1)
        #if answer in results[0][0]:

        #Solution with checking the top three results (Accuracy see below)
        results = model.most_similar(positive=[second, third], negative=[first], topn=3)
        if answer in results[0][0] or answer in results[1][0] or answer in results[2][0]:
            corr += 1
            overall_count += 1
    else:
      if corr == 0:
        name = line.split(" ")[1].replace("\n", "")
      else: 
        final_results[name] = corr
        line_count[name] = line_counter
        corr = 0
        line_counter = 0
        name = line.split(" ")[1].replace("\n", "")

print(final_results)

print("Overall accuracy in top three results : ", str(overall_count/overall_line_counter))


for key, value in final_results.items():
  print("Accuracy for ", key, ": \n", final_results[key]/line_count[key], "\n")

analogy.close()

{'capital-common-countries': 198, 'capital-world': 308, 'city-in-state': 969, 'family': 175, 'gram1-adjective-to-adverb': 372, 'gram2-opposite': 318, 'gram3-comparative': 1291, 'gram4-superlative': 263, 'gram5-present-participle': 897, 'gram6-nationality-adjective': 613, 'gram7-past-tense': 1207, 'gram8-plural': 646}
Overall accuracy in top three results :  0.8620344635908839
Accuracy for  capital-common-countries : 
 0.9428571428571428 

Accuracy for  capital-world : 
 0.9746835443037974 

Accuracy for  city-in-state : 
 0.8769230769230769 

Accuracy for  family : 
 0.9615384615384616 

Accuracy for  gram1-adjective-to-adverb : 
 0.458128078817734 

Accuracy for  gram2-opposite : 
 0.6284584980237155 

Accuracy for  gram3-comparative : 
 0.9692192192192193 

Accuracy for  gram4-superlative : 
 0.9669117647058824 

Accuracy for  gram5-present-participle : 
 0.9042338709677419 

Accuracy for  gram6-nationality-adjective : 
 0.9668769716088328 

Accuracy for  gram7-past-tense : 
 0.90615