In [None]:
library(tokenizers)
library(stringr)
library(neuralnet)
library(testthat)
options(warn=-1)

## The Word2Vec Assignment

In the assignment for the semantic models chapter, we are going to apply word2vec to a somewhat larger corpus of text drawn from the children's book "Green eggs and ham" by Dr. Zeus. Again we use a children's book, because the text is repetitive and so the model has common contexts to work with but doesn't take too long to train.

Note I have removed quotation marks because they break the rules of field names in R and I have changed question marks to the token QUESTION, for the same reason.

In [None]:
text = "DO WOULD YOU LIKE GREEN EGGS AND HAM QUESTION

I DO NOT LIKE THEM SAM I AM .
I DO NOT LIKE GREEN EGGS AND HAM .

WOULD YOU LIKE THEM HERE OR THERE QUESTION

I WOULD NOT LIKE THEM HERE OR THERE .
I WOULD NOT LIKE THEM ANYWHERE .
I DO NOT LIKE GREEN EGGS AND HAM .
I DO NOT LIKE THEM SAM I AM .

WOULD YOU LIKE THEM IN A HOUSE QUESTION
WOULD YOU LIKE THEN WITH A MOUSE QUESTION

I DO NOT LIKE THEM IN A HOUSE .
I DO NOT LIKE THEM WITH A MOUSE .
I DO NOT LIKE THEM HERE OR THERE .
I DO NOT LIKE THEM ANYWHERE .
I DO NOT LIKE GREEN EGGS AND HAM .
I DO NOT LIKE THEM SAM I AM .

WOULD YOU EAT THEM IN A BOX QUESTION
WOULD YOU EAT THEM WITH A FOX QUESTION

NOT IN A BOX . NOT WITH A FOX .
NOT IN A HOUSE . NOT WITH A MOUSE .
I WOULD NOT EAT THEM HERE OR THERE .
I WOULD NOT EAT THEM ANYWHERE .
I WOULD NOT EAT GREEN EGGS AND HAM .
I DO NOT LIKE THEM SAM I AM .

WOULD YOU QUESTION COULD YOU QUESTION IN A CAR QUESTION
EAT THEM EAT THEM HERE THEY ARE .

I WOULD NOT COULD NOT IN A CAR .

YOU MAY LIKE THEM . YOU WILL SEE .
YOU MAY LIKE THEM IN A TREE

I WOULD NOT COULD NOT IN A TREE .
NOT IN A CAR YOU LET ME BE .
I DO NOT LIKE THEM IN A BOX .
I DO NOT LIKE THEM WITH A FOX .
I DO NOT LIKE THEM IN A HOUSE .
I DO NOT LIKE THEM WITH A MOUSE .
I DO NOT LIKE THEM HERE OR THERE .
I DO NOT LIKE THEM ANYWHERE .
I DO NOT LIKE GREEN EGGS AND HAM .
I DO NOT LIKE THEM SAM I AM ."

corpus = tokenize_ptb(tolower(text))
corpus
T = length(corpus[[1]])
print(paste0("corpus size = ", T))
for (i in 1:T){
    if (corpus[[1]][i] == "1"){
        print (i)
    }
}

As in the tutorial, we create the vocabulary (vocab) and the index (I) that maps words into unique dimensions in preparation for creating the training patterns.

In [None]:
vocab = unique(str_sort(corpus[[1]]))
V = length(vocab)
I = 1:V
names(I) = vocab

print(paste0("vocabulary size = ", V))
vocab


Now we can create the dataframe that contains our patterns. Note that we set the context size (C) to 1 in this case. Later, we will explore how changing the context size changes the output of the model. You will need to adjust the size in this cell and regenerate the dataframe.  

In [None]:
contextvars = NULL
targetvars = NULL
for (i in 1:V){
    contextvars = c(contextvars, paste0(vocab[i], "C"))
    }
for (i in 1:V){
    targetvars = c(targetvars, paste0(vocab[i], "T"))
    }

C = 1 # context size
df = data.frame(matrix(0, T, V*2))

colnames(df) = c(contextvars, targetvars)

for (i in 1:T){
    target = corpus[[1]][i]
    df[i, I[[target]]+V] = 1
    for (j in (i-C):(i+C)){
        if (j >= 1 && j <= T && i != j){
            context = corpus[[1]][j]
            df[i, I[[context]]] = 1
        }
    }
}

targetvars = paste(targetvars, collapse="+")
contextvars = paste(contextvars, collapse="+")
df


### CBoW Model

Now we are ready to train the CBoW version of the model. This code also outputs the sum of squared errors (SSE) that the model produces and the number of steps that it took to reach convergence. As the name suggests, to calculate the SSE we take the differences between the teacher and the activation of each output unit for all patterns, square them and then add. The network is reproducing the teacher patterns well when the SSE is close to zero.

In [None]:
NumHidden = 5
set.seed(9)
strformula = paste(targetvars, "~", contextvars)
nn = neuralnet(as.formula(strformula), data=df, hidden=NumHidden, act.fct="logistic", linear.output=FALSE)
print ("Error:")
print (nn$result.matrix[[1]])
print("Number of steps:")
print (nn$result.matrix[[3]])

And now we can plot the hierarchical cluster diagram of the weights.

In [None]:
ws = data.frame(t(nn$weights[[1]][[2]][2:(NumHidden+1), 1:V]))
rownames(ws) = vocab
options(repr.plot.width=8, repr.plot.height=15)
dist_mat <- dist(ws, method = 'euclidean')
hclust_avg <- hclust(dist_mat, method = 'average')
plot(as.dendrogram(hclust_avg), horiz=TRUE)

### Skipgram Model

Now we will do the same for the skipgram version of the model.

In [None]:
strformula = paste(contextvars, "~", targetvars)
NumHidden = 5
set.seed(9)
nnSkipgram = neuralnet(as.formula(strformula), data=df, hidden=NumHidden, act.fct="logistic", linear.output=FALSE)
print ("Error:")
print(nnSkipgram$result.matrix[[1]])
print("Number of steps:")
print(nnSkipgram$result.matrix[[3]])

And now we can plot the hierachical cluster diagram for the skipgram version of the model. 

In [None]:
ws = data.frame(nnSkipgram$weights[[1]][[1]][2:(V+1),1:NumHidden])
rownames(ws) = vocab
options(repr.plot.width=8, repr.plot.height=15)
dist_mat <- dist(ws, method = 'euclidean')
hclust_avg <- hclust(dist_mat, method = 'average')
plot(as.dendrogram(hclust_avg), horiz=TRUE)


### Exercise 1

A major part of the work in applying neural networks to a problem is to decide on the parameters one will use. We will focus on two main ones for this assignment - the context size and the number of hidden units. 

The context size determines how many words to the left and right of the target word the models considers. The number of hidden units determines how many units are in the middle layer that maps the inputs to the outputs.  Changing these parameters impacts on the performance of the network.

Using the code above, run the CBoW and Skipgram models with different values of the context size and different numbers hidden units. Specifically, consider context sizes of 1 or 2 and 5, 10 or 15 hidden units. Tabulate the error and the number of steps taken to converge. 


YOUR ANSWER HERE

### Exercise 2

Why are the CBoW error values different from the Skipgram values?


YOUR ANSWER HERE

YOUR ANSWER HERE

YOUR ANSWER HERE

### Exercise 4

How do the error values typically change as a function of the number hidden units? Why would that be the case?


YOUR ANSWER HERE

### Exercise 5

In general, the more hidden units you have the lower the error becomes. Does that mean that more hidden units are always better? 

Compare the hierarachical cluster digram of the CBoW model with a context size of 1 and 10 hidden units and with 100 hidden units. Pay particular attention to where "tree" is placed relative to the other nouns like "box" and "house".  Which makes the most sense? When will fewer hidden units be better?


YOUR ANSWER HERE