## Skip Gram Assignment

Team member names:  Max Ruby, Carlos Salinas

Complete all of the sections as described below.   Then run all, print to pdf using Chrome, and submit on Gradescope (indicating on your submission the start of each part of the assignment and choosing your team members).

## TODO:
Complete the programming tasks below.  Then answer these questions briefly.  

Q1:  What are the differences in the untrained words similar to 'film' between the CBOW example and the Skip Gram example?  Do you think that's because of CBOW vs Skip Gram or the neural network architecture?

Q2:  Why do you think the Skip Gram example produces so many words close to each target word?  How could you improve the performance of this example?  

In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.4'

In [2]:
from keras.datasets import imdb

V = 5000 # vocabulary size
num_reviews = 5000 # number of reviews to use during training
num_test = 100 # number of reviews to use during testing and validation
dim = 20 # embedding dimension
window_size = 2


(train_data_full, train_labels_full), (test_data_full, test_labels_full) = imdb.load_data(num_words=V)
train_data = train_data_full[0:num_reviews]
test_data = test_data_full[0:num_test]
val_data = test_data_full[num_test:2*num_test]


The argument `num_words=V` means that we will only keep the top V most frequently occurring words in the training data. Rare words 
will be discarded. This allows us to work with vector data of manageable size.

The variables `train_data` and `test_data` are lists of reviews, each review being a list of word indices (encoding a sequence of words). 
`train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for "negative" and 1 stands for "positive".  The labels will not be used for this assignment. 

Here's some code to decode back to English words:

In [3]:
class WordIndexManager:
  def __init__(self, word_index = []):
    self.word_index = word_index
    self.reverse_word_index = []
    
    if not (word_index == []): # Reverse the, mapping integer indices to words
      self.reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

  def ind_to_string(self, word_ind):
    # Decode a word; note that our indices were offset by 3
    # because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
    return self.reverse_word_index.get(word_ind - 3, '?')

  def inds_to_string(self, word_inds):  
    # Put a list of decoded words into a string
    decoded_review = ' '.join([self.ind_to_string(i) for i in word_inds])
    return decoded_review
  
  def create_word_list(self, word_inds):
    word_list = []
    for ind in word_inds:
      word_list.append(self.ind_to_string(ind))
    return word_list

# word_index is a dictionary mapping words to an integer index
# We create an instance of a class to manage this index
word_index = imdb.get_word_index()
WIM = WordIndexManager(word_index)



Here are some examples of how to use the word index manager.

In [5]:
# print the first review
print(WIM.inds_to_string(train_data[0]))

# print the second word
print(WIM.ind_to_string(train_data[0][1]))


? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly ? was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little ? that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big ? for the whole film but these children are amazing and should be ? for what they have done don't you think the whole story was

## TODO:
Determine the maximum word length, and for each length from 1 to max, print the number of unique words of that length.  

In [7]:
import numpy as np

# ** YOUR CODE HERE ** to define max_word_len and word_lens, which is an array 
# so that word_lens[i] is the length of the word with index i
max_word_len = 0
word_lens = np.zeros(V)
for i in range(0,V):
    word = WIM.ind_to_string(i)
    word_lens[i] = len(word)
    max_word_len = max(max_word_len,len(word))

print('Max word length is ' + str(max_word_len))

for i in range(max_word_len, 0, -1):
    print('Num of words of length ' + str(i) + ' is ' + str(np.sum(word_lens == i)))



Max word length is 16
Num of words of length 16 is 2
Num of words of length 15 is 2
Num of words of length 14 is 7
Num of words of length 13 is 29
Num of words of length 12 is 55
Num of words of length 11 is 130
Num of words of length 10 is 221
Num of words of length 9 is 363
Num of words of length 8 is 527
Num of words of length 7 is 781
Num of words of length 6 is 886
Num of words of length 5 is 876
Num of words of length 4 is 710
Num of words of length 3 is 283
Num of words of length 2 is 87
Num of words of length 1 is 41


Determine the set of unique characters in the text. We add '?' to the list to allow for the '?' used in place of unknown words.  

In [9]:
# List of unique characters in the corpus
chars = set([])
i = 0
for (key, value) in word_index.items():
  if not (set(key) <= chars):
    chars = chars.union(list(set(key)))
chars = ['?'] + sorted(list(chars))

print(chars)

['?', '\x08', '\x10', "'", '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\x80', '\x84', '\x85', '\x8d', '\x8e', '\x91', '\x95', '\x96', '\x97', '\x9a', '\x9e', '\xa0', '¡', '¢', '£', '¤', '¦', '§', '¨', '«', '\xad', '®', '°', '³', '´', '·', 'º', '»', '½', '¾', '¿', 'À', 'Á', 'Ã', 'Ä', 'Å', 'È', 'É', 'Ê', 'Õ', 'Ø', 'Ü', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ō', '–', '‘', '’', '“', '”', '…', '₤', '\uf0b7']


##TODO:
Define a dictionary called char_index to map each character in the list above to  an index into new_chars (below) so that a-z are preserved, '?' is preserved, and all other characters are mapped to '*'. The output below should be 
```
a
z
*
?
```

In [23]:
new_chars = ['.', '*', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

# ** YOUR CODE HERE ** to define char_indices
char_indices = dict()
for char in chars:
    char_indices[char] = 1

char_indices['?'] = 2

for i in range(97,97+26):
    char_indices[chr(i)] = i +3-97

print(new_chars[char_indices['a']])
print(new_chars[char_indices['z']])
print(new_chars[char_indices['\x80']])
print(new_chars[char_indices['?']])



a
z
*
?


## Preparing the data


We cannot feed lists of integers into a neural network. We have to turn our lists into tensors. This is done with a data generator.  To use the data generator, call the function to get an instance of the generator, then iterate on that instance.  E.g.,

```
data_gen = generate_data(input_data, window_size, vocab_size, batch_size)
for x,y in data_gen:
    do something
```

##TODO:
Fill in the code below to build one-hot letter by letter encodings from word indices and to generate training batches.  

For build_words, each word index produces an array of size (max_word_len, len(new_chars)), where each vector [i,:] is a one-hot encoding of the ith letter (starting from index 0). 

For generate_data, each batch should consist of 

```
x: (batch_size,  max_word_len, len(new_chars))
y: (batch_size, V)
```
x can be constructed from build_words using the indices for the words at the center of each sample. y is a probability distribution over the vocabulary using the context words within window_size before and after the central word for a given sample.  

E.g., if the indices in a passage are 45, 13, 98, 6, 9, 18, then one sample would construct x[0,:,:] from 98, and y[0,:] would have entries 0.25 in each of indices 45, 13, 6, 9.  The next sample would construct x[1,:,:] from 6, and y[1,:] would have entries 0.25 in each of indices 13, 98, 9, 18.  If a context word is repeated, then it gets proportionally more weight.  

In [0]:
import numpy as np
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
import keras.backend as K
from keras.preprocessing import sequence


# Take in a list of word indices
# Convert each word to an array of size (max_word_len, len(new_chars))
# Use one-hot encoding on each letter and pad with 0s to fill up max_word_len
def build_words(word_inds, max_word_len=max_word_len, char_list=new_chars, char_indices=char_indices):
    output = np.zeros((len(word_inds), max_word_len, len(char_list)), dtype=np.bool)
    # ** YOUR CODE HERE ** to build the output    
    words = WIM.inds_to_string(word_inds)
    for i in range(len(word_inds)):
        word_len = len(words[i])
        for j in range(word_len):
            index = char_indices[words[i][j]]
            output[i][j][index] = 1
    return output
      
# Print the words generated from build_words  
def print_words(center_words, char_map=new_chars):
  for word in center_words:
    for i in range(word.shape[0]):
      cur_ind = np.argwhere(word[i,:] > 0)  # Find the index of the current letter
      if len(cur_ind) > 0:
        print(char_map[cur_ind[0][0]], end='')  # Print this letter using the character map
      else: 
        print('')
        break
    
# Generate training samples:  
# Training input is batch_size x max_word_len x len(new_chars)
# Each element in a batch is encoded using build_words - one hot encoding of letters in a word
# Each label is a probability distribution on the vocabulary: equal weighting for each context word
def generate_data(corpus, window_size, V, batch_size=16):
    maxlen = window_size*2
    # ** YOUR CODE HERE ** to generate batches
    while 1:
        for i in range(0,num_reviews):
            review_size = len(corpus[i])
            total_batch_num = (review_size - 2*window_size)//batch_size
            for current_batch_num in range(total_batch_num):
                x = np.zeros((batch_size, max_word_len, len(new_chars)))
                y = np.zeros((batch_size, V))
                for batch_val in range(batch_size):
                    y[batch_val,corpus[i][batch_val + current_batch_num*batch_size]] = 1
                    for k in range(2*window_size):
                        if k < window_size:
                            x[batch_val, k, corpus[i][current_batch_num*batch_size + batch_val + k - window_size]] = 1
                        else:
                            x[batch_val, k, corpus[i][current_batch_num*batch_size + batch_val + k + 1 - window_size]] = 1
                yield x,y

Test your generator.  The final line of context words should be 


```
the 0.25 they 0.25 played 0.25 suited 0.25
```
and the final line of center words should be


```
part
```




In [0]:

window_size = 2
train_gen = generate_data(train_data, window_size, V)
val_gen = generate_data(val_data, window_size, V)
test_gen = generate_data(test_data, window_size, V)

for center_words, contexts in train_gen:
  print('** Context words **')
  for vector in contexts:
    inds = np.argwhere(vector > 0)
    for j in range(inds.shape[0]):
      cur_ind = inds[:,0][j]
      print(WIM.ind_to_string(cur_ind) + ' ' + str(vector[cur_ind]) + ' ' , end='')
    print('')
  print('\n** Center words **')
  print_words(center_words)
  break

##TODO:
Construct the model in two stages.  First construct an LSTM layer called word_embedding.  This should use an embedding dimension of dim and take training samples from your generator.

Then create the skip_gram model by using the word_embedding model followed by a dense layer with softmax activation onto the vocabulary.  

In [0]:

# ** YOUR CODE HERE ** for word_embedding

word_embedding.summary()

# ** YOUR CODE HERE ** for skip_gram

skip_gram.compile(loss='categorical_crossentropy', optimizer='rmsprop')
skip_gram.summary()

Here we define a function to save the embedding weights.  In this case we need some separate code to get the weights from the model.  

In [0]:

def save_weights(weights, vocab_size=V, dim=dim, filename='vectorsSG.txt'):
  f = open(filename ,'w')
  f.write('{} {}\n'.format(vocab_size-1, dim))
  for i in range(1,vocab_size):
      str_vec = ' '.join(map(str, list(weights[i, :])))
      word = WIM.ind_to_string(i)
      f.write('{} {}\n'.format(word, str_vec))
  f.close()
  


Here we use the LSTM to get the embedding for all of the words in the vocabulary and then save the corresponding weights.

In [0]:
# Construct word embedding dictionary
def get_weights(inds):
  cur_words = build_words(inds)
  return word_embedding.predict(cur_words)


weightsUT = get_weights(range(V))
print(weightsUT.shape)
save_weights(weightsUT, filename='untrainedSG.txt')

Train the model.

In [0]:

val_steps = 100

history = skip_gram.fit_generator(train_gen,
                              steps_per_epoch=1500,
                              epochs=6,
                              validation_data=val_gen,
                              validation_steps=val_steps)

Plot the training

In [0]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Save the dictionary

In [0]:
all_words = build_words(range(V))
weights = word_embedding.predict(all_words)
print(weights.shape)
save_weights(weights, filename='vectorsSG.txt')

Load the word embeddings and compare trained and untrained embeddings.  

In [0]:

import gensim
w2vUT = gensim.models.KeyedVectors.load_word2vec_format('./untrainedSG.txt', binary=False)
w2vT = gensim.models.KeyedVectors.load_word2vec_format('./vectorsSG.txt', binary=False)

def print_similarities(word, w2vUT=w2vUT, w2vT=w2vT):
  print('Nearest words and similarities to "' + word + '" ')
  print('Untrained similarities\tTrained similarities\n')
  for item1, item2 in zip(w2vUT.most_similar(positive=[word]), w2vT.most_similar(positive=[word])):
    print("{:10s}".format(item1[0]) + ', ' + "{:.2f}".format(item1[1]) + '\t' 
          + "{:10s}".format(item2[0]) + ', ' + "{:.2f}".format(item2[1]))
  print(' ')

print_similarities('movie')

In [0]:
print_similarities('film')

In [0]:
print_similarities('role')

In [0]:
print('Word pair similarity')
print('\t\t\tUntrained\tTrained')
print('film and movie: \t' + "{:.2f}".format(w2vUT.similarity('film', 'movie')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('film', 'movie')))
print('man and woman:   \t' + "{:.2f}".format(w2vUT.similarity('man', 'woman')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('man', 'woman')))
print('plot and talent: \t' + "{:.2f}".format(w2vUT.similarity('plot', 'talent')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('plot', 'talent')))

print(' ')

Use TSNE to plot the primary two components of the embedding.  

In [0]:
from sklearn.manifold import TSNE
import plotly.offline as py
import plotly.graph_objs as go

number_of_words = 1000

X_embedded = TSNE(n_components=2).fit_transform(weights[0:number_of_words])
word_list = WIM.create_word_list(range(number_of_words))


trace = go.Scatter(
    x = X_embedded[0:number_of_words,0], 
    y = X_embedded[0:number_of_words, 1],
    mode = 'markers',
    text= word_list[0:number_of_words]
)

layout = dict(title= 'Trained t-SNE 1 vs t-SNE 2 for first 1000 words ',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)


In [0]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
  
configure_plotly_browser_state()  

py.init_notebook_mode()
py.iplot(fig)

In [0]:
X_embedded = TSNE(n_components=2).fit_transform(weightsUT[0:number_of_words])
word_list = WIM.create_word_list(range(number_of_words))


trace = go.Scatter(
    x = X_embedded[0:number_of_words,0], 
    y = X_embedded[0:number_of_words, 1],
    mode = 'markers',
    text= word_list[0:number_of_words]
)

layout = dict(title= 'Untrained t-SNE 1 vs t-SNE 2 for first 1000 words ',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)


In [0]:
configure_plotly_browser_state()  

py.init_notebook_mode()
py.iplot(fig)