## CBOW Assignment

Team member names:  

Complete all of the sections as described below.   Then run all, print to pdf using Chrome, and submit on Gradescope (indicating on your submission the start of each part of the assignment and choosing your team members).

In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.4'

## Continuous bag of words (CBOW) embedding


We'll learn a word embedding using the CBOW framework as described in section 4.2 of <a href="https://cs224d.stanford.edu/lecture_notes/notes1.pdf">these notes</a>.  

The word corpus will be the "IMDB dataset", a set of 50,000 highly-polarized reviews from the Internet Movie Database. The following code loads the dataset and sets a few parameters for use later.   

In [2]:
from keras.datasets import imdb

V = 5000 # vocabulary size
num_reviews = 5000 # number of reviews to use during training
num_test = 200 # number of reviews to use during testing and validation
dim = 20 # embedding dimension
window_size = 2


(train_data_full, train_labels_full), (test_data_full, test_labels_full) = imdb.load_data(num_words=V)

train_data = train_data_full[0:num_reviews]
test_data = test_data_full[0:num_test]
val_data = test_data_full[num_test:2*num_test]


The argument `num_words=V` means that we will only keep the top V most frequently occurring words in the training data. Rare words 
will be discarded. This allows us to work with vector data of manageable size.

The variables `train_data` and `test_data` are lists of reviews, each review being a list of word indices (encoding a sequence of words). 
`train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for "negative" and 1 stands for "positive".  The labels will not be used for this assignment. 

Here's some code to decode back to English words:

In [3]:
class WordIndexManager:
  def __init__(self, word_index = []):
    self.word_index = word_index
    self.reverse_word_index = []
    
    if not (word_index == []): # Reverse the, mapping integer indices to words
      self.reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

  def ind_to_string(self, word_ind):
    # Decode a word; note that our indices were offset by 3
    # because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
    return self.reverse_word_index.get(word_ind - 3, '?')

  def inds_to_string(self, word_inds):  
    # Put a list of decoded words into a string
    decoded_review = ' '.join([self.ind_to_string(i) for i in word_inds])
    return decoded_review

  def create_word_list(self, word_inds):
    word_list = []
    for ind in word_inds:
      word_list.append(self.ind_to_string(ind))
    return word_list
  
# word_index is a dictionary mapping words to an integer index
# We create an instance of a class to manage this index
WIM = WordIndexManager(imdb.get_word_index())

## TODO:  
Use the code above to print the entire first review (index 0) and also to print the 2nd word (index 1) in the first review.

In [50]:
# print the entire first review
WIM.create_word_list(train_data[0])
# print the second word of the first review
WIM.create_word_list(train_data[0])[1]
# ** YOUR CODE HERE **


[[1, 2], [3, 4]]

## TODO:

Print the top 20 most common words.  Print them in a table form as in:

| Index      | Count | Word    |
| :---    |    :----:      |    ---: |
| 2      | 122808  |    ?   |
| ... | ... |  ...|





In [55]:
import numpy as np

# ** YOUR CODE HERE **
from tabulate import tabulate
headers = ["Index", "Count", "Word"]
M = [[]]
for i in range(4,24):
    M += [[i,0,WIM.ind_to_string(i)]]
m = np.array(M[1:-1])
table = tabulate(m, headers, tablefmt="fancy_grid")
print(table)

╒═════════╤═════════╤════════╕
│   Index │   Count │ Word   │
╞═════════╪═════════╪════════╡
│       4 │       0 │ the    │
├─────────┼─────────┼────────┤
│       5 │       0 │ and    │
├─────────┼─────────┼────────┤
│       6 │       0 │ a      │
├─────────┼─────────┼────────┤
│       7 │       0 │ of     │
├─────────┼─────────┼────────┤
│       8 │       0 │ to     │
├─────────┼─────────┼────────┤
│       9 │       0 │ is     │
├─────────┼─────────┼────────┤
│      10 │       0 │ br     │
├─────────┼─────────┼────────┤
│      11 │       0 │ in     │
├─────────┼─────────┼────────┤
│      12 │       0 │ it     │
├─────────┼─────────┼────────┤
│      13 │       0 │ i      │
├─────────┼─────────┼────────┤
│      14 │       0 │ this   │
├─────────┼─────────┼────────┤
│      15 │       0 │ that   │
├─────────┼─────────┼────────┤
│      16 │       0 │ was    │
├─────────┼─────────┼────────┤
│      17 │       0 │ as     │
├─────────┼─────────┼────────┤
│      18 │       0 │ for    │
├───────

## Preparing the data


We cannot feed lists of integers into a neural network. We have to turn our lists into tensors. This is done with a data generator.  To use the data generator, call the function to get an instance of the generator, then iterate on that instance.  E.g.,

```
data_gen = generate_data(input_data, window_size, vocab_size, batch_size)
for x,y in data_gen:
    do something
```


##TODO: 
Create a data generator that takes in train_data, window_size, vocab_size, and batch size as above and returns

x: a tensor of size 
```
(batch_size,  2*window_size, vocab_size).
```
Each vector x[i,j,:] is a one-hot encoding of one of the neighbor words of the central word

y: a tensor of size 
```
(batch_size, vocab_size).
```
Each vector y[i,:] is a one-hot encoding of the central word.  

You may shuffle the order of the reviews if you like, except the first review (index 0) should be left in place.  

Your generator should loop infinitely.  Each time through the loop should lead to 

```
yield (x, y)
```

In [0]:
import numpy as np
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
import keras.backend as K
from keras.preprocessing import sequence


## Assignment 2 generator
# def generator(data, lookback, delay, min_index, max_index,
#               shuffle=False, batch_size=128, step=6):
#     if max_index is None:
#         max_index = len(data) - delay - 1
#     i = min_index + lookback
#     while 1:
#         if shuffle:
#             rows = np.random.randint(
#                 min_index + lookback, max_index, size=batch_size)
#         else:
#             if i + batch_size >= max_index:
#                 i = min_index + lookback
#             rows = np.arange(i, min(i + batch_size, max_index))
#             i += len(rows)

#         samples = np.zeros((len(rows),
#                            lookback // step,
#                            data.shape[-1]))
#         targets = np.zeros((len(rows),))
#         for j, row in enumerate(rows):
#             indices = range(rows[j] - lookback, rows[j], step)
#             samples[j] = data[indices]
#             targets[j] = data[rows[j] + delay][1]
#         yield samples, targets

def generate_data(corpus, window_size, V, batch_size=16):
    x = np.array()
    y = np.array()
    yield(x, y)
  # ** YOUR CODE HERE **         
            

## TODO:  
Verify your work by running the code below.  It should work without modification given the data generator you've written.  The final line should be 


```
film was brilliant casting :  just 
```



In [0]:
window_size = 2
train_gen = generate_data(train_data, window_size, V)
val_gen = generate_data(val_data, window_size, V)
test_gen = generate_data(test_data, window_size, V)

for bow, output in train_gen:
  for i in range(5):
    for k in range(bow.shape[1]):
        ind = np.nonzero(bow[i,k,:])[0][0]
        print(WIM.ind_to_string(ind) + ' ', end="")
    ind = np.nonzero(output[i,:])[0][0]
    print(':  ' + WIM.ind_to_string(ind) + ' ')
  break

Here we create a function to save the embedding weights for use with the word2vec package, which allows us to explore the word embeddings easily.  

In [0]:
def save_weights(model, vocab_size=V, dim=dim, filename='vectorsCB.txt'):
  f = open(filename ,'w')
  f.write('{} {}\n'.format(vocab_size-1, dim))
  vectors = model.get_weights()[0]
  for i in range(1,vocab_size):
      str_vec = ' '.join(map(str, list(vectors[i, :])))
      word = WIM.ind_to_string(i)
      f.write('{} {}\n'.format(word, str_vec))
  f.close()
  return vectors

## TODO:  
Construct a CBOW model called cbow.   Use dim as the embedding dimension. The final layer should be a softmax activation onto the size of the vocabulary.  Use categorical crossentropy.  Use the cbow.summary() function to display the result.  

In [0]:
cbow = # ** YOUR CODE HERE **

cbow.summary()

Save the untrained weights for comparison.  

In [0]:
weightsUT = save_weights(cbow, filename='untrainedCB.txt')

Train the model

In [0]:

val_steps = 100

history = cbow.fit_generator(train_gen,
                              steps_per_epoch=1500,
                              epochs=30,
                              validation_data=val_gen,
                              validation_steps=val_steps)

Plot the training

In [0]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Save the trained dictionary

In [0]:
weights = save_weights(cbow, filename='vectorsCB.txt')

Load the word embeddings and compare trained and untrained embeddings.  

In [0]:

import gensim
w2vUT = gensim.models.KeyedVectors.load_word2vec_format('./untrainedCB.txt', binary=False)
w2vT = gensim.models.KeyedVectors.load_word2vec_format('./vectorsCB.txt', binary=False)

def print_similarities(word, w2vUT=w2vUT, w2vT=w2vT):
  print('Nearest words and similarities to "' + word + '" ')
  print('Untrained similarities\tTrained similarities\n')
  for item1, item2 in zip(w2vUT.most_similar(positive=[word]), w2vT.most_similar(positive=[word])):
    print("{:10s}".format(item1[0]) + ', ' + "{:.2f}".format(item1[1]) + '\t' 
          + "{:10s}".format(item2[0]) + ', ' + "{:.2f}".format(item2[1]))
  print(' ')

print_similarities('movie')

In [0]:
print_similarities('film')

In [0]:
print_similarities('role')


In [0]:
print('Word pair similarity')
print('\t\t\tUntrained\tTrained')
print('film and movie: \t' + "{:.2f}".format(w2vUT.similarity('film', 'movie')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('film', 'movie')))
print('man and woman:   \t' + "{:.2f}".format(w2vUT.similarity('man', 'woman')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('man', 'woman')))
print('plot and talent: \t' + "{:.2f}".format(w2vUT.similarity('plot', 'talent')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('plot', 'talent')))

print(' ')

Use TSNE to plot the two primary component of the embedding.  

In [0]:
from sklearn.manifold import TSNE
import plotly.offline as py
import plotly.graph_objs as go

number_of_words = 1000

X_embedded = TSNE(n_components=2).fit_transform(weights[0:number_of_words])
word_list = WIM.create_word_list(range(number_of_words))


trace = go.Scatter(
    x = X_embedded[0:number_of_words,0], 
    y = X_embedded[0:number_of_words, 1],
    mode = 'markers',
    text= word_list[0:number_of_words]
)

layout = dict(title= 'Trained t-SNE 1 vs t-SNE 2 for first 1000 words ',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)

In [0]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
  
configure_plotly_browser_state()  

py.init_notebook_mode()
py.iplot(fig)

In [0]:
X_embedded = TSNE(n_components=2).fit_transform(weightsUT[0:number_of_words])
word_list = WIM.create_word_list(range(number_of_words))


trace = go.Scatter(
    x = X_embedded[0:number_of_words,0], 
    y = X_embedded[0:number_of_words, 1],
    mode = 'markers',
    text= word_list[0:number_of_words]
)

layout = dict(title= 'Untrained t-SNE 1 vs t-SNE 2 for first 1000 words ',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)

In [0]:
configure_plotly_browser_state()  

py.init_notebook_mode()
py.iplot(fig)