# Learning Word2Vec with Nathaniel Tucker

Notebook dedicated to follow the awesome notebook created by Nathaniel Tucker:
    
https://github.com/knathanieltucker/tf-keras-tutorial/blob/master/WordRepresentations.ipynb

## Good old word vectors

So let's start off by doing something classic, we will be looking at word vectors, or a learned representation of the meanings of words.

This is important because I want to show you how to go about doing representational learning in a couple of ways. The first way is to solve a prediction problem. This is often the most common and most useful of approaches. We are in a situation where we want to solve a prediciton problem and the features that we have are qualitative and interact with one another: eg. language. Another common example might be recommending movies using collaborative filtering.

So we don't have an easy/intuitive way to represent our qualitative features, so we will learn the representations. As a byproduct we actually get great representations that are geared at your prediction problem.

In [1]:
# we initialize some hyperparams
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
INDEX_FROM = 3
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

We will be trying to predict whether movie reviews are good or bad, thus we will be using the an IMDB dataset (it might take some time to download the movies):

In [2]:
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_NB_WORDS, index_from=INDEX_FROM)

Using TensorFlow backend.


In [3]:
ls ~/.keras/datasets/imdb*

/Users/dazconap/.keras/datasets/imdb.npz
/Users/dazconap/.keras/datasets/imdb_full.pkl
/Users/dazconap/.keras/datasets/imdb_word_index.json


In [15]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((25000, 1000), (25000,), (25000,), (25000,))

Notice that the data has already been converted to a form that we like, each word is now an index (we will talk about why this is so important later).

In [5]:
x_train[0][:6]

[1, 33, 86, 168, 7, 4]

In [16]:
y_train[0]

0

That being said we can convert it back to normal by using the word index dictionary. Notice that our first three words in the dictionary are: padding, start char, and unknown word (like a proper noun).

In [6]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

In [7]:
ls ~/.keras/datasets/imdb*

/Users/dazconap/.keras/datasets/imdb.npz
/Users/dazconap/.keras/datasets/imdb_full.pkl
/Users/dazconap/.keras/datasets/imdb_word_index.json


In [8]:
id_to_word = {value:key for key, value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

<START> at first look of the plot tagline i figured it could have been a decent film could i have ever been more wrong the beginning of the film makes it look like a bunch of freaks got together and decided to make a low budget film for the first 10 minutes you don't notice the cheesy acting horrible sound and god awful special effects but then it gets worse just about 20 minutes into it i was asking myself what was the plot again i could only ask that question when i wasn't busted out laughing from the sheer <UNK> of this film the main actor has one setting for emotions and he sticks to it throughout the entire film even though he was supposed to go through love and hate and everything in between the flashback scene almost made me vomit because it made me re live one extra minute of footage from earlier in the movie now we hit the middle of the film where they are obviously trying to rip off <UNK> from the matrix although he is doing just a horrible job the actor's talking about star 

In [9]:
from keras.preprocessing.sequence import pad_sequences

x_train = pad_sequences(x_train, maxlen=MAX_SEQUENCE_LENGTH) # same length now

In [10]:
x_train

array([[   0,    0,    0, ...,  106,   14,   20],
       [   0,    0,    0, ...,   40,    4, 3196],
       [   0,    0,    0, ..., 4276,    7,  265],
       ..., 
       [   0,    0,    0, ...,  418,    7,  595],
       [   0,    0,    0, ...,   81,   67,   12],
       [   0,    0,    0, ..., 2171,   47,  421]], dtype=int32)

Our labels will just be 0 or 1 (good or bad):

In [11]:
y_train

array([0, 0, 1, ..., 0, 1, 1])

And finally we will make our network. Check out the embedding layer. The really cool thing about the embedding layer is that it is just one big weight matrix. Where each index is an embedding, or in this case a word. So the reason the input needs to be an index, is so we know where to look for it in the weight matrix!

So we feed the embeddings into a CNN to predict whether the review is good or bad.

In [12]:
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding

embedding_layer = Embedding(MAX_NB_WORDS,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(1, activation='sigmoid')(x)

There is one final thing we will need to do before training the model. And this is secific to tensorflow and thus keras. Tensorflow allows us to visualize embeddings, but it needs a little more information about the embeddings that it will visualize: specifically which index is which word. That is what we output below:

In [13]:
import csv

with open('word_reps/data/word_metadata.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
    for key, value in sorted(id_to_word.items()):
        writer.writerow([value.encode('utf8')])

In [14]:
from keras.models import Model
from keras.callbacks import TensorBoard

model = Model(sequence_input, preds)

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

embedding_metadata = {
    embedding_layer.name: 'data/word_metadata.csv'
}

model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_split=VALIDATION_SPLIT,
          callbacks=[TensorBoard(log_dir='word_reps', embeddings_freq=1, embeddings_metadata=embedding_metadata)])

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 3584/20000 [====>.........................] - ETA: 4:40 - loss: 0.6936 - acc: 0.5128

KeyboardInterrupt: 

That is it. Let's check out tensorboard!

Launch TensorBoard:
> $ tensorboard --logdir=word_reps/