## Side Notes for Lesson 3

### 1. Simple word embeddings with nltk and gensim

Following: https://github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest

In [1]:
import nltk
from nltk.corpus import brown
from nltk.data import find

import gensim

import numpy as np

Define a couple of helper functions for cosine similarities: one deriving similarity between two words in the context of a model, the other for two vectors directly:

In [2]:
def cossim_words(vec_model, a, b):
    """
    arguments: word a, word b
    return: cosine similarity between accociated model vectors with a and b
    """
    
    vec_a = vec_model[a]
    vec_b = vec_model[b]
    
    return np.dot(vec_a, vec_b)/np.sqrt(np.dot(vec_a, vec_a))/np.sqrt(np.dot(vec_b, vec_b))

def cossim_vecs(vec_a, vec_b):
    """
    arguments: word a, word b
    return: cosine similarity between accociated model vectors with a and b
    """
    
    return np.dot(vec_a, vec_b)/np.sqrt(np.dot(vec_a, vec_a))/np.sqrt(np.dot(vec_b, vec_b))

Download NLTK's sample word2vec embeddings:

In [3]:
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/joachim/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


True

In [4]:
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

Load the embeddings into a gensim model:

In [6]:
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

What is the size of the vocabulary?

In [7]:
len(model.vocab)

43981

What is the **embedding dimension**?

In [8]:
len(model['university'])

300

Let's play with cosine similarities:

In [9]:
cossim_words(model, 'university', 'school')

0.5080747

Now... let's try to 'construct' the embedding vector for queen:

In [10]:
model_queen = model['king'] + (model['woman'] - model['man'])

In [11]:
cossim_vecs(model['queen'], model_queen)

0.71181935

Let's compare 'queen' to other similar words: 

In [12]:
cossim_words(model, 'queen', 'king')

0.65109587

Not bad! Our reconstructed vector seems to be decent.

### 2. Simple BOW Classification using Word Embeddings in Keras

This should roughly implement the model on slides 41 in a toy setting.

In [13]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K

In [14]:
len(model.vocab.keys())

43981

Ok, now we know the number of words that have an embedding. Let's build the embedding matrix from the model:

In [15]:
EMBEDDING_DIM = len(model['university'])      # we know... it's 300

# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(model.vocab.keys()) + 1, EMBEDDING_DIM))       
vocab_dict = {}

# build the embedding matrix and the word-to-id map:
for i, word in enumerate(model.vocab.keys()):
    embedding_vector = model[word]
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        vocab_dict[word] = i



What's the shape?

In [16]:
embedding_matrix.shape

(43982, 300)

Correct? Looks right.

Let's build the embedding layer:

In [17]:
MAX_SEQUENCE_LENGTH = 5  # Keras' embedding layer expects a specific input length. Padding is often needed here.

embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],          ## note: depreciated!
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

W0908 21:51:02.776325 4572186048 deprecation.py:506] From /anaconda3/envs/tf1_14/lib/python3.7/site-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Note the 'trainable=False' flag. **Q: What would happen if we had set it to true?**

Using this embedding layer, let's use **Keras' Functional API** as opposed to the sequential model we looked at last week. The format is a bit different, but not very much.

Start with defining the input, then add the layers sequentially acting on the previous layer:

In [18]:
# Input layer:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

# apply the Embedding layer to the input layer
embedded_sequences = embedding_layer(sequence_input)

# add all of the 
sum_embeddings =  K.sum(embedded_sequences, axis=1)       # for future reference: Lamba layers are good here...

hidden = Dense(100, activation='relu')(sum_embeddings)

preds = Dense(1, activation='sigmoid')(hidden)

W0908 21:51:03.280385 4572186048 deprecation.py:506] From /anaconda3/envs/tf1_14/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


**Q: What are the dimensions of the layers?**

Next: we build the model, defining input and output:

In [19]:
bow_model = Model(sequence_input, preds)

Let's see whether our dimension discussion was correct. Print a model summary:

In [20]:
bow_model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 5)]               0         
_________________________________________________________________
embedding (Embedding)        (None, 5, 300)            13194600  
_________________________________________________________________
tf_op_layer_Sum (TensorFlowO [(None, 300)]             0         
_________________________________________________________________
dense (Dense)                (None, 100)               30100     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 13,224,801
Trainable params: 30,201
Non-trainable params: 13,194,600
_________________________________________________________________


Like last week... let's compile the model. I.e, define optimizer, loss function, etc.

In [21]:
bow_model.compile(optimizer='adam', loss='binary_crossentropy')

W0908 21:51:03.352510 4572186048 deprecation.py:323] From /anaconda3/envs/tf1_14/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Almost there... let's create some fake training and test data.

In [22]:
train_sentences = ['this is really absolutely great', 'this is really absolutely terrible']
train_labels = [[1], [0]]

test_sentences = ['never seen anything this stupid', 'never seen anything this fantastic']
test_labels = [[0], [1]]


def sents_to_ids(sentences):
    """
    converting a list of strings to a list of lists of word ids
    """
    text_ids = []
    for sentence in sentences:
        example = []
        for word in sentence.split(' '):
            example.append(vocab_dict[word])
        text_ids.append(example)

    return  text_ids   


train_input = np.array(sents_to_ids(train_sentences))
train_labels = np.array(train_labels)

test_input = np.array(sents_to_ids(test_sentences))
test_labels = np.array(test_labels)

**Q: Before we start... should this come out ok-ish?**

Next: let's get the start predictions. Should be random-ish. Are they?

In [23]:
print(bow_model.predict(train_input))
print(bow_model.predict(test_input))

[[0.4580708]
 [0.430571 ]]
[[0.42396268]
 [0.4473657 ]]


Yup.

Finally... let's train!

In [24]:
bow_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=20)

Train on 2 samples, validate on 2 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1a20d2e208>

Look's good!

What are train & test predictions now?

In [25]:
bow_model.predict(train_input)

array([[0.6886179 ],
       [0.31423044]], dtype=float32)

Yey! But we obviously cheated here with the choice of sentences. Nevertheless, the idea should be clear.