## Side Notes for Lesson 3

### 1. Simple word embeddings with nltk and gensim

Following: https://github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest

In [1]:
import nltk
from nltk.corpus import brown
from nltk.data import find

import gensim

import numpy as np

Define a couple of helper functions for cosine similarities: one deriving similarity between two words in the context of a model, the other for two vectors directly:

In [2]:
def cossim_words(vec_model, a, b):
    """
    arguments: word a, word b
    return: cosine similarity between assoociated model vectors with a and b
    """
    
    vec_a = vec_model[a]
    vec_b = vec_model[b]
    
    return np.dot(vec_a, vec_b)/np.sqrt(np.dot(vec_a, vec_a))/np.sqrt(np.dot(vec_b, vec_b))

def cossim_vecs(vec_a, vec_b):
    """
    arguments: word a, word b
    return: cosine similarity between associated model vectors with a and b
    """
    
    return np.dot(vec_a, vec_b)/np.sqrt(np.dot(vec_a, vec_a))/np.sqrt(np.dot(vec_b, vec_b))

Download NLTK's sample word2vec embeddings:

In [3]:
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /home/mhbutler/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


True

In [4]:
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

Load the embeddings into a gensim model:

In [5]:
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

What is the size of the vocabulary? [Use model.vocab...]

In [6]:
len(model.vocab)

43981

Ok, 43981 words in vocab.

What is the **embedding**? [model['word']...]

In [7]:
model['great']

array([ 3.06035e-02,  8.86877e-02, -1.21269e-02,  7.61965e-02,
        5.66269e-02, -4.24702e-02,  4.10129e-02, -4.97567e-02,
       -3.64328e-03,  6.32889e-02, -1.42608e-02, -7.91111e-02,
        1.74877e-02, -3.83064e-02,  9.26433e-03,  2.95626e-02,
        7.70293e-02,  9.49334e-02, -4.28866e-02, -2.95626e-02,
        4.45244e-05,  6.82854e-02,  1.73836e-02,  3.14363e-02,
        6.53708e-02,  2.89380e-02, -4.39275e-02,  1.78000e-02,
        1.82164e-02, -4.70503e-02, -2.85216e-02,  1.79041e-02,
        1.06592e-01,  9.07696e-02,  6.78690e-02,  6.16755e-03,
       -2.08187e-02,  5.95936e-03,  1.51586e-03,  8.95205e-02,
        6.49544e-02, -3.12281e-02,  9.24351e-02, -2.45661e-02,
       -1.21269e-02, -1.53538e-03,  6.49544e-02, -1.12421e-02,
        9.10819e-03, -6.45380e-02,  4.43439e-02,  1.35738e-01,
       -7.91111e-02,  1.57181e-02, -4.72585e-02, -1.35322e-02,
       -4.33029e-02, -5.16304e-02,  1.37404e-01, -3.12281e-02,
       -6.49544e-02,  1.14087e-01, -6.41217e-02, -5.246

Dimension as expected?

In [8]:
model['great'].shape

(300,)

Let's play with cosine similarities:

In [9]:
cossim_words(model, 'nice', 'great')

0.6454657

In [10]:
cossim_words(model, 'nice', 'bad')

0.39996934

Cool, as expected.

Now... word vectors are supposed to capture meaningful linguistic relationships. So let's try to 're-construct' the embedding vector for the word 'son' via

model['son']  $\sim$ model['boy'] - model['girl'] + model['daughter']

In [11]:
model_son = model['boy'] - model['girl'] + model['daughter']

How close is this constructed vector to the actual embedding bector for 'boy'?

In [12]:
cossim_vecs(model['son'], model_son)

0.9217184

Close! And it is much closer to the embedding of 'boy' than other words in the family (in a double-sense): 

In [13]:
cossim_words(model, 'son', 'brother')

0.83795315

In [14]:
cossim_words(model, 'son', 'daughter')

0.8468295

So the approximate relationship model['son']  $\sim$ model['boy'] - model['girl'] + model['daughter'] seems valid.

### 2. Simple BOW Classification using Word Embeddings in Keras

This section roughly implements the model on slides 41 in a toy setting.

In [15]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K

Ok, now we know the number of words that have an embedding. Let's build the embedding matrix from the model:

In [16]:
EMBEDDING_DIM = len(model['university'])      # we know... it's 300

# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(model.vocab.keys()) + 1, EMBEDDING_DIM))       
vocab_dict = {}

# build the embedding matrix and the word-to-id map:
for i, word in enumerate(model.vocab.keys()):
    embedding_vector = model[word]
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        vocab_dict[word] = i



What's the shape?

In [17]:
embedding_matrix.shape

(43982, 300)

Correct? Looks right.

Let's build the embedding layer:

In [18]:
MAX_SEQUENCE_LENGTH = 5  # Keras' embedding layer expects a specific input length. Padding is often needed here.

embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [19]:
try:
    del tf_model
except:
    pass

Note the 'trainable=False' flag...

Now let's build the model, again as a **Sequential Model**: 

In [20]:
tf_model = tf.keras.Sequential()

tf_model.add(embedding_layer)                                        # embedding layer
tf_model.add(tf.keras.layers.Lambda(lambda x: K.mean(x, axis=1)))    # average of embedding vectors
tf_model.add(Dense(100, activation='relu'))                          # hidden layer
tf_model.add(Dense(1, activation='sigmoid'))                         # classification layer

**Q: What are the dimensions of the layers?**

Next: we build the model, defining input and output:

Let's see whether our dimension discussion was correct. Print a model summary:

In [21]:
tf_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 5, 300)            13194600  
_________________________________________________________________
lambda (Lambda)              (None, 300)               0         
_________________________________________________________________
dense (Dense)                (None, 100)               30100     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 13,224,801
Trainable params: 30,201
Non-trainable params: 13,194,600
_________________________________________________________________


Like last week... let's compile the model. I.e, define optimizer, loss function, etc.

In [22]:
tf_model.compile(optimizer='adam', loss='BinaryCrossentropy')

Almost there... let's create some fake training and test data.

In [23]:
train_sentences = ['this is really absolutely great', 'this is really absolutely terrible']
train_labels = [[1], [0]]

test_sentences = ['never seen anything this stupid', 'never seen anything this fantastic']
test_labels = [[0], [1]]

... and then do some formatting gymnastics:

In [24]:
def sents_to_ids(sentences):
    """
    converting a list of strings to a list of lists of word ids
    """
    text_ids = []
    for sentence in sentences:
        example = []
        for word in sentence.split(' '):
            example.append(vocab_dict[word])
        text_ids.append(example)

    return  text_ids   


train_input = np.array(sents_to_ids(train_sentences))
train_labels = np.array(train_labels)

test_input = np.array(sents_to_ids(test_sentences))
test_labels = np.array(test_labels)

So the model input are word ids in the vocab:

In [25]:
train_input

array([[35029, 16908, 34554,  7427, 35058],
       [35029, 16908, 34554,  7427, 37254]])

Next: let's get the start predictions. Should be random-ish. Are they?

In [26]:
print(tf_model.predict(train_input))
print(tf_model.predict(test_input))

[[0.49760967]
 [0.49496353]]
[[0.5114079]
 [0.5123441]]


Yup, looks quite random.

Finally... let's train!

In [27]:
tf_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=1)
tf_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=150, verbose=0)
tf_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7f319431c070>

Look's good!

What are train & test predictions now?

In [28]:
tf_model.predict(test_input)

array([[0.1635555],
       [0.7503134]], dtype=float32)

Yey! But we obviously cheated here with the choice of sentences. Nevertheless, the idea should be clear.

**Questions for the class for joint live in-class exercises**:

1) Can you relate the value for the validation loss to the prediction for the test set 

2) What do you think happens if you change the 'trainable' flag in the embedding layer from 'False' to 'True'?

3) Let's look into the model and inspect some weights. (Use tf_model.layers. We can get weights of individual layers through  tf_model.layers[<layer_num>].weights):
   - Related to Q2, depending on the 'trainable' flag, did the embedding matrix change?
   
   
LET'S TRY IT!!