## Live Notes for Lesson 5: CNNs



In [2]:
import nltk
from nltk.corpus import brown
from nltk.data import find

import gensim

import numpy as np

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model


In [1]:
!pip install gensim

You should consider upgrading via the '/home/mhbutler/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


### 1. Simple CNNs Math with numpy 

Let's start by defining simple 1D convolution operation, realizing the formulas from the slides:

In [3]:
def convolution(w, x):
    """
    Most quick and basic operation, valuing clarity over compactness.   
    """

    if len(w.shape) == 1:                              # if input is 1d ...
        return w.dot(x)
    
    elif len(w.shape) == 2:                            # if input is 2d ...
        return np.trace(np.matmul(w, np.transpose(x)))
    
    else:
        raise ValueError('you have to define the proper formula')

Next, choose weights and bias for our (one) convolution layer to mimic the edge detector in the slides: 

In [4]:
convolution_weights = np.array([-1, 2, -1])
convolution_bias = 0

Consider the following input and then calculate the convolution. Do we locater the edges/transitions?

In [5]:
input_vals = np.array([0,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0])

In [6]:
convolution_shape = convolution_weights.shape 
input_format = input_vals.shape 

convolution_results = []

# slide window across input
for i in range(1 + input_format[-1] - convolution_shape[-1]):
    
    # convolution results at position i
    conv_val_i = convolution(convolution_weights,input_vals[i:i + convolution_shape[0]]) + convolution_bias
    
    # apply poor-man's relu
    if conv_val_i < 0:
        conv_val_i = 0
    convolution_results.append(conv_val_i)
    
print(np.array(convolution_results))

[0 0 1 0 1 0 0 0 1 0 0 0 1 0]


Looks right! We don't use any padding ('valid' padding) hence we miss the first and last positions. 

**Question:** how would the dimension change with larger stride size? 

Next, imagine that our input has 2 dimensions. (This could for example represent an array of word embeddings or the output of a previous CNN layer with multiple filters.)

So we define 2-d input and choose arbitrary weights to illustrate the point. (These parameters represent the values of our example in the last part of the slides.):

In [7]:
convolution_weights = np.array([[1, 3, -1], [2,1,-1]])
convolution_bias = 0

input_vals = np.transpose(np.array([[-3,-5], [-2,2],[3,1],[4,3], [-1,-1]]))

In [8]:
input_vals

array([[-3, -2,  3,  4, -1],
       [-5,  2,  1,  3, -1]])

In [9]:
convolution_weights

array([[ 1,  3, -1],
       [ 2,  1, -1]])

In [10]:
convolution_results = []

input_format = input_vals.shape 
convolution_shape = convolution_weights.shape 

# slide window across input
for i in range(1 + input_format[-1] - convolution_shape[-1]):
    
    # convolution results at position i
    conv_val_i = convolution(convolution_weights,
                                   input_vals[:,i:i + convolution_shape[-1]]) + convolution_bias
    
    # let's  choose tanh for the activation 
    conv_val_i = np.tanh(conv_val_i)
    convolution_results.append(conv_val_i)
    
print(np.array(convolution_results))

[-1.         0.9999092  1.       ]


Do the dimensions look right?

### 2. Simple CNN Classification using Word Embeddings in Keras

Here we apply CNN classification to our **toy** example of the lesson notebook for the previous session. I.e., we use the same gensim/word-2-vec model, define the embedding matrix, and then create the embedding layer for our Keras model:  

In [11]:
nltk.download('word2vec_sample')
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /home/mhbutler/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


Ok, now we know the number of words that have an embedding. Let's build the embedding matrix from the model:

In [12]:
EMBEDDING_DIM = len(model['university'])      # we know... it's 300

# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(model.vocab.keys()) + 1, EMBEDDING_DIM))       
vocab_dict = {}

# build the embedding matrix and the word-to-id map:
for i, word in enumerate(model.vocab.keys()):
    embedding_vector = model[word]
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        vocab_dict[word] = i



In [13]:
MAX_SEQUENCE_LENGTH = 5  # Keras' embedding layer expects a specific input length. Padding is often needed here.

embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [14]:
try:
    del tf_model
except:
    pass

Now let's build the model (again as a **Sequential Model**). Now, we replace the concatination with a 1D CNN layer and a max-pooling operation. Let's choose 10 filters.

In [15]:
tf_model = tf.keras.Sequential()

tf_model.add(embedding_layer)                                        # embedding layer

tf_model.add(tf.keras.layers.Conv1D(
    filters=10, 
    kernel_size=3, 
    strides=1, 
    padding='same', 
    activation='relu', 
    use_bias=True,
    kernel_initializer='glorot_uniform', 
    bias_initializer='zeros')
            )    

tf_model.add(tf.keras.layers.GlobalMaxPooling1D())


tf_model.add(Dense(100, activation='relu'))                          # hidden layer
tf_model.add(Dense(1, activation='sigmoid'))                         # classification layer

Let's look at dimensions and parameters of the model.

**Question**: does this make sense?

In [16]:
tf_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 5, 300)            13194600  
_________________________________________________________________
conv1d (Conv1D)              (None, 5, 10)             9010      
_________________________________________________________________
global_max_pooling1d (Global (None, 10)                0         
_________________________________________________________________
dense (Dense)                (None, 100)               1100      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 13,204,811
Trainable params: 10,211
Non-trainable params: 13,194,600
_________________________________________________________________


Like last week... let's compile the model. I.e, define optimizer, loss function, etc.

In [17]:
tf_model.compile(optimizer='adam', loss='binary_crossentropy')

Almost there... let's the same crazy-simple fake training and test data.

In [18]:
train_sentences = ['this is really absolutely great', 'this is really absolutely terrible']
train_labels = [[1], [0]]

test_sentences = ['never seen anything this stupid', 'never seen anything this fantastic']
test_labels = [[0], [1]]

... and then do some formatting gymnastics:

In [19]:
def sents_to_ids(sentences):
    """
    converting a list of strings to a list of lists of word ids
    """
    text_ids = []
    for sentence in sentences:
        example = []
        for word in sentence.split(' '):
            example.append(vocab_dict[word])
        text_ids.append(example)

    return  text_ids   


train_input = np.array(sents_to_ids(train_sentences))
train_labels = np.array(train_labels)

test_input = np.array(sents_to_ids(test_sentences))
test_labels = np.array(test_labels)

So the model input are word ids in the vocab:

In [20]:
train_input

array([[35029, 16908, 34554,  7427, 35058],
       [35029, 16908, 34554,  7427, 37254]])

Next: let's get the start predictions. Should be random-ish. Are they?

In [21]:
print(tf_model.predict(train_input))
print(tf_model.predict(test_input))

[[0.5088843]
 [0.5097769]]
[[0.5144767 ]
 [0.51471424]]


Yup, looks quite random.

Finally... let's train!

In [22]:
tf_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=1)
tf_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=150, verbose=0)
tf_model.fit(train_input, train_labels, validation_data=(test_input, test_labels), epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7fd3681073a0>

Look's good! Actually better than last week... but don't make much of that, given this crazy simple data set. 

What are train & test predictions now?

In [23]:
tf_model.predict(test_input)

array([[0.1841006 ],
       [0.86312157]], dtype=float32)

Yey! But we obviously cheated here with the choice of sentences. Nevertheless, the idea should be clear.

**Questions for the class for joint live in-class exercises**:

1) Can you relate the value for the validation loss to the prediction for the test set 

2) What do you think happens if you change the 'trainable' flag in the embedding layer from 'False' to 'True'?   

3) What do you need to change in the model if you want more filters of the same kernel size?    

**Note/Question:** What would you need to change if you wanted to add CNN layers (at the same position) of different kernel sizes? That gets us to Keras Functional API... 