# Convolutional Neural Networks

Perhaps the most important advantage of the deep neural network is that it allows us add layers of abstraction to our model between the input and the output. This in turn gives us great flexibility in specifying how our inputs might relate to our outputs. These inductive biases, as they're sometimes called, have proven to be very helpful in modeling complex phenomena such as imagery, speech, and language.

So what sorts of biases might exist in our data? It depends on the task of course, but two biases that are very common in nature are locality and compositionality. In other words, things that are close to each other are more likely to be related to each other, and big things can often be more succintly described as a collection of smaller things. Consider, for example, an image. In its raw form, an image is simply a matrix of pixels. But these pixels are not equally likely to be related to each other. Pixels that are next to each other are more likely to belong to the same object, or shape. And objects and shapes that are next to each other, in turn, are more likely to describe larger objects and shapes that are also nearby. A densely connected neural network layer has no such bias, it allows for all pixels to interact with all pixels to form some output. Can do we better by constraining our model to better reflect locality and compositionality? The answer is often yes, and one popular way to do this is with a convolutional layer.

In its simplest form, a convolutional layer is simply one or more densely connected layers that are applied to subsets of the input. By only allowing these densely connected layers to operate on small subsets of the input, we are essentially forcing them to learn to recognize smaller structures that may exist. Information about these structures can then be passed to later layers (potentially also convolutional) to form a larger understanding of the input. This idea has enjoyed enormous success in computer vision and it is now used extensively in essentially every vision processing task. But imagery is not the only input that exhibits locality and compositionality. Language also has this. For example, words that occur next to each other in a sentence are far more likely to be related to each other. The same is true for letters, and if we want to think about the larger structures, words are made of letters, phrases are made of words, sentences are made of phrases, and documents are made of sentences. Some of the same insights from vision also apply to language.

Consider, for example, the sentence "the man fell on his left side." If we were to represent each of the words in this sentence by a vector, we could represent the entire input by the concatenation of these vectors. We could then perform a convolutional operation across each 3 word sub-sequence in this sentence as follows:

![Images](Images/numeric_1d_conv_animation.gif)

Note, in this particular convolutional layer we have two dense layers (also known as filters). Like any dense layer each filter has weights, controlling how each input contributes to the final output, and a bias. As each dense layer is applied to each continuous 3-word sequence in our sentence, it generates an output. The final result is that for just this one sentence, each filter has generated 6 outputs corresponding to the 6 continuous 3-word sequences found in our sentence. Because there tends to be lots of redundant information in the output of each filter, it is common to aggregate this information. One approach is max pooling, i.e. simply  take the highest value produced by each filter. The result is a single vector of output containing 2 values, corresponding to the highest values produced by each of our filters. The entire computation is illustrated below:

![Images](Images/one_dim_conv_anim_continuous.gif)

We implement this model below, in Keras. 

### Preparing the Data

The first step is to prepare the input data and here we diverge from our previous approach. In the past we used the bag-of-words approach, discarding all information about the order in which words appear. Now that we're working with convolutions, we need to preserve this information. We will accomplish this by using the Keras Tokenizer to map each word to a unique number, and then representing the sequence of words in each our narratives by the corresponding sequence of numbers. Although this ends up happening behind the scenes, this is equivalent to representing each word with a one-hot-encoding and stacking the one-hot-encodings sequentially.

In [1]:
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer

import pandas as pd

# read in our training data
df_train = pd.read_hdf('Data/msha_2010-2011.h5')
# read in our validation data
df_valid = pd.read_hdf('Data/msha_2012.h5')

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train['NARRATIVE'])
X_train_seq = tokenizer.texts_to_sequences(df_train['NARRATIVE'])
X_valid_seq = tokenizer.texts_to_sequences(df_valid['NARRATIVE'])

# keras only accepts a one-hot encoding of the training labels
# we do that here
label_encoder = LabelBinarizer().fit(df_train['INJ_BODY_PART'])
y_train = label_encoder.transform(df_train['INJ_BODY_PART'])
y_valid = label_encoder.transform(df_valid['INJ_BODY_PART'])
n_codes = len(label_encoder.classes_)

Using TensorFlow backend.


In [2]:
print(X_train_seq[0])

[244, 29, 7152, 1570, 764, 213, 970, 4, 3198, 139, 5, 1924, 424, 223, 610, 1, 764, 29, 10, 1, 1570, 9, 3, 64, 2, 490, 110, 5, 213, 1, 764, 813, 4, 164, 317, 11, 6, 15, 54]


As you can see, the Keras tokenizer has converted our narrative into a list of numbers, each corresponding to a word. There is, however, one more modification we need to make. Because each narrative contains a different number of words, but all our neural network layers contain a fixed number of weights, we need to figure out what to do with the mismatch. The simplest approach is simply to pad each narrative to the same length with special "blank" words (representer by the number 0). We accomplish this using the pad_sequences function from Keras, padding each narrative to 200 words (or truncating it to 200 words, if it is longer). 

In [None]:
from keras.preprocessing import sequence

X_train_seq = sequence.pad_sequences(X_train_seq, maxlen=200)
X_valid_seq = sequence.pad_sequences(X_valid_seq, maxlen=200)

print(X_train_seq[0])

We're now ready to specify the convolutional model. Here we use a single convolutional layer with 100 filters, each operating over 3-word subsets of the input.

In [3]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Concatenate
from keras.optimizers import Adam

input_text = Input(shape=(200,), dtype='int32')
embedding = Embedding(len(tokenizer.word_index), 
                          300, 
                          input_length=200)(input_text)
dropout = Dropout(0.1)(embedding)
convolution = Conv1D(filters=100, 
                     kernel_size=3,
                     padding='valid',
                     strides=1,
                     activation='relu')(dropout)
pool = GlobalMaxPooling1D()(convolution)
dense = Dense(100, activation='relu')(pool)
dropout = Dropout(0.5)(dense)
output = Dense(len(label_encoder.classes_), activation='softmax')(dense)

conv_model = Model(inputs=input_text, outputs=output)

conv_model.compile(optimizer='adam', 
                  loss='categorical_crossentropy', 
                  metrics=['accuracy'])

In [4]:
conv_model.fit(x=X_train_seq, y=y_train,
               validation_data=(X_valid_seq, y_valid),
               batch_size=32, epochs=5)

Train on 18681 samples, validate on 9032 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f8bc121f358>

There's nothing magical about convolving over 3-word subsequences, an alternate is to have multiple convolutional layers each operating over different length subsequences. Here, we create convolutional layers for 2, 3, 4, and 5 word subsequences. The resulting outputs are then concatenated before being fed to subsequent layers.

In [5]:
input_text = Input(shape=(200,), dtype='int32')
embedding = Embedding(len(tokenizer.word_index), 
                          300, 
                          input_length=200)(input_text)
dropout = Dropout(0.1)(embedding)
pooled_convolutions = []
for kernel_size in [2, 3, 4, 5]:
    convolution = Conv1D(filters=20, 
                         kernel_size=kernel_size,
                         padding='valid',
                         strides=1,
                         activation='relu')(dropout)
    pool = GlobalMaxPooling1D()(convolution)
    pooled_convolutions.append(pool)
concatenated = Concatenate()(pooled_convolutions)
dropout = Dropout(0.5)(concatenated)
dense = Dense(100, activation='relu')(dropout)
dropout = Dropout(0.5)(dense)
output = Dense(len(label_encoder.classes_), activation='softmax')(dense)

conv_model = Model(inputs=input_text, outputs=output)

conv_model.compile(optimizer='adam', 
                  loss='categorical_crossentropy', 
                  metrics=['accuracy'])

In [6]:
conv_model.fit(x=X_train_seq, y=y_train,
               validation_data=(X_valid_seq, y_valid),
               batch_size=32, epochs=5)

Train on 18681 samples, validate on 9032 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f8bbe8a76d8>