<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Convolutional_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'

--2019-06-17 21:26:10--  https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx
Resolving github.com (github.com)... 13.229.188.59
Connecting to github.com (github.com)|13.229.188.59|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx [following]
--2019-06-17 21:26:11--  https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4183086 (4.0M) [application/octet-stream]
Saving to: ‘msha.xlsx.1’


2019-06-17 21:26:11 (80.5 MB/s) - ‘msha.xlsx.1’ saved [4183086/4183086]



# Convolutional Neural Networks

In our previous class we introduced the densely connected neural network with a single hidden layer. I pointed out that there has been some theoretical work showing such a model can approximate any relationship between the inputs and outputs to an arbitrary degree. None the less, when we tried it out on MSHA coding we saw no significant improvements. Why?

The problem is that it is not enough for a model to be theoretically capable of representing any relationship, it must also be able to learn that relationship from the training data that is available. Here, there are no guarantees. The more flexible our model is, the more possibilities it has to rule out based on the training data. Conversely, the more correctly we can specify the model up front, the less burden we place on the training data. This is sometimes called an "inductive bias." One of the big advantages of deep neural networks is that they give us a lot of flexibility in specifying inductive biases.  

### Locality
So what sorts of biases might exist in our data? Obviously it depends on the task, but there are some biases that seem to be very common in nature. One of these is locality. Locality simply means that things that are close to each other are much more likely to interact with each other. This occurs at all sorts of levels in all sorts of ways. If you're an antelope on the savannah, a lion right next to you is a serious threat, a lion 10,000 miles away is completely irrelevant. If you're a particle hurtling through space, the fact that there's a planet a million miles away is much less relevant to your flight path then the fact that there's another particle just in front of you that you're about to collide with. If you're a pixel on an image, the pixel right next to you is much more likely to be similar to you than the one on the other side of the image. If you're a word in a document, your meaning is much more likely to be affected by other words in the same sentence than by words in some distant paragraph. Locality is basically everywhere in the natural world yet our densely connected neural network has no such bias, it simply allows every input to interact with every other input and all of these are treated initially as equally likely possibilities. So an important question is, can we constrain our model to better reflect the bias toward locality? The answer is yes. One way to do this is with a convolutional layer.

### Convolutional Layer
In its simplest form, a convolutional layer is simply one or more densely connected artificial neurons that are applied to multiple subsets of the input. Typically these subsets are contiguous (locally connected), reflecting a locality bias. By constraining our neurons in this way we force them to consider only local interactions that exist in these subsets. Information about these interactions can then be passed to later layers (potentially also convolutional) to form a larger understanding of the overall input. 

This idea has enjoyed enormous success in computer vision and is now used extensively, but imagery is not the only input that exhibits locality. Language also has this. For example, words that occur next to each other in a sentence are far more likely to be related to each other. The same is true for letters. Some of the same insights from vision also apply to language.

### Word Convolutions
Consider, for example, the sentence "the man fell on his left side." If we were to represent each of the words in this sentence by a vector, we could represent the entire input by the concatenation of these vectors. We could then perform a convolutional operation across each 3 word sub-sequence in this sentence as follows:

![Images](https://github.com/ameasure/colab_tutorials/blob/master/Images/1d_conv_nopool_loop.gif?raw=1)

Note, in this particular convolutional layer we have two artificial neurons. Neurons in a convolutional layer are sometimes called `filters` because you can think of them as "filtering" the input for particular patterns. Like any articial neuron each filter has weights, controlling how each input contributes to the final output, and a bias. As each filter is applied to each continuous 3-word sequence in our sentence, it generates an output. The final result is that for just this one sentence, each filter has generated 6 outputs corresponding to the 6 continuous 3-word sequences found in our sentence. Because there tends to be lots of redundant information in the output of each filter, it is common to aggregate this information. One approach is `max pooling`, i.e. simply  take the highest value produced by each filter. The result is a single vector of output containing 2 values, corresponding to the highest values produced by each of our filters. The entire computation is illustrated below:

![Images](https://github.com/ameasure/colab_tutorials/blob/master/Images/1d_conv_withpool_loop.gif?raw=1)

We implement this model below, in Keras. 

### Preparing the Data

The first step is to prepare the input data and here we diverge from our previous approach. In the past we used the bag-of-words approach, discarding all information about the order in which words appear. Now that we're working with convolutions, we need to preserve this information. We will accomplish this by using the Keras Tokenizer to map each word to a unique number, and then representing the sequence of words in each our narratives by the corresponding sequence of numbers. Although this ends up happening behind the scenes, this is equivalent to representing each word with a one-hot-encoding and stacking the one-hot-encodings sequentially.

In [2]:
from keras.preprocessing.text import Tokenizer
import pandas as pd


df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)
df['ACCIDENT_YEAR'].value_counts()
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train['NARRATIVE'])
X_train_seq = tokenizer.texts_to_sequences(df_train['NARRATIVE'])
X_valid_seq = tokenizer.texts_to_sequences(df_valid['NARRATIVE'])

Using TensorFlow backend.


training rows: 18681
validation rows: 9032


In [3]:
print(X_train_seq[0])

[244, 29, 7152, 1570, 764, 213, 970, 4, 3198, 139, 5, 1924, 424, 223, 610, 1, 764, 29, 10, 1, 1570, 9, 3, 64, 2, 490, 110, 5, 213, 1, 764, 813, 4, 164, 317, 11, 6, 15, 54]


As you can see, the Keras tokenizer has converted our narrative into a list of numbers, each corresponding to a word. There is, however, one more modification we need to make. Because each narrative contains a different number of words, but all our neural network layers contain a fixed number of weights, we need to figure out what to do with the mismatch. The simplest approach is simply to pad each narrative to the same length with special "blank" words (representer by the number 0). We accomplish this using the pad_sequences function from Keras, padding each narrative to 200 words (or truncating it to 200 words, if it is longer). 

In [4]:
from keras.preprocessing import sequence

X_train_seq = sequence.pad_sequences(X_train_seq, maxlen=200)
X_valid_seq = sequence.pad_sequences(X_valid_seq, maxlen=200)

print(X_train_seq[0])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0  244   29 7152 1570  764  213  970
    4 3198  139    5 1924  424  223  610    1  764   29   10    1 1570
    9    3   64    2  490  110    5  213    1  764  813    4  164  317
   11 

Our training inputs are now ready, we just need to prepare the training outputs. Keras requires these to be in a one-hot encoding. We do that below using sklearn's LabelBinarizer.

In [5]:
from sklearn.preprocessing import LabelBinarizer
# keras only accepts a one-hot encoding of the training labels
# we do that here
label_encoder = LabelBinarizer().fit(df_train['INJ_BODY_PART'])
y_train = label_encoder.transform(df_train['INJ_BODY_PART'])
y_valid = label_encoder.transform(df_valid['INJ_BODY_PART'])
n_codes = len(label_encoder.classes_)
print(y_train[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0]


We're now ready to specify the convolutional model. Here we use a single convolutional layer with 100 filters, each operating over 3-word subsets of the input.

In [6]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Concatenate
from keras.optimizers import Adam

input_text = Input(shape=(200,), dtype='int32')
embedding = Embedding(len(tokenizer.word_index), 
                          300, 
                          input_length=200)(input_text)
dropout = Dropout(0.1)(embedding)
convolution = Conv1D(filters=100, 
                     kernel_size=3,
                     padding='valid',
                     strides=1,
                     activation='relu')(dropout)
pool = GlobalMaxPooling1D()(convolution)
dense = Dense(100, activation='relu')(pool)
dropout = Dropout(0.5)(dense)
output = Dense(len(label_encoder.classes_), activation='softmax')(dense)

conv_model = Model(inputs=input_text, outputs=output)

conv_model.compile(optimizer='adam', 
                  loss='categorical_crossentropy', 
                  metrics=['accuracy'])

W0617 21:26:19.162114 139626775693184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0617 21:26:19.183620 139626775693184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0617 21:26:19.187818 139626775693184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0617 21:26:19.201766 139626775693184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0617 21:26:19.211415 

In [7]:
conv_model.fit(x=X_train_seq, y=y_train,
               validation_data=(X_valid_seq, y_valid),
               batch_size=32, epochs=5)

W0617 21:26:19.586082 139626775693184 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 18681 samples, validate on 9032 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efd19554b38>

There's nothing magical about convolving over 3-word subsequences, an alternate is to have multiple convolutional layers each operating over different length subsequences. Here, we create convolutional layers for 2, 3, 4, and 5 word subsequences. The resulting outputs are then concatenated before being fed to subsequent layers.

In [0]:
input_text = Input(shape=(200,), dtype='int32')
embedding = Embedding(len(tokenizer.word_index), 
                          300, 
                          input_length=200)(input_text)
dropout = Dropout(0.1)(embedding)
pooled_convolutions = []
for kernel_size in [2, 3, 4, 5]:
    convolution = Conv1D(filters=20, 
                         kernel_size=kernel_size,
                         padding='valid',
                         strides=1,
                         activation='relu')(dropout)
    pool = GlobalMaxPooling1D()(convolution)
    pooled_convolutions.append(pool)
concatenated = Concatenate()(pooled_convolutions)
dropout = Dropout(0.5)(concatenated)
dense = Dense(100, activation='relu')(dropout)
dropout = Dropout(0.5)(dense)
output = Dense(len(label_encoder.classes_), activation='softmax')(dense)

conv_model = Model(inputs=input_text, outputs=output)

conv_model.compile(optimizer='adam', 
                  loss='categorical_crossentropy', 
                  metrics=['accuracy'])

In [9]:
conv_model.fit(x=X_train_seq, y=y_train,
               validation_data=(X_valid_seq, y_valid),
               batch_size=32, epochs=5)

Train on 18681 samples, validate on 9032 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efd04071588>

# Next Lesson
[Recurrent Neural Networks](https://colab.research.google.com/drive/1UI85j2DIkMXMgiHKduboCIRVasRZDwrJ)