[View in Colaboratory](https://colab.research.google.com/github/ameasure/try_git/blob/master/ELMO.ipynb)

# Setup


In [1]:
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'
!pip install xlrd



Redirecting output to ‘wget-log.1’.


# Download Training and Validation Data

In [2]:
import pandas as pd

df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)
df['ACCIDENT_YEAR'].value_counts()
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

training rows: 18681
validation rows: 9032


In [0]:
from sklearn.preprocessing import LabelBinarizer

label_encoder = LabelBinarizer().fit(df_train['INJ_BODY_PART'])
y_train = label_encoder.transform(df_train['INJ_BODY_PART'])
y_valid = label_encoder.transform(df_valid['INJ_BODY_PART'])
n_classes = len(label_encoder.classes_)

# ELMO embeddings

ELMO is a pretrained RNN language model. A copy of  it is documented and available on [tensorflow_hub](https://tfhub.dev/google/elmo/2). We can use the tensorflow_hub module to load a copy of the model as follows:

In [4]:
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K
from keras.models import Model
from keras.layers import Dense, Input, Lambda, GlobalMaxPooling1D
from keras.optimizers import Adam


# Read in the pre-trained elmo model
elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

# Create a function that accepts a raw string input x, passes it into the elmo 
# model, and returns the sequence of vectors ELMO uses to represent the input.
# Putting this in a function allows us to connect the elmo_model to a Keras 
# model using a Lambda layer.
def get_elmo_embedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), 
                      signature='default',
                      as_dict=True)['elmo']

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.


Using TensorFlow backend.


# Define and Link the models

We can now load and link this pretrained model with a new model of our design as follows:

In [7]:
# specify the input - ELMO accepts a 1 dimensional vector where each entry is 
# a string representing an injury narrative
text_input = Input(shape=(1,), dtype='string')
# A Lambda layer performs the computation defined by the function it receives.
# In this case that function is the get_elmo_embedding() function defined above. the function feeds data into the ELMO model and returns the output
# output_shape - tells Keras the output of this layer is a variable length (None)
# sequence of 1024 dimensional vectors.
elmo_embedding = Lambda(get_elmo_embedding, 
                        output_shape=(None, 1024))(text_input)
max_pooling = GlobalMaxPooling1D()(elmo_embedding)
# feed the output of the ELMO model into the output layer
# the output layer will predict part_of_body probabilities
output = Dense(units=n_classes, activation='softmax', name='output')(max_pooling)

# tell Keras which layers are the inputs and outputs of our model
model = Model(inputs=text_input, outputs=[output])
# optimizer - the algorithm for calculating the optimal weights (ADAM is a
#   variant of gradient descent)
# loss - the loss function we will attempt to minimize through gradient descent (cross_entropy)
# metrics - the validation metrics we will calculate after each epoch (accuracy)
adam = Adam(lr=.001)
model.compile(optimizer=adam, 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


# Train the model

In [0]:
model.fit(x=df_train['NARRATIVE'].as_matrix(), y=y_train,
          validation_data=(df_valid['NARRATIVE'].as_matrix(), y_valid),
          batch_size=64, epochs=20)

Train on 18681 samples, validate on 9032 samples
Epoch 1/20