The code in this notebook is based on [Richard Liao's implementation of hierarchical attention networks](https://github.com/richliao/textClassifier/blob/master/textClassifierHATT.py) and a related [Google group discussion](https://groups.google.com/forum/#!topic/keras-users/IWK9opMFavQ). The notebook also includes code from [Keras documentation](https://keras.io/) and [blog](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) as well as this [word2vec tutorial](http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/).

To enable Theano to run on a single GPU: 

* check the following dependencies: 

  `conda install pygpu`
  

* Replace $HOME/.theanorc with this:
```
[global]
floatX = float32
device = gpu0
[lib]
gpuarray.preallocate=1
```

In [1]:
import os 
os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu0'
os.environ['PATH'] = os.environ['PATH'] + ':/usr/local/cuda-8.0/bin'
import theano
print(theano.config.device) 

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29



gpu0


Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5110)


In [2]:
import numpy as np
import pandas as pd
from collections import defaultdict
import os 
os.environ['KERAS_BACKEND'] = 'theano'
import subprocess
import time

from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.optimizers import SGD

from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed
from keras.models import Model, load_model

from keras import backend as K
from keras.engine.topology import Layer, InputSpec
from keras import initializers, regularizers, optimizers
from keras.callbacks import History, CSVLogger

Using Theano backend.


Download the book reviews data from Azure Machine Learning

In [24]:
"""
from azureml import Workspace
ws = Workspace(
    workspace_id='817780d9ee0d4a878e25f8c9deb3b866',
    authorization_token='6df8a52943bd49eba6e57446bc73f5fc',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['Book Reviews from Amazon']
all_data = ds.to_dataframe()
all_data.rename(columns={0: 'rating', 1: 'text'}, inplace=True)
all_data.loc[:, 'rating'] = all_data['rating'] - 1           # reindex ratings to start from 0
"""

In [3]:
from azureml import Workspace
ws = Workspace(
    workspace_id='817780d9ee0d4a878e25f8c9deb3b866',
    authorization_token='6df8a52943bd49eba6e57446bc73f5fc',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['dfe_happysad_utf.csv']
all_data = ds.to_dataframe()
all_data.rename(columns={'features': 'text', 'label': 'rating'}, inplace=True)
all_data.replace({'rating': {'sadness': 0, 'happiness': 1}}, inplace=True)

Split data into a training and a test set. 

In [4]:
n_tr = 7500

ind_range = np.arange(all_data.shape[0])
tr_ind = np.random.choice(ind_range, n_tr, replace=False)

train_data = all_data.iloc[tr_ind, :]
test_data = all_data.iloc[np.setdiff1d(ind_range, tr_ind), :]

Set the dimensions of the input and the embedding. Because of the hierarchical nature of the network, the input has to be a 3-dimensional tensor of fixed size (sample_size x n_sentences x n_words). 

MAX_SENT_LEN : the number of words in each sentence. 

MAX_SENTS : the number of sentences in each document.

Longer documents and sentences will be truncated, shorter ones will be padded with zeros.

MAX_NB_WORDS : the size of the word encoding (number of most frequent words to keep in the vocabulary)

EMBEDDING_DIM : the dimensionality of the word embedding

In [5]:
MAX_SENT_LENGTH = 100
MAX_SENTS = 30
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 200

Fit a Keras tokenizer to the most frequent words using the entire training data set as the corpus.
Create the training data in the 3d format required. 

In [6]:
import nltk 

nltk.download('punkt')

reviews = []
labels = []
texts = []

for idx in range(train_data.shape[0]):
    text = train_data['text'].iloc[idx]
    texts.append(text)
    sentences = nltk.tokenize.sent_tokenize(text)
    reviews.append(sentences)
    labels.append(train_data['rating'].iloc[idx])

[nltk_data] Downloading package punkt to /home/anargyri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)

In [8]:
data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
doc_lst = []

# keep the MAX_NB_WORDS most frequent words and replace the rest with 'UNK'
# truncate to the first MAX_SENTS sentences per doc and MAX_SENT_LENGTH words per sentence

for i, sentences in enumerate(reviews):
    for j, sent in enumerate(sentences):
        if j < MAX_SENTS:
            wordTokens = text_to_word_sequence(sent)
            k = 0
            words_in_sent = []
            for _, word in enumerate(wordTokens):
                if k < MAX_SENT_LENGTH: 
                    if (word in tokenizer.word_index) and (tokenizer.word_index[word] < MAX_NB_WORDS):
                        data[i, j, k] = tokenizer.word_index[word]
                        words_in_sent.append(word)
                    else:
                        data[i, j, k] = MAX_NB_WORDS
                        words_in_sent.append('UNK')
                    k = k + 1
            doc_lst.append(words_in_sent)

Convert the ratings to one-hot categorical labels.

In [9]:
word_index = tokenizer.word_index
print('Total %s unique tokens.' % len(word_index))

y_train = to_categorical(np.asarray(labels))
x_train = data

print('Shape of data tensor:', x_train.shape)
print('Shape of label tensor:', y_train.shape)

Total 14725 unique tokens.
Shape of data tensor: (7500, 30, 100)
Shape of label tensor: (7500, 2)


In [10]:
n_classes = y_train.shape[1]

Train word2vec on the training documents in order to initialize the word embedding. Ignore rare words (min_count=6). Use skip-gram as the training algorithm (sg=1).

In [11]:
# train word2vec on the sentences to initialize the word embedding 
import gensim, logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# use skip-gram
word2vec_model = gensim.models.Word2Vec(doc_lst, min_count=6, size=EMBEDDING_DIM, sg=1, workers=os.cpu_count())

2017-09-15 17:37:26,241 : INFO : collecting all words and their counts
2017-09-15 17:37:26,242 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-09-15 17:37:26,262 : INFO : PROGRESS: at sentence #10000, processed 78426 words, keeping 12124 word types
2017-09-15 17:37:26,270 : INFO : collected 14725 word types from a corpus of 104038 raw words and 13287 sentences
2017-09-15 17:37:26,270 : INFO : Loading a fresh vocabulary
2017-09-15 17:37:26,281 : INFO : min_count=6 retains 1585 unique words (10% of original 14725, drops 13140)
2017-09-15 17:37:26,282 : INFO : min_count=6 leaves 85487 word corpus (82% of original 104038, drops 18551)
2017-09-15 17:37:26,287 : INFO : deleting the raw counts dictionary of 14725 items
2017-09-15 17:37:26,288 : INFO : sample=0.001 downsamples 73 most-common words
2017-09-15 17:37:26,289 : INFO : downsampling leaves estimated 62674 word corpus (73.3% of prior 85487)
2017-09-15 17:37:26,290 : INFO : estimated required memory for 

Create the initial embedding matrix from the output of word2vec.

In [12]:
embeddings_index = {}

for word in word2vec_model.wv.vocab:
    coefs = np.asarray(word2vec_model.wv[word], dtype='float32')
    embeddings_index[word] = coefs

print('Total %s word vectors.' % len(embeddings_index))

Total 1585 word vectors.


In [13]:
# Initial embedding
embedding_matrix = np.zeros((MAX_NB_WORDS + 1, EMBEDDING_DIM))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None and i < MAX_NB_WORDS:
        embedding_matrix[i] = embedding_vector
    elif i == MAX_NB_WORDS:
        # index MAX_NB_WORDS in data corresponds to 'UNK'
        embedding_matrix[i] = embeddings_index['UNK']

Define the network.
The mask_zero option determines whether masking is performed, i.e. whether the layers ignore the padded zeros in shorter documents.

In [52]:
# building Hierachical Attention network

REG_PARAM = 1e2
l2_reg = regularizers.l2(REG_PARAM)

embedding_layer = Embedding(MAX_NB_WORDS + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SENT_LENGTH,
                            trainable=True,
                            mask_zero=True,
                            embeddings_regularizer=l2_reg,
                            weights=[embedding_matrix])

Define a custom layer implementing the attention mechanism.

In [53]:
CONTEXT_DIM = 100

class AttLayer(Layer):
    def __init__(self, regularizer=None, **kwargs):
        self.regularizer = regularizer
        self.supports_masking = True
        super(AttLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3        
        self.W = self.add_weight(name='W', shape=(input_shape[-1], CONTEXT_DIM), initializer='normal', trainable=True, 
                                 regularizer=self.regularizer)
        self.b = self.add_weight(name='b', shape=(CONTEXT_DIM,), initializer='normal', trainable=True, 
                                 regularizer=self.regularizer)
        self.u = self.add_weight(name='u', shape=(CONTEXT_DIM,), initializer='normal', trainable=True, 
                                 regularizer=self.regularizer)        
        super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!

    def call(self, x, mask=None):
        eij = K.dot(K.tanh(K.dot(x, self.W) + self.b), self.u)
        ai = K.exp(eij)
        alphas = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        if mask is not None:
            # use only the inputs specified by the mask
            alphas *= mask
        weighted_input = x * alphas.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[-1])
    
    def get_config(self):
        config = {}
        base_config = super(AttLayer, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

    def compute_mask(self, inputs, mask):
        return None

GRU_UNITS is the dimensionality of each GRU output (the number of GRU units). GRU_IMPL = 2 selects a matricized RNN implementation which is more appropriate for training on a GPU. 

There are two levels of models in the definition. The sentence model `sentEncoder` is shared across all sentences in the input document.   

In [54]:
GPU_IMPL = 2          
GRU_UNITS = 50        

sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(GRU(GRU_UNITS, return_sequences=True, kernel_regularizer=l2_reg, implementation=GPU_IMPL))(embedded_sequences)
l_att = AttLayer(regularizer=l2_reg)(l_lstm)            
sentEncoder = Model(sentence_input, l_att)

review_input = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(GRU(GRU_UNITS, return_sequences=True, kernel_regularizer=l2_reg, implementation=GPU_IMPL))(review_encoder)
l_att_sent = AttLayer(regularizer=l2_reg)(l_lstm_sent)       
preds = Dense(n_classes, activation='softmax', kernel_regularizer=l2_reg)(l_att_sent)
model = Model(review_input, preds)

In [55]:
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.SGD(lr=0.01, momentum=0.9),
              metrics=['acc'])

In [56]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_10 (InputLayer)        (None, 30, 100)           0         
_________________________________________________________________
time_distributed_5 (TimeDist (None, 30, 100)           4085700   
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 30, 100)           45300     
_________________________________________________________________
att_layer_10 (AttLayer)      (None, 100)               10200     
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 202       
Total params: 4,141,402
Trainable params: 4,141,402
Non-trainable params: 0
_________________________________________________________________


In [57]:
ref_str = 'tweets'
history = History()
csv_logger = CSVLogger('./hatt_model_' + str(REG_PARAM) + '_' + ref_str + '.log',
                       separator=',',
                       append=True)

Order training data by the number of sentences in document (as suggested in the [Yang et al.] paper).

In [20]:
doc_lengths = [len(r) for r in reviews]
ind = np.argsort(doc_lengths)

In [58]:
t1 = time.time()

print("model fitting - Hierachical attention network")
model.fit(x_train[ind,:,:], y_train[ind,:], epochs=10, batch_size=64, shuffle=False, 
          callbacks=[history, csv_logger], verbose=2)

t2 = time.time()

model fitting - Hierachical attention network
Epoch 1/10
98s - loss: 26514.2522 - acc: 0.5324
Epoch 2/10
97s - loss: 0.8006 - acc: 0.5104
Epoch 3/10
97s - loss: 0.6942 - acc: 0.5104
Epoch 4/10
98s - loss: 0.6942 - acc: 0.5104
Epoch 5/10
97s - loss: 0.6942 - acc: 0.5104
Epoch 6/10
97s - loss: 0.6942 - acc: 0.5104
Epoch 7/10
97s - loss: 0.6942 - acc: 0.5104
Epoch 8/10
97s - loss: 0.6942 - acc: 0.5104
Epoch 9/10
98s - loss: 0.6942 - acc: 0.5104
Epoch 10/10
97s - loss: 0.6942 - acc: 0.5104


In [59]:
# save model
model.save('./hatt_model_{0}_{1}.h5'.format(REG_PARAM, ref_str))

In [60]:
np.savetxt('./hatt_model_{0}_{1}_time.txt'.format(REG_PARAM, ref_str), [REG_PARAM, (t2-t1) / 3600])
with open('./hatt_model_{0}_{1}_history.txt'.format(REG_PARAM, ref_str), "w") as res_file:
    res_file.write(str(history.history))

In [61]:
test_reviews = []
test_labels = []
test_texts = []

for idx in range(test_data.shape[0]):
    text = test_data['text'].iloc[idx]
    test_texts.append(text)
    sentences = nltk.tokenize.sent_tokenize(text)
    test_reviews.append(sentences)
    test_labels.append(test_data['rating'].iloc[idx])

In [62]:
data2 = np.zeros((len(test_texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

for i, sentences in enumerate(test_reviews):
    for j, sent in enumerate(sentences):
        if j < MAX_SENTS:
            wordTokens = text_to_word_sequence(sent)
            k = 0
            words_in_sent = []
            for _, word in enumerate(wordTokens):
                if k < MAX_SENT_LENGTH: 
                    if (word in tokenizer.word_index) and (tokenizer.word_index[word] < MAX_NB_WORDS):
                        data2[i, j, k] = tokenizer.word_index[word]
                        words_in_sent.append(word)
                    else:
                        data2[i, j, k] = MAX_NB_WORDS
                        words_in_sent.append('UNK')
                    k = k + 1

In [63]:
y_test = to_categorical(np.asarray(test_labels))
x_test = data2

In [64]:
from sklearn.metrics import accuracy_score

In [68]:
preds = model.predict(x_test)
accuracy_score(test_labels, preds.argmax(axis=1))

0.51397624039133472