The code in this notebook is based on the [Keras documentation](https://keras.io/) and [blog](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) as well as this [word2vec tutorial](http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/).

In [1]:
import numpy as np
import os
import pandas as pd
import pickle
import time

os.environ['KERAS_BACKEND']='cntk'
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.models import Sequential, load_model
from keras import regularizers
from keras.optimizers import SGD
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.callbacks import History, CSVLogger
from keras.utils import to_categorical

Using CNTK backend


Download the book reviews data from Azure Machine Learning

In [2]:
from azureml import Workspace
ws = Workspace(
    workspace_id='817780d9ee0d4a878e25f8c9deb3b866',
    authorization_token='6df8a52943bd49eba6e57446bc73f5fc',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['Book Reviews from Amazon']
all_data = ds.to_dataframe()
all_data.rename(columns={0: 'rating', 1: 'text'}, inplace=True)
all_data.loc[:, 'rating'] = all_data['rating'] - 1           # reindex ratings to start from 0

In [2]:
"""
from azureml import Workspace
ws = Workspace(
    workspace_id='817780d9ee0d4a878e25f8c9deb3b866',
    authorization_token='6df8a52943bd49eba6e57446bc73f5fc',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['dfe_happysad_utf.csv']
all_data = ds.to_dataframe()
all_data.rename(columns={'features': 'text', 'label': 'rating'}, inplace=True)
all_data.replace({'rating': {'sadness': 0, 'happiness': 1}}, inplace=True)
"""

Split data into a training and a test set. 

In [3]:
n_tr = 7500

ind_range = np.arange(all_data.shape[0])
tr_ind = np.random.choice(ind_range, n_tr, replace=False)

train_data = all_data.iloc[tr_ind, :]
test_data = all_data.iloc[np.setdiff1d(ind_range, tr_ind), :]

Set the dimensions of the input and the embedding. 

MAX_DOC_LEN : the size of the input i.e. the number of words in the document. Longer documents will be truncated, shorter ones will be padded with zeros.

VOCAB_SIZE : the size of the word encoding (number of most frequent words to keep in the vocabulary)

EMBEDDING_DIM : the dimensionality of the word embedding

In [4]:
MAX_DOC_LEN = 300
VOCAB_SIZE = 6000
EMBEDDING_DIM = 200

In [5]:
TEXT_COL = 'text'
LABEL_COL = 'rating'

Fit a Keras tokenizer to the most frequent words using the entire training data set as the corpus.

In [6]:
# tokenize, create seqs, pad
tok = Tokenizer(num_words=VOCAB_SIZE, lower=True, split=" ")
tok.fit_on_texts(train_data[TEXT_COL])
train_seq = tok.texts_to_sequences(train_data[TEXT_COL])
train_seq = sequence.pad_sequences(train_seq, maxlen=MAX_DOC_LEN)
test_seq = tok.texts_to_sequences(test_data[TEXT_COL])
test_seq = sequence.pad_sequences(test_seq, maxlen=MAX_DOC_LEN)

Convert the ratings to one-hot categorical labels.

In [7]:
labels = to_categorical(np.asarray(train_data[LABEL_COL]))
labels = labels.astype('float32')

In [8]:
n_classes = labels.shape[1]

Train word2vec on the training documents in order to initialize the word embedding. Ignore rare words (min_count=6). Use skip-gram as the training algorithm (sg=1).

In [53]:
import nltk 

nltk.download('punkt')

sent_lst = []

for doc in train_data[TEXT_COL]:
    sentences = nltk.tokenize.sent_tokenize(doc)
    for sent in sentences:
        word_lst = [w for w in nltk.tokenize.word_tokenize(sent) if w.isalnum()]
        sent_lst.append(word_lst)

[nltk_data] Downloading package punkt to /home/anargyri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [54]:
import gensim, logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# use skip-gram
word2vec_model = gensim.models.Word2Vec(sentences=sent_lst, min_count=6, size=EMBEDDING_DIM, sg=1, workers=os.cpu_count())

2017-09-15 11:21:16,427 : INFO : collecting all words and their counts
2017-09-15 11:21:16,428 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-09-15 11:21:16,451 : INFO : PROGRESS: at sentence #10000, processed 75804 words, keeping 13063 word types
2017-09-15 11:21:16,459 : INFO : collected 15977 word types from a corpus of 100883 raw words and 13257 sentences
2017-09-15 11:21:16,460 : INFO : Loading a fresh vocabulary
2017-09-15 11:21:16,470 : INFO : min_count=6 retains 1613 unique words (10% of original 15977, drops 14364)
2017-09-15 11:21:16,471 : INFO : min_count=6 leaves 80706 word corpus (79% of original 100883, drops 20177)
2017-09-15 11:21:16,476 : INFO : deleting the raw counts dictionary of 15977 items
2017-09-15 11:21:16,478 : INFO : sample=0.001 downsamples 65 most-common words
2017-09-15 11:21:16,479 : INFO : downsampling leaves estimated 59121 word corpus (73.3% of prior 80706)
2017-09-15 11:21:16,479 : INFO : estimated required memory for 

Create the initial embedding matrix from the output of word2vec.

In [55]:
embeddings_index = {}

for word in word2vec_model.wv.vocab:
    coefs = np.asarray(word2vec_model.wv[word], dtype='float32')
    embeddings_index[word] = coefs

print('Total %s word vectors.' % len(embeddings_index))

# Initial embedding
embedding_matrix = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))

for word, i in tok.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None and i < VOCAB_SIZE:
        embedding_matrix[i] = embedding_vector

Total 1613 word vectors.


LSTM_DIM is the dimensionality of each LSTM output (the number of LSTM units).
The mask_zero option determines whether masking is performed, i.e. whether the layers ignore the padded zeros in shorter documents.

In [56]:
BATCH_SIZE = 100
NUM_EPOCHS = 10
LSTM_DIM = 100
OPTIMIZER = SGD(lr=0.01, nesterov=True)

In [57]:
def lstm_create_train(reg_param, ref_str):
    l2_reg = regularizers.l2(reg_param)

    # model init
    embedding_layer = Embedding(VOCAB_SIZE,
                                EMBEDDING_DIM,
                                input_length=MAX_DOC_LEN,
                                trainable=True,
                                mask_zero=False,
                                embeddings_regularizer=l2_reg,
                                weights=[embedding_matrix])

    lstm_layer = LSTM(units=LSTM_DIM, kernel_regularizer=l2_reg)
    dense_layer = Dense(n_classes, activation='softmax', kernel_regularizer=l2_reg)

    model = Sequential()
    model.add(embedding_layer)
    model.add(Bidirectional(lstm_layer))
    model.add(dense_layer)

    model.compile(loss='categorical_crossentropy',
                  optimizer=OPTIMIZER,
                  metrics=['acc'])

    history = History()
    csv_logger = CSVLogger('./lstm_model_wvec_{0}_{1}.log'.format(reg_param, ref_str),
                           separator=',',
                           append=True)

    print("Training model with regularization parameter = {}".format(reg_param))
    t1 = time.time()
    # model fit
    model.fit(train_seq,
              labels.astype('float32'),
              batch_size=BATCH_SIZE,
              epochs=NUM_EPOCHS,
              callbacks=[history, csv_logger],
              verbose=2)
    t2 = time.time()
    print("\n")
    
    # save model
    model.save('./lstm_wvec_{0}_{1}_model.h5'.format(reg_param, ref_str))
    np.savetxt('./lstm_wvec_{0}_{1}_time.txt'.format(reg_param, ref_str), 
               [reg_param, (t2-t1) / 3600])
    with open('./lstm_wvec_{0}_{1}_history.txt'.format(reg_param, ref_str), "w") as res_file:
        res_file.write(str(history.history))

In [58]:
for rp in [1e-10, 1e-7, 1e-4, 1e-1, 1e2]:
    lstm_create_train(rp, 'tweets')

Training model with regularization parameter = 1e-10
Epoch 1/10
33s - loss: 0.6922 - acc: 0.5447
Epoch 2/10
32s - loss: 0.6916 - acc: 0.5557
Epoch 3/10
32s - loss: 0.6910 - acc: 0.5652
Epoch 4/10
32s - loss: 0.6905 - acc: 0.5893
Epoch 5/10
32s - loss: 0.6899 - acc: 0.5828
Epoch 6/10
32s - loss: 0.6893 - acc: 0.5827
Epoch 7/10
32s - loss: 0.6887 - acc: 0.5943
Epoch 8/10
32s - loss: 0.6882 - acc: 0.5901
Epoch 9/10
32s - loss: 0.6876 - acc: 0.5933
Epoch 10/10
32s - loss: 0.6870 - acc: 0.5905


Training model with regularization parameter = 1e-07
Epoch 1/10
32s - loss: 0.6944 - acc: 0.4816
Epoch 2/10
32s - loss: 0.6936 - acc: 0.4899
Epoch 3/10
32s - loss: 0.6929 - acc: 0.5145
Epoch 4/10
32s - loss: 0.6924 - acc: 0.5299
Epoch 5/10
32s - loss: 0.6917 - acc: 0.5463
Epoch 6/10
32s - loss: 0.6911 - acc: 0.5524
Epoch 7/10
32s - loss: 0.6905 - acc: 0.5656
Epoch 8/10
32s - loss: 0.6899 - acc: 0.5616
Epoch 9/10
32s - loss: 0.6893 - acc: 0.5647
Epoch 10/10
32s - loss: 0.6888 - acc: 0.5716


Training

In [63]:
from sklearn.metrics import accuracy_score

for rp in [1e-10, 1e-7, 1e-4, 1e-1, 1e2]:
    model = load_model('./lstm_wvec_{0}_{1}_model.h5'.format(rp, 'tweets'))
    preds = model.predict_classes(test_seq, verbose=0)
    print((rp, accuracy_score(test_data[LABEL_COL], preds)))

(1e-10, 0.59364081062194274)
(1e-07, 0.57092941998602376)
(0.0001, 0.57477288609364086)
(0.1, 0.50454227812718377)
(100.0, 0.56533892382948991)
