# Introduction
This notebook was originally being done in order to participate in a Kaggle competition: https://www.kaggle.com/c/msk-redefining-cancer-treatment. Unfortunately, I didn't get to start until two days before the deadline, so I didn't get to make a submission in time, let alone augment the base model to be tailored to the interesting nuances of the data in the competition (same text maps to different classes as multiple genes are studied in one paper; machine-generated fake data confounds the ability to take everything as ground truth with absolute certainty).

I enjoyed the incentive to become more familiar with Pandas and Tensorflow. I look forward to studying the Kaggle kernels (e.g. https://www.kaggle.com/reiinakano/basic-nlp-bag-of-words-tf-idf-word2vec-lstm) to see what I may not have known and get a better sense of how else I could have approached the problem.

Here's looking forward to the next competition!

-DK

In [1]:
import numpy as np
import pandas as pd
import pickle
from nltk.tokenize import sent_tokenize, word_tokenize
import random
import tensorflow as tf
import re
import os
from collections import Counter
import random
import pdb

# Pandas section

In [2]:
train_text = "training_text_mod_top" #changed top to allow for easy read using pd.read_csv
train_csv = "training_variants"
glove_fp = "glove.6B.300d.txt"

In [3]:
text_df = pd.read_csv(train_text, sep = r'\|\|')
labels_df = pd.read_csv(train_csv, sep = ',')
# combine to one DataFrame
merged_traindev_df = pd.merge(text_df, labels_df, on = 'ID')

num_samples = merged_traindev_df.shape[0]

  if __name__ == '__main__':


In [4]:
#randomize df order, then split into train_df and dev_df
shuffled_traindev_df = merged_traindev_df.sample(frac=1)
frac_test = 0.8
num_test = int(frac_test*num_samples)
train_df, dev_df = shuffled_traindev_df.iloc[:num_test], shuffled_traindev_df[num_test:]

In [5]:
# presume we have arrays of inputs and targets from, e.g., Pandas DataFrame
# X is size [num_features x num_samples]
# Y is size [num_samples]

# in this case, tokens are our "features" and are replaced by 300-D representations
# so num_features should be replaced by [representation_dim x num_tokens]
#
# we'll emulate the varying number of tokens with np.random.choice
# and presume we've already done a search-and-replace of "token" to "representation of token"
# in the original input (or initialized unknown words with small random values)
#
# we'll also pad X with window_radius rows of zeros at the beginnings and ends of the sentences
# so that the beginning and end of the paragraph are valid
# X = np.random.random(size = (num_samples, representation_dim, np.random.choice(100)))
# y = np.random.randint(0, high = num_classes, size = (num_samples))

In [6]:
def make_token2index(df, frac = 0.5, reverse = False):
    """
    Return a dictionary mapping the n most common tokens in the dataframe to an index [0, ..., n-1]
    If n is None or greater than the number of distinct tokens, returns a mapping of all words.
    If reverse is set to True, returns the n *least* common tokens.
    Information about how token2index was created stored in key 'CREATION INFO'.
    """
    c = Counter()
    for index, row in df.iterrows():
        c.update(word_tokenize(row.loc['Text']))
    
    if frac < 0 or frac > 1:
        num = None
    else:
        num = int(frac * len(d))
    
    if reverse:
        token2index = {token[0]:i for i,token in enumerate(reversed(c.most_common(num)))}
    else:
        token2index = {token[0]:i for i,token in enumerate(c.most_common(num))}
    
    # 'CREATION INFO' will not override any actual tokens as this key contains whitespace
    token2index['CREATION INFO'] = ('frac = {0}'.format(frac), 'reverse = {0}'.format(reverse))
    
    with open('token2index.p', mode = 'wb') as f:
        pickle.dump(token2index, f, protocol = pickle.HIGHEST_PROTOCOL)
    
    return token2index
    

In [7]:
def make_token2index(df):
    "Return a dictionary of all the tokens in df.loc['Text'] to their corresponding frequency ranks (0-indexed)."
    c = Counter()
    for index, row in df.iterrows():
        c.update(word_tokenize(row.loc['Text']))    
    token2index = {token[0]:i for i,token in enumerate(c.most_common())}
    
    with open('token2index.p', mode = 'wb') as f:
        pickle.dump(token2index, f, protocol = pickle.HIGHEST_PROTOCOL)
    
    return token2index
    

In [8]:
def make_text_counter(df):
    "Return a dictionary of all the tokens in df.loc['Text'] to their corresponding frequency ranks (0-indexed)."
    c = Counter()
    for index, row in df.iterrows():
        c.update(word_tokenize(row.loc['Text']))    
    
    with open('text_counter.p', mode = 'wb') as f:
        pickle.dump(c, f, protocol = pickle.HIGHEST_PROTOCOL)
    
    return c

In [9]:
# text_counter = make_text_counter(merged_traindev_df)

In [10]:
# conveniently, already done this time around
with open('text_counter.p', mode = 'rb') as f:
    text_counter = pickle.load(f)

In [11]:
# now, my computer can't handle a ton of ~300k dimensional vectors floating around, even if they're one-hot
# (~300k is the actual number of different tokens)
# so we'll cut down to near, say, the ~30th percentile and ignore all <unk>'s encountered in the training data
lexicon_size = int(0.1*len(text_counter))

token2index = {token[0]:i for i, token in enumerate(text_counter.most_common()[int(0.3*len(text_counter) - 0.5 * lexicon_size):\
                                                                             int(0.3*len(text_counter) + 0.5 * lexicon_size)])}

In [12]:
lexicon_size = len(token2index)
num_classes = 10

# Tensorflow section

Major thanks to tutorial series starting at
https://medium.com/@erikhallstrm/hello-world-rnn-83cd7105b767

(Though I am still slightly confused about a couple things. Oh well, time for empiricism!)

In [13]:
tf.reset_default_graph()
sess = tf.Session()
tf.VariableScope(True)

<tensorflow.python.ops.variable_scope.VariableScope at 0x1e75d90de10>

In [14]:
# # neighborhood around which to update token representation (toward context and toward class labels)
# # total context goes from [-window_radius + i, window_radius + i] for token at index i
# # so total context size is 2*window_radius + 1
# window_radius = 3
# batch_size = 2*window_radius + 1

batch_size = 8

# how many timesteps forward are taken from initial batch location
# (will need to ensure we aren't going past the end of the sequence!) (???)
truncated_backprop_length = 16

consec_token_length = 200
representation_dim = 100
num_layers = 3
state_size = 20
num_classes = 10

In [15]:
# we convert our words directly into tf.one_hot() Tensors instead of feeding a NumPy array into a tf.placeholder()
# (largely because encoding such a NumPy array is either inelegant or memory-intensive)
def choose_batch(df, n = batch_size, num_tokens = 50):
    '''
    Input:
    df - DataFrame containing both input text ('Text') and labels ('Class').
    n - batch size
    Outputs:
    X_arr - (batch_size, num_words, lexicon_size) NumPy array of one-hot vectors 
    corresponding to a string of num_tokens tokens held in token2index and located in the sample.
    These tokens are *not* necessary immediately consecutive -- some intervening tokens not in the lexicon
    will have been removed.
    Y_tensor - (batch_size, num_classes) NumPy array of one-hot vectors corresponding to correct class
    '''
    # first we sample from all entries at random
    sample = df.sample(n)
    # now we extract out rows
    X, Y = [], []
    while len(X) < n:
        try:
            for index, row in df.sample(n).iterrows():
                text, target = row.loc['Text'], row.loc['Class']
                index_list = [token2index[token] for token in word_tokenize(text) if token in token2index]
        #         pdb.set_trace()
                rand_i = random.randint(0, len(index_list) - num_tokens - 1)

                tokens_sample = index_list[rand_i:rand_i + num_tokens]
                X.append(tf.one_hot(tokens_sample, lexicon_size)), Y.append(tf.one_hot(target, num_classes))
                if len(X) == n:
                    break
        except ValueError: #not enough recognized tokens in a sample
            pass # try another sample
    # stack along batch axis (chosen to be axis=0) before returning
    X_tensor, Y_tensor = tf.stack(X, axis = 0, name = 'batch_X'), tf.stack(Y, axis = 0, name = 'batch_Y')
    with tf.Session(): #evaluate into NumPy arrays
        X_arr, Y_arr = X_tensor.eval(), Y_tensor.eval()
    return X_arr.astype(np.bool), Y_arr.astype(np.bool)
#     return X_tensor, Y_tensor
        
        
# test_X, test_Y = choose_batch(train_df, n = 100, num_tokens = 100)

In [16]:

# def choose_batch(df, n = batch_size):
#     '''
#     Input:
#     df - DataFrame containing both input text ('Text') and labels ('Class').
#     n - batch size
#     Outputs:
#     X - (batch_size, num_words, lexicon_size) tf.Tensor of one-hot vectors corresponding to words in sample
#     y - (batch_size, num_classes) tf.Tensor of one-hot vectors corresponding to correct class
#     '''
#     # first we sample from all entries at random
#     sample = df.sample(n)
#     # now we extract out rows
#     word_vecs = []
#     X, Y = [], []
#     for index, row in sample.iterrows():
#         text, target = row.loc['Text'], row.loc['Class']
#         for word in word_tokenize(text):
#             word_vecs.append(eye_word[word2index[word]])
#         x = np.array(word_vecs)
#         X.append(x), Y.append(eye_class[target])
#     # return as NumPy arrays for feed dict
#     # X is (batch_size, num_words, lexicon_size) and Y is (batch_size, num_classes)
#     return np.array(X), np.array(Y)

In [17]:
# # placeholders

# # accepts one-hot representation of sample sentences
# # batch_X = tf.placeholder(tf.float32, shape = (batch_size, *(list(X.shape)[1:])), name = 'batch_X')
# batch_X = tf.placeholder(tf.float32, shape = (batch_size, None, lexicon_size), name = 'batch_X')

# # accepts one-hot representation of correct class
# # batch_Y = tf.placeholder(tf.int32, shape = (batch_size, ), name = 'batch_Y')
# batch_Y = tf.placeholder(tf.int32, shape = (batch_size, num_classes), name = 'batch_Y')

In [18]:
batch_X = tf.placeholder(tf.float32, shape = (batch_size, consec_token_length, lexicon_size), name = 'batch_X')
batch_Y = tf.placeholder(tf.int8, shape = (batch_size, num_classes), name = 'batch_Y')
# batch_Y = tf.placeholder(tf.bool, shape = (batch_size,), name = 'batch_Y') #for sparse_softmax optimizer

In [19]:
# word representation
embeddings = tf.Variable(np.random.random(size = (lexicon_size, representation_dim)), \
                         dtype = tf.float32, name = 'word_embeddings')

In [20]:
def batch_mult(a_batch, b, axis = 0):
    '''Given a batched rank-(R+1) tf.Tensor a_batch and rank-R tf.Tensor b, return a rank-(R+1) tf.matmul product.'''
    a_list = tf.unstack(a_batch, axis = axis)
    prod_list = [tf.matmul(a, b) for a in a_list]
    return tf.stack(prod_list, axis = axis)

In [21]:
# X_embed = tf.matmul(batch_X, embeddings, name = 'batch_embeddings')
X_embed = batch_mult(a_batch = batch_X, b = embeddings)

In [22]:
# make "master" state, with all layers, both c and h states, for each of the batches, and each of the states
lstm_state = tf.Variable(initial_value=tf.zeros((num_layers, 2, batch_size, state_size)))
# now divide it down to a list of 2-tuples of (batch_size, state_size) to feed into cell one at a time
state_per_layer_list = tf.unstack(lstm_state, axis = 0)
# rnn_tuple_state = tuple([state_per_layer_list[i][0], state_per_layer_list[i][1]] for i in range(num_layers))
rnn_tuple_state = tuple(tf.nn.rnn_cell.LSTMStateTuple(state_per_layer_list[i][0], state_per_layer_list[i][1]) for i in range(num_layers))


# form bidirectional RNN from LSTM cell

# fixing "ValueError: Trying to share variable rnn/multi_rnn_cell/cell_0/lstm_cell/kernel, but specified shape (20, 40) and found shape (60, 40)." 
# as per https://stackoverflow.com/questions/44615147/valueerror-trying-to-share-variable-rnn-multi-rnn-cell-cell-0-basic-lstm-cell-k

def lstm_cell():
    cell = tf.nn.rnn_cell.LSTMCell(state_size)
    # include output dropout as "ensembling" method
    cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=0.5)
    return cell
# stack LSTMs on top of each other
cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell() for _ in range(num_layers)])
cell_outputs, cell_states = tf.nn.dynamic_rnn(cell, X_embed, initial_state = rnn_tuple_state)

In [23]:
W = tf.Variable(np.random.rand(state_size, num_classes), dtype = tf.float32)
b = tf.Variable(np.zeros((1, num_classes)), dtype = tf.float32)

# perform final scoring and calculate loss function
# we multiply W with the 'h' states of our final LSTM layer
logits = tf.add(tf.matmul(cell_states[-1][1], W), b, name = 'logits')
# loss function/optimizer
loss_fun = tf.losses.softmax_cross_entropy(batch_Y, logits)

lr = 0.001 # TODO: schedule learning rate over number of epochs
optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss_fun)

In [None]:


# # summarizing for Tensorboard

# with tf.name_scope('summaries'):
#     tf.summary.scalar(loss_fun.name, loss_fun)
#     summarizer = tf.summary.merge_all()

#     logdir = os.path.join(os.getcwd(), 'logs_train')

#     train_writer = tf.summary.FileWriter(logdir, graph = sess.graph)

In [None]:
n_iter = 101 # this one is beefy...

init = tf.global_variables_initializer()

sess.run(init)

# ugly code... clean up...

for i in range(n_iter):
    
    with tf.name_scope("train"):
        # batch inputs/outputs
        X, Y = choose_batch(train_df, n = batch_size, num_tokens = consec_token_length)
        feeds = {batch_X: X, batch_Y: Y}
        _, loss = sess.run([optimizer, loss_fun], feed_dict = feeds)
#         _, loss, summary = sess.run([optimizer, loss_fun, summarizer], feed_dict = feeds)
#         train_writer.add_summary(summary, i)
        if i % 1 == 0:
            print("Training Loss at Iteration {0}: {1}".format(i, loss))
            
    with tf.name_scope("dev"):
        X, Y = choose_batch(dev_df, n = batch_size, num_tokens = consec_token_length)
        feeds = {batch_X: X, batch_Y: Y}
        loss = sess.run(loss_fun, feed_dict = feeds)
#         loss, summary = sess.run([loss_fun, summarizer], feed_dict = feeds)
        if i % 1 == 0:
            print("Testing Loss at Iteration {0}: {1}".format(i, loss))

Training Loss at Iteration 0: 2.2990963459014893
Testing Loss at Iteration 0: 2.3722589015960693
Training Loss at Iteration 1: 2.4208621978759766
Testing Loss at Iteration 1: 2.432000160217285
Training Loss at Iteration 2: 2.414440631866455
Testing Loss at Iteration 2: 2.369863271713257
Training Loss at Iteration 3: 2.367913246154785
Testing Loss at Iteration 3: 2.3556129932403564
Training Loss at Iteration 4: 2.383305549621582
Testing Loss at Iteration 4: 2.4195165634155273
Training Loss at Iteration 5: 2.3732433319091797
Testing Loss at Iteration 5: 2.3215577602386475
Training Loss at Iteration 6: 2.3476362228393555
Testing Loss at Iteration 6: 2.3816590309143066
Training Loss at Iteration 7: 2.3711276054382324
Testing Loss at Iteration 7: 2.3617382049560547
Training Loss at Iteration 8: 2.369445562362671
Testing Loss at Iteration 8: 2.332627773284912
Training Loss at Iteration 9: 2.3643910884857178
Testing Loss at Iteration 9: 2.219759464263916
Training Loss at Iteration 10: 2.40820