## **Most of the code is self explanatory, if not comment, I will add notes wherever posible!!!**

# Text Data and LSTM 

Following are the high level data flow used in i-Tagger:
1. Preprocessing the data
2. Creating the vocab 
    - Word/Tokens vocab which includes **word -> id** & **id -> word**
    - Char vocab which includes **char -> id** & **id -> char**
3. Prepraing the features
    - Padding words/characters with custome function (or) 
    - Using Tensorflow APIs
4. Model
    - Word Embeddings
    - Word Level BiLSTM encoding ignoring the padded words
    - Character Embedding
    - Char Level BiLSTM encoding ignoring the padded characters


## You cant initialize graph components twice, if you encounter error, restart the notebook

In [1]:
import tensorflow as tf
from tensorflow.contrib import lookup
from tensorflow.python.platform import gfile
import numpy as np
from tqdm import tqdm

tf.reset_default_graph()


  return f(*args, **kwds)


### Constants


In [2]:


# Normally this takes the mean length of the words in the dataset documents
MAX_DOCUMENT_LENGTH = 7

# Padding word that is used when a document has less words than the calculated mean length of the words
PAD_WORD = '<PAD>'
PAD_WORD_ID = 0

PAD_CHAR = "<P>"
PAD_CHAR_ID = 0

UNKNOWN_WORD = "<UNKNOWN>"
UNKNOWN_WORD_ID = 1

UNKNOWN_CHAR = "<U>"
UNKNOWN_CHAR_ID = 1

EMBEDDING_SIZE = 3
WORD_EMBEDDING_SIZE = 3
CHAR_EMBEDDING_SIZE = 3
WORD_LEVEL_LSTM_HIDDEN_SIZE = 3
CHAR_LEVEL_LSTM_HIDDEN_SIZE = 3

### MISC

In [3]:
value = [0, 1, 2, 3, 4, 5, 6, 7]
init = tf.constant_initializer(value)

tf.reset_default_graph()
with tf.Session() :
    x = tf.get_variable('x', shape = [3, 4], initializer = init)
    x.initializer.run()
    print(x.eval())

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 7.  7.  7.  7.]]


There are places where "tf.get_variable" is used without a initializer, in which case the TF engine initializes it with default intializer

In [4]:
tf.reset_default_graph()
with tf.Session() as sess:
    W = tf.get_variable("W", dtype=tf.float32, shape=[5,5])
    b = tf.get_variable("b", shape=[12],  dtype=tf.float32, initializer=tf.zeros_initializer())
    sess.run(tf.global_variables_initializer())
    print(sess.run(W))
    print(b.eval())

[[ 0.38060057  0.50141478  0.40662122  0.72093308 -0.29107267]
 [-0.2623179   0.30131912 -0.55878854  0.63164377 -0.46682838]
 [ 0.75094438  0.61107099 -0.59864926  0.54592323 -0.48808187]
 [-0.43495807 -0.13385278  0.19299984  0.512218    0.28124976]
 [ 0.29379117 -0.02249491  0.35649407  0.51151133 -0.64783204]]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


** Dynamic Sequence Lengths **

In [5]:
#TODO make this independent of dimensions
def get_sequence_length(sequence_ids, pad_word_id=0, axis=1):
    '''
    Returns the sequence length, droping out all the padded tokens if the sequence is padded
    
    :param sequence_ids: Tensor(shape=[batch_size, doc_length])
    :param pad_word_id: 0 is default
    :return: Array of Document lengths of size batch_size
    '''
    flag = tf.greater(sequence_ids, pad_word_id)
    used = tf.cast(flag, tf.int32)
    length = tf.reduce_sum(used, axis)
    length = tf.cast(length, tf.int32)
    return length

In [6]:
# MAX_DOC_LENGTHS = 4
# rand_array = np.random.randint(1,MAX_DOC_LENGTHS, size=(3,5,4))

#Assume all negative values are padding
rand_array = np.array([[ 2,  0,  0,  0,  0,  0],
 [ 3,  4,  0,  0,  0,  0],
 [ 5,  6,  4,  0,  0,  0],
 [ 7,  8,  6,  4,  0,  0],
 [ 9, 10,  6, 11, 12, 13],
 [ 0,  0,  0,  0,  0,  0]])

rand_array


array([[ 2,  0,  0,  0,  0,  0],
       [ 3,  4,  0,  0,  0,  0],
       [ 5,  6,  4,  0,  0,  0],
       [ 7,  8,  6,  4,  0,  0],
       [ 9, 10,  6, 11, 12, 13],
       [ 0,  0,  0,  0,  0,  0]])

In [7]:
with tf.Session() as sess:
        length = get_sequence_length(rand_array, axis=1, pad_word_id=PAD_WORD_ID)
        print("Get dynamic sequence lengths: ", sess.run(length))

Get dynamic sequence lengths:  [1 2 3 4 6 0]


In [8]:
# data = np.random.randint(1,6, size=(3,MAX_DOCUMENT_LENGTH,4))

# print(data)
# with tf.Session() as sess:
#         length = get_sequence_length(data, axis=1)
#         print("Get dynamic sequence lengths: ", sess.run(length))
    

In [9]:
def _pad_sequences(sequences, pad_tok, max_length):
    """
    Args:
        sequences: a generator of list or tuple
        pad_tok: the char to pad with

    Returns:
        a list of list where each sublist has same length
    """
    sequence_padded, sequence_length = [], []

    for seq in sequences:
        seq = list(seq)
        seq_ = seq[:max_length] + [pad_tok]*max(max_length - len(seq), 0)
        sequence_padded +=  [seq_]
        sequence_length += [min(len(seq), max_length)]

    return sequence_padded, sequence_length


def pad_sequences(sequences, pad_tok, nlevels, MAX_WORD_LENGTH=6):
    """
    Args:
        sequences: a generator of list or tuple
        pad_tok: the char to pad with
        nlevels: "depth" of padding, for the case where we have characters ids

    Returns:
        a list of list where each sublist has same length

    """
    if nlevels == 1:
        sequence_padded = []
        sequence_length = []
        max_length = max(map(lambda x : len(x.split(" ")), sequences))
        # sequence_padded, sequence_length = _pad_sequences(sequences,
        #                                                   pad_tok, max_length)
        #breaking the code to pad the string instead on its ids
        for seq in sequences:
            current_length = len(seq.split(" "))
            diff = max_length - current_length
            pad_data = pad_tok * diff
            sequence_padded.append(seq + pad_data)
            sequence_length.append(max_length) #assumed

        # print_info(sequence_length)
    elif nlevels == 2:
        # max_length_word = max([max(map(lambda x: len(x), seq))
        #                        for seq in sequences])
        sequence_padded, sequence_length = [], []
        for seq in tqdm(sequences):
            # all words are same length now
            sp, sl = _pad_sequences(seq, pad_tok, MAX_WORD_LENGTH)
            sequence_padded += [sp]
            sequence_length += [sl]

        max_length_sentence = max(map(lambda x : len(x), sequences))
        sequence_padded, _ = _pad_sequences(sequence_padded,
                                            [pad_tok]*MAX_WORD_LENGTH,
                                            max_length_sentence) #TODO revert -1 to pad_tok
        sequence_length, _ = _pad_sequences(sequence_length, 0,
                                            max_length_sentence)

    return sequence_padded, sequence_length

## 1. Data Cleaning /Precrocessing

In [10]:
# Assume each line to be an document
lines = ['Simple',
         'Some title', 
         'A longer title', 
         'An even longer title', 
         'This is longer than doc length isnt',
          '']

## 2. Extracting Vocab from the Corpus

In [11]:
! rm vocab_test.tsv

In [12]:
tf.reset_default_graph()


print ('TensorFlow Version: ', tf.__version__)


# Create vocabulary
# min_frequency -> consider a word if and only it repeats for fiven count
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH, 
                                                                     min_frequency=0)
vocab_processor.fit(lines)

word_vocab = []

#Create a file and store the words
with gfile.Open('vocab_test.tsv', 'wb') as f:
    f.write("{}\n".format(PAD_WORD))
    word_vocab.append(PAD_WORD)
    for word, index in vocab_processor.vocabulary_._mapping.items():
        word_vocab.append(word)
        f.write("{}\n".format(word))
        
VOCAB_SIZE = len(vocab_processor.vocabulary_) + 1
print ('{} words into vocab.tsv'.format(VOCAB_SIZE))


id_2_word = {id:word for id, word in enumerate(word_vocab)}
id_2_word

TensorFlow Version:  1.4.0
15 words into vocab.tsv


{0: '<PAD>',
 1: '<UNK>',
 2: 'Simple',
 3: 'Some',
 4: 'title',
 5: 'A',
 6: 'longer',
 7: 'An',
 8: 'even',
 9: 'This',
 10: 'is',
 11: 'than',
 12: 'doc',
 13: 'length',
 14: 'isnt'}

In [13]:
! cat vocab_test.tsv

<PAD>
<UNK>
Simple
Some
title
A
longer
An
even
This
is
than
doc
length
isnt


In [14]:
def naive_vocab_creater(lines, out_file_name):
    final_vocab = ["<PAD>", "<UNK>"]
    vocab = [word for line in lines for word in line.split(" ")]
    vocab = set(vocab)

    try:
        vocab.remove("<UNK>")
    except:
        print("No <UNK> token found")

    vocab = list(vocab)
    final_vocab.extend(vocab)
    return final_vocab

In [15]:
def get_char_vocab(words_vocab):
    '''

    :param words_vocab: List of words
    :return:
    '''
    words_chars_vocab = ['<P>', '<U>']
    chars = set()
    for word in words_vocab:
        for char in word:
            chars.add(str(char))
    words_chars_vocab.extend(chars)
    return sorted(chars)


chars = get_char_vocab(word_vocab)
# Create char2id map
char_2_id_map = {c:i for i,c in enumerate(chars)}

CHAR_VOCAB_SIZE = len(char_2_id_map)
char_2_id_map

{'<': 0,
 '>': 1,
 'A': 2,
 'D': 3,
 'K': 4,
 'N': 5,
 'P': 6,
 'S': 7,
 'T': 8,
 'U': 9,
 'a': 10,
 'c': 11,
 'd': 12,
 'e': 13,
 'g': 14,
 'h': 15,
 'i': 16,
 'l': 17,
 'm': 18,
 'n': 19,
 'o': 20,
 'p': 21,
 'r': 22,
 's': 23,
 't': 24,
 'v': 25}

## 3. Preparing the features

In [16]:
list_char_ids = []
char_ids_feature2 = []

for line in lines:
    for word in line.split():
        word_2_char_ids = [char_2_id_map.get(c, 0) for c in word]
        list_char_ids.append(word_2_char_ids)
    char_ids_feature2.append(list_char_ids)
    list_char_ids = []

In [17]:
char_ids_feature2

[[[7, 16, 18, 21, 17, 13]],
 [[7, 20, 18, 13], [24, 16, 24, 17, 13]],
 [[2], [17, 20, 19, 14, 13, 22], [24, 16, 24, 17, 13]],
 [[2, 19], [13, 25, 13, 19], [17, 20, 19, 14, 13, 22], [24, 16, 24, 17, 13]],
 [[8, 15, 16, 23],
  [16, 23],
  [17, 20, 19, 14, 13, 22],
  [24, 15, 10, 19],
  [12, 20, 11],
  [17, 13, 19, 14, 24, 15],
  [16, 23, 19, 24]],
 []]

In [18]:
char_ids_feature2, char_seq_length = pad_sequences(char_ids_feature2, nlevels=2, pad_tok=0)
char_ids_feature2 = np.array(char_ids_feature2)
char_ids_feature2.shape

100%|██████████| 6/6 [00:00<00:00, 34192.70it/s]


(6, 7, 6)

In [19]:
char_ids_feature2

array([[[ 7, 16, 18, 21, 17, 13],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 7, 20, 18, 13,  0,  0],
        [24, 16, 24, 17, 13,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 2,  0,  0,  0,  0,  0],
        [17, 20, 19, 14, 13, 22],
        [24, 16, 24, 17, 13,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 2, 19,  0,  0,  0,  0],
        [13, 25, 13, 19,  0,  0],
        [17, 20, 19, 14, 13, 22],
        [24, 16, 24, 17, 13,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 8, 15, 16, 23,  0,  0],
      

In [20]:
print("Character IDs shape: ", char_ids_feature2.shape)
print("Number of sentences: ", len(lines))
print("MAX_DOC_LENGTH: ", max([len(line.split()) for line in lines]))
print("MAX_WORD_LENGTH: ", 8)

Character IDs shape:  (6, 7, 6)
Number of sentences:  6
MAX_DOC_LENGTH:  7
MAX_WORD_LENGTH:  8


## 4. Model

In [21]:
tf.reset_default_graph()

#### Reference Link: https://www.tensorflow.org/api_docs/python/tf/contrib/lookup/index_table_from_file

In [22]:
id_2_word

{0: '<PAD>',
 1: '<UNK>',
 2: 'Simple',
 3: 'Some',
 4: 'title',
 5: 'A',
 6: 'longer',
 7: 'An',
 8: 'even',
 9: 'This',
 10: 'is',
 11: 'than',
 12: 'doc',
 13: 'length',
 14: 'isnt'}

In [23]:
# can use the vocabulary to convert words to numbers
table = lookup.index_table_from_file(
  vocabulary_file='vocab_test.tsv', 
    num_oov_buckets=0, 
    vocab_size=None, 
    default_value=UNKNOWN_WORD_ID) #id of <PAD> is 0

In [24]:
# string operations
# Array of Docs -> Split it into Tokens/words 
#               -> Convert it into Dense Tensor apending PADWORD
#               -> Table lookup 
#               -> Slice it to MAX_DOCUMENT_LENGTH
data = tf.constant(lines)
words = tf.string_split(data)

densewords = tf.sparse_tensor_to_dense(words, default_value=PAD_WORD)
word_ids = table.lookup(densewords)
print(word_ids)


##Following extrasteps are taken care by above 'table.lookup'
# now pad out with zeros and then slice to constant length
# padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
# # this takes care of documents with zero length also
# padded = tf.pad(word_ids, padding)

#if you wanted to clip the document MAX size then it can be done here!
# sliced = tf.slice(word_ids, [0,0], [-1, MAX_DOCUMENT_LENGTH])

Tensor("hash_table_Lookup:0", shape=(?, ?), dtype=int64)


In [25]:

word2ids = table.lookup(tf.constant(lines[1].split()))
word2ids_1 = table.lookup(tf.constant("Some unknown title".split()))


with tf.Session() as sess:
    #Tables needs to be initialized before using it
    tf.tables_initializer().run()
    print ("{} --> {}".format(lines[1], word2ids.eval()))
    print ("{} --> {}".format("Some unknown title", word2ids_1.eval()))
    print(sess.run(densewords))


Some title --> [3 4]
Some unknown title --> [3 1 4]
[[b'Simple' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>']
 [b'Some' b'title' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>']
 [b'A' b'longer' b'title' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>']
 [b'An' b'even' b'longer' b'title' b'<PAD>' b'<PAD>' b'<PAD>']
 [b'This' b'is' b'longer' b'than' b'doc' b'length' b'isnt']
 [b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>' b'<PAD>']]


In [26]:
seq_length= get_sequence_length(word_ids)

In [27]:
with tf.Session() as sess:
     #Tables needs to be initialized before using it
    tf.tables_initializer().run()
    print(sess.run(word_ids))
    print("==========================")
    print(sess.run(seq_length))

[[ 2  0  0  0  0  0  0]
 [ 3  4  0  0  0  0  0]
 [ 5  6  4  0  0  0  0]
 [ 7  8  6  4  0  0  0]
 [ 9 10  6 11 12 13 14]
 [ 0  0  0  0  0  0  0]]
[1 2 3 4 7 0]


### Embed Layer

In [28]:
with tf.device('/cpu:0'), tf.name_scope("embed-layer"):  

    # layer to take the words and convert them into vectors (embeddings)
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into
    # [batch_size, MAX_DOCUMENT_LENGTH, EMBEDDING_SIZE].
    word_embeddings = tf.contrib.layers.embed_sequence(word_ids,
                                              vocab_size=VOCAB_SIZE,
                                              embed_dim=WORD_EMBEDDING_SIZE,
                                                   initializer=tf.contrib.layers.xavier_initializer(
                                                                   seed=42))

In [29]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
     #Tables needs to be initialized before using it
    tf.tables_initializer().run()
    print("word_embeddings : [batch_size, MAX_DOCUMENT_LENGTH, EMBEDDING_SIZE] ", sess.run(word_embeddings).shape)
    print("<===============================================>")
    print(sess.run(word_embeddings))


word_embeddings : [batch_size, MAX_DOCUMENT_LENGTH, EMBEDDING_SIZE]  (6, 7, 3)
[[[-0.36255243 -0.44535902 -0.18911475]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]]

 [[ 0.2578851  -0.3242403   0.41261792]
  [ 0.37403101  0.11017311 -0.57392919]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]]

 [[-0.29184508  0.00701374 -0.15982357]
  [-0.52557528  0.54521036  0.37919033]
  [ 0.37403101  0.11017311 -0.57392919]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]
  [ 0.52223814  0.20485282  0.34100413]]

 [[-0.09862986  0.11739373 -0.18522915]
  [ 0.08441174 -0.30454588  0.20182

In [30]:
with  tf.name_scope("word_level_lstm_layer"):
    # Create a LSTM Unit cell with hidden size of EMBEDDING_SIZE.
    d_rnn_cell_fw_one = tf.nn.rnn_cell.LSTMCell(WORD_LEVEL_LSTM_HIDDEN_SIZE,
                                                state_is_tuple=True)
    #https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMStateTuple
    d_rnn_cell_bw_one = tf.nn.rnn_cell.LSTMCell(WORD_LEVEL_LSTM_HIDDEN_SIZE,
                                                state_is_tuple=True)

    (fw_output_one, bw_output_one), output_states = tf.nn.bidirectional_dynamic_rnn(
        cell_fw=d_rnn_cell_fw_one,
        cell_bw=d_rnn_cell_bw_one,
        dtype=tf.float32,
        sequence_length=seq_length,
        inputs=word_embeddings,
        scope="encod_sentence")

    # [BATCH_SIZE, MAX_SEQ_LENGTH, 2*WORD_LEVEL_LSTM_HIDDEN_SIZE) TODO check MAX_SEQ_LENGTH?
    encoded_sentence = tf.concat([fw_output_one,
                                  bw_output_one], axis=-1)

    tf.logging.info('encoded_sentence =====> {}'.format(encoded_sentence))

INFO:tensorflow:encoded_sentence =====> Tensor("word_level_lstm_layer/concat:0", shape=(?, ?, 6), dtype=float32)


In [31]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
     #Tables needs to be initialized before using it
    tf.tables_initializer().run()
    
    tf.logging.info('fw_output_one \t=====> {}'.format(fw_output_one.get_shape()))
    tf.logging.info('bw_output_one \t=====> {}'.format(bw_output_one.get_shape()))
    tf.logging.info("=================================================================")
    tf.logging.info('forward hidden state \t=====> {}'.format(output_states[0][0].get_shape()))
    tf.logging.info('forward out state \t=====> {}'.format(output_states[0][1].get_shape()))
    tf.logging.info('backward hidden state \t=====> {}'.format(output_states[1][0].get_shape()))
    tf.logging.info('backward out state \t=====> {}'.format(output_states[1][1].get_shape()))
    tf.logging.info("=================================================================")
    tf.logging.info('encoded_sentence \t=====> {}'.format(encoded_sentence.get_shape()))
    tf.logging.info("=================================================================")
    encoded_senence_out =  encoded_sentence.eval()
    #check for zeros in the encoced sentence, where it omits padded words
    tf.logging.info('encoded_senence_out \t=====> {}'.format(encoded_senence_out.shape))
    tf.logging.info("=================================================================")
    print("encoded_senence_out:\n" , encoded_senence_out)
    

INFO:tensorflow:fw_output_one 	=====> (?, ?, 3)
INFO:tensorflow:bw_output_one 	=====> (?, ?, 3)
INFO:tensorflow:forward hidden state 	=====> (?, 3)
INFO:tensorflow:forward out state 	=====> (?, 3)
INFO:tensorflow:backward hidden state 	=====> (?, 3)
INFO:tensorflow:backward out state 	=====> (?, 3)
INFO:tensorflow:encoded_sentence 	=====> (?, ?, 6)
INFO:tensorflow:encoded_senence_out 	=====> (6, 7, 6)
encoded_senence_out:
 [[[-0.01976291 -0.00923283  0.07085376  0.01306459  0.03344805 -0.04371169]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]]

 [[-0.04772531  0.11881609 -0.03189922 

In [32]:
print(char_ids_feature2.shape)
char_ids_feature2

(6, 7, 6)


array([[[ 7, 16, 18, 21, 17, 13],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 7, 20, 18, 13,  0,  0],
        [24, 16, 24, 17, 13,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 2,  0,  0,  0,  0,  0],
        [17, 20, 19, 14, 13, 22],
        [24, 16, 24, 17, 13,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 2, 19,  0,  0,  0,  0],
        [13, 25, 13, 19,  0,  0],
        [17, 20, 19, 14, 13, 22],
        [24, 16, 24, 17, 13,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0,  0]],

       [[ 8, 15, 16, 23,  0,  0],
      

In [33]:
with tf.variable_scope("char_embed_layer"):
    
        char_ids = tf.convert_to_tensor(char_ids_feature2, np.int64)
        s = tf.shape(char_ids)
        #remove pad words
        char_ids_reshaped = tf.reshape(char_ids, shape=(s[0] * s[1], s[2])) #6 -> char dim
        
        char_embeddings = tf.contrib.layers.embed_sequence(char_ids,
                                                           vocab_size=CHAR_VOCAB_SIZE,
                                                           embed_dim=CHAR_EMBEDDING_SIZE,
                                                           initializer=tf.contrib.layers.xavier_initializer(
                                                               seed=42))
        word_lengths = get_sequence_length(char_ids_reshaped)

        #[BATCH_SIZE, MAX_SEQ_LENGTH, MAX_WORD_LEGTH, CHAR_EMBEDDING_SIZE]
        tf.logging.info('char_ids_reshaped =====> {}'.format(char_ids_reshaped))
        tf.logging.info('char_embeddings =====> {}'.format(char_embeddings))

INFO:tensorflow:char_ids_reshaped =====> Tensor("char_embed_layer/Reshape:0", shape=(?, ?), dtype=int64)
INFO:tensorflow:char_embeddings =====> Tensor("char_embed_layer/EmbedSequence/embedding_lookup:0", shape=(6, 7, 6, 3), dtype=float32)


In [34]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
     #Tables needs to be initialized before using it
    tf.tables_initializer().run()
    tf.logging.info('char_ids =====> {}'.format(char_ids_reshaped.get_shape()))
    tf.logging.info("=================================================================")
    res = char_ids_reshaped.eval()
    tf.logging.info('char_ids_reshaped shape =====> {}\n'.format(res.shape))
    tf.logging.info('char_ids_reshaped =====> {}\n'.format(res))
    tf.logging.info('word_lengths =====> {}\n'.format(word_lengths.eval()))
    tf.logging.info("=================================================================")
    tf.logging.info('char_embeddings =====> {}'.format(char_embeddings.shape))
    tf.logging.info("=================================================================")
    char_embeddings_out = char_embeddings.eval()
    print(char_embeddings_out.shape)
    tf.logging.info("=================================================================")
    char_embeddings_out = char_embeddings.eval()
    print("char_embeddings_out shape\n", char_embeddings_out.shape)
    print("char_embeddings_out \n", char_embeddings_out)

INFO:tensorflow:char_ids =====> (?, ?)
INFO:tensorflow:char_ids_reshaped shape =====> (42, 6)

INFO:tensorflow:char_ids_reshaped =====> [[ 7 16 18 21 17 13]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 7 20 18 13  0  0]
 [24 16 24 17 13  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 2  0  0  0  0  0]
 [17 20 19 14 13 22]
 [24 16 24 17 13  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 2 19  0  0  0  0]
 [13 25 13 19  0  0]
 [17 20 19 14 13 22]
 [24 16 24 17 13  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 8 15 16 23  0  0]
 [16 23  0  0  0  0]
 [17 20 19 14 13 22]
 [24 15 10 19  0  0]
 [12 20 11  0  0  0]
 [17 13 19 14 24 15]
 [16 23 19 24  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 

In [35]:
with tf.variable_scope("chars_level_bilstm_layer"):
        # put the time dimension on axis=1
        shape = tf.shape(char_embeddings)

        BATCH_SIZE = shape[0]
        MAX_DOC_LENGTH = shape[1]
        CHAR_MAX_LENGTH = shape[2]

        # [BATCH_SIZE, MAX_SEQ_LENGTH, MAX_WORD_LEGTH, CHAR_EMBEDDING_SIZE]  ===>
        #      [BATCH_SIZE * MAX_SEQ_LENGTH, MAX_WORD_LEGTH, CHAR_EMBEDDING_SIZE]
        char_embeddings_reshaped = tf.reshape(char_embeddings,
                                     shape=[BATCH_SIZE * MAX_DOC_LENGTH, CHAR_MAX_LENGTH,
                                            CHAR_EMBEDDING_SIZE],
                                     name="reduce_dimension_1")

        tf.logging.info('reshaped char_embeddings =====> {}'.format(char_embeddings))

        # word_lengths = get_sequence_length_old(char_embeddings) TODO working
        word_lengths = get_sequence_length(char_ids_reshaped)

        tf.logging.info('word_lengths =====> {}'.format(word_lengths))

        # bi lstm on chars
        cell_fw = tf.contrib.rnn.LSTMCell(CHAR_LEVEL_LSTM_HIDDEN_SIZE,
                                          state_is_tuple=True)
        cell_bw = tf.contrib.rnn.LSTMCell(CHAR_LEVEL_LSTM_HIDDEN_SIZE,
                                          state_is_tuple=True)

        _output = tf.nn.bidirectional_dynamic_rnn(
            cell_fw=cell_fw,
            cell_bw=cell_bw,
            dtype=tf.float32,
            sequence_length=word_lengths,
            inputs=char_embeddings_reshaped,
            scope="encode_words")

        # read and concat output
        (char_fw_output_one, char_bw_output_one) , output_state = _output
        ((hidden_fw, output_fw), (hidden_bw, output_bw)) = output_state
        encoded_words = tf.concat([output_fw, output_bw], axis=-1)
        
        char_encoded = tf.concat([char_fw_output_one,
                                  char_bw_output_one], axis=-1)
        lstm_out_encoded_words = encoded_words
        # [BATCH_SIZE, MAX_SEQ_LENGTH, WORD_EMBEDDING_SIZE]
        encoded_words = tf.reshape(encoded_words,
                                   shape=[BATCH_SIZE, MAX_DOC_LENGTH, 2 *
                                          CHAR_LEVEL_LSTM_HIDDEN_SIZE])

        tf.logging.info('encoded_words =====> {}'.format(encoded_words))


    
    
    

INFO:tensorflow:reshaped char_embeddings =====> Tensor("char_embed_layer/EmbedSequence/embedding_lookup:0", shape=(6, 7, 6, 3), dtype=float32)
INFO:tensorflow:word_lengths =====> Tensor("chars_level_bilstm_layer/Sum:0", shape=(?,), dtype=int32)
INFO:tensorflow:encoded_words =====> Tensor("chars_level_bilstm_layer/Reshape:0", shape=(?, ?, 6), dtype=float32)


In [36]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    tf.tables_initializer().run()
    
    tf.logging.info('char_embeddings =====> {}'.format(char_embeddings.shape))
    
    tf.logging.info('char_encoded =====> {}'.format(char_encoded.get_shape()))
    print("char_encoded:\n", char_encoded.eval())
    
    tf.logging.info('char_fw_output_one =====> {}'.format(char_fw_output_one.get_shape()))
    tf.logging.info('char_bw_output_one =====> {}'.format(char_bw_output_one.get_shape()))
    
    tf.logging.info('char hidden_fw =====> {}'.format(hidden_fw.get_shape()))
    tf.logging.info('char output_fw =====> {}'.format(output_fw.get_shape()))
    tf.logging.info('char hidden_bw =====> {}'.format(hidden_bw.get_shape()))
    tf.logging.info('char output_bw =====> {}'.format(output_bw.get_shape()))
    
    tf.logging.info('lstm_out_encoded_words =====> {}'.format(lstm_out_encoded_words.get_shape()))
    #check for zeros in the encoced words, where it omits padded characters
    tf.logging.info('lstm_out_encoded_words =====> {}\n'.format(lstm_out_encoded_words.eval()))
    tf.logging.info('=====================================================================')
    tf.logging.info('encoded_words =====> {}'.format(encoded_words.get_shape()))
    tf.logging.info('encoded_words =====> {}\n'.format(encoded_words.eval()))
    

INFO:tensorflow:char_embeddings =====> (6, 7, 6, 3)
INFO:tensorflow:char_encoded =====> (?, ?, 6)
char_encoded:
 [[[ 0.01289693  0.03215854  0.0049089   0.02896848  0.01639003  0.01760693]
  [-0.00175623 -0.0293929   0.0057649  -0.00010866 -0.02562934  0.02795034]
  [-0.02921278 -0.06415376 -0.00687395 -0.00772599 -0.02570577 -0.00648446]
  [-0.00780411  0.04103209 -0.00176386  0.03748152  0.03362147 -0.01367662]
  [-0.05726055 -0.03375445 -0.02974403 -0.02291263 -0.04749316 -0.00541486]
  [-0.06904165  0.03736177 -0.03910471  0.06656165  0.05170256  0.00199127]]

 [[ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.       

# References: 
- https://medium.com/towards-data-science/how-to-do-text-classification-using-tensorflow-word-embeddings-and-cnn-edae13b3e575
- https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/blogs/textclassification

In [1]:
# Convert this notebook for Docs
! jupyter nbconvert --to markdown --output-dir ../../docs/_posts 2017-12-18-walk_through_of_lstm_apis.ipynb

[NbConvertApp] Converting notebook 2017-12-18-walk_through_of_lstm_apis.ipynb to markdown
[NbConvertApp] Writing 58408 bytes to ../../docs/_posts/2017-12-18-walk_through_of_lstm_apis.md
