This is part 2 of text classification based on Recurrent Neural Networks (RNNs).

### Core concepts in part 2:
- **Multiple RNNs layers** with **`tf.nn.dynamic_rnn`**, which just copy the cell state through and output zero-tensor after real sequence length.
- **Variable sequence length**
    - For model input, to achieve minimal padding, I first sort the data, then pad them to same sequence length in each batch. Note:**The less the padding, the fast the training is** [1].
    - For model output, the hidden state at the actual sequence length step should be extracted.
- **Exploding and vanishing gradients**

### Variable sequence length
- **Input**:
    - Passing actual sequence lengths to `sequence_length` of `tf.nn.dynamic_rnn` via `actual_seq_len = tf.cast(tf.reduce_sum(tf.sign(self.input_X), 1), tf.int32)` [1].
- **Output**:
    - We should extract the hidden state at last actual step. This can be down by the following codes [2]:
    ```
    batch_range = tf.range(tf.shape(outputs)[0])
    indices = tf.stack([batch_range, actual_batch_len - 1], axis=1)
    last_output = tf.gather_nd(outputs, indices)
    ```
- **Trick**
    - **Bucketing** can be used to **accelerate** the training process, but do not have the effect on the model accuracy. Each bucket has a corresponding `placeholder`, which corresponds to a sub-graph, and they share other parts of whole computation graph [3]. Not implemented yet. 

### Exploding and vanishing gradients
- **Vanishing gradients**
    - Appearance: the weights of last layers change a lot more than those at the beginning layers.
    - Why [LSTM / GRU prevents vanishing gradients](https://www.quora.com/How-does-LSTM-help-prevent-the-vanishing-and-exploding-gradient-problem-in-a-recurrent-neural-network) ? 

- **Exploding gradients**
    - `tf.clip_by_global_norm` for avoiding exploding gradients.
    ```
    variables = tf.trainable_variables()
    gradients = tf.gradients(ys=self.cost, xs=variables)
    clipped_gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_norm)
    optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
    optimize = optimizer.apply_gradients(grads_and_vars=zip(clipped_gradients, variables), global_step=self.global_step)
    ```


References:  
[1] https://danijar.com/variable-sequence-lengths-in-tensorflow/  
[2] https://stackoverflow.com/questions/36817596/get-last-output-of-dynamic-rnn-in-tensorflow  
[3] https://www.zhihu.com/question/52200883/answer/136317118  
[4] https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html

In [1]:
import os
import codecs
import itertools
from collections import Counter
from random import shuffle
import tensorflow as tf
import numpy as np
import random
import math

Class `DataGenerator` is used to read input files, convert words to index and generate batch training or testing data. 

Two extra word are introduced, `PAD` for padding shorter sequences and `OOV` for representing out-of-vocabulary words.

Sorting data then pad them in each batch to achieve minimal padding.

In [2]:
class DataGenerator():
    """
    reading each training and testing files, and generating batch data.
    """
    
    def __init__(self, args):
        self.folder_path = args.FOLDER_PATH
        self.batch_size = args.BATCH_SIZE
        self.vocab_size = args.VOCAB_SIZE
        self.num_epoch = args.NUM_EPOCH
        self.read_build_input()
        self.label_dict = {0:'auto', 1:'business', 2:'IT', 3:'health', 4:'sports', 5:'yule'}
        
        
    def one_hot_encode(self, x, n_classes=6):
        return np.eye(n_classes)[[x]][0]

    def read_build_input(self):
        training_src = []
        testing_src = []
        training_article_len = []

        for cur_category in range(1, 7):
            
            print('parsing file >>>>>>>>>>>>>>> ', cur_category)
            print('-'*100)
            
            training_input_file = codecs.open(filename=os.path.join(self.folder_path, 'training_' + str(cur_category) + '.cs'), mode='r', encoding='utf-8')
            for tmp_line in training_input_file:
                #if len(tmp_line.split()) < 50:
                #    training_src.append((tmp_line.split(), cur_category-1))
                #    training_article_len.append(len(tmp_line.split()))

                training_src.append((tmp_line.split(), cur_category-1))
                training_article_len.append(len(tmp_line.split()))

            testing_input_file = codecs.open(filename=os.path.join(self.folder_path, 'testing_' + str(cur_category) + '.cs'), mode='r', encoding='utf-8')
            for tmp_line in testing_input_file:
                #if len(tmp_line.split()) < 50:
                #    testing_src.append((tmp_line.split(), cur_category-1))

                testing_src.append((tmp_line.split(), cur_category-1))

        print('='*100)
        print('Size of training data:', len(training_src))
        print('Size of testing data:', len(testing_src))
            
        self.TRAINING_SIZE = len(training_src)
        
        training_X_src = [pair[0] for pair in training_src]
        testing_X_src = [pair[0] for pair in testing_src]
        all_data = list(itertools.chain.from_iterable(training_X_src))
        word_counter = Counter(all_data).most_common(self.vocab_size-2)
        del all_data
        
        print('='*100)
        print('top 10 frequent words:')
        print(word_counter[0:10])
        self.word2idx = {val[0]: idx+1 for idx, val in enumerate(word_counter)}
        self.word2idx['PAD'] = 0 # padding word
        self.word2idx['OOV'] = self.vocab_size - 1 # out-of-vocabulary
        self.idx2word = dict(zip(self.word2idx.values(), self.word2idx.keys()))
        print('Total vocabulary size:{}'.format(len(self.word2idx)))
        
        
        self.training = self.generate_batch_data(training_src)
        self.testing = self.generate_batch_data(testing_src)       
        
    
    def generate_batch_data(self, data):
        sorted_data = sorted(data, key=lambda x: len(x[0]))
        num_batches = int(math.floor(len(data) / self.batch_size))
        rtn_data = []
        for i in range(num_batches):
            rtn_data.append(self.pad_data(sorted_data[i*self.batch_size: (i+1)*self.batch_size]))
            
            '''
            # print test data
            if i<200 and i>190:
                print('*'*100)
                aaa = rtn_data[i]
                for tmp in aaa:
                    print('-'*10)
                    print(tmp[0])
                    print(', '.join(self.idx2word[x] for x in tmp[0]))
                    print(tmp[1])
            '''          
        return rtn_data
    
    def pad_data(self, batch_data):
        max_batch_len = max([len(tmp[0]) for tmp in batch_data])
        rtn_batch = []
        for tmp in batch_data:
            tmp_sen = [self.word2idx[w] if w in self.word2idx else self.word2idx['OOV'] for w in tmp[0]]
            rtn_batch.append(((tmp_sen + [self.word2idx['PAD']] * (max_batch_len - len(tmp_sen))), tmp[1]))
        return rtn_batch
    
    def next_batch_training(self):
        random.shuffle(self.training)
        for i in range(len(self.training)):
            batch_X = []
            batch_y = []
            for tmp in self.training[i]:
                batch_X.append(tmp[0])
                batch_y.append(self.one_hot_encode(tmp[1]))
            yield np.array(batch_X, dtype=np.int32), np.array(batch_y, dtype=np.int32)
           
    def next_batch_testing(self):
        random.shuffle(self.testing)
        for i in range(len(self.testing)):
            batch_X = []
            batch_y = []
            for tmp in self.testing[i]:
                batch_X.append(tmp[0])
                batch_y.append(self.one_hot_encode(tmp[1]))
            yield np.array(batch_X, dtype=np.int32), np.array(batch_y, dtype=np.int32)

In [3]:
class Arguments:
    """
    main hyper-parameters
    """
    NUM_LAYERS = 3
    MAX_NORM = 5.0
    # MAX_SEQ_LENGTH = 150 # no usage here, it is a variable, and it is determined by the training data
    EMBED_SIZE = 128 # embedding dimensions
    BATCH_SIZE = 64
    VOCAB_SIZE = 300000 # vocabulary size
    NUM_CLASSES = 6 # number of classes
    FOLDER_PATH = 'sogou_corpus'
    NUM_EPOCH = 7
    KEEP_PROB = 0.8 # dropout rate for rnn cell
    RNN_TYPE = 'LSTM' # LSTM or GRU
    CHECKPOINTS_DIR = 'text_classification_LSTM_model_part2'
    LOGDIR = 'text_classification_LSTM_logdir_part2'

Helper function for better organizing Tensorflow model structure.

From https://danijar.com/structuring-your-tensorflow-models

In [4]:
import functools

def lazy_property(function):
    """
    helper function from https://danijar.com/structuring-your-tensorflow-models
    """
    attribute = '_cache_' + function.__name__

    @property
    @functools.wraps(function)
    def decorator(self):
        if not hasattr(self, attribute):
            setattr(self, attribute, function(self))
        return getattr(self, attribute)

    return decorator

Class `TextClassificationModel` defines main model, which includes `input_output`, `RNNs_model`, `score`, `cost` and `optimizer`. 

Multiple layers RNNs with `tf.nn.dynamic_rnn` is used.

### `outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None)`

- `inputs` size is `[batch_size, num_steps, embedding_size]`
- `outputs` size is `[batch_size, num_steps, num_units]`.

In [5]:
class TextClassificationModel:
    """
    Model class.
    """
    def __init__(self, args, is_training=True):
        self.num_units = args.EMBED_SIZE
        self.num_layers = args.NUM_LAYERS
        self.batch_size = args.BATCH_SIZE
        self.rnn_type = args.RNN_TYPE
        self.is_training = is_training
        self.clip_norm = args.MAX_NORM
        self.keep_prob = args.KEEP_PROB
        
        if self.is_training:
            self.batch_size = args.BATCH_SIZE
        else:
            self.batch_size = args.TESTING_SIZE
            
        self.num_classes = args.NUM_CLASSES
        self.vocab_size = args.VOCAB_SIZE
        self.best_accuracy = tf.Variable(initial_value=0.0, dtype=tf.float32, trainable=False, name='best_accuracy')
        self.global_step = tf.Variable(initial_value=0, dtype=tf.int32, trainable=False, name='global_step')

        self.input_output
        self.model
        self.score
        self.cost
        self.optimizer
        
    @lazy_property
    def input_output(self):
        with tf.name_scope('input_output'):
            input_X = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, None], name='input_X')
            #input_X_len = tf.placeholder(dtype=tf.int32, shape=[self.batch_size], name='input_X_len')
            output_y = tf.placeholder(dtype=tf.int32, shape = [self.batch_size, self.num_classes], name='output_y')
        return (input_X, output_y)
                
        
    @lazy_property
    def model(self):        
        with tf.name_scope('RNNs_model'):
            with tf.variable_scope('embedding'):
                with tf.device('/cpu:0'):
                    embedding_matrix = tf.get_variable(name='embedding_matrix', shape=[self.vocab_size, self.num_units])
                    # inputs shape: (self.batch_size, self.num_steps, self.num_units)
                    inputs = tf.nn.embedding_lookup(params=embedding_matrix, ids=self.input_output[0], name='embed')

            if self.rnn_type == 'GRU':
                cell = tf.contrib.rnn.GRUCell(num_units=self.num_units)
            elif self.rnn_type == 'LSTM':
                cell = tf.contrib.rnn.BasicLSTMCell(num_units=self.num_units)
            else:
                raise ValueError('The input rnn type is undefined.')
                
            cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=self.keep_prob)
            '''
            `[cell] * num_layers` is good for avoiding variable sharing than 
            ```
            for _ in range.num_layers:
                cells.append(cell)
            ```
            '''
            cell = tf.contrib.rnn.MultiRNNCell([cell] * self.num_layers)            
            
            initial_state = cell.zero_state(batch_size=self.batch_size, dtype=tf.float32)           
            
            print('='*100)           
            print('dynamic_rnn inputs type:', type(inputs)) # Tensor
            print('dynamic_rnn inputs shape:', inputs.get_shape()) # [self.batch_size, self.num_steps, self.num_units]
            print('='*100)            
            
             
            # calculate the acutal sequence length for each batch
            actual_batch_len = tf.cast(tf.reduce_sum(tf.sign(self.input_output[0]), 1), tf.int32)
            
            outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, sequence_length=actual_batch_len, initial_state=initial_state)
            
            print('dynamic_rnn output type:', type(outputs)) # Tensor
            print('dynamic_rnn output shape:', outputs.get_shape()) # [self.batch_size, self.num_steps, self.num_units]         
            print('dynamic_rnn last_state type:', type(last_state)) # tuple(LSTMStateTuple or Tensor)
            print('last_state:', last_state)
            print('='*100)
            
            # method 1: obtain the last_output for the last word of each sequence
            # from https://danijar.com/variable-sequence-lengths-in-tensorflow/
            last_word_idx = tf.range(0, self.batch_size) * tf.shape(outputs)[1] + (actual_batch_len - 1)
            last_output = tf.gather(tf.reshape(outputs, [-1, self.num_units]), last_word_idx)
            
            # method 2: obtain the last_output for the last word of each sequence
            batch_range = tf.range(tf.shape(outputs)[0])
            indices = tf.stack([batch_range, actual_batch_len - 1], axis=1)
            last_output_2 = tf.gather_nd(outputs, indices)
            
            # they are TRUE
            self.check_euqal = tf.equal(last_output, last_output_2)
            
            
            print('='*100)
            print('dynamic_rnn last_output type:', type(last_output)) # Tensor
            print('dynamic_rnn last_output shape:', last_output.get_shape()) # [self.batch_size, self.num_units]         
            
        return (last_output_2, last_state)
    
    @lazy_property
    def score(self):
        
        with tf.variable_scope('score'):
            softmax_weights = tf.get_variable(name='softmax_weights', shape=[self.num_units, self.num_classes])
            softmax_bias = tf.get_variable(name='softmax_bias', shape=[self.num_classes])        
            logits = tf.matmul(self.model[0], softmax_weights) + softmax_bias
            probs = tf.nn.softmax(logits)
            prediction = tf.argmax(probs, 1)
            accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.cast(prediction, tf.int32), tf.cast(tf.argmax(self.input_output[1], 1), tf.int32)), tf.float32))
            tf.summary.scalar(name='accuracy', tensor=accuracy)
            
        return (logits, accuracy)
    
    @lazy_property
    def cost(self):            
        with tf.name_scope('cost'):
            cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=self.score[0], labels=self.input_output[1]))
            cost_check = tf.check_numerics(cost, 'Cost is Nan')
            with tf.control_dependencies([cost_check]):
                tf.summary.scalar(name='loss', tensor=cost)
                tf.summary.histogram(name='histogram_loss', values=cost)
                self.summary_op = tf.summary.merge_all()
        return cost
    
    @lazy_property
    def optimizer(self):
        with tf.name_scope('optimizer'):
            variables = tf.trainable_variables()
            gradients = tf.gradients(ys=self.cost, xs=variables)
            clipped_gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_norm)
            
            #grad_check = tf.check_numerics(clipped_gradients, 'Gradients is Nan')
            #with tf.control_dependencies([grad_check]):
            #starter_learning_rate = 0.05
            #decay_rate = tf.train.exponential_decay(starter_learning_rate, self.global_step, 1000, 0.96)
            optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
            optimize = optimizer.apply_gradients(grads_and_vars=zip(clipped_gradients, variables), global_step=self.global_step)
            return optimize    

In [6]:
def train(data, model, args):
    saver = tf.train.Saver()
    with tf.Session() as sess:
        train_writer = tf.summary.FileWriter(logdir=args.LOGDIR + '/train', graph=sess.graph)
        test_writer = tf.summary.FileWriter(logdir=args.LOGDIR + '/test', graph=sess.graph)
        
        sess.run(tf.global_variables_initializer())
        ckpt = tf.train.get_checkpoint_state(checkpoint_dir=args.CHECKPOINTS_DIR)
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess=sess, save_path=ckpt.model_checkpoint_path)
            print(ckpt)
        
        initial_step = model.global_step.eval()
        idx = 1
        for loop in range(args.NUM_EPOCH):
            for batch_X, batch_y in data.next_batch_training():

                feed_dict = {model.input_output: (batch_X, batch_y)}
                aaa, tmp_accuracy, tmp_cost, _, tmp_summary = sess.run([model.check_euqal, model.score[1], model.cost, model.optimizer, model.summary_op], feed_dict=feed_dict)
                train_writer.add_summary(summary=tmp_summary, global_step=model.global_step.eval())
               
                
                if idx % 50 == 0:
                    #print(aaa)
                    print('='*100)
                    print('Step:{}, training accuracy:{:4f}'.format(model.global_step.eval(), tmp_accuracy))
                    print('loop / idx: {} / {}, loss:{:4f}, accuracy:{:4f}'.format(loop, idx, tmp_cost, tmp_accuracy))
                    print('='*100)

                if idx % 1000 == 0:
                    test_cost = 0.0
                    test_accuracy = 0.0
                    cc = 0
                    for test_batch_X, test_batch_y in data.next_batch_testing():
                        test_feed_dict = {model.input_output:(test_batch_X, test_batch_y)}
                        test_tmp_cost, test_tmp_accuracy, test_tmp_summary = sess.run([model.cost, model.score[1], model.summary_op], feed_dict=test_feed_dict)
                        test_writer.add_summary(summary=test_tmp_summary, global_step=model.global_step.eval())
                        test_cost += test_tmp_cost
                        test_accuracy += test_tmp_accuracy
                        cc += 1                    
                    print('-'*100)
                    test_accuracy = test_accuracy / cc
                    test_cost = test_cost / cc
                    print('Step:{}, average testing cost:{:4f}, average testing accuracy:{:4f}'.format(model.global_step.eval(), test_cost, test_accuracy))
                    print('-'*100)
                    
                    if test_accuracy >= sess.run(model.best_accuracy):
                        print('Best model accuracy: {:4f}'.format(test_accuracy))
                        sess.run(model.best_accuracy.assign(test_accuracy))
                        saver.save(sess=sess, save_path=os.path.join(args.CHECKPOINTS_DIR, 'text_classification_lstm.ckpt'), global_step=model.global_step.eval())
                                        
                idx += 1

In [7]:
if __name__ == '__main__':
    args = Arguments()
    data = DataGenerator(args)

    model = TextClassificationModel(args)
    train(data, model, args)    

parsing file >>>>>>>>>>>>>>>  1
----------------------------------------------------------------------------------------------------
parsing file >>>>>>>>>>>>>>>  2
----------------------------------------------------------------------------------------------------
parsing file >>>>>>>>>>>>>>>  3
----------------------------------------------------------------------------------------------------
parsing file >>>>>>>>>>>>>>>  4
----------------------------------------------------------------------------------------------------
parsing file >>>>>>>>>>>>>>>  5
----------------------------------------------------------------------------------------------------
parsing file >>>>>>>>>>>>>>>  6
----------------------------------------------------------------------------------------------------
Size of training data: 90000
Size of testing data: 18000
top 10 frequent words:
[('系列', 88969), ('月', 77879), ('中', 70570), ('年', 64838), ('产品', 61786), ('日', 58050), ('英寸', 57891), ('华硕', 56191), ('屏幕尺

----------------------------------------------------------------------------------------------------
Step:1000, average testing cost:0.375240, average testing accuracy:0.897409
----------------------------------------------------------------------------------------------------
Best model accuracy: 0.897409
Step:1050, training accuracy:0.921875
loop / idx: 0 / 1050, loss:0.147111, accuracy:0.921875
Step:1100, training accuracy:0.750000
loop / idx: 0 / 1100, loss:0.786429, accuracy:0.750000
Step:1150, training accuracy:0.906250
loop / idx: 0 / 1150, loss:0.379448, accuracy:0.906250
Step:1200, training accuracy:0.796875
loop / idx: 0 / 1200, loss:0.600435, accuracy:0.796875
Step:1250, training accuracy:0.937500
loop / idx: 0 / 1250, loss:0.328683, accuracy:0.937500
Step:1300, training accuracy:0.953125
loop / idx: 0 / 1300, loss:0.213153, accuracy:0.953125
Step:1350, training accuracy:0.968750
loop / idx: 0 / 1350, loss:0.196135, accuracy:0.968750
Step:1400, training accuracy:0.953125
loo

Step:2350, training accuracy:0.984375
loop / idx: 1 / 2350, loss:0.071001, accuracy:0.984375
Step:2400, training accuracy:1.000000
loop / idx: 1 / 2400, loss:0.001901, accuracy:1.000000
Step:2450, training accuracy:0.968750
loop / idx: 1 / 2450, loss:0.113970, accuracy:0.968750
Step:2500, training accuracy:0.984375
loop / idx: 1 / 2500, loss:0.062683, accuracy:0.984375
Step:2550, training accuracy:0.953125
loop / idx: 1 / 2550, loss:0.162526, accuracy:0.953125
Step:2600, training accuracy:0.953125
loop / idx: 1 / 2600, loss:0.170751, accuracy:0.953125
Step:2650, training accuracy:0.953125
loop / idx: 1 / 2650, loss:0.118053, accuracy:0.953125
Step:2700, training accuracy:0.968750
loop / idx: 1 / 2700, loss:0.133201, accuracy:0.968750
Step:2750, training accuracy:0.968750
loop / idx: 1 / 2750, loss:0.230817, accuracy:0.968750
Step:2800, training accuracy:0.984375
loop / idx: 1 / 2800, loss:0.038225, accuracy:0.984375
Step:2850, training accuracy:0.953125
loop / idx: 2 / 2850, loss:0.132

KeyboardInterrupt: 