# RNN Variable Length Text Classifier with Bucketing

For the network in ``RNN_VariableLength_Text_Classifier.ipynb``, we used a batch_size of 256. But each example in the batch had a different length ranging from 5 to 30. As the maximum length for each batch is usually very close to 30, short sequences required a lot of padding (e.g., all sequences of length 5 in the batch are padded with up to 25 zeros).

This leads to a lot of excess computation, and we can improve upon it by “bucketing” our training samples. If we select our batches such that the lengths of the samples in each batch are within, say, 5 of each other, then the amount of padding in a batch of 256 is bounded by 256 * 5 = 1280. This would make our worst case outcome more than twice as good as previous average case outcome.

To take advantage of bucketing, we simply modify our DataIterator. There are many ways one might implement this, but the key point to keep in mind is that we should not “bias” the order in which different sequence lengths are sampled any more than necessary to achieve bucketing. E.g., sorting our data by sequence length might seem like a good solution, but then each epoch would be trained on short sequences before longer sequences, which could harm results.

The new code is between marks like:

---
# BUCKETING
# END BUCKETING
---

<a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html#improving-training-speed-using-bucketing">[Ref]</a>

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import math

import blogs_data #available at https://github.com/spitis/blogs_data

## Hyperparameters

In [2]:
DATA_PCT_LOADED = 0.06

STATE_SIZE = 64

BATCH_SIZE = 256

NUM_EPOCHS = 10

## Read data

In [20]:
df = blogs_data.loadBlogs().sample(frac=DATA_PCT_LOADED).reset_index(drop=True)
df.head(3)

Unnamed: 0,post_id,gender,age_bracket,string,as_numbers,length
0,169696,0,2,a day later they were all headed to florida an...,"[7, 94, 281, 44, 88, 37, 1172, 5, 2296, 6, 4, ...",16
1,39870,1,1,they decided that i should give the rest to hi...,"[44, 305, 9, 3, 145, 227, 4, 398, 5, 70, 1841, 2]",12
2,65938,0,0,i recommend you go online and look at some of ...,"[3, 2578, 15, 71, 602, 6, 186, 35, 67, 8, 4, 6...",17


In [25]:
type(df['as_numbers'][0])

list

In [4]:
vocab,reverse_vocab = blogs_data.loadVocab()
train_len, test_len = math.floor(len(df)*0.8), math.floor(len(df)*0.2)
train_len,test_len

(73382, 18345)

In [5]:
train = df.iloc[:train_len-1]
test = df.iloc[train_len:train_len + test_len]

In [6]:
train.head(2)

Unnamed: 0,post_id,gender,age_bracket,string,as_numbers,length
0,47107,0,0,google plans <UNK> <UNK> <UNK> search engine g...,"[1670, 1077, 0, 0, 0, 1200, 3234, 2150, 1077, ...",30
1,148011,1,0,the only person who cares about you is a half ...,"[4, 99, 211, 74, 1587, 47, 15, 14, 7, 351, 0, ...",18


In [7]:
test.head(2)

Unnamed: 0,post_id,gender,age_bracket,string,as_numbers,length
73382,27131,1,1,we were so busy seeing a prospect in each that...,"[32, 88, 27, 701, 550, 7, 6159, 11, 273, 9, 32...",17
73383,156099,1,1,<UNK> does not <UNK> war ’s <UNK> with its mor...,"[0, 149, 34, 0, 482, 206, 0, 26, 132, 6160, 18...",24


In [8]:
df = None

## Manage data

### Data iterator

In [9]:
class SimpleDataIterator():
    def __init__(self,df):
        self.df = df
        self.size = len(self.df)
        self.epochs = 0
        self.shuffle()
        
    def shuffle(self):
        self.df = self.df.sample(frac=1).reset_index(drop=True)
        self.cursor = 0
        
    def next_batch(self,n):
        if(self.cursor+n-1 > self.size):
            self.epochs += 1
            self.shuffle()
        res = self.df.iloc[self.cursor:self.cursor+n]
        self.cursor += n
        return res['as_numbers'],res['gender']*3 + res['age_bracket'],res['length']

In [10]:
data = SimpleDataIterator(train)
d = data.next_batch(3)
print('Input sequences\n', d[0], end='\n\n')
print('Target values\n', d[1], end='\n\n')
print('Sequence lengths\n', d[2])

Input sequences
 0    [41, 30, 2637, 26, 19, 1, 6, 3, 123, 59, 231, ...
1       [0, 3, 64, 29, 14, 90, 5, 30, 7, 506, 1021, 2]
2       [43, 4, 1257, 60, 30, 124, 40, 106, 29, 2, 39]
Name: as_numbers, dtype: object

Target values
 0    1
1    4
2    2
dtype: int64

Sequence lengths
 0    17
1    12
2    11
Name: length, dtype: int64


---

# BUCKETING

### Data padding

In [26]:
class BucketedDataIterator(SimpleDataIterator):
    
    def __init__(self,df,num_buckets=5):
        df = df.sort_values('length').reset_index(drop=True)
        self.size = len(df) // num_buckets
        self.dfs = []
        for bucket in range(num_buckets):
            self.dfs.append(df.iloc[bucket*self.size:(bucket+1)*self.size - 1])
        self.num_buckets = num_buckets
        
        #cursor[i] will be the cursor for the ith bucket
        self.cursor = np.array([0] * num_buckets)
        self.shuffle()
        self.epochs = 0
        
    def shuffle(self):
        #sorts dataframe by sequence length, but keeps it random within the same length
        for i in range(self.num_buckets):
            self.dfs[i] = self.dfs[i].sample(frac=1).reset_index(drop=True)
            self.cursor[i] = 0
    
    def next_batch(self,n):
        
        if(np.any(self.cursor+n+1 > self.size)):
            self.epochs += 1
            self.shuffle()
        
        i = np.random.randint(0,self.num_buckets)
    
        res = self.dfs[i].iloc[self.cursor[i]:self.cursor[i]+n]
        self.cursor[i] += n
        
        #Pad sequences with 0s so they are all the same length
        maxlen = max(res['length'])
        x = np.zeros([n,maxlen],dtype=np.int32)
        for i,x_i in enumerate(x):
            x_i[:res['length'].values[i]] = res['as_numbers'].values[i]
        
        return x,res['gender']*3 + res['age_bracket'], res['length']

In [27]:
data = BucketedDataIterator(train)
d = data.next_batch(3)
print('Input sequences\n',d[0],end='\n\n')
print('Target sequences\n',d[1],end='\n\n')

Input sequences
 [[   3  123 3131  222   46   90    5  586   35    0   49 4977    2    8
  2050    2    0    0]
 [ 973    8  493 2553   10   17   69  486   10   80  113  133   16 4420
     6 2530    2    0]
 [  32   42   11 4382 6975    6   57 5130  624    1 1883    1   10   14
     7  173  192    2]]

Target sequences
 0    3
1    3
2    5
dtype: int64



# END BUCKETING

---

## Model

In [13]:
def reset_graph():
    if 'sess' in globals() and sess:
        sess.close()
    tf.reset_default_graph()
    
def build_graph(vocab_size = len(vocab), state_size = 64, batch_size = 256, num_classes = 6):
    
    reset_graph()
    
    #Placeholders
    x = tf.placeholder(tf.int32,[batch_size,None]) #[batch_size, num_steps]
    seqlen = tf.placeholder(tf.int32,[batch_size])
    y = tf.placeholder(tf.int32,[batch_size])
    keep_prob = tf.placeholder(1.0,name='keep_prob')
    
    #Embedding layer
    embeddings = tf.get_variable('embedding_matrix',[vocab_size,state_size])
    rnn_inputs = tf.nn.embedding_lookup(embeddings,x)
    
    #RNN
    cell = tf.nn.rnn_cell.GRUCell(state_size)
    init_state = tf.get_variable('init_state',[1,state_size],initializer=tf.constant_initializer(0.0))
    init_state = tf.tile(init_state,[batch_size,1])
    rnn_outputs, final_state = tf.nn.dynamic_rnn(cell,rnn_inputs,sequence_length=seqlen,initial_state=init_state)
    rnn_outputs = tf.nn.dropout(rnn_outputs,keep_prob) #Dropout
    
    #Last revelant output
    last_rnn_output = tf.gather_nd(rnn_outputs,tf.stack([tf.range(batch_size),seqlen-1],axis=1))
    
    #Softmax layer - Prediction
    with tf.variable_scope('softmax'):
        W = tf.get_variable('W',[state_size,num_classes])
        b = tf.get_variable('b',[num_classes],initializer=tf.constant_initializer(0.0))
    logits = tf.matmul(last_rnn_output,W) + b
    preds = tf.nn.softmax(logits)
    correct = tf.equal(tf.cast(tf.argmax(preds,1),tf.int32),y)
    accuracy = tf.reduce_mean(tf.cast(correct,tf.float32))
    
    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=y))
    train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
    
    ret_dict = {'x':x,'seqlen':seqlen,'y':y,'dropout':keep_prob,'loss':loss,'ts':train_step,'preds':preds,'accuracy':accuracy}
    
    return ret_dict

## Train function

In [14]:
def train_graph(graph,batch_size = 256, num_epochs = 10, iterator=BucketedDataIterator):
    with tf.Session() as sess:
        tf.global_variables_initializer().run()
        tr = iterator(train)
        te = iterator(test)
        
        step,accuracy = 0,0
        tr_losses,te_losses = [],[]
        current_epoch = 0
        
        while current_epoch < num_epochs:
            step += 1
            batch = tr.next_batch(batch_size)
            feed = {g['x']:batch[0],g['y']:batch[1],g['seqlen']:batch[2],g['dropout']:0.6}
            
            accuracy_,_ = sess.run([g['accuracy'],g['ts']],feed_dict = feed)
            accuracy += accuracy_
            
            if(tr.epochs > current_epoch):
                current_epoch += 1
                tr_losses.append(accuracy/step)
                step,accuracy = 0,0
                
                #eval test set
                te_epoch = te.epochs
                while (te.epochs == te_epoch):
                    step =+ 1
                    batch = te.next_batch(batch_size)
                    feed = {g['x']:batch[0],g['y']:batch[1],g['seqlen']:batch[2],g['dropout']:1.0}
                    accuracy_ = sess.run([g['accuracy']],feed_dict = feed)[0]
                    accuracy += accuracy_
                    
                te_losses.append(accuracy/step)
                step,accuracy = 0,0
                print('Accuracy after epoch',current_epoch," - tr:", tr_losses[-1]," -te:", te_losses[-1])
                
    return tr_losses,te_losses

## Train & Test

In [15]:
g = build_graph(state_size=STATE_SIZE,batch_size=BATCH_SIZE)

In [16]:
train_graph(g,batch_size=BATCH_SIZE,num_epochs=NUM_EPOCHS)

Accuracy after epoch 1  - tr: 0.226151315789  -te: 16.52734375
Accuracy after epoch 2  - tr: 0.250947960251  -te: 14.78125
Accuracy after epoch 3  - tr: 0.298364542802  -te: 18.37109375
Accuracy after epoch 4  - tr: 0.310604079498  -te: 18.13671875
Accuracy after epoch 5  - tr: 0.317354210251  -te: 12.8359375
Accuracy after epoch 6  - tr: 0.323270789749  -te: 13.93359375
Accuracy after epoch 7  - tr: 0.327715355805  -te: 13.06640625
Accuracy after epoch 8  - tr: 0.337642840485  -te: 15.76953125
Accuracy after epoch 9  - tr: 0.344046875  -te: 15.80078125
Accuracy after epoch 10  - tr: 0.352512184633  -te: 14.1875


([0.22615131578947367,
  0.25094796025104604,
  0.29836454280155644,
  0.31060407949790797,
  0.31735421025104604,
  0.32327078974895396,
  0.32771535580524347,
  0.33764284048507465,
  0.34404687499999997,
  0.35251218463302753],
 [16.52734375,
  14.78125,
  18.37109375,
  18.13671875,
  12.8359375,
  13.93359375,
  13.06640625,
  15.76953125,
  15.80078125,
  14.1875])