#### Sentence Classification

We will download the corpus from [http://cogcomp.org/Data/QA/QC/](http://cogcomp.org/Data/QA/QC/).


In [1]:
import os
from urllib.request import urlretrieve
import shutil

url = 'http://cogcomp.org/Data/QA/QC/'
train_file_name = 'train_1000.label'
test_file_name = 'TREC_10.label'

def maybe_download(url, file_name):
    if os.path.exists(file_name):
        print('Requested file', file_name, 'exists locally, no download will be performed')
    else:
        file_url = url + file_name
        print('Downloading from', file_url)
        local_tmp_file, _ = urlretrieve(file_url)
        shutil.move(local_tmp_file, file_name)
        print('Remote file successfully downloaded')

maybe_download(url, train_file_name)
maybe_download(url, test_file_name)

Requested file train_1000.label exists locally, no download will be performed
Requested file TREC_10.label exists locally, no download will be performed


The file has following format
``<class>:<subclass> <question>?``

The taxonomy of the class and subclass can be found [here](http://cogcomp.org/Data/QA/QC/definition.html)


In [2]:
def read_questions(in_file):
    #returns 
    # class: List of class, the total results are same as the number of questions in the file
    # sub_class: List of sub class, the total results are same as the number of questions in the file    
    # questions: each item in the list of is a list of split of the question by space
    # max_question_len: Maximum length of the question after splitting by space
    question_class = []
    question_subclass = []
    splits = []
    max_len = 0
    with open(in_file, 'r', encoding = 'latin-1') as f:
        lines = f.readlines()
        for line in lines:
            head, *tail = line.split(':')
            tail = tail[0].lower().split()
            question_class.append(head)
            question_subclass.append(tail[0])
            splits.append(tail[1:])
            max_len = max(max_len, len(tail) - 1)
    
    return question_class, question_subclass, splits, max_len

In [3]:
train_question_class, _ , train_questions, train_max_len =  read_questions(train_file_name)
test_question_class, _ , test_questions, test_max_len =  read_questions(test_file_name)
print('Max num of words in train question corpus is', train_max_len)
print('Max num of words in test question corpus is', test_max_len)

Max num of words in train question corpus is 32
Max num of words in test question corpus is 17


Lets look at first few question categories and question splits

In [4]:
for i in range(3):
    print('Question category is', train_question_class[i])
    print('\tQuestion tokens are', train_questions[i])    

Question category is DESC
	Question tokens are ['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?']
Question category is ENTY
	Question tokens are ['what', 'films', 'featured', 'the', 'character', 'popeye', 'doyle', '?']
Question category is DESC
	Question tokens are ['how', 'can', 'i', 'find', 'a', 'list', 'of', 'celebrities', "'", 'real', 'names', '?']


We will pad the words with a padding string ``PAD`` to ensure all questions have same length

In [5]:
def pad_questions(unpadded_questions, max_len):
    padded_questions = []
    for up in unpadded_questions:
        q = ['PAD'] * max_len
        padded_questions.append(q)
        for i, token in enumerate(up):
            q[i] = token
        
    return padded_questions

max_len = max(train_max_len, test_max_len)
padded_train_set = pad_questions(train_questions, max_len)
padded_test_set = pad_questions(test_questions, max_len)

print('Length of padded questions is', max_len)
print('Sample padded training question is\n\t', padded_train_set[0])

Length of padded questions is 32
Sample padded training question is
	 ['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']


---

We will create the following 4 data structures from the padded questions

- dictionary: Mapping between (word, word_id) in corpus
- reverse_dictionary: Mapping between (word_id, word) in corpus
- count: list of tuples of (word, count of word) ordered by the number of occurrances
- data: The data where all words in the question are replaced by the id

In [6]:
import collections


def prepare_dataset(padded_question_set):
    all_words = [token for q in padded_question_set for token in q]
    counts = collections.Counter(all_words).most_common()
    dictionary = {}
    reverse_dictionary = {}
    
    for i, (word, _) in enumerate(counts):
        dictionary[word] = i
        reverse_dictionary[i] = word
    
    data = [[dictionary[w] for w in q] for q in padded_question_set]
    return dictionary, reverse_dictionary, data, counts

all_questions = list(padded_train_set)
all_questions.extend(padded_test_set)
dictionary, reverse_dictionary, dataset, counts = prepare_dataset(all_questions)
print('counts(top 5)', counts[:5])
print('Number of unique words in corpus are', len(counts))
print('Sample question(0) is', dataset[0])
print('Reversed sample question(0) is', [reverse_dictionary[i] for i in dataset[0]])
unique_labels = set(train_question_class)
unique_labels.update(test_question_class)
print('Unique labels are', unique_labels)

counts(top 5) [('PAD', 34407), ('?', 1454), ('the', 999), ('what', 963), ('is', 587)]
Number of unique words in corpus are 3349
Sample question(0) is [9, 15, 982, 983, 6, 23, 984, 985, 518, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Reversed sample question(0) is ['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']
Unique labels are {'ABBR', 'DESC', 'ENTY', 'HUM', 'LOC', 'NUM'}


---

We will now define a ``BatchGenerator`` which will generate batches of the give data
the batch will return two value

- an input of dimension (batch_size, max_sent_length, embedding_size), in our case embedding size will be same as the size of the vocabulary size and the value will be one hot encoded vector
- The labels which will be one hot encoded vectors of size same as the number of labels

In [7]:
import numpy as np

class BatchGenerator(object):
    
    def __init__(self, batch_size, dataset, labels, word2emb, unique_labels):
        #
        # batch_size: Batch size
        # dataset: is the prepared dataset from previous step        
        # word2emb: function that generates embedding from given word
        #
        self.dataset = dataset
        self.word2emb = word2emb
        self.current_idx = 0
        self.labels = labels
        self.batch_size = batch_size
        self.unique_labels = list(unique_labels)
        None
    
    
    def __shape_question__(self, padded_question):
        return [to_one_hot_encoding(w) for w in padded_question]

    def reset(self):
        self.current_idx = 0
    
    def generate_batch(self):
        batch = []
        labels = []
        for _ in range(self.batch_size):
            batch.append(self.__shape_question__(self.dataset[self.current_idx]))
            c_label = [0] * len(self.unique_labels)
            c_label[self.unique_labels.index(self.labels[self.current_idx])] = 1
            labels.append(c_label)
            self.current_idx += 1
            self.current_idx %= len(dataset)
            
        return np.array(batch), np.array(labels)
    

def to_one_hot_encoding(word):
    one_hot = [0] * len(dictionary)
    one_hot[word] = 1
    return one_hot


prepared_train_dataset = dataset[0:1000]
prepared_test_dataset = dataset[1000:]
the_one_hot = to_one_hot_encoding(dictionary['the'])
print('One hot for "the" has element number', the_one_hot.index(1), 'set')
batch_size = 16

train_batch_generator = BatchGenerator(
                            batch_size, 
                            prepared_train_dataset, 
                            train_question_class, 
                            to_one_hot_encoding, 
                            unique_labels)

input_batch, input_labels = train_batch_generator.generate_batch()
print('shape of input_batch is', input_batch.shape, ', input_labels has shape', input_labels.shape)
print('first label has value', input_labels[0], ' labels are', train_batch_generator.unique_labels)
print('Labels for first batch are', 
      np.argmax(input_labels, axis = 1), 'first 16 labels are', train_question_class[0:16])

One hot for "the" has element number 2 set
shape of input_batch is (16, 32, 3349) , input_labels has shape (16, 6)
first label has value [0 1 0 0 0 0]  labels are ['ABBR', 'DESC', 'ENTY', 'HUM', 'LOC', 'NUM']
Labels for first batch are [1 2 1 2 0 3 3 3 1 3 5 1 3 3 2 4] first 16 labels are ['DESC', 'ENTY', 'DESC', 'ENTY', 'ABBR', 'HUM', 'HUM', 'HUM', 'DESC', 'HUM', 'NUM', 'DESC', 'HUM', 'HUM', 'ENTY', 'LOC']


---

Now that we have the necessary infrastructure in place, let us setup the CNN to and train the network

First we will define the input and label placeholders

In [8]:
import tensorflow as tf

#Each sentence is made of size max_len with PAD at the end to fill up the blanks. Each word is one hot encoded
#This each sentence is max_len X embedding_size
#We then have a batch of such matrix
embedding_size = len(dictionary) #Since  we use one hot encoded labels, the size is same as vocabulary size
num_labels = len(unique_labels)

tf.reset_default_graph()
batch_size = 32
x = tf.placeholder(shape = [batch_size, max_len, embedding_size], dtype = tf.float32, name = 'input')

#Output is one hot encoded for the label types
y = tf.placeholder(shape = [batch_size, num_labels], dtype = tf.float32, name = 'labels')

print('Shape of input and labels are', x.shape, 'and', y.shape, 'respectively')

Shape of input and labels are (32, 32, 3349) and (32, 6) respectively


---

We now define 1-D convolution of different size over the input. These convolutions filters are of different size and are applied in parallel to the sentence input. 

Next an activation function is applied to the result of convolution and max pooling applied giving scalar value per convolution. The data input to convolution layer by default needs to have the dimensions ``[filter_size, input_width, input_channel]`` as the default value of the ``data_format`` input is `NWC`. 

In our case the number of input channels size if the embedding and input width is the sentence width.

The 1-D convolution filter has the dimension ``[filter_size, input_channels, output_channels]`` In out case the ``output_channels`` is 1 and ``input_channels`` is the vocabulary size


In [9]:
filter_sizes = [3, 5, 7] 

kernels = [tf.Variable(
                tf.truncated_normal(shape = [f, embedding_size, 1], stddev = 0.02, dtype = tf.float32), 
                name = 'W' + str(i)) for i, f in enumerate(filter_sizes)]

bias = [tf.Variable(
                tf.random_normal(shape = [1], mean = 0, stddev = 0.01, dtype = tf.float32), 
                name = 'b' + str(i)) for i, f in enumerate(filter_sizes)]


conv_layer = [ tf.nn.conv1d(x, k, stride = 1, padding = 'SAME') + b for k, b in zip(kernels, bias)]

activations = [tf.nn.relu(c) for c in conv_layer]

pooling = [tf.reduce_max(a, axis = 1) for a in activations]

print('Convolutions produce tensor of dimension', conv_layer[0].shape)
print('Max Pooling produce tensor of dimension', pooling[0].shape)

Convolutions produce tensor of dimension (32, 32, 1)
Max Pooling produce tensor of dimension (32, 1)



---

Now  we will define a dense layer of size ``[num_filters, num_labels]`` to come up with logits


In [10]:
conv_stack = tf.concat(pooling, axis=1)

W_dense = tf.Variable(
            tf.truncated_normal(shape = [len(filter_sizes), num_labels], stddev = 0.5, dtype = tf.float32),
            name = 'W_dense')

b_dense = tf.Variable(
            tf.random_normal(shape = [num_labels], mean = 0, stddev = 0.01, dtype = tf.float32),
            name = 'b_dense')

logits = tf.matmul(conv_stack, W_dense) + b_dense

---

Define the loss and optimizer


In [11]:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = y))
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,momentum=0.9).minimize(loss)

predictions = tf.argmax(tf.nn.softmax(logits),axis=1)


Start the training


In [12]:
train_batch_generator = BatchGenerator(
                            batch_size, 
                            prepared_train_dataset, 
                            train_question_class, 
                            to_one_hot_encoding, 
                            unique_labels)

test_batch_generator = BatchGenerator(
                            batch_size, 
                            prepared_test_dataset, 
                            test_question_class, 
                            to_one_hot_encoding, 
                            unique_labels)


num_epochs = 50


def accuracy(labels, preds):
    return np.sum(np.argmax(labels,axis=1)==preds)/labels.shape[0]

with tf.Session() as session:
    tf.global_variables_initializer().run()
    
    
    for e in range(num_epochs):
        total_loss = []
        train_accuracy = []
        train_batch_generator.reset()
        test_batch_generator.reset()
        for _ in range((len(prepared_train_dataset)//batch_size)-1):
            input_batch, input_labels = train_batch_generator.generate_batch()
            feed_dict = {x: input_batch, y: input_labels}
            l, _ , pred= session.run([loss, optimizer, predictions], feed_dict)
            train_accuracy.append(accuracy(input_labels, pred))
            total_loss.append(l)
            
        print('Epoch', e, ', mean training loss', np.mean(total_loss))
        
        test_accuracy = []
        for _ in range((len(prepared_test_dataset)//batch_size)-1):
            input_batch, input_labels = test_batch_generator.generate_batch()
            l, pred = session.run([logits, predictions], feed_dict = {x: input_batch})
            test_accuracy.append(accuracy(input_labels, pred))
            #print(l[0], pred[0], input_labels[0])
            
        print('Epoch', e, ', mean (train, test) accuracy', 
              (np.mean(train_accuracy) * 100, np.mean(test_accuracy) * 100))
            

Epoch 0 , mean training loss 1.76074
Epoch 0 , mean (train, test) accuracy (21.25, 27.901785714285715)
Epoch 1 , mean training loss 1.70561
Epoch 1 , mean (train, test) accuracy (21.979166666666668, 16.741071428571427)
Epoch 2 , mean training loss 1.68196
Epoch 2 , mean (train, test) accuracy (21.5625, 16.741071428571427)
Epoch 3 , mean training loss 1.67022
Epoch 3 , mean (train, test) accuracy (24.895833333333332, 16.741071428571427)
Epoch 4 , mean training loss 1.66354
Epoch 4 , mean (train, test) accuracy (24.166666666666668, 16.741071428571427)
Epoch 5 , mean training loss 1.65905
Epoch 5 , mean (train, test) accuracy (24.270833333333332, 16.741071428571427)
Epoch 6 , mean training loss 1.65434
Epoch 6 , mean (train, test) accuracy (25.520833333333332, 16.741071428571427)
Epoch 7 , mean training loss 1.64804
Epoch 7 , mean (train, test) accuracy (28.125, 16.964285714285715)
Epoch 8 , mean training loss 1.64093
Epoch 8 , mean (train, test) accuracy (29.6875, 17.1875)
Epoch 9 , mean

In [13]:
#test_batch_generator.reset()
#x, y = test_batch_generator.generate_batch()
#def mat_2_sent(mat):
#    words = [reverse_dictionary[i] for i in np.argmax(mat, axis = 1)]
#    return " ".join(words)

#print([mat_2_sent(x[i,:,:]) for i in range(x.shape[0])])
#labs = list()
#print([unique_labels[i] for i in np.argmax(y, axis = 1)])
#print([reverse_dictionary[i] for i in dataset[1000]])

---

Looks like we are overfitting and the accuracy isn't great.

Next steps (TODO)

- Read the paper for sentence classification and implement as per the paper. The Paper is given [here](https://arxiv.org/pdf/1408.5882.pdf). Using pretrained word vectors rather than one hot encoded words may give better results
- Go through [this](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/) url and re-implement

