# Text Convolutional Neural Network with TF-Slim
*by Marvin Bertin*
<img src="../images/tensorflow.png" width="400">

**Convolutional Neural Network** are typically used in Computer Vision. CNNs have been responsible for major breakthroughs in Image Classification tasks. However CNNs can also be used to solve problems in **Natural Language Processing**.

**Text data**

Text data, as opposed to static images, is sequential in nature and therefore have temporal dependencies compared to spacial dependencies in images.

** Reccurent Neural Networks**

Traditionally, text data is naturally suited for **Reccurent Neural Networks (RNNs)** where connections between units form a directed cycle. This creates an **internal state** of the network which allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process **arbitrary sequences of inputs**.

<img src="../images/RNN.png" width="600">

The problem with RNNs is that they are computationally heavy compared to CNNs and therefore slow to train on high dimensional data like text. On the other hand, CNNs are fast and highly optimized on GPUs.

## Redefining Sequential Data to Fit CNNs
In this notebook, we will see how we can apply tricks to our text classfication tasks, and redefine it in the context of an image, while still performing well a classification task.

**Image vector representation**

- each pixel is a different feature/dimension
- pixel are representated as floating point numbers centered at zero
- 3D tensor (vertical number of pixel) x (horizontal number of pixel) x (number of channel - 3 for colored images)

**Text vector representation**

- text is tokenized at the word or character level
- tokens are represented as vector embeddings of floating point numbers (or one-hot vectors)
- 3D tensor (text sequence length) x (token embedding size) x (1 dimension - similar to black and white images)

**Image 2D convolutions**

- filters (feature maps) slide both horizontally and vertically over the image

**Text 1D convolutions**

- filters (feature maps) only slide vertically along the text sequence dimension.
- filters always have an horizontal dimension equal to the token embedding length.


## Text CNN Model
<img src="../images/text-cnn.png" width="400">

** Want to learn more**
- [understanding convolutional neural networks for nlp](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)
- [Deep Learning, NLP, and Representations](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)
- [Visualizing Representations: Deep Learning and Human Beings](http://colah.github.io/posts/2015-01-Visualizing-Representations/)

## Import Tensorflow Slim

In [1]:
import sys  
sys.path.append("../") 

import tensorflow as tf
slim = tf.contrib.slim

%load_ext autoreload
%autoreload 2

## Text CNN Model
```
embedding_size = 20
num_filters = 10
seq_length = 100

with slim.arg_scope([slim.conv2d, slim.max_pool2d]):
    branches = []
    for i, filter_size in enumerate([3,4,5]):
        with tf.name_scope("conv-maxpool-%s" % filter_size):
            # conv 1
            net = slim.conv2d(inputs, num_filters, [filter_size, embedding_size],
                              stride = [1,1], padding="VALID",scope='1D-conv_%d'%(i+1))
            # max-pool 2
            net = slim.max_pool2d(net, [seq_length-filter_size+1, 1], stride = [1,1],
                                  scope="1D-pool-%d" % (i+1))
            # append branch to stack
            branches.append(net)
    
    # concatenate 3 branches
    net = tf.concat(3, branches)
    
    # dropout
    net = slim.dropout(net, 0.5, is_training=is_training, scope='dropout4')

    # fully connected layer
    net = slim.conv2d(net, self.output_dim, [1, 1],
                      activation_fn=None,
                      normalizer_fn=None,
                      scope='prediction')
```

## Load Text CNN Model

In [4]:
from utils.slim_models import CNNClassifier
# [in_height, in_width, in_channels] (seq_length, embedding_size, channel)
text_tensor_shape = (100, 20, 1)
num_class = 15

CNN_model = CNNClassifier("dbpedia", text_tensor_shape , num_class)
CNN_model.examine_model_structure()

Layers
name = CNN_dbpedia_text_classifier/conv-maxpool-3/1D-conv_1/Relu:0 shape = (?, 98, 1, 10)
name = CNN_dbpedia_text_classifier/conv-maxpool-3/1D-pool-1/MaxPool:0 shape = (?, 1, 1, 10)
name = CNN_dbpedia_text_classifier/conv-maxpool-4/1D-conv_2/Relu:0 shape = (?, 97, 1, 10)
name = CNN_dbpedia_text_classifier/conv-maxpool-4/1D-pool-2/MaxPool:0 shape = (?, 1, 1, 10)
name = CNN_dbpedia_text_classifier/conv-maxpool-5/1D-conv_3/Relu:0 shape = (?, 96, 1, 10)
name = CNN_dbpedia_text_classifier/conv-maxpool-5/1D-pool-3/MaxPool:0 shape = (?, 1, 1, 10)
name = CNN_dbpedia_text_classifier/prediction/squeezed:0       shape = (?, 15)


Parameters
name = CNN_dbpedia_text_classifier/1D-conv_1/weights:0         shape = (3, 20, 1, 10)
name = CNN_dbpedia_text_classifier/1D-conv_1/biases:0          shape = (10,)
name = CNN_dbpedia_text_classifier/1D-conv_2/weights:0         shape = (4, 20, 1, 10)
name = CNN_dbpedia_text_classifier/1D-conv_2/biases:0          shape = (10,)
name = CNN_dbpedia_text_class

## Helper Functions for Text Data

Text data cannot be fed directly into the neural network.
A number of preprocessing steps need to happend first.

- Tokenize the text into words and remove words with very low frequency
- Transform the training and test data into sequences of token ids
- Truncate every sequence to a fixed set length, or pad with zeros shorter sequences
- Generate a trainable and regularized vector embedding for each tokken in your vocabulary
- Load the sample sequences in batches by using an embedding lookup matrix

## Process vocabulary

In [5]:
import numpy as np

def preprocess_vocabulary(x_train, x_test, max_document_length = 100):

    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(
        max_document_length, min_frequency=2)
    x_train = np.array(list(vocab_processor.fit_transform(x_train)))
    x_test = np.array(list(vocab_processor.transform(x_test)))
    n_words = len(vocab_processor.vocabulary_)
    print('Total words: %d' % n_words)
    return x_train, x_test 

## Load and Tansform Text Data

In [6]:
from utils.datasets import text_datasets
from collections import defaultdict
import pandas as pd

def get_dbpedia_dataset(size='small'):
    data = text_datasets.load_dbpedia(size=size)
    
    x_train = pd.DataFrame(data.train.data)[1]
    y_train = pd.Series(data.train.target)
    x_test = pd.DataFrame(data.test.data)[1]
    y_test = pd.Series(data.test.target)
    
    x_train, x_test = preprocess_vocabulary(x_train, x_test, max_document_length = 100)
    
    dataset = defaultdict(dict)
    dataset['train']['X'] = x_train
    dataset['train']['y'] = y_train
    dataset['test']['X'] = x_test
    dataset['test']['y'] = y_test
    return dataset

## Build an Embedding Lookup Matrix

In [7]:
def get_batch_inputs(data, data_type, n_words, weight_decay=0.005):
    word_vectors = tf.contrib.layers.embed_sequence(
        data[data_type]['X'], vocab_size=n_words, embed_dim=20,
        initializer=tf.truncated_normal_initializer(),
        regularizer=slim.l2_regularizer(weight_decay),
        trainable=True)

    batch_indices = np.random.choice(n_words,32,replace=False)
    inputs = tf.nn.embedding_lookup(word_vectors, batch_indices)
    return inputs

In [8]:
# load and transform data
dataset = get_dbpedia_dataset()

Total words: 1457


In [9]:
# Text representation in terms of token ids
dataset['train']['X'][10]

array([   0,    0, 1044,    4,    6,   66,   37,    0, 1280, 1321,    5,
       1265, 1321,  941,  188, 1280,    5, 1265,    0,    0,  220,    0,
         42,  921,  298,    9,  852,    5,    0,   14,    0,    5,    0,
          0,    0,    0,    8,  226,    2,    0,    2,  278,    0,    0,
         12,    4,    1,  190,   81,   64,  386,    5,  909, 1354,    3,
       1265,    5, 1280, 1321,    0,  127,   22,   32,    0,  908,    2,
        155,  304,  359,  310,  607,    5,    1,  832,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0])

## Generate a Sample Batch Input for Text CNN

In [10]:
inputs = get_batch_inputs(dataset, 'train', 1457)

# Shape of input tensor into text CNN
# (batch size, sequence length, word embedding size)
inputs.get_shape()

TensorShape([Dimension(32), Dimension(100), Dimension(20)])

<img src="../images/divider.png" width="100">