# Text Classification and Data Sets

Text classification is a common task in natural language processing, which transforms a sequence of text of indefinite length into a category of text.

## Outline

- Text Sentiment Classification Data
- Classification using a Bag of Context Free Embeddings
- Classification using Convolutional Neural Networks (textCNN)
- Classification using Recurrent Neural Networks
- Classification using Recurrent Neural Networks with Self-Attention

## Text Sentiment Classification Data

Use Stanford's Large Movie Review Dataset as the data set for text sentiment classification.
- Contains parts for training and testing purposes, each containing 25,000 movie reviews downloaded from IMDb
- In each data set, the number of comments labeled as "positive" and "negative" is equal.

In [1]:
from utils import load_data_imdb

batch_size = 64

train_iter,test_iter, vocab = load_data_imdb(batch_size)
print(vocab)

Downloading ../data/aclImdb_v1.tar.gz from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz...
Vocab(size=49342, unk="<unk>", reserved="['<pad>', '<bos>', '<eos>']")


In [2]:
import gluonnlp as nlp
import mxnet as mx
import d2l
from mxnet import gluon, nd
from mxnet.gluon import nn, rnn

# Using a Bag of Context Free Embeddings

In [3]:
class ContinuousBagOfWords(nn.HybridBlock):
    def __init__(self, vocab_size, embed_size, **kwargs):
        super(ContinuousBagOfWords, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.encoder = nn.HybridLambda(lambda F, x: F.mean(x, axis=1))
        self.decoder = nn.Dense(2)

    def forward(self, inputs):
        # The shape of inputs is (batch size, number of words).
        embeddings = self.embedding(inputs)
        encoding = self.encoder(embeddings)
        outputs = self.decoder(encoding)
        return outputs

In [4]:
embed_size, ctx = 100, [mx.gpu(0)]
net = ContinuousBagOfWords(len(vocab), embed_size)
net.hybridize()
net.initialize(mx.init.Xavier(), ctx=ctx)

Next we load the pre-trained word vectors,

In [5]:
glove_embedding = nlp.embedding.create('glove', source='glove.6B.100d')
idx_to_vec = glove_embedding[vocab.idx_to_token]
idx_to_vec.shape

(49342, 100)

and use these word vectors as feature vectors for each word in the reviews. 

In [6]:
net.embedding.weight.set_data(idx_to_vec)
net.embedding.collect_params().setattr('grad_req', 'null')

### Train and Evaluate the Model



In [7]:
from utils import train

lr, num_epochs = 0.01, 5
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
train(net, train_iter, test_iter, loss, trainer, num_epochs, ctx)

loss 0.679, train acc 0.586
loss 0.665, train acc 0.623
loss 0.654, train acc 0.642
loss 0.646, train acc 0.652
loss 0.638, train acc 0.661
loss 0.603, train acc 0.706
loss 0.597, train acc 0.712
loss 0.592, train acc 0.714
loss 0.590, train acc 0.717
loss 0.587, train acc 0.719
loss 0.568, train acc 0.731
loss 0.565, train acc 0.731
loss 0.565, train acc 0.733
loss 0.564, train acc 0.735
loss 0.562, train acc 0.736
loss 0.554, train acc 0.745
loss 0.554, train acc 0.745
loss 0.550, train acc 0.746
loss 0.549, train acc 0.745
loss 0.547, train acc 0.746
loss 0.536, train acc 0.756
loss 0.537, train acc 0.751
loss 0.539, train acc 0.750
loss 0.538, train acc 0.750
loss 0.537, train acc 0.753
loss 0.537, train acc 0.752, test acc 0.750
32206.6 exampes/sec on [gpu(0)]


We define a prediction function:

In [8]:
def predict_sentiment(net, vocab, sentence):
    sentence = nd.array(vocab[sentence.split()], ctx=ctx[0])
    label = nd.argmax(net(sentence.reshape((1, -1))), axis=1)
    return 'positive' if label.asscalar() == 1 else 'negative'

In [9]:
predict_sentiment(net, vocab, 'this movie is so great')

'positive'

In [10]:
predict_sentiment(net, vocab, 'this movie is so bad')

'negative'

# Using Convolutional Neural Networks (textCNN)

Idea: treat text as a one-dimensional "image".

Then we can use one-dimensional convolutional neural networks to capture associations between adjacent words.

This section describes a groundbreaking approach to applying
convolutional neural networks to text analysis: textCNN :cite:`Kim.2014`.

## One-dimensional Convolutional Layer

![One-dimensional cross-correlation operation. The shaded parts are the first output element as well as the input and kernel array elements used in its calculation: $0\times1+1\times2=2$. ](../img/conv1d.svg)

Before introducing the model, let us explain how a one-dimensional convolutional layer works. Like a two-dimensional convolutional layer, a one-dimensional convolutional layer uses a one-dimensional cross-correlation operation. In the one-dimensional cross-correlation operation, the convolution window starts from the leftmost side of the input array and slides on the input array from left to right successively. When the convolution window slides to a certain position, the input subarray in the window and kernel array are multiplied and summed by element to get the element at the corresponding location in the output array. As shown in Figure 12.4, the input is a one-dimensional array with a width of 7 and the width of the kernel array is 2. As we can see, the output width is $7-2+1=6$ and the first element is obtained by performing multiplication by element on the leftmost input subarray with a width of 2 and kernel array and then summing the results.


Next, we implement one-dimensional cross-correlation in the `corr1d` function. It accepts the input array `X` and kernel array `K` and outputs the array `Y`.

In [11]:
def corr1d(X, K):
    w = K.shape[0]
    Y = nd.zeros((X.shape[0] - w + 1))
    for i in range(Y.shape[0]):
        Y[i] = (X[i: i + w] * K).sum()
    return Y

Let's reproduce the results of the one-dimensional cross-correlation operation seen in above Figure.

In [12]:
X, K = nd.array([0, 1, 2, 3, 4, 5, 6]), nd.array([1, 2])
corr1d(X, K)


[ 2.  5.  8. 11. 14. 17.]
<NDArray 6 @cpu(0)>

## One-dimensional Convolutional Layer with multiple input channels


![One-dimensional cross-correlation operation with three input channels. The shaded parts are the first output element as well as the input and kernel array elements used in its calculation: $0\times1+1\times2+1\times3+2\times4+2\times(-1)+3\times(-3)=2$. ](../img/conv1d-channel.svg)

The one-dimensional cross-correlation operation for multiple input channels is also similar to the two-dimensional cross-correlation operation for multiple input channels. On each channel, it performs the one-dimensional cross-correlation operation on the kernel and its corresponding input and adds the results of the channels to get the output. Figure 12.5 shows a one-dimensional cross-correlation operation with three input channels.

Now, we reproduce the results of the one-dimensional cross-correlation operation with multi-input channel.

In [13]:
def corr1d_multi_in(X, K):
    return nd.add_n(*[corr1d(x, k) for x, k in zip(X, K)])

In [14]:
X = nd.array([[0, 1, 2, 3, 4, 5, 6],
              [1, 2, 3, 4, 5, 6, 7],
              [2, 3, 4, 5, 6, 7, 8]])
K = nd.array([[1, 2], [3, 4], [-1, -3]])
corr1d_multi_in(X, K)


[ 2.  8. 14. 20. 26. 32.]
<NDArray 6 @cpu(0)>

This is equivalent to two-dimensional cross-correlation with a single input channel

![Two-dimensional cross-correlation operation with a single input channel. The highlighted parts are the first output element and the input and kernel array elements used in its calculation: $2\times(-1)+3\times(-3)+1\times3+2\times4+0\times1+1\times2=2$. ](../img/conv1d-2d.svg)

We can obtain multiple output channels by applying the cross-correlation multiple times with different kernels.

## Max-Over-Time Pooling Layer

Similarly, we have a one-dimensional pooling layer. The max-over-time pooling layer used in TextCNN actually corresponds to a one-dimensional global maximum pooling layer. Assuming that the input contains multiple channels, and each channel consists of values on different time steps, the output of each channel will be the largest value of all time steps in the channel. Therefore, the input of the max-over-time pooling layer can have different time steps on each channel.

To improve computing performance, we often combine timing examples of different lengths into a mini-batch and make the lengths of each timing example in the batch consistent by appending special characters (such as 0) to the end of shorter examples. Naturally, the added special characters have no intrinsic meaning. Because the main purpose of the max-over-time pooling layer is to capture the most important features of timing, it usually allows the model to be unaffected by the manually added characters.

## The TextCNN Model

![TextCNN design. ](../img/textcnn.svg)

Kim, Yoon. "Convolutional neural networks for sentence classification." EMNLP 2014.

TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer. Suppose the input text sequence consists of $n$ words, and each word is represented by a $d$-dimension word vector. Then the input example has a width of $n$, a height of 1, and $d$ input channels. The calculation of textCNN can be mainly divided into the following steps:

1. Define multiple one-dimensional convolution kernels and use them to perform convolution calculations on the inputs. Convolution kernels with different widths may capture the correlation of different numbers of adjacent words.
2. Perform max-over-time pooling on all output channels, and then concatenate the pooling output values of these channels in a vector.
3. The concatenated vector is transformed into the output for each category through the fully connected layer. A dropout layer can be used in this step to deal with overfitting.

Figure 12.7 gives an example to illustrate the textCNN. The input here is a sentence with 11 words, with each word represented by a 6-dimensional word vector. Therefore, the input sequence has a width of 11 and 6 input channels. We assume there are two one-dimensional convolution kernels with widths of 2 and 4, and 4 and 5 output channels, respectively. Therefore, after one-dimensional convolution calculation, the width of the four output channels is $11-2+1=10$, while the width of the other five channels is $11-4+1=8$. Even though the width of each channel is different, we can still perform max-over-time pooling for each channel and concatenate the pooling outputs of the 9 channels into a 9-dimensional vector. Finally, we use a fully connected layer to transform the 9-dimensional vector into a 2-dimensional output: positive sentiment and negative sentiment predictions.

Next, we will implement a textCNN model. Compared with the previous section, in addition to replacing the recurrent neural network with a one-dimensional convolutional layer, here we use two embedding layers, one with a fixed weight and another that participates in training.

In [15]:
class TextCNN(nn.Block):
    def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
                 **kwargs):
        super(TextCNN, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # The embedding layer does not participate in training
        self.constant_embedding = nn.Embedding(vocab_size, embed_size)
        self.dropout = nn.Dropout(0.5)
        self.decoder = nn.Dense(2)
        # The max-over-time pooling layer has no weight, so it can share an
        # instance
        self.pool = nn.GlobalMaxPool1D()
        # Create multiple one-dimensional convolutional layers
        self.convs = nn.Sequential()
        for c, k in zip(num_channels, kernel_sizes):
            self.convs.add(nn.Conv1D(c, k, activation='relu'))

    def forward(self, inputs):
        # Concatenate the output of two embedding layers with shape of
        # (batch size, number of words, word vector dimension) by word vector
        embeddings = nd.concat(
            self.embedding(inputs), self.constant_embedding(inputs), dim=2)
        # According to the input format required by Conv1D, the word vector
        # dimension, that is, the channel dimension of the one-dimensional
        # convolutional layer, is transformed into the previous dimension
        embeddings = embeddings.transpose((0, 2, 1))
        # For each one-dimensional convolutional layer, after max-over-time
        # pooling, an NDArray with the shape of (batch size, channel size, 1)
        # can be obtained. Use the flatten function to remove the last
        # dimension and then concatenate on the channel dimension
        encoding = nd.concat(*[nd.flatten(
            self.pool(conv(embeddings))) for conv in self.convs], dim=1)
        # After applying the dropout method, use a fully connected layer to
        # obtain the output
        outputs = self.decoder(self.dropout(encoding))
        return outputs

Create a TextCNN instance. It has 3 convolutional layers with kernel widths of 3, 4, and 5, all with 100 output channels.

In [16]:
embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100]
net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels)
net.initialize(mx.init.Xavier(), ctx=ctx)

### Load Pre-trained Word Vectors

As in the previous section, load pre-trained 100-dimensional GloVe word vectors and initialize the embedding layers `embedding` and `constant_embedding`. Here, the former participates in training while the latter has a fixed weight.

In [17]:
net.embedding.weight.set_data(idx_to_vec)
net.constant_embedding.weight.set_data(idx_to_vec)
net.constant_embedding.collect_params().setattr('grad_req', 'null')

### Train and Evaluate the Model

Now we can train the model.

In [18]:
lr, num_epochs = 0.001, 5
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
train(net, train_iter, test_iter, loss, trainer, num_epochs, ctx)

loss 0.841, train acc 0.600
loss 0.704, train acc 0.656
loss 0.634, train acc 0.690
loss 0.593, train acc 0.713
loss 0.561, train acc 0.730
loss 0.380, train acc 0.832
loss 0.370, train acc 0.836
loss 0.365, train acc 0.839
loss 0.361, train acc 0.840
loss 0.357, train acc 0.844
loss 0.282, train acc 0.888
loss 0.272, train acc 0.889
loss 0.265, train acc 0.891
loss 0.267, train acc 0.890
loss 0.264, train acc 0.892
loss 0.166, train acc 0.939
loss 0.162, train acc 0.941
loss 0.159, train acc 0.941
loss 0.163, train acc 0.939
loss 0.164, train acc 0.939
loss 0.084, train acc 0.973
loss 0.086, train acc 0.971
loss 0.089, train acc 0.969
loss 0.089, train acc 0.969
loss 0.090, train acc 0.969
loss 0.090, train acc 0.969, test acc 0.868
2538.9 exampes/sec on [gpu(0)]


Below, we use the trained model to the classify sentiments of two simple sentences.

In [19]:
predict_sentiment(net, vocab, 'this movie is so great')

'positive'

In [20]:
predict_sentiment(net, vocab, 'this movie is so bad')

'negative'

## Summary

* We can use one-dimensional convolution to process and analyze timing data.
* A one-dimensional cross-correlation operation with multiple input channels can be regarded as a two-dimensional cross-correlation operation with a single input channel.
* The input of the max-over-time pooling layer can have different numbers of time steps on each channel.
* TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer.

# Using Recurrent Neural Networks

In this section, we will apply
pre-trained word vectors and bidirectional recurrent neural networks with
multiple hidden layers:

Maas, Andrew L., et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Association for Computational Linguistics, 2011.

## Bidirectional RNNs

![BiRNN](../img/birnn.svg)


## Use a Recurrent Neural Network Model

In this model, each word first obtains a feature vector from the embedding
layer. Then, we further encode the feature sequence using a bidirectional
recurrent neural network to obtain sequence information. Finally, we transform
the encoded sequence information to output through the fully connected
layer. Specifically, we can concatenate hidden states of bidirectional
long-short term memory in the initial time step and final time step and pass it
to the output layer classification as encoded feature sequence information. In
the `BiRNN` class implemented below, the `Embedding` instance is the embedding
layer, the `LSTM` instance is the hidden layer for sequence encoding, and the
`Dense` instance is the output layer for generated classification results.

In [25]:
class BiRNN(nn.Block):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, **kwargs):
        super(BiRNN, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # Set Bidirectional to True to get a bidirectional recurrent neural
        # network
        self.encoder = rnn.LSTM(num_hiddens, num_layers=num_layers,
                                bidirectional=True, input_size=embed_size)
        self.decoder = nn.Dense(2)

    def forward(self, inputs):
        # The shape of inputs is (batch size, number of words). Because LSTM
        # needs to use sequence as the first dimension, the input is
        # transformed and the word feature is then extracted. The output shape
        # is (number of words, batch size, word vector dimension).
        embeddings = self.embedding(mx.nd.transpose(inputs))
        # Since the input (embeddings) is the only argument passed into
        # rnn.LSTM, it only returns the hidden states of the last hidden layer
        # at different time step (outputs). The shape of outputs is
        # (number of words, batch size, 2 * number of hidden units).
        outputs = self.encoder(embeddings)
        # Concatenate the hidden states of the initial time step and final
        # time step to use as the input of the fully connected layer. Its
        # shape is (batch size, 4 * number of hidden units)
        encoding = mx.nd.concat(outputs[0], outputs[-1])
        outs = self.decoder(encoding)
        return outs

Create a bidirectional recurrent neural network with two hidden layers.

In [26]:
embed_size, num_hiddens, num_layers, ctx = 100, 100, 2, d2l.try_all_gpus()
net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers)
net.initialize(mx.init.Xavier(), ctx=ctx)

### Load Pre-trained Word Vectors

Because the training data set for sentiment classification is not very large, in order to deal with overfitting, we will directly use word vectors pre-trained on a larger corpus as the feature vectors of all words. Here, we load a 100-dimensional GloVe word vector for each word in the dictionary `vocab`.

In [27]:
net.embedding.weight.set_data(idx_to_vec)
net.embedding.collect_params().setattr('grad_req', 'null')

### Train and Evaluate the Model

Now, we can start training.

In [28]:
lr, num_epochs = 0.01, 5
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
train(net, train_iter, test_iter, loss, trainer, num_epochs, ctx)

loss 0.693, train acc 0.531
loss 0.689, train acc 0.539
loss 0.668, train acc 0.576
loss 0.632, train acc 0.618
loss 0.602, train acc 0.651
loss 0.444, train acc 0.797
loss 0.429, train acc 0.805
loss 0.419, train acc 0.811
loss 0.412, train acc 0.816
loss 0.404, train acc 0.820
loss 0.343, train acc 0.850
loss 0.357, train acc 0.843
loss 0.356, train acc 0.844
loss 0.355, train acc 0.843
loss 0.353, train acc 0.846
loss 0.299, train acc 0.873
loss 0.311, train acc 0.868
loss 0.307, train acc 0.870
loss 0.307, train acc 0.869
loss 0.310, train acc 0.868
loss 0.267, train acc 0.892
loss 0.274, train acc 0.887
loss 0.279, train acc 0.881
loss 0.282, train acc 0.881
loss 0.281, train acc 0.881
loss 0.281, train acc 0.881, test acc 0.862
913.5 exampes/sec on [gpu(0)]


In [29]:
predict_sentiment(net, vocab, 'this movie is so great')

'positive'

In [30]:
predict_sentiment(net, vocab, 'this movie is so bad')

'negative'