# Kim's CNN Model for Sentence Classification PyTorch Tutorial

This tutorial guides new PyTorch users through implementing, training, and testing a simple CNN model for sentence classification. More specifically, users will implement the CNN model described in [this paper](http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf) and use it on the [SST-1 dataset](https://nlp.stanford.edu/sentiment/), a popular dataset for sentence sentiment classification.

## Introduction

In learning the basics of PyTorch, I frequently consulted the [docs](http://pytorch.org/docs/master/). I found that most illuminating was using example code as reference and trying to write the project without copying and pasting. This way, I could understand the role of every function, instead of treating the entire model as opaque.

The paper itself is simple to read and understand. However, it glosses over some details--mainly about preprocessing--such as the mechanics of preprocessing SST-1 and using a word2vec model. I will try to address as many of these details where possible.

## Architecture

How would we translate the model to code? The paper mentions four different variants, but they differ mostly in only the input layer. This suggests that we modularize as follows:

```
input_module = rand | static | non-static | multi-channel
full_model = conv_module(input_module)
```

In fact, upon closer examination, we realize that `rand`, `static`, and `non-static` have only a single channel and differ only in initialization and parameter update. Thus, we can further let `input_module = single-channel | multi-channel`.

Here is the fully documented single-channel input module code:

In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as nn_func


class SingleChannelWordModel(nn.Module):
    """
    The input layer module for rand, static, and non-static.
    """
    def __init__(self, id_dict, weights, unknown_vocab=[], static=True):
        """
        Creates the model.

        Args:
            id_dict: the id-to-word dictionary
            weights: the word vector matrix that maps id to a word vector
            unknown_vocab: the list of words that are not in id_dict. These will be randomly initialized.
            static: if True, do not compute gradients and update weights.
        """
        super().__init__()
        vocab_size = len(id_dict) + len(unknown_vocab)
        self.n_channels = 1
        self.lookup_table = id_dict

        # Build and initialize the weight matrix, concatenating random vectors for unknown words
        last_id = max(id_dict.values())
        for word in unknown_vocab:
            last_id += 1
            self.lookup_table[word] = last_id
        # Do the concatenation of random weights for unknown words
        self.weights = np.concatenate((weights, np.random.rand(len(unknown_vocab), 300) - 0.5))
        self.dim = self.weights.shape[1] # the size of the embedding

        # The word embedding layer. Note that any nn.Module that __setattr__'s to another module becomes part of 
        # the other module. That is, self.embedding = nn.Embedding(..) effectively adds all of self.embedding's 
        # parameters to the self parameter list, since nn.Embedding subclasses nn.Module.
        self.embedding = nn.Embedding(vocab_size, self.dim, padding_idx=2)
        # Copy the weights to initialize the embedding
        self.embedding.weight.data.copy_(torch.from_numpy(self.weights))

        # Turn off gradient computation if static
        if static:
            self.embedding.weight.requires_grad = False

    @classmethod
    def make_random_model(cls, id_dict, unknown_vocab=[], dim=300):
        # Creates a rand model as specified in the paper. All weights are randomly initialized.
        weights = np.random.rand(len(id_dict), dim) - 0.5
        return cls(id_dict, weights, unknown_vocab, static=False)

    # This function gets run on each forward pass of the model. x is the input tensor or variable.
    def forward(self, x):
        # self.embedding(x) looks up the embedding for x, where x is a tensor of indices.
        # That is, if self.embedding.weights contains [[1, 2, 3], [4, 5, 6]], then self.embedding([0, 1, 0])
        # returns [[1, 2, 3], [4, 5, 6], [1, 2, 3]].
        batch = self.embedding(x)
        # We expand the channel dimension, so batch has shape (batch, channel, sent length, embed dim)
        return batch.unsqueeze(1)

    # Converts each word in sentences to a suitable list of embedding indices.
    def lookup(self, sentences):
        indices_list = []
        max_len = 0
        for sentence in sentences:
            indices = []
            for word in str(sentence).split():
                try:
                    index = self.lookup_table[word]
                    indices.append(index)
                except KeyError:
                    continue
            indices_list.append(indices)
            if len(indices) > max_len: # Find the maximum length of the sentence to pad to.
                max_len = len(indices)
        for indices in indices_list:
            # Specify that all indices of value "2" to be zero-padded. This is made apparent by padding_idx=2 in
            # nn.Embedding's constructor. There seems to be a bug with negative pad values...
            indices.extend([2] * (max_len - len(indices))) 
        return indices_list

It's clear that the `multi-channel` input module is just a pair of `static` and `non-static` input modules:

In [3]:
class MultiChannelWordModel(nn.Module):
    """
    All *ChannelWordModels have attributes n_channels, dim, and lookup(sentences). They will be used
    in the CNN module code later.
    """
    def __init__(self, id_dict, weights, unknown_vocab=[]):
        super().__init__()
        self.n_channels = 2 # We have 2 channels now
        self.non_static_model = SingleChannelWordModel(id_dict, weights, unknown_vocab, static=False) # Non-static
        self.static_model = SingleChannelWordModel(id_dict, self.non_static_model.weights) # Static
        self.dim = self.static_model.dim

    def forward(self, x):
        # One input batch comes from the static model
        batch1 = self.static_model(x)
        # The other comes from non-static, as specified in paper
        batch2 = self.non_static_model(x)
        return torch.cat((batch1, batch2), dim=1) # Concatenate batch1 and batch2 along the channel dimension

    def lookup(self, sentences):
        # Delegate lookup to one of the child models
        return self.static_model.lookup(sentences)

We now build the CNN module:

In [6]:
class KimCNN(nn.Module):
    def __init__(self, word_model, **config):
        """
        Creates a CNN sentence classification module as described by Yoon Kim.

        Args:
            word_model: the input layer model to use. Can be rand, static, non-static, or multi-channel, as represented by
                SingleChannelWordModel and MultiChannelWordModel.
            **config: A dictionary of configuration settings. If blank, it defaults to those recommended in Kim's paper.
        """
        # In PyTorch, we typically initialize all the trainable modules/layers in __init__ and then use them in forward()
        super().__init__()
        n_fmaps = config.get("n_feature_maps", 100)
        weight_lengths = config.get("weight_lengths", [3, 4, 5]) # the sizes of the convolutional kernel
        embedding_dim = word_model.dim

        # By doing self.word_model = word_model, word_model is now a sub-module of KimCNN: all its parameters are now
        # part of KimCNN's parameters.
        self.word_model = word_model
        n_c = word_model.n_channels

        # The convolutional layers, 3 of 3x300, 4x300, 5x300 by default. (300 is the embedding size)
        self.conv_layers = [nn.Conv2d(n_c, n_fmaps, (w, embedding_dim), padding=(w - 1, 0)) for w in weight_lengths]
        for i, conv in enumerate(self.conv_layers):
            self.add_module("conv{}".format(i), conv) # since conv_layers is a list, we need to add the modules manually
        self.dropout = nn.Dropout(config.get("dropout", 0.5)) # a dropout layer
        # Finally linearly combine all conv layers to form a logits output for softmax with cross entropy loss...
        # There are 5 sentiments in SST by default: very negative, negative, neutral, positive, very positive
        self.fc = nn.Linear(len(self.conv_layers) * n_fmaps, config.get("n_labels", 5)) 

    def preprocess(self, sentences):
        # Preprocess the string sentences for input to the model. In other words, takes a list of string sentences and outputs
        # its embedding tensor representation
        return torch.from_numpy(np.array(self.word_model.lookup(sentences)))

    def forward(self, x):
        # Runs x through the current word input model, which is one of rand, static, non-static, and multi-channel
        x = self.word_model(x) # shape: (batch, channel, sent length, embed dim)
        # Perform convolution with rectified linear units as recommended in most papers
        x = [nn_func.relu(conv(x)).squeeze(3) for conv in self.conv_layers] # squeeze(3) to get rid of the extraneous dimension
        # max-pool over time as mentioned in the paper
        x = [nn_func.max_pool1d(c, c.size(2)).squeeze(2) for c in x]
        # Concatenate along the second dimension:
        x = torch.cat(x, 1)
        # Apply dropout
        x = self.dropout(x)
        # Return logits
        return self.fc(x)

## Loading data

The problem of constructing `id_dict`, `weights`, and `unknown_vocab` for the input models still remains. Conceptually, `id_dict` maps words to their indices in `weights`; for example, if `id_dict = {"a": 0, "hello": 1}`, then the rows corresponding to indices of `0` and `1` would contain the word embeddings for "a" and "hello", respectively. `unknown_vocab` is simply the list of absent words in `id_dict`. It's used for constructing random word embeddings.

I've provided the `weights` and `id_dict` data in the **data** directory in my [repository](https://github.com/daemon/kim-cnn). In principle, `weights` can be any numpy array and `id_dict` can be any pickled dictionary, both with the aforementioned format.  The SST data is there as well; Peng has already preprocessed that data in [his implementation](https://github.com/Impavidity/kim_cnn), assigning the sentiment labels in the TSV files.

Please clone my repository for the data.

In [9]:
import numpy as np
import os
import pickle
import random

def load_sst_sets(dirname="data", fmt="stsa.fine.{}.tsv"):
    set_names = ["phrases.train", "dev", "test"] # the pre-determined set names for training/dev/test
    def read_set(name):
        data_set = []
        with open(os.path.join(dirname, fmt.format(name))) as f:
            for line in f.readlines():
                sentiment, sentence = line.replace("\n", "").split("\t")
                data_set.append((sentiment, sentence))
        return np.array(data_set) # data_set is of form [[0, "hello world!"], ...], i.e. label-to-sentence
    return [read_set(name) for name in set_names]

def load_embed_data(dirname="data", weights_file="embed_weights.npy", id_file="word_id.dat"):
    id_file = os.path.join(dirname, id_file)
    weights_file = os.path.join(dirname, weights_file)
    train_file = os.path.join(dirname, "stsa.fine.phrases.train.tsv")
    with open(id_file, "rb") as f:
        id_dict = pickle.load(f)
    with open(weights_file, "rb") as f:
        weights = np.load(f)
    unk_vocab = set()
    unk_vocab_list = []
    with open(train_file) as f:
        for line in f.readlines():
            words = line.split("\t")[1].replace("\n", "").split()
            for word in words:
                if word not in id_dict and word not in unk_vocab:
                    unk_vocab.add(word)
                    unk_vocab_list.append(word)
    return (id_dict, weights, unk_vocab_list)

We need a way to convert the dataset returned by `load_sst_sets` into something PyTorch can use (i.e. a variable or tensor):

In [8]:
def convert_dataset(model, dataset):
    model_in = dataset[:, 1].reshape(-1) # grab all the sentences
    model_out = dataset[:, 0].flatten().astype(np.int) # grab all the output labels
    model_out = torch.autograd.Variable(torch.from_numpy(model_out)).cuda() # .cuda() moves it to the GPU
    model_in = model.preprocess(model_in) # turn the sentences into embeddings in a tensor
    model_in = torch.autograd.Variable(model_in.cuda()) # move to GPU and turn into variable, which is needed for backprop
    return (model_in, model_out)

We also need to regularize the weights by scaling them to have L2 norm <= 3 as mentioned in the paper:

In [10]:
def clip_weights(parameter, s=3):
    norm = parameter.weight.data.norm()
    if norm < s:
        return
    parameter.weight.data.mul_(s / norm)

Finally, we can begin training and evaluating our model. The code should be fairly self-explanatory:

In [16]:
import model

def main():
    torch.cuda.set_device(1)
    id_dict, weights, unk_vocab_list = load_embed_data()
    # Uncomment one of the following for rand/static/non-static/multi-channel, respectively
    #word_model = model.SingleChannelWordModel.make_random_model(id_dict, unk_vocab_list)
    word_model = model.SingleChannelWordModel(id_dict, weights, unk_vocab_list)
    #word_model = model.SingleChannelWordModel(id_dict, weights, unk_vocab_list, static=False)
    #word_model = model.MultiChannelWordModel(id_dict, weights, unk_vocab_list)
    word_model.cuda()
    kcnn = model.KimCNN(word_model) # Create the model
    kcnn.cuda()
    criterion = nn.CrossEntropyLoss()
    parameters = filter(lambda p: p.requires_grad, kcnn.parameters()) # Only retrieve parameters that are affected
    # Adadelta works well but results in non-deterministic results on the GPU. Please use Adam or SGD if you wish for that.
    optimizer = torch.optim.Adadelta(parameters, lr=0.001, weight_decay=0.)

    train_set, dev_set, test_set = load_sst_sets()
    for epoch in range(30):
        kcnn.train()
        optimizer.zero_grad()
        np.random.shuffle(train_set)
        mbatch_size = 50
        i = 0
        while i + mbatch_size < len(train_set):
            mbatch = train_set[i:i + mbatch_size]
            train_in, train_out = convert_dataset(kcnn, mbatch)

            scores = kcnn(train_in)
            loss = criterion(scores, train_out)
            loss.backward() # Computes the gradients on all pertinent variables
            optimizer.step()
            for conv_layer in kcnn.conv_layers:
                clip_weights(conv_layer)
            i += mbatch_size

            if i % 3000 == 0:
                kcnn.eval()
                dev_in, dev_out = convert_dataset(kcnn, dev_set)
                scores = kcnn(dev_in)
                n_correct = (torch.max(scores, 1)[1].view(len(dev_set)).data == dev_out.data).sum()
                accuracy = n_correct / len(dev_set)
                print("Dev set accuracy: {}".format(accuracy))
                kcnn.train()

After choosing the best model based on dev set accuracy, you may run the model on the test set. Please refer to [README](https://github.com/daemon/kim-cnn) for similar results. They're 0.5-2 percentage points lower than those of the paper, but that's probably because I'm using a smaller word embedding model.

Please feel free to extend anything in this tutorial.