# SM CNN Model PyTorch Walkthrough

This purpose of this notebook is to explain how to use PyTorch to implement the SM CNN Model for new PyTorch users. Here are the recommended prerequisites before reading this walkthrough:

* Have knowledge of Convolutional Neural Networks. If not these are helpful slides: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt.
* Read the SM Model paper: http://dl.acm.org/citation.cfm?id=2767738

The following is a slightly modified version of the SM CNN architecture that will be implemented in this tutorial. It does not have the bilinear similarity modeling component present in the original model by Severyn and Moschitti. The following paper found removing this component actually improved answer selection effectiveness:

Jinfeng Rao, Hua He, and Jimmy Lin. Experiments with Convolutional Neural Network Models for Answer Selection. *Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017)*, August 2017, Tokyo, Japan.

![caption](files/nn-architecture.png)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In the block below we define the model, with detailed explanations in comments. This model is slightly different from the model in model.py to keep the tutorial straightforward (e.g. ignore GPU code).

In [2]:
class QAModel(nn.Module):
    """
    All PyTorch models should subclass nn.Module, the base class for neural network modules.
    """

    def __init__(self, input_n_dim, filter_width, conv_filters=100,
                 no_ext_feats=False, ext_feats_size=4, n_classes=2):
        """
        :param input_n_dim: the dimension of each word vector
        :param filter_width: the width of each convolution filter
        :param conv_filters: the number of convolution filters
        :param no_ext_feats: no additional external features
        :param ext_feats_size: number of external features to use
        :param n_classes: number of label classes
        """
        super(QAModel, self).__init__()

        self.no_ext_feats = no_ext_feats

        # self.conv_channels specify the dimension of the output of the convolution,
        # i.e. the number of convolution feature maps
        self.conv_channels = conv_filters
        # the elements in the hidden layer consist of equal number of inputs from the query and document (hence the 2*)
        # and optionally the additional features (ext_feats_size)
        n_hidden = 2*self.conv_channels + (0 if no_ext_feats else ext_feats_size)

        # define the convolution for the question/query - 1D convolution followed by tanh nonlinear activation
        # modules (nn.Conv1d and nn.Tanh) will be added in the order presented to the nn.Sequential container
        self.conv_q = nn.Sequential(
            # the first parameter specifies the input dimension, the second parameter specifies the output dimension
            nn.Conv1d(input_n_dim, self.conv_channels, filter_width, padding=filter_width-1),
            # tanh activation is used to allow the network to learn non-linear decision boundaries
            nn.Tanh()
        )

        # define the convolution for the answer/document
        self.conv_a = nn.Sequential(
            nn.Conv1d(input_n_dim, self.conv_channels, filter_width, padding=filter_width-1),
            nn.Tanh()
        )

        # combining the features from the question, answer, and external features if any into a single vector
        # note PyTorch nn classes follow a similar signature - the first parameter specifies the input dimension,
        # the second parameter specifies the output dimension
        # nn.Linear applies a linear transformation: Ax + b, where A and b are learned parameters.
        self.combined_feature_vector = nn.Linear(2*self.conv_channels + \
            (0 if no_ext_feats else ext_feats_size), n_hidden)

        # defining other layers used in the network, note they are not yet linked with each other yet
        # tanh is a non-linear activation function
        self.combined_features_activation = nn.Tanh()
        # dropout is used to prevent overfitting and only used during training
        # elements are randomly zeroed with probability 0.5 and all elements are scaled by a factor of 1/0.5 = 2
        self.dropout = nn.Dropout(0.5)
        # hidden layer is used to capture additional interactions between the components of the intermediate representation
        self.hidden = nn.Linear(n_hidden, n_classes)
        # softmax computes probability distributions
        self.logsoftmax = nn.LogSoftmax()


    def forward(self, question, answer, ext_feats):
        """
        Defines the forward pass of the network. When the model is called, e.g. model(*args) the args
        are actually passed to the forward method.
        The question and answer tensors are 3-dimensional. The first dimension specifies the sentence - it
        can be larger than 1 since multiple sentences can be batched together in one forward pass.
        The second and third dimensions specify the dimension of the word vector and the number of tokens respectively.
        
        :param question: the sentence matrices of questions (queries). Note the plural form - this is explained above.
        :param answer: the sentence matrices of answers (documents). Note the plural form - this is explained above.
        :param ext_feats: the external features for the question-answer pairs.
        :returns: the log-likelihood of the question-answer pairs belonging in each class.
        """
        # feed the question sentence matrices through the conv_q layers.
        # IMPORTANT: the second dimension of the question MUST match the the first argument
        # the Conv1d instance created (input_n_dim). The first dimension of the question specifies
        # the batch size (number of questions).
        q = self.conv_q.forward(question)
        # max pool using q.size()[2] as the window size, which is the length of each convolution feature map
        q = F.max_pool1d(q, q.size()[2])
        # reshape max pooled elements into a vector of length equal to the number of feature maps
        # the max pooling takes one value (the max) out of each convolution feature map
        q = q.view(-1, self.conv_channels)

        # feed the answer sentence matrices through the conv_a layers, similar to the previous part for the question.
        a = self.conv_a.forward(answer)
        a = F.max_pool1d(a, a.size()[2])
        a = a.view(-1, self.conv_channels)

        # concatenate the outputs of the conv_q, conv_a layers together
        # with optionally the ext_feats along the first dimension
        x = None
        if self.no_ext_feats:
            x = torch.cat([q, a], 1)
        else:
            x = torch.cat([q, a, ext_feats], 1)

        # feed the concatenated feature vector through the rest of the network (starting with join layer in figure)
        x = self.combined_feature_vector.forward(x)
        x = self.combined_features_activation.forward(x)
        x = self.dropout(x)
        x = self.hidden(x)
        x = self.logsoftmax(x)

        return x
    
    @staticmethod
    def load(model_fname):
        return torch.load(model_fname)

We now load a pre-trained model with one input and see what the model actually does. For this, you'll need to clone the `data` and `models` projects in https://github.com/castorini.

The cell below contains some bootstrapping code to prepare the data, load the model, etc.. It is not important to understand it just to see how the SM CNN model itself works.

In [3]:
import os
import sys

import numpy as np

from train import Trainer
import utils

torch.manual_seed(1234)
np.random.seed(1234)

# cache word embeddings
word_vectors_file = '../../data/word2vec/aquaint+wiki.txt.gz.ndim=50.bin'
cache_file = os.path.splitext(word_vectors_file)[0] + '.cache'
utils.cache_word_embeddings(word_vectors_file, cache_file)

vocab_size, vec_dim = utils.load_embedding_dimensions(cache_file)

# loading a pre-trained model
trained_model = QAModel.load('../../models/sm_model/sm_model.TrecQA.TRAIN-ALL.2017-04-02.castor')
evaluator = Trainer(trained_model, 0.001, 0.0, False, vec_dim)

evaluator.load_input_data('../../data/TrecQA', cache_file, None, None, 'raw-dev')

questions, sentences, labels, maxlen_q, maxlen_s, ext_feats = evaluator.data_splits['raw-dev']
word_vectors = evaluator.embeddings
pair_idx = 100  # particular question/answer pair we are interested in
batch_inputs, batch_labels = evaluator.get_tensorized_inputs(
    questions[pair_idx:pair_idx + 1],
    sentences[pair_idx:pair_idx + 1],
    labels[pair_idx:pair_idx + 1],
    ext_feats[pair_idx:pair_idx + 1],
    word_vectors, vec_dim
)

xq, xa, x_ext_feats = batch_inputs[0]



The question we want to compute similarity for and its sentence matrix dimension is shown below. The first dimension is the batch size, which is one in this case. Hence, the first index can be thought of as an index into the particular single sentence matrix. For each sentence matrix, each column represents the word vector for the corresponding word/token in the sentence (5 in total).

In [4]:
print(questions[pair_idx])
print(xq.size())

where was durst born ?
torch.Size([1, 50, 5])


The answer we want to compute similarity for and its sentence matrix is:

In [5]:
print(sentences[pair_idx])
print(xa.size())

born in jacksonville , fla . , durst grew up in gastonia , n.c . , where his love of hip-hop music and break dancing made him an outcast .
torch.Size([1, 50, 30])


We'll not use any external features for this example, so `x_ext_feats` is a vector of zeros.

In [6]:
x_ext_feats

Variable containing:
 0  0  0  0
[torch.FloatTensor of size 1x4]

Now let us step through the `forward` method of the model. Normally we'll just call `trained_model(xq, xa, x_ext_feats)` but to illustrate the steps we'll copy the lines here again and see what happens underneath the hood.

First, we want to compute convolutional feature maps for the question. We just call `forward` to make a forward pass. We get 100 convolutional feature maps of length 9 each. We have 5 tokens with a padding of 4 on each side, for a total width of 5+2*4 = 13. Our convolution filter width is 5. Hence, we have 13 - 5 + 1 = 9 total positions for the "sliding window".

In [7]:
q = trained_model.conv_q.forward(xq)
q.size()

torch.Size([1, 100, 9])

Then, we want to max-pool the convolutional feature maps. We take max element out of every convolutional feature map of length 9, getting back 100 elements.

In [8]:
print('q.size()[2]:', q.size()[2])
# max pool using q.size()[2] as the window size, which is the length of each convolution feature map
q = F.max_pool1d(q, q.size()[2])
q.size()

q.size()[2]: 9


torch.Size([1, 100, 1])

Next, we reshape `q` into a 1 x 100 vector. Using -1 automatically determines the dimension for that index.

In [9]:
q = q.view(-1, trained_model.conv_channels)
q.size()

torch.Size([1, 100])

Similarly, we want to compute the max-pooled convolutional feature maps for the answer. This is a vector of length 100.

In [10]:
a = trained_model.conv_a.forward(xa)
a = F.max_pool1d(a, a.size()[2])
a = a.view(-1, trained_model.conv_channels)
a.size()

torch.Size([1, 100])

Next, we join the max-pooled results together with the external features. Note the pre-trained model was trained with external features so we must run the code path with external features, but we can use 0 as the inputs since this is only for demonstration purposes.

In [11]:
x = torch.cat([q, a, x_ext_feats], 1)
x.size()

torch.Size([1, 204])

Next, we forward pass the features through the join layer, getting 201 inputs into the hidden layer.

In [12]:
x = trained_model.combined_feature_vector.forward(x)
x.size()

torch.Size([1, 201])

In [13]:
x = trained_model.combined_features_activation.forward(x)
x.size()

torch.Size([1, 201])

Note the activation doesn't change the dimensions. After activation we pass it through the Dropout layer, although this doesn't do anything since are not training the model.

In [14]:
print('First 10 elements before Dropout', x[0, :10].data.numpy())
x = trained_model.dropout(x)
print('First 10 elements after Dropout', x[0, :10].data.numpy())
x.size()

First 10 elements before Dropout [-0.12813647  0.01474741 -0.12794048 -0.13291343 -0.24393715 -0.00718142
 -0.08802623 -0.08587593  0.23123664  0.02877411]
First 10 elements after Dropout [-0.12813647  0.01474741 -0.12794048 -0.13291343 -0.24393715 -0.00718142
 -0.08802623 -0.08587593  0.23123664  0.02877411]


torch.Size([1, 201])

Next, we pass the elements through the hidden layer, outputing just 2 elements.

In [15]:
x = trained_model.hidden(x)
x.size()

torch.Size([1, 2])

Finally, we find the log-probabilities.

In [16]:
x = trained_model.logsoftmax(x)
x

Variable containing:
-0.0051 -5.2729
[torch.FloatTensor of size 1x2]

Get the actual probabilities by using `exp`.

In [17]:
torch.exp(x)

Variable containing:
 0.9949  0.0051
[torch.FloatTensor of size 1x2]

Hence, the probability of label 0 is 0.9949 while the probability of label 1 is 0.0051.