# SM CNN Model PyTorch Walkthrough

This purpose of this notebook is to explain how to use PyTorch to implement the SM CNN Model for new PyTorch users. Here are the recommended prerequisites before reading this walkthrough:

* Have knowledge of Convolutional Neural Networks. If not these are helpful slides: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt.
* Read the SM Model paper: http://dl.acm.org/citation.cfm?id=2767738

The following is a slightly modified version of the SM CNN architecture presented in the paper that will be implemented in this tutorial.

![caption](files/nn-architecture.png)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In the block below we define the model, with detailed explanations in comments. This model is slightly different from the model in model.py to keep the tutorial straightforward (e.g. ignore GPU code).

In [2]:
class QAModel(nn.Module):
    """
    All PyTorch models should subclass nn.Module, the base class for neural network modules.
    """

    def __init__(self, input_n_dim, filter_width, conv_filters=100,
                 no_ext_feats=False, ext_feats_size=4, n_classes=2):
        """
        :param input_n_dim: the dimension of each word vector
        :param filter_width: the width of each convolution filter
        :param conv_filters: the number of convolution filters
        :param no_ext_feats: whether or not to use additional features X_feat
        :param ext_feats_size: number of external features to use
        :param n_classes: number of label classes
        """
        super(QAModel, self).__init__()

        self.no_ext_feats = no_ext_feats

        # self.conv_channels specify the dimension of the output of the convolution,
        # i.e. the number of convolution feature maps
        self.conv_channels = conv_filters
        # the elements in the hidden layer consist of equal number of inputs from the query and document (hence the 2*)
        # and optionally the additional features (ext_feats_size)
        n_hidden = 2*self.conv_channels + (0 if no_ext_feats else ext_feats_size)

        # define the convolution for the question/query - 1D convolution followed by tanh nonlinear activation
        self.conv_q = nn.Sequential(
            # the first parameter specifies the input dimension, the second parameter specifies the output dimension
            nn.Conv1d(input_n_dim, self.conv_channels, filter_width, padding=filter_width-1),
            nn.Tanh()
        )

        # define the convolution for the answer/document
        self.conv_a = nn.Sequential(
            nn.Conv1d(input_n_dim, self.conv_channels, filter_width, padding=filter_width-1),
            nn.Tanh()
        )

        # combining the features from the question, answer, and external features if any into a single vector
        # note PyTorch nn classes follow a similar signature - the first parameter specifies the input dimension,
        # the second parameter specifies the output dimension
        self.combined_feature_vector = nn.Linear(2*self.conv_channels + \
            (0 if no_ext_feats else ext_feats_size), n_hidden)

        # defining other layers used in the network, note they are not yet linked with each other yet
        self.combined_features_activation = nn.Tanh()
        # dropout is used to prevent overfitting and only used during training
        self.dropout = nn.Dropout(0.5)
        self.hidden = nn.Linear(n_hidden, n_classes)
        self.logsoftmax = nn.LogSoftmax()


    def forward(self, question, answer, ext_feats):
        """
        Defines the forward pass of the network.
        The question and answer tensors are 3-dimensional. The first dimension specifies the sentence - it
        can be larger than 1 since multiple sentences can be batched together in one forward pass.
        The second and third dimensions specify the dimension of the word vector and the number of tokens respectively.
        
        :param question: the sentence matrices of questions (queries). Note the plural form - this is explained above.
        :param answer: the sentence matrices of answers (documents). Note the plural form - this is explained above.
        :param ext_feats: the external features for the question-answer pairs.
        :returns: the log-likelihood of the question-answer pairs belonging in each class.
        """
        # feed the question sentence matrices through the conv_q layers.
        # IMPORTANT: the second dimension of the question MUST match the the first argument
        # the Conv1d instance created (input_n_dim). The first dimension of the question specifies
        # the batch size (number of questions).
        q = self.conv_q.forward(question)
        # max pool using q.size()[2] as the window size, which is the length of each convolution feature map
        q = F.max_pool1d(q, q.size()[2])
        # reshape max pooled elements into a vector of length equal to the number of feature maps
        # the max pooling takes one value (the max) out of each convolution feature map
        q = q.view(-1, self.conv_channels)

        # feed the answer sentence matrices through the conv_a layers, similar to the previous part for the question.
        a = self.conv_a.forward(answer)
        a = F.max_pool1d(a, a.size()[2])
        a = a.view(-1, self.conv_channels)

        # concatenate the outputs of the conv_q, conv_a layers together
        # with optionally the ext_feats along the first dimension
        x = None
        if self.no_ext_feats:
            x = torch.cat([q, a], 1)
        else:
            x = torch.cat([q, a, ext_feats], 1)

        # feed the concatenated feature vector through the rest of the network (starting with join layer in figure)
        x = self.combined_feature_vector.forward(x)
        x = self.combined_features_activation.forward(x)
        x = self.dropout(x)
        x = self.hidden(x)
        x = self.logsoftmax(x)

        return x
    
    @staticmethod
    def load(model_fname):
        return torch.load(model_fname)

We now load a pre-trained model with one input and see what the model actually does. For this, you'll need to clone the `data` and `models` projects in https://github.com/castorini.

The cell below contains some bootstrapping code to prepare the data, load the model, etc.. It is not important to understand it just to see how the SM CNN model itself works.

In [3]:
import os
import sys

import numpy as np

from train import Trainer
import utils

torch.manual_seed(1234)
np.random.seed(1234)

# cache word embeddings
word_vectors_file = '../../data/word2vec/aquaint+wiki.txt.gz.ndim=50.bin'
cache_file = os.path.splitext(word_vectors_file)[0] + '.cache'
utils.cache_word_embeddings(word_vectors_file, cache_file)

vocab_size, vec_dim = utils.load_embedding_dimensions(cache_file)

trained_model = QAModel.load('../../models/sm_model/sm_model.TrecQA.TRAIN-ALL.2017-04-02.castor')
trained_model_weights = list(trained_model.parameters())
evaluator = Trainer(trained_model, 0.001, 0.0, False, vec_dim)

evaluator.load_input_data('../../data/TrecQA', cache_file, None, None, 'raw-dev')

questions, sentences, labels, maxlen_q, maxlen_s, ext_feats = evaluator.data_splits['raw-dev']
word_vectors = evaluator.embeddings
batch_inputs, batch_labels = evaluator.get_tensorized_inputs(
    questions[0:1],
    sentences[0:1],
    labels[0:1],
    ext_feats[0:1],
    word_vectors, vec_dim
)

xq, xa, x_ext_feats = batch_inputs[0]



The question we want to compute similarity for and its sentence matrix is shown below. Each column represents the word vector for the corresponding word/token in the sentence (9 in total).

In [4]:
print(questions[0])
print(xq)

what ethnic group / race are crip members ?
Variable containing:
(0 ,.,.) = 
  0.0151  0.0496 -0.0434 -0.5739  0.2806  0.0424  0.2552  0.3129  0.3887
  0.3431  0.2241 -0.0209 -0.7470 -0.0753  0.4269 -0.0267  0.1465 -0.7898
 -0.1953 -0.4663 -0.3478  0.7483  0.2263 -0.3437 -0.3074 -0.3011 -0.0599
  0.0161 -0.2898 -0.2497  0.2319 -0.1071 -0.1151 -0.1103 -0.4594  0.8386
 -0.6012 -0.3394 -0.1846 -0.6559 -0.3885 -0.0970 -0.0578 -0.3267 -0.4596
 -0.1988  0.2961 -0.0193 -0.4824 -0.0110 -0.1166 -0.2493  0.0772 -0.0718
  0.0492  0.0412  0.0854 -0.3839  0.0192 -0.1304 -0.2810  0.1978  0.0469
 -0.0640 -0.0597  0.0677 -0.0215  0.1749  0.0468  0.0171  0.1196  0.6734
 -0.0178 -0.3511 -0.1803  0.3700  0.0912 -0.1384  0.0203 -0.3293  0.2773
 -0.1661 -0.5140 -0.3630 -0.4209 -0.1508  0.2184 -0.2679 -0.7618 -0.0724
 -0.1613  0.4295  0.1512  0.5132 -0.2926 -0.1213  0.1927  0.1640  1.5271
  0.0671 -0.1575 -0.2356  0.4252 -0.5078 -0.1085 -0.5529  0.1896 -0.4746
 -0.3540 -0.0366  0.0548  0.1675  0.2479 -0.001

The answer we want to compute similarity for and its sentence matrix is:

In [5]:
print(sentences[0])
print(xa)

prison gangs have a de facto negotiation system to defuse potential conflicts , black gang members said .
Variable containing:
(0 ,.,.) = 

Columns 0 to 5 
   2.5708e-01  3.8999e-01 -2.3996e-02  5.0112e-02  2.4804e-01  5.0402e-02
  4.8300e-03 -1.0852e-01  3.0625e-01 -2.6673e-01 -3.4403e-01  3.0642e-01
 -9.3015e-02 -3.4896e-01 -4.5291e-01 -4.3777e-01  8.9621e-01  8.6556e-01
 -4.9385e-01 -5.1353e-01 -3.3375e-01  7.5828e-02  2.8502e-02 -3.9073e-01
 -1.5353e-01 -4.6417e-01 -2.4324e-01  3.2717e-01  3.2242e-01  1.5353e-02
 -3.0698e-02  2.0278e-01 -1.3350e-01 -1.2179e-01 -2.7394e-01 -2.4425e-01
  1.5015e-01 -3.9379e-01 -3.0872e-03  3.4965e-01  6.4216e-02  5.7407e-02
 -1.3859e-01 -1.8899e-01 -3.3805e-02 -2.1679e-01  1.6352e-02 -2.0942e-01
 -4.4011e-02 -2.0412e-01 -1.2718e-02 -2.9922e-01 -2.2560e-01 -5.0424e-01
 -4.7677e-01 -2.1772e-01  1.6759e-01 -2.2158e-01  3.2218e-01 -2.8547e-01
  1.9762e-01  1.8645e-01 -2.6878e-01  6.7237e-02  2.2902e-01  2.7136e-01
 -9.6843e-02 -2.8648e-01  4.8062e-02 -9.

We'll not use any external features for this example, so `x_ext_feats` is a vector of zeros.

Now let us step through the `forward` method of the model. Normally we'll just call `trained_model(xq, xa, x_ext_feats)` but to illustrate the steps we'll copy the lines here again and see what happens underneath the hood.

First, we want to compute the max-pooled convolutional feature maps for the question. This is a vector of length 100.

In [6]:
q = trained_model.conv_q.forward(xq)
# max pool using q.size()[2] as the window size, which is the length of each convolution feature map
q = F.max_pool1d(q, q.size()[2])
# reshape max pooled elements into a vector of length equal to the number of feature maps
# the max pooling takes one value (the max) out of each convolution feature map
q = q.view(-1, trained_model.conv_channels)
q

Variable containing:

Columns 0 to 9 
 0.3809  0.1988  0.2360  0.1194  0.3365  0.1115  0.1986  0.1210  0.2921  0.2780

Columns 10 to 19 
 0.3465  0.1391  0.2252  0.2028  0.2566  0.2121  0.1145  0.2566 -0.0210  0.0501

Columns 20 to 29 
 0.2870  0.2066  0.1883  0.2801  0.3078  0.1150  0.2241  0.2031  0.3592  0.2815

Columns 30 to 39 
 0.1696  0.5523  0.3096  0.1780  0.1016  0.1626  0.0875  0.1396  0.1777  0.3225

Columns 40 to 49 
 0.2883  0.3741  0.1806  0.3296  0.2664  0.3498  0.1885  0.2927  0.1897  0.1204

Columns 50 to 59 
 0.3060  0.4954  0.1754 -0.0252  0.2306  0.1556  0.2742  0.2156  0.2779  0.2621

Columns 60 to 69 
 0.2128  0.1529  0.1572  0.3329  0.2616  0.3895  0.5503  0.1169  0.1668  0.1846

Columns 70 to 79 
 0.3015  0.1883  0.1910  0.0919  0.0889  0.2142  0.2740  0.2514  0.2368  0.3541

Columns 80 to 89 
 0.3160  0.2832  0.2037 -0.0011  0.2704  0.1965  0.2841  0.0728  0.1713  0.2122

Columns 90 to 99 
 0.0745  0.3086  0.2942  0.2640  0.3694  0.3202  0.2400  0.1647  0.1133

Similarly, we want to compute the max-pooled convolutional feature maps for the answer. This is a vector of length 100.

In [7]:
a = trained_model.conv_a.forward(xa)
a = F.max_pool1d(a, a.size()[2])
a = a.view(-1, trained_model.conv_channels)
a

Variable containing:

Columns 0 to 9 
 0.1825  0.2509  0.1841  0.2155  0.1540  0.0712  0.0202  0.0729  0.2625  0.2019

Columns 10 to 19 
 0.2617  0.1551  0.0740  0.0866  0.1213  0.1076  0.1244  0.0700  0.1704  0.2703

Columns 20 to 29 
 0.3295  0.1632  0.2424  0.3351  0.2256  0.2852  0.2369  0.1707  0.0588  0.2356

Columns 30 to 39 
 0.2519  0.0645  0.2570  0.1386  0.2201  0.1745  0.2735  0.3343  0.0153 -0.0033

Columns 40 to 49 
 0.0237  0.1205  0.3471  0.2018  0.2501  0.2470  0.1938  0.3186  0.3065  0.1581

Columns 50 to 59 
 0.2928  0.1028  0.2428  0.4409  0.0721  0.1606  0.3391  0.1829  0.0680  0.1236

Columns 60 to 69 
 0.3346  0.1593  0.3588  0.1548  0.1571  0.3544  0.1718  0.2322  0.2835  0.2238

Columns 70 to 79 
 0.0263  0.1436  0.0468  0.2739  0.2501  0.1943  0.1942  0.1571  0.3216  0.1812

Columns 80 to 89 
 0.0562  0.1824  0.0222  0.0820  0.2011  0.3240  0.0599  0.2776  0.2596  0.4003

Columns 90 to 99 
 0.0814  0.1553  0.0909  0.1959 -0.0631  0.1912  0.1817  0.5493  0.3086

Next, we join the max-pooled results together with the external features. Note the pre-trained model was trained with external features so we must run the code path with external features, but we can use 0 as the inputs since this is only for demonstration purposes.

In [8]:
x = torch.cat([q, a, x_ext_feats], 1)

We run through the rest of the model. Outputting the log-probability of the classes at the very end.

In [9]:
x = trained_model.combined_feature_vector.forward(x)
x = trained_model.combined_features_activation.forward(x)
x = trained_model.dropout(x)
x = trained_model.hidden(x)
x = trained_model.logsoftmax(x)
x

Variable containing:
-0.0002 -8.6184
[torch.FloatTensor of size 1x2]

In [10]:
torch.exp(x)

Variable containing:
 0.9998  0.0002
[torch.FloatTensor of size 1x2]

Hence, the probability of label 0 is 0.9998 while the probability of label 1 is 0.0002.