![PyTorch Packages](figures/pytorch-logo.png)

# PyTorch Tutorial - CNN for Sentiment Analysis

Based the PyTorch tutorial from WS17 by Glorianna Jagfeld

This notebook provides code to train and evaluate a CNN-based model for sentiment analysis using PyTorch.

In order to do so, we will follow these steps:

1. Inspect task and data and decide on model

2. Implement model/computation graph (forward pass)

3. Reading in and preprocessing of data

4. Training code

5. Evaluation code

## 1) Task and Data

The task is to assign a binary sentiment label (0 - negative, 1 - positive) to statements taken from film reviews.

The data was taken from keras (IMDB dataset) and preprocessed as done in our keras tutorial.

## 2) Model
- We will represent each word in a sentence by a number (index), to be later mapped to a dense vector from an embedding matrix.
- The sentence is then represented by a dense matrix, where each row is a word embedding.
- We extract features from trigrams of the sentence by using a CNN with filter height 3 and filter width equal to the embedding size
- We apply (2,1) maximum pooling to keep the information of the 50% most relevant trigrams from each feature map.
- To classify the sentiment, we feed the CNN output through a fully connected layer with 2 output units and apply softmax activation to get the most likely sentiment label.

The figure below illustrates the model.
Our model differs in the following points from the illustration:
- Our sentence matrix will be of size max_sentence_length x 300 (embedding size -word2vec)
- We will use only one region size/ filter height (3)
- We use 100 filters
- We do not apply 1-max pooling but (2,1) max pooling on the CNN output

![Model](figures/model.png)

In [1]:
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional, init
import random

# CNN Model definition (1 convolutional layer)
class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, seq_len, num_classes, num_filters, region_size, W):
        """
        :param vocab_size: 
        :param embedding_size: 
        :param seq_len: number of tokens (must be constant for all batches)
        :param num_classes:
        :param num_filters: 
        :param region_size: 
        """
        super(CNN, self).__init__()
        
        # Embedding Layer
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        # Load pretrained embeddings, word2vec in this case
        self.embedding.weight = nn.Parameter(torch.from_numpy(W))
        
        # filter width = embedding_size -> extract features from n-grams of size region_size
        self.filter_size = (region_size, embedding_size)
        # pad height to extract only features from complete n-grams
        self.padding = (region_size-1, 0)

        # pack together several subsequent operations in one layer
        # layer input: [batch, 1, seq_len, embedding_size]
        # output: [batch, num_filters, H_{out}, W_{out}]
        # H_{out} = seq_len/2+1 because of kernel_size and padding
        # W_{out} = 1 because kernel height = embedding_size
        self.layer = nn.Sequential(
            # in_channels=1, out_channels=num_filters
            nn.Conv2d(1, num_filters, kernel_size=self.filter_size, padding=self.padding),
            nn.ReLU(),
            nn.MaxPool2d((2,1)) # half size of each feature map
        )
        
        self.max_pool_out_size = int(seq_len/2 +1)

        # dense layer
        self.fc = nn.Linear(self.max_pool_out_size * num_filters, num_classes)

    def forward(self, x):
        #[batch, seq_len] -> [batch, seq_len, embedding_size]
        embedded = self.embedding(x)
        # add channel dimension:
        # [batch, seq_len, embedding_size] -> [batch, 1, seq_len, embedding_size]
        embedded = embedded.unsqueeze(1)

        # [batch, 1, seq_len, embedding_size] -> [batch, num_filters, (seq_len/2)+1, 1]
        out = self.layer(embedded)

        # [batch,num_filters, (seq_len/2)+1, 1] -> [batch, num_filters * ((seq_len/2) +1)]
        out = out.view(out.size(0), -1)

        # [batch, num_filters * ((seq_len/2) +1)] -> [batch, num_classes]
        out = self.fc(out)
        return out

## 3) Data reading and preprocessing
We have to read in the movie review sentences (IMDB) and their corresponding labels from keras datasets.

For the review sentences we have to take care of the following:
- The tokens in the sentences are aleready as indices of a fixed vocabulary.
- All sentences have to be padded to the same length, such that the CNN layer always yields output vectors of the same size to the dense layer.
- Word2Vec embeddings are extracted.

In [2]:
# keras code
_ = ''' from keras.datasets import imdb

# Loading the IMBD dataset
# Selecting the 2000 most frequent words
(x_train_org, y_train), (x_test_org, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=2000,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=2)

# Loading the vocabulary
import numpy as np

vocab = imdb.get_word_index(path="./imdb_word_index.json")
print "Number of unique words:", len(vocab)

INDEX_FROM = 2 

# Dict {word:id}
word_to_id = {x:vocab[x]+INDEX_FROM for x in vocab if vocab[x]<=2000}
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2


# Dict {id:word}
id_to_word = {word_to_id[x]:x for x in word_to_id}

# Array of ordered words by their frequency + special characters
vocab_list = np.array(["<PAD>"]+[id_to_word[x] for x in range(1,2001)])

def get_W(word_vecs, k=300):
    """
    Get word matrix. W[i] is the vector for word indexed by i
    """
    vocab_size = len(word_vecs)
    word_idx_map = dict()
    W = np.zeros(shape=(vocab_size + 1, k), dtype='float32')
    W[0] = np.zeros(k, dtype='float32')
    i = 1
    for word in word_vecs:
        W[i] = word_vecs[word]
        word_idx_map[word] = i
        i += 1
    return W, word_idx_map


def load_bin_vec(fname, vocab):
    """
    Loads 300x1 word vecs from Google (Mikolov) word2vec
    """
    word_vecs = {}
    with open(fname, "rb") as f:
        header = f.readline()
        # ~ print header
        vocab_size, layer1_size = map(int, header.split())
        binary_len = np.dtype('float32').itemsize * layer1_size
        # print(vocab_size)
        for line in range(vocab_size):
            # print(line)
            word = []
            while True:
                ch = f.read(1)
                if ch == ' ':
                    word = ''.join(word)
                    break
                if ch != '\n':
                    word.append(ch)
            # print(word)
            if word in vocab:
                # print(word)
                word_vecs[word] = np.frombuffer(f.read(binary_len), dtype='float32')
            else:
                f.read(binary_len)

    return word_vecs

def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25, 0.25, k)
            
w2v_file = "/projekte/slu/share/GoogleNews-vectors-negative300.bin"
w2v = load_bin_vec(w2v_file, word_to_id)
print("num words found:"+ str(len(w2v)))
add_unknown_words(w2v, word_to_id, k=300)
W, word_idx_map = get_W(w2v, k=300)


# Padding the input data
from keras.preprocessing import sequence
input_length = 350 # average length 

X_train = sequence.pad_sequences(x_train_org, maxlen=input_length, padding='post', truncating='post')
X_dev = sequence.pad_sequences(x_test_org, maxlen=input_length, padding='post', truncating='post')
'''

In [3]:
import pickle

X_train, y_train, X_dev, y_dev, max_len, W  = pickle.load(open('data.p', 'rb'), encoding="bytes")     
vocab_size, embedding_size = W.shape

print(X_train.shape, X_dev.shape)
print(max_len)
print(W.shape)


(25000, 350) (25000, 350)
350
(2003, 300)


In [4]:
def pad(inputs, max_len):
    """
    :param inputs: list of list of tokens
    :param max_len: length to which all inputs will be padded
    :return: padded inputs, original lenght of unpadded inputs
    """
    lengths = [len(x) for x in inputs]

    # we do not want to shorten the sentences
    assert max_len >= max(lengths)

    for input in inputs:
        for i in range(0, max_len - len(input)):
            input.append(voc['PAD'])
    return inputs, lengths

def get_minibatches(inputs, targets, batch_size, max_len, shuffle=False):
    """
    yields padded mini batches (lists of examples)
    :param inputs:
    :param targets:
    :param batch_size:
    :param max_len:
    :param shuffle: whether to randomize the order of the input-target pairs (important for training)
    :return:
    """
    assert len(inputs) == len(targets)
    examples = list(zip(inputs, targets))

    if shuffle:
        random.shuffle(examples)

    # take steps of size batch_size, take at least one step
    for start_idx in range(0, max(batch_size, len(inputs) - batch_size + 1), batch_size):
        batch_examples = examples[start_idx:start_idx + batch_size]
        batch_inputs, batch_targets = zip(*batch_examples)

        # pad the inputs
        batch_inputs, batch_lengths = pad(batch_inputs, max_len)

        yield list(batch_inputs), list(batch_lengths), list(batch_targets)

## 4) Training
To train the model, we need a loss function, here we take the cross entropy loss.
We use the Adam optimizer to update the parameters.

The train function loops over the training dataset for a provided number of training epochs.
One epoch corresponds to running the model once on each training example.

Each training step consists of running the forward function of the model, then the backward function and then taking an optimization step.
Note that we run the model on _batches_ of training examples, so each training step contains the information of multiple training examples.

In [5]:
def count_correct_predictions(outputs, labels):
    """
    :param outputs: predicted probability distribution over labels from model
    :param labels: annotation
    :return: number of examples in batch for which label was correctly predicted
    """
    # predicted label = label with highest probability
    # topk(int) yields tuple: values, indices
    pred = nn.functional.softmax(outputs, dim=1).topk(1)[1].squeeze()

    correct = pred.eq(labels)
    
    if torch.cuda.is_available():
        correct = correct.cpu()

    num_correct = torch.sum(correct, dim=0)
    return num_correct.numpy()

def train(model, X_train, y_train, learning_rate, num_epochs, batch_size, sent_len):
    """
    train model on given training examples X_train with correct labels y_train for num_epochs
    :param model: model of type torch.nn
    :param X_train:
    :param y_train:
    :param learning_rate:
    :param num_epochs:
    :param batch_size:
    :param sent_len: maximum sentence length to pad all inputs in X_train
    :return: model with trained parameters
    """
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss() # contains softmax layer and cross entropy loss, averages over examples in batch
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # train the model
    for epoch in range(num_epochs):
        train_correct = 0.0
        train_correct_checkpoint = 0.0
        train_loss = 0.0
        for i, (inputs, _, labels) in enumerate(get_minibatches(X_train, y_train, batch_size, sent_len, shuffle=True)):
            
            # now you also have to transfer the inputs on the GPU
            if torch.cuda.is_available():
                inputs = Variable(torch.cuda.LongTensor(inputs))
                labels = Variable(torch.cuda.LongTensor(labels))
            else:
                inputs = Variable(torch.LongTensor(inputs))
                labels = Variable(torch.LongTensor(labels))

            # Forward
            optimizer.zero_grad()
            outputs = model(inputs)
            train_correct += count_correct_predictions(outputs, labels)
            
            # Backward
            loss = criterion(outputs, labels)
            loss.backward()

            # Optimize model parameters
            optimizer.step()

            train_loss += loss.data

        print('Epoch [%d/%d], Train accuracy: %.2f, Train loss: %.2f'
                       % (epoch + 1, num_epochs, train_correct*1.0/len(X_train), train_loss/(len(X_train)/batch_size)))
    return model

In [6]:
import os  
os.environ["CUDA_VISIBLE_DEVICES"]="3"

# model configuration
#emb_size = 50
num_filters = 100
# extract features from tri-grams
region_size = 3
num_classes = 2

# build model
net = CNN(vocab_size, embedding_size, max_len, num_classes, num_filters, region_size, W)

# place all tensors in model on GPU
if torch.cuda.is_available():
    net = net.cuda()
    


# initialize model parameters
print("initialize model paramters")
for (name, param) in net.named_parameters():
    print(name)
    nn.init.normal_(param, std=0.01)

# training configuration
num_epochs = 5
learning_rate = 0.001
batch_size = 32
trained_net = train(net, X_train, y_train, learning_rate, num_epochs, batch_size, max_len)

initialize model paramters
embedding.weight
layer.0.weight
layer.0.bias
fc.weight
fc.bias
Epoch [1/5], Train accuracy: 0.82, Train loss: 0.39
Epoch [2/5], Train accuracy: 0.89, Train loss: 0.28
Epoch [3/5], Train accuracy: 0.91, Train loss: 0.22
Epoch [4/5], Train accuracy: 0.94, Train loss: 0.14
Epoch [5/5], Train accuracy: 0.98, Train loss: 0.07


## 5) Evaluation
When evaluating the model on the development set, we only run the forward pass and compute the loss and accuracy of the model.
We do not run the backward pass and do not optimize the parameters.

In [7]:
def evaluate(model, X, y, sent_len):
    # loss, do not average over batches for dev set but only do final average over all examples
    criterion = nn.CrossEntropyLoss(size_average=False)

    # evaluate
    dev_loss = 0.0
    dev_correct = 0.0
    # we do not need to shuffle the dataset
    for i, (inputs, _, labels) in enumerate(get_minibatches(X_dev, y_dev, batch_size, sent_len, shuffle=False)):
        inputs = Variable(torch.LongTensor(inputs))
        labels = Variable(torch.LongTensor(labels))
        if torch.cuda.is_available():
            inputs = inputs.cuda()
            labels = labels.cuda()
            
        outputs = model(inputs)

        dev_loss += criterion(outputs, labels).data

        dev_correct += count_correct_predictions(outputs, labels)
    
    accuracy = dev_correct/len(X_dev)
    avg_loss = dev_loss / len(X_dev)
    return accuracy, avg_loss

# evaluate model on development set after training
accuracy, avg_loss = evaluate(trained_net, X_dev, y_dev, max_len)

print('Accuracy: %.2f, Loss: %.2f' %(accuracy, avg_loss))



Accuracy: 0.85, Loss: 0.53


Evaluation on the development set can also be included into the training loop to monitor the performance of the model on the development set.
This is useful for early stopping and to determine overfitting.

In [8]:
def train_and_evaluate(model, X_train, y_train, X_dev, y_dev, learning_rate, num_epochs, batch_size, sent_len):
    """
    train model on given training examples X_train with correct labels y_train for num_epochs
    evaluate on development set after each epoch
    :param model: model of type torch.nn
    :param X_train:
    :param y_train:
    :param X_dev:
    :param y_dev:
    :param learning_rate:
    :param num_epochs:
    :param batch_size:
    :param sent_len: maximum sentence length to pad all inputs in X_train
    :return: model with trained parameters
    """
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss() # contains softmax layer and cross entropy loss, averages over examples in batch
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # train the model
    for epoch in range(num_epochs):
        train_correct = 0.0
        train_correct_checkpoint = 0.0
        train_loss = 0.0
        for i, (inputs, _, labels) in enumerate(get_minibatches(X_train, y_train, batch_size, sent_len, shuffle=True)):
            inputs = Variable(torch.LongTensor(inputs))
            labels = Variable(torch.LongTensor(labels))
            
            if torch.cuda.is_available():
                inputs = inputs.cuda()
                labels = labels.cuda()

            # Forward
            optimizer.zero_grad()
            outputs = model(inputs)
            train_correct += count_correct_predictions(outputs, labels)
            
            # Backward
            loss = criterion(outputs, labels)
            loss.backward()

            # Optimize model parameters
            optimizer.step()

            train_loss += loss.data
        
        train_accuracy = train_correct/len(X_train)
        train_loss = train_loss/(len(X_train)/batch_size)
        dev_accuracy, dev_loss = evaluate(model, X_dev, y_dev, sent_len)

        print('Epoch [%d/%d], Train accuracy: %.2f, Train loss: %.2f, Dev accuracy: %.2f, Dev loss: %.2f'
                       % (epoch + 1, num_epochs, train_accuracy, train_loss, dev_accuracy, dev_loss))

    return model

# build model
net = CNN(vocab_size, embedding_size, max_len, num_classes, num_filters, region_size, W)

if torch.cuda.is_available():
     net = net.cuda()

# initialize model parameters
print("initialize model paramters")
for (name, param) in net.named_parameters():
    print(name)
    nn.init.normal(param, std=0.01)
    
trained_net = train_and_evaluate(net, X_train, y_train, X_dev, y_dev, learning_rate, num_epochs, batch_size, max_len)

initialize model paramters
embedding.weight
layer.0.weight
layer.0.bias
fc.weight
fc.bias




Epoch [1/5], Train accuracy: 0.81, Train loss: 0.39, Dev accuracy: 0.87, Dev loss: 0.31
Epoch [2/5], Train accuracy: 0.89, Train loss: 0.28, Dev accuracy: 0.86, Dev loss: 0.33
Epoch [3/5], Train accuracy: 0.91, Train loss: 0.22, Dev accuracy: 0.86, Dev loss: 0.35
Epoch [4/5], Train accuracy: 0.95, Train loss: 0.14, Dev accuracy: 0.85, Dev loss: 0.45
Epoch [5/5], Train accuracy: 0.98, Train loss: 0.06, Dev accuracy: 0.85, Dev loss: 0.55
