**CNN FOR SENTENCE CLASSIFICATION**

Author: Brian Yan

The task of this model is to classify sentences into 16 different genres, ranging from video games to music.

The approach is to use a convolution layer to extract n-gram features of varying lengths. These features are then passed through a fully connected layer. Altogether, this CNN has 2 layers and achieves 88% accuracy on the validation set.

Individually trained models trained this way can achieve between 84-86% accuracy with optimal early stopping. If several of these models are ensembled together using a majority voting scheme, then an additional 2-4% boost is achieved.

This implementation references Yoon Kim's paper as a baseline: https://arxiv.org/pdf/1408.5882.pdf

Second source for CNN model code in PyTorch is Graham Neubig's course sample code: https://github.com/neubig/nn4nlp-code/tree/master/05-cnn-pytorch

In [0]:
## Run this code for Google Colab to ensure a high ram environment is provisioned

# a = []
# while(1):
#     a.append('1')

In [0]:
version = '/m5'

**SETUP**

Packages, drive mounting, and data loading

In [0]:
import numpy as np
import torch
import sys
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils import data
from torchvision import transforms
from torchvision.datasets import MNIST

import matplotlib.pyplot as plt
import time

import pickle

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
root_path = '/content/gdrive/My Drive/CNN for Sentence Classification'  #change dir to your project folder

In [0]:
def txt_to_npy(file):
    with open(file, 'r') as f:
        lines = f.readlines()
        x = []
        y = []
        for line in lines:
            tmp = line.split(' ||| ')
            x_tmp = tmp[1].replace(' @-@ ', '-')
            x_tmp = x_tmp.replace(' @.@ ', '.')
            x_tmp = x_tmp.replace(' @,@ ', ',')
            x.append(x_tmp.strip('\n'))
            y.append(tmp[0])
    return np.array(x), np.array(y)

In [0]:
val_x, val_y = txt_to_npy(root_path+"/topicclass_valid.txt")
print(len(val_x), len(val_y))
train_x, train_y = txt_to_npy(root_path+"/topicclass_train.txt")
print(len(train_x), len(train_y))
test_x, _ = txt_to_npy(root_path+"/topicclass_test.txt")
print(len(test_x))

643 643
253909 253909
697


In [0]:
cuda = torch.cuda.is_available()
cuda

True

**WORD EMBEDDINGS**

Pre-trained word embeddings from FastText (a FB project) are used in the CNN. These embeddings are kept static during training, which is a decision made to ensure greater generalization. The task-specific corpus is more narrow and thus training word embeddings on these result in poorer performance.

Note: an unexplored option is to combine static and newly trained word embeddings, either through a multiplication filter or through addition.

In [0]:
## This code is used to load the full FastText word embeddings from text, which is a 2 GB file available on their website
## Do not need to run this code if the myftvec.p pickle file is available; that is a smaller version of the relevant words to this task

# def load_vocab(file_list):
#     vocab = set()
#     for file in file_list:
#         with open(file, 'r') as f:
#             lines = f.readlines()
#             for line in lines:
#                 tmp = line.split(' ||| ')
#                 words = tmp[1].split(" ")
#                 for w in words:
#                     vocab.add(w)
#     return vocab

# vocab = load_vocab([root_path+"/topicclass_test.txt", root_path+"/topicclass_train.txt", root_path+"/topicclass_valid.txt"])

In [0]:
# def load_vectors(file_name):
#     ft_vectors = {}
#     with open(file_name, 'r') as f:
#         metadata = f.readline().split(' ')          #n lines, vec size
#         for i in range(int(metadata[0]) // 2):      #read half of the file
#             line = f.readline().split(' ')
#             word = line[0]
#             vec = np.array(line[1:],dtype=float)
#             ft_vectors[word] = vec
#     return ft_vectors

# ft_vectors = load_vectors(root_path+"/wiki-news-300d-1M.vec")
# print(len(ft_vectors))

In [0]:
# my_ft_vectors = {}
# for word in vocab:
#     if word in ft_vectors.keys():
#         my_ft_vectors[word] = ft_vectors[word]

# pickle.dump(my_ft_vectors, open(root_path+"/myftvec.p", "wb"))

In [0]:
# print(len(my_ft_vectors))

In [0]:
## Run from here if already saved pickle previously

my_ft_vectors = pickle.load(open(root_path+"/myftvec.p", "rb"))

In [0]:
we_len = my_ft_vectors['the'].shape[0]
print(we_len)

300


In [0]:
## unknown word resolution. Ultimately, use 0 vector if the word is unrecognized

def unk_we(word, vecs):
    if word.lower() in vecs:
        return vecs[word.lower()]
    elif word.upper() in vecs:
        return vecs[word.upper()]
    else:
        return np.zeros(we_len) 

**DATA LOADER**

In [0]:
from torch.utils.data import DataLoader, Dataset, TensorDataset

In [0]:
def to_tensor(numpy_array):
    return torch.from_numpy(numpy_array).float()

In [0]:
## This code is used to generate the list of classes below

# classes = set()
# for c in val_y:
#     classes.add(c)
# print(len(classes))
# for c in train_y:
#     classes.add(c)
# print(len(classes))
# classes = list(classes)
# classes.sort()
# print(classes)

In [0]:
classes = ['Agriculture, food and drink', 'Art and architecture', 
           'Engineering and technology', 'Geography and places', 
           'History', 'Language and literature', 
           'Mathematics', 'Media and drama', 
           'Miscellaneous', 'Music', 
           'Natural sciences', 'Philosophy and religion', 
           'Social sciences and society', 'Sports and recreation', 
           'Video games', 'Warfare']

In [0]:
## class number given the string

def class_id(y):
    return classes.index(y)

In [0]:
## word embedding vector given the string

def word_embedding(x, vecs):
    words = x.split(' ')
    we = np.empty((len(words), we_len))
    for i, word in enumerate(words):
        if word in vecs:
            we[i] = vecs[word]
        else: 
            we[i] = unk_we(word, vecs)
    return we

In [0]:
## custom dataset class

class myDataset(Dataset):
    def __init__(self, x, y):
        self.x_list = x
        self.y_list = y

    def __getitem__(self, idx):
        xi = to_tensor(word_embedding(self.x_list[idx], my_ft_vectors))
        xi.requires_grad = False
        yi = -1 if self.y_list is None else class_id(self.y_list[idx])
        return xi, yi

    def __len__(self):
        return len(self.x_list)

In [0]:
## custom collate fxn, which pads sentences in batch to the same length; necessary for CNN operations

from torch.nn.utils.rnn import pad_sequence
def collate(batch):
    x_batch = [item[0] for item in batch]
    y_batch = [item[1] for item in batch]
    return pad_sequence(x_batch, batch_first=True, padding_value=0.0), np.array(y_batch)

In [0]:
train_data = myDataset(train_x, train_y)

In [0]:
val_data = myDataset(val_x, val_y)

In [0]:
test_data = myDataset(test_x, None)

In [0]:
## data loader class definitions

num_workers = 8 if cuda else 0
train_loader_args = dict(shuffle=True, batch_size=64, num_workers=num_workers, pin_memory=True, collate_fn=collate) if cuda\
                    else dict(shuffle=True, batch_size=64, collate_fn=collate)
train_loader = data.DataLoader(train_data, **train_loader_args)

In [0]:
val_loader_args = dict(shuffle=True, batch_size=64, num_workers=num_workers, pin_memory=True, collate_fn=collate) if cuda\
                    else dict(shuffle=True, batch_size=64, collate_fn=collate)
val_loader = data.DataLoader(val_data, **val_loader_args)

In [0]:
## testing of speed; data loading can be a training time bottleneck where GPU waits for CPU

%%timeit
for epoch in range(1):
    #print("Epoch", epoch)
    for x_batch, y_batch in val_loader:
          # print(len(x_batch), len(x_batch[0]))
          # print(len(y_batch), y_batch)
          break

The slowest run took 21.23 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 454 ms per loop


In [0]:
%%timeit
val_data.__getitem__(1)

The slowest run took 44.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 41.2 µs per loop


**MODEL**

Model definition, which is 2 layers: convolution and fully connected. The convolution layer is designed to be variable given an input, such that different filter sizes and number of filters can be used. Different filter sizes are concatenated after being activated in ReLu.

While batch norm is always on, dropout can be toggled. Max pooling is chosen as the pooling operation.

In [0]:
save_path = root_path + version+ ".pt"

In [0]:
import torch.nn as nn
import torch.nn.functional as F

class CNNclass(torch.nn.Module):
    def __init__(self, emb_size, num_filters, window_sizes, ntags, dropout=False):
        super(CNNclass, self).__init__()

        self.dropout = dropout
        self.n_windows = len(window_sizes)

        #Convolution filters by length
        self.convs = nn.ModuleList(nn.Sequential(nn.Conv1d(in_channels=emb_size, out_channels=num_filters, kernel_size=win,
                                                stride=1, padding=2, dilation=1, groups=1, bias=True),
                                              nn.BatchNorm1d(num_filters),
                                              torch.nn.ReLU()) for win in window_sizes)

        self.projection_layer = torch.nn.Linear(in_features=num_filters*self.n_windows, out_features=ntags, bias=True)
        torch.nn.init.xavier_uniform_(self.projection_layer.weight)

    def forward(self, emb):
        emb = emb.permute(0, 2, 1)

        # Convolutions, batch norm, relu
        h_list = [conv(emb) for conv in self.convs]     # 1 x num_filters x nwords

        # Do max pooling
        h_list = [h.max(dim=2)[0] for h in h_list]      # 1 x num_filters

        h = torch.cat(h_list, dim=1)                    #1 x (3 x num_filters)

        if self.dropout:
            h = F.dropout(h, p=.2)

        out = self.projection_layer(h)                  # size(out) = 1 x ntags

        return out

In [0]:
# # This is the baseline model from the paper: https://arxiv.org/pdf/1408.5882.pdf
# # Reaches 81-84%, with early stopping

# EMB_SIZE = we_len
# N_FILTERS = 100
# WIN_SIZES = [2, 3, 4]
# N_TAGS = len(classes)
# DROP = True

# # initialize the model
# model = CNNclass(EMB_SIZE, N_FILTERS, WIN_SIZES, N_TAGS, DROP)
# model.cuda()
# device = torch.device("cuda" if cuda else "cpu")
# model.to(device)
# criterion = torch.nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters())

# print(model, device)

CNNclass(
  (convs): ModuleList(
    (0): Sequential(
      (0): Conv1d(300, 1000, kernel_size=(3,), stride=(1,), padding=(2,))
      (1): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (1): Sequential(
      (0): Conv1d(300, 1000, kernel_size=(3,), stride=(1,), padding=(2,))
      (1): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (2): Sequential(
      (0): Conv1d(300, 1000, kernel_size=(3,), stride=(1,), padding=(2,))
      (1): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
  )
  (projection_layer): Linear(in_features=3000, out_features=16, bias=True)
) cuda


In [0]:
## uncomment this to load a saved model

#model.load_state_dict(torch.load(save_path))

**TRAIN**

In [0]:
## training procedure

def train_model(model, loader, criterion, optimizer, device):
    
    # Perform training
    model.train()

    running_loss = 0.0
    train_correct = 0.0
    start = time.time()

    for batch_idx, (words, target) in enumerate(loader): 
        words = words.to(device)
        target = to_tensor(target).to(device)
        
        outputs = model(words)
        loss = criterion(outputs, target.long())
        running_loss += loss.item()
        
        # Do back-prop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    end = time.time()
    running_loss /= len(train_loader)
    print('Training Loss: ', running_loss, 'Time: ', end - start, 's')
    
    #torch.save(model.state_dict(), save_path)
    return running_loss

In [0]:
## testing procedure

def test_model(model, loader, criterion, device):
    with torch.no_grad():
        model.eval()

        running_loss = 0.0
        total_predictions = 0.0
        correct_predictions = 0.0

        for batch_idx, (words, target) in enumerate(loader):   
            words = words.to(device)
            target = to_tensor(target).to(device)

            outputs = model(words)

            _, predicted = torch.max(outputs.data, 1)
            total_predictions += target.size(0)
            correct_predictions += (predicted == target).sum().item()

            loss = criterion(outputs, target.long())
            running_loss += loss.item()


        running_loss /= len(loader)
        acc = (correct_predictions/total_predictions)*100.0
        print('Testing Loss: ', running_loss)
        print('Testing Accuracy: ', acc, '%')
        return running_loss, acc

In [0]:
## Basic training for 1 model. Commented out in favor of ensembled training, which follows below.

# for ITER in range(10):
#     train_model(model, train_loader, criterion, optimizer, device)
#     test_model(model, val_loader, criterion, device)
#     print('='*20)

**TEST**

This section is used to generate accuracies and labels for single models. Commented out in favor of ensembled method below.

In [0]:
## Prediction for a single model

# def predict(model, loader, out_file):
#     with torch.no_grad():
#         model.eval()

#         with open(out_file, "w") as f:
#             for batch_idx, (words, target) in enumerate(loader):
#                 outputs = model(words.to(device))
#                 _, predicted = torch.max(outputs.data, 1)
#                 f.write(classes[predicted.cpu().numpy()[0]]+'\n')

In [0]:
# pred_loader_args = dict(shuffle=False, batch_size=1, num_workers=0, pin_memory=True, collate_fn=collate) if cuda\
#                     else dict(shuffle=False, batch_size=1, collate_fn=collate)
# pred_loader = data.DataLoader(val_data, **pred_loader_args)

In [0]:
# predict(model, pred_loader, root_path+"/val.txt")

**ENSEMBLING**

The ensembling method that follows is the main contribution of this project that is incremental to the baseline provided in the Yoon Kim paper. 

The procedure relies on a hyperparameter controller, which makes random pertubations to the model architecture within a defined search space: filter number, kernel sizes, number of kernel types, and dropout. These models must pass a defined threshold, set at 84%, in order to be considered in the majority voting ensembling method.

The result is a significant boost in accuracy beyond the baseline implementation. One key to this method is the implementation of early stopping. The chosen process is to terminate learning after an epoch decreases the validation accuracy. The model parameters from the previous epoch (which achieved the max observed accuracy) is used.

In [0]:
import random

# Hyper-param pertubations, which randomly generates within a defined search space
# Return whether model passes the baseline
# 1. number of window sizes
# 2. number of filters
# 3. window sizes
# 4. dropout
def model_generator(device):
    EMB_SIZE = we_len
    N_FILTERS = random.randint(150, 400)
    WIN_TYPES = random.choice([1, 2, 3, 4])
    WIN_START = random.choice([1, 2, 3])
    WIN_SIZES = [WIN_START + i + random.choice([0,1,2,3]) for i in range(WIN_TYPES)]
    N_TAGS = len(classes)
    DROP = random.choice([True, False])

    print('*'*40)
    params = "N_FILTERS=" + str(N_FILTERS) + " | WIN_SIZES=" + str(WIN_SIZES) + " | DROP=" + str(DROP)
    print(params)

    # initialize the model
    model = CNNclass(EMB_SIZE, N_FILTERS, WIN_SIZES, N_TAGS, DROP)
    model.cuda()
    model.to(device)
    
    print(model, device)
    print('*'*40)

    return model, params

# Ensemble n of N models which pass the baseline accuracy
def ensemble_controller(num_models, baseline_acc, path):
    ensemble_models = []
    for i in range(num_models):
        model_name = "e"+str(i)
        print("THIS IS: "+ model_name)
        save_path = path + model_name

        device = torch.device("cuda" if cuda else "cpu")
        model, params = model_generator(device)

        acc = 0
        

        for ITER in range(10):
            criterion = torch.nn.CrossEntropyLoss()
            optimizer = torch.optim.Adam(model.parameters())

            train_loss = train_model(model, train_loader, criterion, optimizer, device)
            test_loss, test_acc = test_model(model, val_loader, criterion, device)
            
            # early stopping condition
            if test_acc < acc:
                break
            else:
                acc = test_acc
                torch.save(model.state_dict(), save_path)
            print('='*20)

        model.load_state_dict(torch.load(save_path))
        if acc > baseline_acc:
            ensemble_models.append((model, acc, params, save_path))
            print(model_name+" has passed the threshold, achieving: " + str(acc) + "%")
            with open(path+'ensemble_log.txt', 'a') as f:
                f.write(str(i) + " | " + str(acc) + " | " + params) + '\n'

        print("-")

    return ensemble_models

In [0]:
ensemble_group = "/models3/"

In [0]:
ensemble_models = ensemble_controller(15, 85, root_path + ensemble_group)

THIS IS: e0
****************************************
N_FILTERS=271 | WIN_SIZES=[3, 3, 6] | DROP=False
CNNclass(
  (convs): ModuleList(
    (0): Sequential(
      (0): Conv1d(300, 271, kernel_size=(3,), stride=(1,), padding=(2,))
      (1): BatchNorm1d(271, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (1): Sequential(
      (0): Conv1d(300, 271, kernel_size=(3,), stride=(1,), padding=(2,))
      (1): BatchNorm1d(271, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (2): Sequential(
      (0): Conv1d(300, 271, kernel_size=(6,), stride=(1,), padding=(2,))
      (1): BatchNorm1d(271, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
  )
  (projection_layer): Linear(in_features=813, out_features=16, bias=True)
) cuda
****************************************
Training Loss:  0.7978225373813221 Time:  31.0920672416687 s
Testing Loss:  0.5085972520438108
Testing Accura

In [0]:
print("Found " + str(len(ensemble_models)) + " models that passed the threshold.")

for i, (model, acc, params, path) in enumerate(ensemble_models):
    print("Model " + str(i) + " is " + str(acc) + "% accurate.")
    print("Model " + str(i) + " is saved at: " + path)
    print('-'*20)

Found 7 models that passed the threshold.
Model 0 is 85.69206842923795% accurate.
Model 0 is saved at: /content/gdrive/My Drive/CNN for Sentence Classification/models4/e0
--------------------
Model 1 is 86.31415241057543% accurate.
Model 1 is saved at: /content/gdrive/My Drive/CNN for Sentence Classification/models4/e2
--------------------
Model 2 is 85.8475894245723% accurate.
Model 2 is saved at: /content/gdrive/My Drive/CNN for Sentence Classification/models4/e4
--------------------
Model 3 is 85.53654743390358% accurate.
Model 3 is saved at: /content/gdrive/My Drive/CNN for Sentence Classification/models4/e7
--------------------
Model 4 is 85.53654743390358% accurate.
Model 4 is saved at: /content/gdrive/My Drive/CNN for Sentence Classification/models4/e9
--------------------
Model 5 is 86.93623639191291% accurate.
Model 5 is saved at: /content/gdrive/My Drive/CNN for Sentence Classification/models4/e12
--------------------
Model 6 is 86.93623639191291% accurate.
Model 6 is saved a

In [0]:
## majority voting scheme for the ensembled models

def ensemble_test(ensemble_models, loader, criterion, device, test=False):
    with torch.no_grad():
        total_predictions = 0.0
        correct_predictions = 0.0
        pred = []

        for batch_idx, (words, target) in enumerate(loader):   
            words = words.to(device)
            target = to_tensor(target).to(device)
            label_votes = np.zeros(len(classes))
            #ensembled_outputs = torch.zeros((target.size(0), len(classes)), device=device)

            # Majority voting
            for (model, acc, params, path) in ensemble_models:
                model.eval()
                output = model(words)
                _, predicted = torch.max(output.data, 1)
                label_votes[predicted.cpu().numpy()[0]] += 1
                #ensembled_outputs += model(words)

            label = np.argmax(label_votes)
            target = target.cpu().numpy()[0]
            #_, predicted = torch.max(ensembled_outputs.data, 1)
            if not test:
                total_predictions += 1
                correct_predictions += (label == target)

            pred.append(classes[label])

        if not test:
            acc = (correct_predictions/total_predictions)*100.0
            print('Testing Accuracy: ', acc, '%')
        return pred

In [0]:
val_pred_loader_args = dict(shuffle=False, batch_size=1, num_workers=0, pin_memory=True, collate_fn=collate) if cuda\
                    else dict(shuffle=False, batch_size=1, collate_fn=collate)
val_pred_loader = data.DataLoader(val_data, **val_pred_loader_args)

In [0]:
predictions = ensemble_test(ensemble_models, val_pred_loader, criterion, device)
print(len(predictions), len(val_x))

Testing Accuracy:  87.55832037325038 %
643 643


In [0]:
out_path = root_path + ensemble_group + "ensembled_val_labels.txt"
with open(out_path, 'w') as f:
    for pred in predictions:
        f.write(pred+"\n")

In [0]:
test_pred_loader_args = dict(shuffle=False, batch_size=1, num_workers=0, pin_memory=True, collate_fn=collate) if cuda\
                    else dict(shuffle=False, batch_size=1, collate_fn=collate)
test_pred_loader = data.DataLoader(test_data, **test_pred_loader_args)

In [0]:
predictions = ensemble_test(ensemble_models, test_pred_loader, criterion, device, True)
print(len(predictions), len(test_x))

697 697


In [0]:
out_path = root_path + ensemble_group + "ensembled_test_labels.txt"
with open(out_path, 'w') as f:
    for pred in predictions:
        f.write(pred+"\n")

**GENERATING LABELS**

The above ensembling procedure was run 3x, achieving ~88% accuracy each time. Those 3 runs were then combined below with another majority voting procedure. This final ensembling did not change the validation accuracy materially.

In [0]:
val_file = "ensembled_val_labels.txt"
val_files = [root_path+"/models"+str(i)+"/"+val_file for i in range(1,4)]
file_data = {file_name : open(file_name, 'r') for file_name in val_files}

with open(root_path + "/val_labels.txt", 'w') as out_file:
    for row in range(len(val_x)):
        label_votes = np.zeros(len(classes))
        # Majority vote
        for f in file_data.values():
            label_votes[class_id(f.readline().strip('\n'))] += 1
        vote_result = np.argmax(label_votes)
        out_file.write(classes[vote_result] + '\n')

for f in file_data.values():
    f.close()

In [0]:
test_file = "ensembled_test_labels.txt"
test_files = [root_path+"/models"+str(i)+"/"+test_file for i in range(1,4)]
file_data = {file_name : open(file_name, 'r') for file_name in test_files}

with open(root_path + "/test_labels.txt", 'w') as out_file:
    for row in range(len(test_x)):
        label_votes = np.zeros(len(classes))
        # Majority vote
        for f in file_data.values():
            label_votes[class_id(f.readline().strip('\n'))] += 1
        vote_result = np.argmax(label_votes)
        out_file.write(classes[vote_result] + '\n')

for f in file_data.values():
    f.close()