# Using Very Deep Convolution Network

To make you understand the importance of the deeper architecture, I am going to provide you with one more example whereby we will be using very deep convolution neural network for the text classification. In this recipe, we will be understanding and implementing the work reported in the research paper by Very Deep Convolutional Networks for Text Classification by Alexis Conneau and coworkers, working with Facebook AI research. This paper claims that with 29 layers deeper network the model is able to beat previously reported state of the art techniques.

The deep convolution network goes up to 49 layers deep.  state-of-the-art configuration can be achieved on text classification tasks by going deeper up to 29 convolutional layers. This model is for the text classification and I particularly choose this model for this recipe. There is a stronger reason to select this model. This model is organised into blocks, each block repeats and has an optional shortcut connection between these blocks. This model will provide a sense of understanding of how modern networks are going deeper by modifying traditional architecture. In the next recipe, we will go one step further and understand various type changes in the network which promises training beyond 100 layers. The entire network looks like as given below

![](figures/deep_conv_model.png)
Figure: Architecture of the very Deep Convolutional Networks for Text Classification as deigned by Alexis Conneau and coworkers.


The model takes character-based encoding as the input. let say if our set has 1024 unique characters and if we consider max sentence length to be 64 then the input to the input to the model will be [batch_size, 64, 1024]. This shape is then converted to the  [batch_size, 1024, 16] by applying embedding to input. Embeddings are shown as a lookup operation in the above diagram. A Convolution 1D with a filter size of 3 is applied to the output generated by the embedding layer with input dimension=16 and output dimension = 64. The output of a 1D convolution is passed to the convolution block. A convolution block has the following layers

A 1D convolution layer
Each convolution layer is followed by a batch normalization layer
Relu activation is applied to the output of the batch normalization layer.
Above said layers are repeated twice. to give out output
if the residual flag is True then the input given to this block is added to the output (A residual connection)
This convolution block is repeated with different input channel and output channels. There are blocks having 64, 128, 256 and 512 input channel and output channels in the entire network. depending on the depth of the network. different blocks are used in different numbers.

# Importing Requirements

In [None]:
import csv
import sys

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from tqdm import tqdm
from sklearn import metrics
from tensorboardX import SummaryWriter
from torch.nn.init import kaiming_normal_
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

SEED = 1234
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
csv.field_size_limit(sys.maxsize)


# Pre-procesing
-  Defining character set
-  Defining other constants
-  Using `Dataset` from `torch.utils.data` to create data iterator

In [None]:
split = 0.80
batch_size  = 128
vocabulary = list("""abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}""")
max_length =1024

In [None]:
class MyDataset(Dataset):
    def __init__(self, data_path, max_length=1014):
        self.data_path = data_path
        self.vocabulary = list(vocabulary)
        texts, labels = [], []
        with open(data_path) as csv_file:
            reader = csv.reader(csv_file, quotechar='"',delimiter = "\t")
            for idx, line in enumerate(reader):
                if idx>0:
                    text = ""
                    for tx in line[2:]:
                        text += tx
                        text += " "
                    label = int(line[1])
                    texts.append(text)
                    labels.append(label)
        self.texts = texts
        self.labels = labels
        self.max_length = max_length
        self.length = len(self.labels)
        self.num_classes = len(set(self.labels))

    def __len__(self):
        return self.length

    def __getitem__(self, index):
        raw_text = self.texts[index]
        data = [self.vocabulary.index(i) + 1 for i in list(raw_text) if i in self.vocabulary]
        if len(data) > self.max_length:
            data = data[:self.max_length]
        elif len(data) < self.max_length:
            data += [0] * (self.max_length - len(data))
        label = self.labels[index]
        return np.array(data, dtype=np.int64), label

**Creating test and train iterator**

In [None]:
training_params = {"batch_size": batch_size,"shuffle": True,"num_workers": 0}
test_params = {"batch_size": batch_size,"shuffle": False,"num_workers": 0}

Imdb movie review dataset or the Large Movie Review Dataset is having 25000 movie reviews with binary labels.


In [None]:
training_set = MyDataset("data/imdb_train.tsv",1024)
test_set = MyDataset("data/imdb_test.tsv", 1024)
training_generator = DataLoader(training_set, **training_params)
test_generator = DataLoader(test_set, **test_params)

# Defining Model 

## Convolution block 
The Convolution Block: Convolution block is defined with ConvBlock function.  The below-given network is common for each of the convolution block shown in the above-given diagram. The convblock function has the following layers. There are 2 messages that need your attention.

- A batch Normalization layer
- A Convolution 1D layer
- An activation function

These layers repeated twice and then followed by residual addition. If the shortcut parameter to __init__ method of this class is   True then the input to this block is added to the output after all the layers.  

In [None]:
class ConvBlock(nn.Module):

    def __init__(self, input_dim=128, n_filters=256, kernel_size=3, padding=1, stride=1, shortcut=False,
                 downsampling=None):
        super(ConvBlock, self).__init__()

        self.downsampling = downsampling
        self.shortcut = shortcut
        self.conv1 = nn.Conv1d(input_dim, n_filters, kernel_size=kernel_size, padding=padding, stride=stride)
        self.batchnorm1 = nn.BatchNorm1d(n_filters)
        self.relu1 = nn.ReLU()
        self.conv2 = nn.Conv1d(n_filters, n_filters, kernel_size=kernel_size, padding=padding, stride=stride)
        self.batchnorm2 = nn.BatchNorm1d(n_filters)
        self.relu2 = nn.ReLU()

    def forward(self, input):

        residual = input
        output = self.conv1(input)
        output = self.batchnorm1(output)
        output = self.relu1(output)
        output = self.conv2(output)
        output = self.batchnorm2(output)

        if self.shortcut:
            if self.downsampling is not None:
                residual = self.downsampling(input)
            output += residual

        output = self.relu2(output)

        return output

## Very deep CNN blocks

This network is constructed little different then we used to do till now. To construct the network an empty list is taken as layers = []. Then all the required layers according to the specified depth are appended to this list. For example, if the network has a depth of 9 then 1 ConvBlock, each having input and output size equals to 64, 128, 256 and 512 are added. As the depth increases the variable number of such blocks are considered and the network is constructed accordingly. This method is very good for the network with variable layer and the architecture changes with the selection of parameters. In the end, the list layers holding all the required layers need to be included in the network is added to the nn.Sequenctial(*layers).  Similarly for the fully connected layers at the end of the network is constructed. 

In [None]:
class VDCNN(nn.Module):

    def __init__(self, n_classes=2, num_embedding=69, embedding_dim=16, depth=9, n_fc_neurons=2048, shortcut=False):
        super(VDCNN, self).__init__()

        layers = []
        fc_layers = []
        base_num_features = 64

        self.embed = nn.Embedding(num_embedding, embedding_dim, padding_idx=0, max_norm=None,
                                  norm_type=2, scale_grad_by_freq=False, sparse=False)
        layers.append(nn.Conv1d(embedding_dim, base_num_features, kernel_size=3, padding=1))

        if depth == 9:
            num_conv_block = [0, 0, 0, 0]
        elif depth == 17:
            num_conv_block = [1, 1, 1, 1]
        elif depth == 29:
            num_conv_block = [4, 4, 1, 1]
        elif depth == 49:
            num_conv_block = [7, 7, 4, 2]

        layers.append(ConvBlock(input_dim=base_num_features, n_filters=base_num_features, kernel_size=3, padding=1,
                                shortcut=shortcut))
        for _ in range(num_conv_block[0]):
            layers.append(ConvBlock(input_dim=base_num_features, n_filters=base_num_features, kernel_size=3, padding=1,
                                    shortcut=shortcut))
        layers.append(nn.MaxPool1d(kernel_size=3, stride=2, padding=1))

        ds = nn.Sequential(nn.Conv1d(base_num_features, 2 * base_num_features, kernel_size=1, stride=1, bias=False),
                           nn.BatchNorm1d(2 * base_num_features))
        layers.append(
            ConvBlock(input_dim=base_num_features, n_filters=2 * base_num_features, kernel_size=3, padding=1,
                      shortcut=shortcut, downsampling=ds))
        for _ in range(num_conv_block[1]):
            layers.append(
                ConvBlock(input_dim=2 * base_num_features, n_filters=2 * base_num_features, kernel_size=3, padding=1,
                          shortcut=shortcut))
        layers.append(nn.MaxPool1d(kernel_size=3, stride=2, padding=1))

        ds = nn.Sequential(nn.Conv1d(2 * base_num_features, 4 * base_num_features, kernel_size=1, stride=1, bias=False),
                           nn.BatchNorm1d(4 * base_num_features))
        layers.append(
            ConvBlock(input_dim=2 * base_num_features, n_filters=4 * base_num_features, kernel_size=3, padding=1,
                      shortcut=shortcut, downsampling=ds))
        for _ in range(num_conv_block[2]):
            layers.append(
                ConvBlock(input_dim=4 * base_num_features, n_filters=4 * base_num_features, kernel_size=3, padding=1,
                          shortcut=shortcut))
        layers.append(nn.MaxPool1d(kernel_size=3, stride=2, padding=1))

        ds = nn.Sequential(nn.Conv1d(4 * base_num_features, 8 * base_num_features, kernel_size=1, stride=1, bias=False),
                           nn.BatchNorm1d(8 * base_num_features))
        layers.append(
            ConvBlock(input_dim=4 * base_num_features, n_filters=8 * base_num_features, kernel_size=3, padding=1,
                      shortcut=shortcut, downsampling=ds))
        for _ in range(num_conv_block[3]):
            layers.append(
                ConvBlock(input_dim=8 * base_num_features, n_filters=8 * base_num_features, kernel_size=3, padding=1,
                          shortcut=shortcut))

        layers.append(nn.AdaptiveMaxPool1d(8))
        fc_layers.extend([nn.Linear(8 * 8 * base_num_features, n_fc_neurons), nn.ReLU()])
        fc_layers.extend([nn.Linear(n_fc_neurons, int(n_fc_neurons/2)), nn.ReLU()])
        fc_layers.extend([nn.Linear(int(n_fc_neurons/2), n_classes)])

        self.layers = nn.Sequential(*layers)
        self.fc_layers = nn.Sequential(*fc_layers)
        self.__init_weights()

    def __init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')

    def forward(self, input):

        output = self.embed(input)
        output = output.transpose(1, 2)
        output = self.layers(output)
        output = output.view(output.size(0), -1)
        output = self.fc_layers(output)
        torch.softmax(output, dim=1)
        return output

**Initializing the Network**

In [None]:
model = VDCNN(n_classes=2, num_embedding=len(list(vocabulary)) + 1, embedding_dim=16,
                  depth=9, n_fc_neurons=2048, shortcut=False)
model.to(device)

**Defining Loss and Otpimizer**

In [None]:
criterion = nn.CrossEntropyLoss()
criterion.to(device)      
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
best_loss = 1e5
best_epoch = 0

# Training

In [None]:
def get_evaluation(y_true, y_prob, list_metrics):
    y_pred = np.argmax(y_prob, -1)
    if 'accuracy' in list_metrics:
        output = metrics.accuracy_score(y_true, y_pred)
    return output

In [None]:
def test(model):
    LABELS = []
    PREDICTIONS = []
    for i in range(100):
        feature, label = get_iterator(test_data,batch_size)
        feature, label =  batch.review, batch.label
        feature = feature.to(device)
        predictions = model(feature)
        LABELS.extend(label.cpu().detach().numpy())
        PREDICTIONS.extend(predictions.cpu().detach().numpy())
    training_metrics = get_evaluation(LABELS, PREDICTIONS)
    return training_metrics

In [None]:
iteration_counter = 0
writer = SummaryWriter()
for epoch in tqdm(range(20)):
    training_accuracy = []
    train_loss = []
    for iter, batch in enumerate(training_generator):
            feature, label = batch
            feature = feature.to(device)
            label = label.to(device)
            optimizer.zero_grad()
            predictions = model(feature)
            loss = criterion(predictions, label.to(device))
            loss.backward()
            optimizer.step()
            train_loss.append(loss.item())
            training_metrics = get_evaluation(label.cpu().numpy(), predictions.cpu().detach().numpy(),
                                              list_metrics=["accuracy"])
            training_accuracy.append(training_metrics)
            writer.add_scalar('Train/Loss', loss, epoch *len(training_generator)+ iter)
            writer.add_scalar('Train/Accuracy', training_metrics, epoch * len(training_generator)+iter)
    print("Train Epoch : ", epoch, " Train loss : ",sum(train_loss)/len(train_loss)," Train accuracy : ", sum(training_accuracy)/len(training_accuracy))
    
    test_accuracy = []
    test_loss = []
    for iter, batch in enumerate(test_generator):
            feature, label = batch
            feature = feature.to(device)
            predictions = model(feature)
            loss = criterion(predictions, label.to(device))
            test_loss.append(loss.item())
            test_metrics = get_evaluation(label.cpu().numpy(), predictions.cpu().detach().numpy(),
                                              list_metrics=["accuracy"])
            test_accuracy.append(test_metrics)
            writer.add_scalar('Test/Loss', loss, epoch * len(test_generator) + iter)
            writer.add_scalar('Test/Accuracy', test_metrics, epoch *len(test_generator)+ iter)
            
    
    print("Test Epoch : ", epoch, " Test loss : ",sum(test_loss)/len(test_loss), " Test accuracy : ",sum(test_accuracy)/len(test_accuracy))

# Plotting Performance

DeeperConv gives stunning results on IMDB dataset. On the Train set t already achieved the 99% accuracy and on the Test set it achieved 80% accuracy. The accuracy/loss vs epoch plot is given below:

![](figures/deep_conv.png)

Figure. Increase in Train and test Accuracy and decrease in Train Loss with epoch when deep convolution network trained on the IMDB movie review data. 

In [None]:
plt.plot(training_accuracy_list , label = "Train Accuracy")
plt.plot(train_loss_list , label = "Train Loss")
plt.plot(test_accuracy_list, label = "Test Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(loc='upper left')
plt.show()