# Ensembling By Taking Snapshot

Ensemble techniques are being very famously used to combine many weak classifiers and form a stronger one. Ensemble methods are traditionally used to produce state of the art results in the famous competition like ImageNet.  There are three different kinds of Ensembling techniques namely 1) Bagging, 2) Boosting and 3) Stacking 

With an increase in the feature, the number of local minima or maxima increases exponentially. There is no sure-shot way to find global minima or maxima. Often the optimizers are found to stuck in the local minima and produce a model with high variance. To understand this let us visualize feature space. once.

![](figures/learning_rate_snapshot.png)

Figure. Showing (left) how the standard Learning rate converges the model by providing one model. On right showing how snapshot ensemble provides a different mode for each minima by using cyclic learning rate. 

The diagram on the left shows the typical energy landscape look like shown above with only 3 features. In machine learning, we try to decreases the loss of the model and the point in the space where the loss is minimum is known as global minima. The model should ideally converge into the global minima ideally. As there is no sure shot method to find the global minima and hence the task is difficult.Whereas the local minima are the place which is minimal compared to close vicinity. In addition to this, the number of local minima exponentially increases with the increase in the features. The Stochastic gradient descent optimization technique often found to stuck into the local minima and produce a poor result.  This problem is very mindfully solved by Snapshot Ensembles technique.  Snapshot Ensembles exploit the behavior of convergence with respect to of learning rate to get better models. When learning rate is high the gradient overshoots and escape from the minima, and when the learning rate is low it converges into local minima. With this paper, the author proposes to have  M parallel model in one training shot. The training epocha T is divided into M cycles. Each cycle starts with the higher learning rate and monotonously decrease to ensure convergence in the local minima. Such M models from the M cycles are collected and used for making the final decision which will be the average of all the models. The monotonously decreasing function is given as :

 $$ α(t) = \frac{α_{t-1}}{2} (cos (π*\frac{T}{M})+1) $$

Where  $a(t)$ is the new learning rate, is the previous learning rate, T is the total number of epochs, M is the total number of learning rate oscillation cycles.



# Importing Requirements 

In [None]:
import json
import math
import os
import random

import numpy as np
import pandas as pd
import torch
from tensorboardX import SummaryWriter
from torchtext import data
from torchtext import vocab
from tqdm import tqdm

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Preprocessing

## Dataset

In chapter 4, Applying CNN In NLP Tasks we have already visited a recipe Word Level CNN For Text Classification. To keep the implementation simple and easy to understand, I am incorporating Snapshot ensemble implementation of  Word Level CNN For Text Classification. We will be using the same dataset,  Large Movie Review Dataset. With the same loss function and optimizer.

In [None]:
split = 0.80
data_block = []
data_from_file = pd.read_csv("../Ch5/data/imdb.tsv", sep="\t")
data_block = data_from_file[["review", "sentiment"]].values


## Preprocessing
We will be using IMDB, Large Movie Review Dataset. This is a dataset for binary sentiment classification containing 25,000 highly polar movie reviews for training, and testing. Let's use TorchText to preprocess our data. The pre-processing involves 

1. Splitting data into two parts, train and test 
2. Reading the data using TorchText and applying various pre-processing operations like tokenization, padding and vocabulary generation. 
3. Defining data fields
4. Generating vocabulary and 
5. Making a train and test data iterator

In [None]:
random.shuffle(data_block)
train_file = open('train.json', 'w')
test_file = open('test.json', 'w')
for i in range(0, int(len(data_block) * split)):
    train_file.write(str(json.dumps({'review': data_block[i][0], 'label': data_block[i][1]})) + "\n")
for i in range(int(len(data_block) * split), len(data_block)):
    test_file.write(str(json.dumps({'review': data_block[i][0], 'label': data_block[i][1]})) + "\n")

train_file.flush()
test_file.flush()


In [None]:
def tokenize(reviews):
    return reviews
def pad_to_equal(x):
    if len(x) < 200:
        return x + ['<pad>' for i in range(0, 61 - len(x))]
    else:
        return x[:200]
def to_categorical(x):
    x = int(x)
    if x == 1:
        return [0, 1]
    if x == 0:
        return [1, 0]


In [None]:
# defining data fields
REVIEW = data.Field(sequential=True , preprocessing = pad_to_equal , use_vocab = True, lower=True,batch_first=True)
LABEL = data.Field(is_target=True,use_vocab = False, sequential=False, preprocessing =to_categorical)
fields = {'review': ('review', REVIEW), 'label': ('label', LABEL)}

# constructing tabular dataset
train_data , test_data = data.TabularDataset.splits(
                            path = '../Ch5/data/',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields)

# constructing vocabulary
REVIEW.build_vocab(train_data, test_data)
LABEL.build_vocab(train_data, test_data)

# making iterator
train_iter, test_iter = data.Iterator.splits(
        (train_data, test_data), sort_key=lambda x: len(x.review),
        batch_sizes=(32,len(test_data)), device=device)

# Downloading Embeddings

For this experimentation, I will be using GloVe vector of dimension 100 trained on "Wikipedia+Gigaword 5 (6B)" dataset. I will be using chakin to download GloVe word vectors. Once the vector is downloaded the vocabulary for our train and test split is mapped to GloVe vector by using below given snippet. Remember this method because we will be using this shortcut at many places in this chapter.

In [None]:
embed_exists = os.path.isfile('../embeddings/glove.6B.zip')
if not embed_exists:
    print("Downloading Glove embeddings, if not downloaded properly, then delete the `embeddings/glove.6B.zip")
    chakin.search(lang='English')
    chakin.download(number=16, save_dir='../embeddings')
    zip_ref = zipfile.ZipFile("../embeddings/glove.6B.zip", 'r')
    zip_ref.extractall("../embeddings/")
    zip_ref.close()

## Contructing vocab

In [None]:
vec = vocab.Vectors(name = "glove.6B.100d.txt",cache = "../embeddings/glove.6B/")
REVIEW.build_vocab(train_data, test_data, max_size=100000, vectors=vec)

In [None]:
review_vocab = REVIEW.vocab

# The Model

The model is same as discued in the Chapter 5 while learning **Using Word Level CNN**. Its an imlementation of  "Convolution Neural Networks for Sentence Classification" by Jonas Gehring et. al., According to this paper with pre-trained embeddings, one can achieve excellent results by just using few layers of the CNN. 

![](../Ch5/figures/Using_Word_Level_CNN_for_Text_Classsification.png)

Figure: Showing the architecture of the model which takes word level features and perform text classification


To keep the implementation simple and easy to understand, I am incorporating Snapshot ensemble implementation of  Word Level CNN For Text Classification. We will be using the same dataset,  Large Movie Review Dataset. With the same loss function and optimizer, we will be incorporating 3 additional mechanisms.

1. A function to decrease loss monotonously
2. Measures to record the snapshot at the end of each cycle 
3. Measures to use given a snapshot for the prediction.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_Text(nn.Module):    
    def __init__(self, embed_num, embed_dim, class_num, kernel_num, kernel_sizes, dropout, static, stride):
        super(CNN_Text, self).__init__() 
        self.embed_num = embed_num
        self.embed_dim = embed_dim 
        self.class_num = class_num 
        self.kernel_num = kernel_num
        self.kernel_sizes  = kernel_sizes 
        self.dropout = dropout
        self.static = static
        
        self.embedding = nn.Embedding(embed_num, embed_dim)
        self.embedding.weight.data.copy_(review_vocab.vectors)
        self.embedding.weight.requires_grad = True
        self.convs1 = nn.ModuleList([nn.Conv2d(in_channels = 1, out_channels=kernel_num, kernel_size= K,stride= stride) for K in kernel_sizes])
        '''
        self.conv13 = nn.Conv2d(in_channels = 1, out_channels=8, kernel_size= 3,stride= 100)
        self.conv14 = nn.Conv2d(in_channels = 1, out_channels=8, kernel_size= 4,stride= 100)
        self.conv15 = nn.Conv2d(in_channels = 1, out_channels=8, kernel_size= 5,stride= 100)
        '''
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(len(kernel_sizes)*kernel_num, class_num)

    def conv_and_pool(self, x, conv):
        x = F.relu(conv(x)).squeeze(3) 
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x

    def forward(self, x):
        x = self.embedding(x)  # (N, W, D)
        x = x.unsqueeze(1)  # (N, Ci, W, D)
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N, Co, W), ...]*len(Ks)
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N, Co), ...]*len(Ks)
        x = torch.cat(x, 1)
        '''
        x1 = self.conv_and_pool(x,self.conv13) 
        x2 = self.conv_and_pool(x,self.conv14) 
        x3 = self.conv_and_pool(x,self.conv15)
        x = torch.cat((x1, x2, x3), 1)
        '''
        x = self.dropout(x)  # (N, len(Ks)*Co)
        logit = F.relu(self.fc1(x))   # (N, C)
        logit  = torch.softmax(logit, dim=1)
        return logit

# Training helpers

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.argmax(preds, dim=1)
    correct = (rounded_preds == torch.argmax(y, dim=1)).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    """
    To train the model
    """
    epoch_loss = []
    epoch_acc = []
    model.train()
    
    for batch in iterator:
        feature, target = batch.review, batch.label
        optimizer.zero_grad()
        predictions = model(feature)            
        loss = criterion(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        loss.backward()
        optimizer.step()
        acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        epoch_loss.append(loss.item())
        epoch_acc.append(acc.item())
        
    return model, sum(epoch_loss) / len(epoch_loss), sum(epoch_acc) / len(epoch_acc)

In [None]:
def test_accuracy_calculator(model, test_iterator):
    """
    To calculate test accuracy
    """
    epoch_acc = []
    for batch in test_iterator:
        feature, target = batch.review, batch.label
        predictions = model(feature)            
        acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        epoch_acc.append(acc.item())
    return  sum(epoch_acc) / len(epoch_acc)

## Defining Model Parameters

In [None]:
embed_num = len(REVIEW.vocab)
class_num = len(LABEL.vocab) - 1
kernel_sizes = [int(k) for k in '2,3,4,5'.split(',')]
embed_dim = 100
stride = 100
kernel_num  = 8
dropout = 0.2
static = False

In [None]:
cnn = CNN_Text( embed_num, embed_dim, class_num, kernel_num, kernel_sizes, dropout, static, stride)
cnn = cnn.to(device)

## Definning optimizer, losses and training loops

In [None]:
optimizer = torch.optim.SGD(cnn.parameters(), lr=0.01, momentum=0.9)
criterion = nn.BCELoss()
criterion = criterion.to(device)

**The learning rate modifier:** Below given function uses a cyclic annealing schedule to quickly lower the learning rate and so to converge the model in the nearest local minima. While in training the learning rate is decreased as shown in the below diagram. 

￼￼

Figure. Showing cyclic changes in the learning rate. In each cycle learning rate starts with some high value and then monotonically quickly decreases to converge the learning in local minima, this local minimum is provided as one of the snapshot models.
￼The X-axis shows the cycles and Y axis shows the learning rate. each cycle starts with the higher learning rate and quickly decreases to converge in the local minima. This functionality implemented with python definition `proposed_lr`. 



$$α(t) = \frac{α_{t-1}}{2} (cos (π*\frac{T}{M})+1)$$


In [None]:
def proposed_lr(initial_lr, iteration, epoch_per_cycle):
    return initial_lr * (math.cos(math.pi * iteration / epoch_per_cycle) + 1) / 2

# Training

**Recording Snapshots:** There is no change in the model it stays as it is. there is a change in the training schedule. The total epoch as and the number of the cycle are defined. The epochs re equally divided into each cycle by dividing total epochs with a number of cycles. An initial learning rate is fixed. here we have fived 300 as a total epoch and training will be carried out for 60 cycles. Each cycle will be having 300/60  = 5 epochs. In each cycle, the loss is allowed to decrease quickly using `proposed_lr` function. In total 60 model snapshots are collected one for each cycle taking the `snapshots = []`  as the model accumulator. Wherein each snapshot is the weight for each individual model. In Pytorch the weight of an individual model can be accessed by calling `state_dict()` function of the model, similarly, we will be getting weight for the model using `cnn.state_dict()`

In [None]:
epochs  = 300
cycles = 60
snapshots = []
_lr_list, _loss_list = [], []
count = 0
initial_lr = 0.1
epochs_per_cycle = epochs // cycles

In [None]:
writer = SummaryWriter()
total_iterations = 0
for i in range(cycles):
        lr = initial_lr
        for j in tqdm(range(epochs_per_cycle)):
            _epoch_loss = 0
            lr = proposed_lr(lr, j, epochs_per_cycle)
            optimizer.state_dict()["param_groups"][0]["lr"] = lr
            for batch in train_iter:
                feature, target = batch.review, batch.label
                optimizer.zero_grad()
                predictions = cnn(feature.to(device))            
                loss = criterion(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
                loss.backward()
                optimizer.step()
                _epoch_loss = _epoch_loss + loss.item()
                acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
            writer.add_scalar('epoch_loss',_epoch_loss, total_iterations)
            writer.add_scalar('learning_rate',lr, total_iterations)
            total_iterations = total_iterations +1
        snapshots.append(cnn.state_dict())

# Test, Combining all Snapshots

Now we have 60 snapshots and using all of them we can have predictions.

**Predicting using Snapshots:** prediction suing above accumulated snapshot is implemented in the test_snapshot_model function. This function takes the following parameters. 

Model: An original Pytorcrch model 
weights: All the snapshots with different weight
num_last_model: Number of last models to be used for the prediction
test_iter: Test data iterator
model_param: Parameters as required by model while loading the model.

In [None]:
def test_snapshot_model(Model, weights, num_last_model, test_iter, model_param):
    # parsing model parameters
    embed_num = model_param["embed_num"] 
    embed_dim = model_param["embed_dim"] 
    class_num = model_param["class_num"]  
    kernel_num = model_param["kernel_num"] 
    kernel_sizes = model_param["kernel_sizes"] 
    dropout = model_param["dropout"] 
    static = model_param["static"] 
    stride = model_param["stride"] 
    
    # Fetching number of last models to be used
    index = len(weights) - num_last_model
    weights = weights[index:]
    
    # initializing all the models with weight of the snapshot
    model_list = [Model(embed_num, embed_dim, class_num, kernel_num, kernel_sizes, dropout, static, stride) for _ in weights]
    
    # initializing all the models with weight of the snapshot
    for model, weight in zip(model_list, weights):
        model.load_state_dict(weight)
        model.to(device)
    # Predicting from all models and averaging the predictions
    for batch in test_iter:
        feature, target = batch.review, batch.label
        optimizer.zero_grad()
        predictions = cnn(feature.to(device))
        output_list = [cnn(feature.to(device)).detach().numpy() for model in model_list]
        output_list = torch.Tensor(np.array(output_list))
        output = torch.mean(output_list, 0).squeeze()
        test_loss = criterion(output.float(), target.float()).data[0]
        acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))           
    metrices = {"Accuracy":acc.item()*100 ,"Test_Loss" : test_loss.item()}
    return metrices

The function can be evoked as given below:

In [None]:
model_param = {
    "embed_num" : embed_num, 
    "embed_dim" : embed_dim, 
    "class_num" : class_num, 
    "kernel_num" : kernel_num, 
    "kernel_sizes" : kernel_sizes, 
    "dropout" : dropout, 
    "static" : static, 
    "stride" : stride
}

metrices = test_snapshot_model(CNN_Text,snapshots, 10,test_iter, model_param)
print(metrices)

Using snapshot ensembling the accuracy was sound to ne 75.8% and the minimum binary cross entropy loss was found to be 1.35.  Below given is the change in the loss as the cycle progresses. 

![](figures/epoch_loss_snapshot.png)

Figure. Decrease in the learning rate over various epoch with multiple learning rate cycles
Snapshot ensembling is generally applied to the model with millions of od parameters. For the purpose of illustration, I have applied it to a smaller model. When it is applied to a bigger model there will be fluctuation in the loss the learning rate changes