## Compressing a multiplayer perceptron

In this notebook, we compress a multiple layer perceptron with 784-500-300-10 neuron architecure considered in [1]. 

The notebook contains the following experiments -

    1. We begin by training a baseline model (Table 1, row : base).
    2. Different magnitude based pruning 
        a. Table 1, row : random
        b. Table 1, row : L2
        c. Table 1, row : L1
    3. Cluster pruning
        - Table 1, row : CUP (manual)
    4. Plot of accuracy vs compression for input/output/both features (Fig 6 b) 
 
---
    
 [1] Liu, Zhuang, et al. "Learning efficient convolutional networks through network slimming." Proceedings of the IEEE International Conference on Computer Vision. 2017.

In [1]:
import sys; sys.argv=['']; del sys
%load_ext autoreload
%autoreload 2
import matplotlib.pyplot as plt
%matplotlib inline

import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

import numpy as np
import random
import os
import time

os.environ["CUDA_VISIBLE_DEVICES"] = '0'

from src.utils import plot_tsne,fancy_dendrogram,save_obj,load_obj
from src.model import ANN,load_model
from src.prune_model import prune_model
from src.cluster_model import cluster_model
from src.train_test import train,test,adjust_learning_rate

### 1. Train the baseline model (Table 2, row : base)

We follow the same training hyperparameters as in [1]

In [2]:
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=128, metavar='N',
                    help='input batch size for training (default: 256)')
parser.add_argument('--test-batch-size', type=int, default=256, metavar='N',
                    help='input batch size for testing (default: 256)')
parser.add_argument('--epochs', type=int, default=30, metavar='N',
                    help='number of epochs to train (default: 1)')
parser.add_argument('--lr', type=float, default=0.1, metavar='LR',
                    help='learning rate (default: 0.1)')
parser.add_argument('--momentum', type=float, default=0.9, metavar='M',
                    help='SGD momentum (default: 0.9)')
parser.add_argument('--weight_decay', type=float, default=1e-4, metavar='LR',
                    help='learning rate (default: 0.0001)')
parser.add_argument('--no-cuda', action='store_true', default=False,
                    help='disables CUDA training')
parser.add_argument('--seed', type=int, default=12346, metavar='S',
                    help='random seed (default: 12346)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
                    help='how many batches to wait before logging training status')
parser.add_argument('--checkpoint_path', type=str, default='./checkpoints/ann.pth', metavar='S',
                    help='path to store model training checkpoints')


#set device to CPU or GPU
args = parser.parse_args()
use_cuda = not args.no_cuda and torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

#set all seeds for reproducability
def set_random_seed(seed):    
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(args.seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    torch.backends.cudnn.deterministic = True

set_random_seed(args.seed)


kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor()#,
                       #transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                       transforms.ToTensor()#,
                       #transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=args.test_batch_size, shuffle=True, **kwargs)


ann = ANN().to(device)
        
optimizer = optim.SGD(ann.parameters(),lr=args.lr,momentum=args.momentum,weight_decay=args.weight_decay,nesterov=False)

if not os.path.isfile(args.checkpoint_path):
    for epoch in range(1, args.epochs + 1):
        adjust_learning_rate(args,optimizer,epoch)
        train(args, ann, device, train_loader, optimizer, epoch)
        test_loss,test_accuracy = test(args, ann, device, test_loader)
        
          
    torch.save({
                'epoch': epoch,
                'model_state_dict': ann.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': test_loss,
            }, args.checkpoint_path, pickle_protocol=4)


Test set: Average loss: 0.1124, Accuracy: 9624/10000 (96%)


Test set: Average loss: 0.0925, Accuracy: 9721/10000 (97%)


Test set: Average loss: 0.0813, Accuracy: 9753/10000 (98%)


Test set: Average loss: 0.0731, Accuracy: 9774/10000 (98%)


Test set: Average loss: 0.0703, Accuracy: 9788/10000 (98%)


Test set: Average loss: 0.0662, Accuracy: 9785/10000 (98%)


Test set: Average loss: 0.0640, Accuracy: 9803/10000 (98%)


Test set: Average loss: 0.0688, Accuracy: 9797/10000 (98%)


Test set: Average loss: 0.0659, Accuracy: 9812/10000 (98%)

Changing Learning Rate to 0.010000000000000002

Test set: Average loss: 0.0511, Accuracy: 9857/10000 (99%)


Test set: Average loss: 0.0503, Accuracy: 9853/10000 (99%)


Test set: Average loss: 0.0495, Accuracy: 9857/10000 (99%)


Test set: Average loss: 0.0493, Accuracy: 9860/10000 (99%)


Test set: Average loss: 0.0489, Accuracy: 9861/10000 (99%)


Test set: Average loss: 0.0489, Accuracy: 9862/10000 (99%)


Test set: Average loss: 0.0486, Accur


Test set: Average loss: 0.0485, Accuracy: 9863/10000 (99%)


Test set: Average loss: 0.0484, Accuracy: 9863/10000 (99%)


Test set: Average loss: 0.0484, Accuracy: 9863/10000 (99%)


Test set: Average loss: 0.0484, Accuracy: 9863/10000 (99%)



### 2. Test magnitude based pruning in [2]

[2] Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).

#### Load pre-trained model

In [7]:
set_random_seed(args.seed)
ann,optimizer = load_model('ann','sgd',args)
test(args, ann, device, test_loader)

loading state from epoch 30 and test loss 0.04843035659790039

Test set: Average loss: 0.0484, Accuracy: 9863/10000 (99%)



(0.04843035700321197, 98.63)

#### Prune 80% filters from each layer

In [8]:
pruning_args = {
    'criterion' : 'random',
    'use_bias' : True,
    'prune_layers' : {1:400, 3:240},
    'conv_feature_size' : 4
}

model_modifier = prune_model(ann,pruning_args)

#### a. row: random (Table 2)

In [9]:
set_random_seed(args.seed)

path = args.checkpoint_path[:-4] + '_small_random.pth'
pruned_ann = model_modifier.prune_model('random')
pruned_ann.cuda()

val_loss_no_retrain, val_accuracy_no_retrain = test(args, pruned_ann, device, test_loader,verbose=False)
optimizer = optim.SGD(pruned_ann.parameters(),lr=args.lr,momentum=args.momentum,weight_decay=args.weight_decay,nesterov=False)

best_val_accuracy_retrain = 0

if not os.path.isfile(path):    
    for epoch in range(1, args.epochs+1):
        adjust_learning_rate(args,optimizer,epoch)
        train(args, pruned_ann, device, train_loader, optimizer, epoch)
        val_loss_retrain, val_accuracy_retrain = test(args, pruned_ann, device, test_loader)

        if val_accuracy_retrain > best_val_accuracy_retrain:  
            torch.save(pruned_ann, path, pickle_protocol=4)            
            best_val_accuracy_retrain = val_accuracy_retrain   
else:
    pruned_ann = torch.load(path)
    val_loss_retrain, val_accuracy_retrain = test(args, pruned_ann, device, test_loader,verbose=False)
    best_val_accuracy_retrain = val_accuracy_retrain
        
print('Accuracy post pruning : {} (without retraining), {} (with retraining)'.format(val_accuracy_no_retrain,best_val_accuracy_retrain))    

Pruning using :  random
Accuracy post pruning : 41.59 (without retraining), 98.45 (with retraining)


#### b. row: L2 (Table 2)

In [10]:
set_random_seed(args.seed)

path = args.checkpoint_path[:-4] + '_small_l2.pth'
pruned_ann = model_modifier.prune_model('l2')
pruned_ann.cuda()

val_loss_no_retrain, val_accuracy_no_retrain = test(args, pruned_ann, device, test_loader,verbose=False)
optimizer = optim.SGD(pruned_ann.parameters(),lr=args.lr,momentum=args.momentum,weight_decay=args.weight_decay,nesterov=False)

best_val_accuracy_retrain = 0

if not os.path.isfile(path):    
    for epoch in range(1, args.epochs+1):
        adjust_learning_rate(args,optimizer,epoch)
        train(args, pruned_ann, device, train_loader, optimizer, epoch)
        val_loss_retrain, val_accuracy_retrain = test(args, pruned_ann, device, test_loader)

        if val_accuracy_retrain > best_val_accuracy_retrain:  
            torch.save(pruned_ann, path, pickle_protocol=4)            
            best_val_accuracy_retrain = val_accuracy_retrain   
else:
    pruned_ann = torch.load(path)
    val_loss_retrain, val_accuracy_retrain = test(args, pruned_ann, device, test_loader,verbose=False)
    best_val_accuracy_retrain = val_accuracy_retrain
        
print('Accuracy post pruning : {} (without retraining), {} (with retraining)'.format(val_accuracy_no_retrain,best_val_accuracy_retrain))    

Pruning using :  l2
Accuracy post pruning : 79.59 (without retraining), 98.47 (with retraining)


#### c. row: L1 (Table 2)

In [11]:
set_random_seed(args.seed)

path = args.checkpoint_path[:-4] + '_small_l1.pth'
pruned_ann = model_modifier.prune_model('l1')
pruned_ann.cuda()

val_loss_no_retrain, val_accuracy_no_retrain = test(args, pruned_ann, device, test_loader, verbose=False)
optimizer = optim.SGD(pruned_ann.parameters(),lr=args.lr,momentum=args.momentum,weight_decay=args.weight_decay,nesterov=False)

best_val_accuracy_retrain = 0

if not os.path.isfile(path):    
    for epoch in range(1, args.epochs+1):
        adjust_learning_rate(args,optimizer,epoch)
        train(args, pruned_ann, device, train_loader, optimizer, epoch)
        val_loss_retrain, val_accuracy_retrain = test(args, pruned_ann, device, test_loader)

        if val_accuracy_retrain > best_val_accuracy_retrain:  
            torch.save(pruned_ann, path, pickle_protocol=4)            
            best_val_accuracy_retrain = val_accuracy_retrain   
else:
    pruned_ann = torch.load(path)
    val_loss_retrain, val_accuracy_retrain = test(args, pruned_ann, device, test_loader, verbose=False)
    best_val_accuracy_retrain = val_accuracy_retrain
        
print('Accuracy post pruning : {} (without retraining), {} (with retraining)'.format(val_accuracy_no_retrain,best_val_accuracy_retrain))    

Pruning using :  l1
Accuracy post pruning : 80.15 (without retraining), 98.38 (with retraining)


### 3. Test proposed cluster pruning (CUP)

#### Load pre-trained model

In [12]:
set_random_seed(args.seed)
ann,optimizer = load_model('ann','sgd',args)
test(args, ann, device, test_loader)

loading state from epoch 30 and test loss 0.04843035659790039

Test set: Average loss: 0.0484, Accuracy: 9863/10000 (99%)



(0.04843035700321197, 98.63)

#### Prune 80% of filters

In [13]:
cluster_args = {
    'cluster_layers' : {1:400, 3:240},
    'conv_feature_size' : 1,
    'reshape_exists' : False,
    'features' : 'both',
    'channel_reduction' : 'fro',
    'use_bias' : False,
    'linkage_method' : 'ward',
    'distance_metric' : 'euclidean',
    'cluster_criterion' : 'hierarchical_trunc',
    'distance_threshold' : 1.60,
    'merge_criterion' : 'max_l2_norm',    
    'verbose' : False
}

path = args.checkpoint_path[:-4] + '_small_cup.pth' 
model_modifier = cluster_model(ann,cluster_args)
compressed_model = model_modifier.cluster_model()
compressed_model.cuda()

val_loss_no_retrain, val_accuracy_no_retrain = test(args, compressed_model, device, test_loader,verbose=False)

args.lr = 0.1
args.epochs = 30
optimizer = optim.SGD(compressed_model.parameters(),lr=args.lr,momentum=args.momentum,weight_decay=args.weight_decay,nesterov=False)

best_val_accuracy_retrain = 0

if not os.path.isfile(path):    
    for epoch in range(1, args.epochs+1):
        adjust_learning_rate(args,optimizer,epoch)
        train(args, compressed_model, device, train_loader, optimizer, epoch)
        val_loss_retrain, val_accuracy_retrain = test(args, compressed_model, device, test_loader)

        if val_accuracy_retrain > best_val_accuracy_retrain:  
            torch.save(compressed_model, path, pickle_protocol=4)            
            best_val_accuracy_retrain = val_accuracy_retrain   
else:
    compressed_model = torch.load(path)
    val_loss_retrain, val_accuracy_retrain = test(args, compressed_model, device, test_loader, verbose=False)
    best_val_accuracy_retrain = val_accuracy_retrain
        
print('Accuracy post pruning : {} (without retraining), {} (with retraining)'.format(val_accuracy_no_retrain,best_val_accuracy_retrain))    

Accuracy post pruning : 85.37 (without retraining), 98.63 (with retraining)


### 4. Plot of accuracy vs compression for input/output/both features (Fig 6 b) 

In [15]:
incoming_loss,incoming_acc = [],[]
outgoing_loss,outgoing_acc = [],[]
both_loss,both_acc = [],[]

cluster_args = {
    'cluster_layers' : {1:400, 3:240},
    'conv_feature_size' : 1,
    'reshape_exists' : False,
    'features' : 'both',
    'channel_reduction' : 'fro',
    'use_bias' : False,
    'linkage_method' : 'ward',
    'distance_metric' : 'euclidean',
    'cluster_criterion' : 'hierarchical_trunc',
    'distance_threshold' : 1.60,
    'merge_criterion' : 'max_l2_norm',    
    'verbose' : False
}

    

for drop_percentage in np.linspace(0.6,0.9,31):
    
    num_drop_nodes = [int(num_nodes * drop_percentage) for num_nodes in[500,300]]     
    
    cluster_args['cluster_layers'] = {1:int(500*drop_percentage),3:int(300*drop_percentage)}
        
    set_random_seed(args.seed)
    ann,optimizer = load_model('ann','sgd',args)
    cluster_args['features'] = 'incoming'
    model_modifier = cluster_model(ann,cluster_args)
    compressed_model = model_modifier.cluster_model()#[int(nodes*drop_percentage) for nodes in [500,300]])
    compressed_model.cuda()
    loss,acc = test(args, compressed_model, device, test_loader,verbose=False)
    incoming_loss.append(loss)
    incoming_acc.append(acc)
    
    set_random_seed(args.seed)
    ann,optimizer = load_model('ann','sgd',args)
    cluster_args['features'] = 'outgoing'
    model_modifier = cluster_model(ann,cluster_args)
    compressed_model = model_modifier.cluster_model()#[int(nodes*drop_percentage) for nodes in [500,300]])
    compressed_model.cuda()
    loss,acc = test(args, compressed_model, device, test_loader,verbose=False)
    outgoing_loss.append(loss)
    outgoing_acc.append(acc)
    
    set_random_seed(args.seed)
    ann,optimizer = load_model('ann','sgd',args)
    cluster_args['features'] = 'both'
    model_modifier = cluster_model(ann,cluster_args)
    compressed_model = model_modifier.cluster_model()#[int(nodes*drop_percentage) for nodes in [500,300]])
    compressed_model.cuda()
    loss,acc = test(args, compressed_model, device, test_loader,verbose=False)
    both_loss.append(loss)
    both_acc.append(acc)

loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading state from epoch 30 and test loss 0.04843035659790039
loading 

In [14]:
from matplotlib import rcParams


plt.tight_layout()

plt.rc('font', family='serif')
plt.rc('xtick', labelsize='large')
plt.rc('ytick', labelsize='large')
plt.rcParams.update({'font.size': 16})
rcParams.update({'figure.autolayout': True})

plt.figure(figsize=(6,5))
plt.plot(np.linspace(0.6,0.9,31)[:],incoming_acc[:],color='blue',linewidth=2.5)
plt.plot(np.linspace(0.6,0.9,31)[:],outgoing_acc[:],color='green',linewidth=2.5)
plt.plot(np.linspace(0.6,0.9,31)[:],both_acc[:],color='red',linewidth=2.5)
plt.legend(['incoming','outgoing','both'])
plt.title('Test accuracy vs Percent pruned')
plt.xlabel('Percent pruned')
plt.ylabel('Test accuracy')
plt.grid(True)
# plt.show()
plt.savefig('figures/features_acc_vs_compression.png')




plt.figure(figsize=(6,5))
plt.plot(np.linspace(0.6,0.90,31)[:],incoming_loss[:])
plt.plot(np.linspace(0.6,0.90,31)[:],outgoing_loss[:])
plt.plot(np.linspace(0.6,0.90,31)[:],both_loss[:])

plt.legend(['incoming + cluster + maxnorm','outgoing + cluster + maxnorm','both + cluster + maxnorm'])

plt.title('test loss vs number of nodes')
plt.xlabel('percentage compression')
plt.ylabel('test loss')
plt.grid(True)

# plt.savefig('figures/features_loss_vs_compression.png')
plt.show()

NameError: name 'incoming_acc' is not defined

<Figure size 432x288 with 0 Axes>

<Figure size 432x360 with 0 Axes>