# The Lottery Ticket Hypothesis - Conv-4 CNN for CIFAR-10 _winning ticket_ verification:

Conv-4 Convolutional Neural Network the following architecture:

1. __Convolutional Layers:__ 64, 64, pool
1. __Convolutional Layers:__ 128, 128, pool
1. __Dense Layers:__ 256, 256, 10

Filter/Kernel size for convolutional layers is 3 x 3, with padding and stride of 1.

Filter and Stride for max-pooling layers is 2 x 2.

This CNN is used to verify the veracity of the sub-network (or, winning ticket) found using the iterative pruning rounds from _The Lottery Ticket Hypothesis_ paper.

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import math
import tensorflow_model_optimization as tfmot
from tensorflow_model_optimization.sparsity import keras as sparsity
# from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
from tensorflow.keras.layers import AveragePooling2D, Conv2D, MaxPooling2D, ReLU
from tensorflow.keras import models, layers, datasets
from tensorflow.keras.layers import Dense, Flatten, Reshape, Input, InputLayer
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.initializers import RandomNormal
# import math
from sklearn.metrics import accuracy_score, precision_score, recall_score

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
tf.__version__

'2.0.0'

In [3]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=1

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=1


In [4]:
batch_size = 60
num_classes = 10
num_epochs = 100

In [5]:
# Data preprocessing and cleaning:
# input image dimensions
img_rows, img_cols = 32, 32

# Load CIFAR-10 dataset-
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

In [6]:
if tf.keras.backend.image_data_format() == 'channels_first':
    X_train = X_train.reshape(X_train.shape[0], 3, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 3, img_rows, img_cols)
    input_shape = (3, img_rows, img_cols)
else:
    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 3)
    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 3)
    input_shape = (img_rows, img_cols, 3)

print("\n'input_shape' which will be used = {0}\n".format(input_shape))


'input_shape' which will be used = (32, 32, 3)



In [7]:
# Convert datasets to floating point types-
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize the training and testing datasets-
X_train /= 255.0
X_test /= 255.0

In [8]:
# convert class vectors/target to binary class matrices or one-hot encoded values-
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

In [9]:
print("\nDimensions of training and testing sets are:")
print("X_train.shape = {0}, y_train.shape = {1}".format(X_train.shape, y_train.shape))
print("X_test.shape = {0}, y_test.shape = {1}".format(X_test.shape, y_test.shape))


Dimensions of training and testing sets are:
X_train.shape = (50000, 32, 32, 3), y_train.shape = (50000, 10)
X_test.shape = (10000, 32, 32, 3), y_test.shape = (10000, 10)


### Prepare CIFAR10 dataset for _GradientTape_ training:

In [10]:
# Create training and testing datasets-
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [11]:
train_dataset = train_dataset.shuffle(buffer_size = 20000, reshuffle_each_iteration = True).batch(batch_size = batch_size, drop_remainder = False)

In [12]:
test_dataset = test_dataset.batch(batch_size=batch_size, drop_remainder=False)

In [13]:
# Choose an optimizer and loss function for training-
loss_fn = tf.keras.losses.CategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(lr = 0.0003)

In [14]:
# Select metrics to measure the error & accuracy of model.
# These metrics accumulate the values over epochs and then
# print the overall result-
train_loss = tf.keras.metrics.Mean(name = 'train_loss')
train_accuracy = tf.keras.metrics.CategoricalAccuracy(name = 'train_accuracy')

test_loss = tf.keras.metrics.Mean(name = 'test_loss')
test_accuracy = tf.keras.metrics.CategoricalAccuracy(name = 'test_accuracy')

In [15]:
# The model is first trained without any pruning for 'num_epochs' epochs-
epochs = num_epochs

num_train_samples = X_train.shape[0]

end_step = np.ceil(1.0 * num_train_samples / batch_size).astype(np.int32) * epochs

print("'end_step parameter' for this dataset =  {0}".format(end_step))

'end_step parameter' for this dataset =  83400


In [16]:
# Specify the parameters to be used for layer-wise pruning, NO PRUNING is done here:
pruning_params_unpruned = {
    'pruning_schedule': sparsity.ConstantSparsity(
        target_sparsity=0.0, begin_step=0,
        end_step = end_step, frequency=100
    )
}

In [17]:
l = tf.keras.layers

In [18]:
def pruned_nn(pruning_params_conv, pruning_params_fc, pruning_params_op):
    """
    Function to define the architecture of a neural network model
    following Conv-2 architecture for CIFAR-10 dataset and using
    provided parameter which are used to prune the model.
    
    Conv-4 architecture-
    64, 64, pool  -- convolutions
    128, 128, pool -- convolutions
    256, 256, 10  -- fully connected layers
    
    Input: 'pruning_params' Python 3 dictionary containing parameters which are used for pruning
    Output: Returns designed and compiled neural network model
    """
    
    pruned_model = Sequential()
    
    pruned_model.add(sparsity.prune_low_magnitude(
        Conv2D(
            filters = 64, kernel_size = (3, 3),
            activation='relu', kernel_initializer = tf.initializers.GlorotUniform(),
            strides = (1, 1), padding = 'same',
            input_shape=(32, 32, 3)
        ),
        **pruning_params_conv)
    )
        
    pruned_model.add(sparsity.prune_low_magnitude(
        Conv2D(
            filters = 64, kernel_size = (3, 3),
            activation='relu', kernel_initializer = tf.initializers.GlorotUniform(),
            strides = (1, 1), padding = 'same'
        ),
        **pruning_params_conv)
    )
    
    pruned_model.add(sparsity.prune_low_magnitude(
        MaxPooling2D(
            pool_size = (2, 2),
            strides = (2, 2)
        ),
        **pruning_params_conv)
    )
    
    pruned_model.add(sparsity.prune_low_magnitude(
        Conv2D(
            filters = 128, kernel_size = (3, 3),
            activation='relu', kernel_initializer = tf.initializers.GlorotUniform(),
            strides = (1, 1), padding = 'same'
        ),
        **pruning_params_conv)
    )

    pruned_model.add(sparsity.prune_low_magnitude(
        Conv2D(
            filters = 128, kernel_size = (3, 3),
            activation='relu', kernel_initializer = tf.initializers.GlorotUniform(),
            strides = (1, 1), padding = 'same'
        ),
        **pruning_params_conv)
    )

    pruned_model.add(sparsity.prune_low_magnitude(
        MaxPooling2D(
            pool_size = (2, 2),
            strides = (2, 2)
        ),
        **pruning_params_conv)
    )

    
    pruned_model.add(Flatten())
    
    pruned_model.add(sparsity.prune_low_magnitude(
        Dense(
            units = 256, activation='relu',
            kernel_initializer = tf.initializers.GlorotUniform()
        ),
        **pruning_params_fc)
    )
    
    pruned_model.add(sparsity.prune_low_magnitude(
        Dense(
            units = 256, activation='relu',
            kernel_initializer = tf.initializers.GlorotUniform()
        ),
        **pruning_params_fc)
    )
    
    pruned_model.add(sparsity.prune_low_magnitude(
        Dense(
            units = 10, activation='softmax'
        ),
        **pruning_params_op)
    )
    

    # Compile pruned CNN-
    pruned_model.compile(
        loss=tf.keras.losses.categorical_crossentropy,
        # optimizer='adam',
        optimizer=tf.keras.optimizers.Adam(lr = 0.0003),
        metrics=['accuracy']
    )
    
    
    return pruned_model


In [19]:
# Initialize a CNN model-
orig_model = pruned_nn(pruning_params_unpruned, pruning_params_unpruned, pruning_params_unpruned)

Instructions for updating:
Please use `layer.add_weight` method instead.


In [19]:
'''
import os

# Change to where the winning ticket is saved-
os.chdir("Run_2/Conv_4_CIFAR10/")
'''

'\nimport os\n\n# Change to where the winning ticket is saved-\nos.chdir("Run_2/Conv_4_CIFAR10/")\n'

In [20]:
# Load weights from before-
orig_model.load_weights("Conv_4_CIFAR10_Winning_Ticket_Distribution_94.6035541009015.h5")

In [21]:
# Strip model of it's pruning parameters-
orig_model_stripped = sparsity.strip_pruning(orig_model)

In [22]:
# Get stripped defined model summary-
orig_model_stripped.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 32, 32, 64)        1792      
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 32, 32, 64)        36928     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 16, 16, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 16, 128)       73856     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 16, 16, 128)       147584    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 8, 8, 128)         0         
_________________________________________________________________
flatten (Flatten)            (None, 8192)              0

### Create mask using winning ticket:

In [23]:
# Instantiate a new neural network model for which, the mask is to be created,
# according to the paper-
mask_model = pruned_nn(pruning_params_unpruned, pruning_params_unpruned, pruning_params_unpruned)

In [24]:
# Load weights of PRUNED model-
# mask_model.set_weights(orig_model.get_weights())
mask_model.load_weights("Conv_4_CIFAR10_Winning_Ticket_Distribution_94.6035541009015.h5")

In [25]:
# Strip the model of its pruning parameters-
mask_model_stripped = sparsity.strip_pruning(mask_model)

In [26]:
# For each layer, for each weight which is 0, leave it, as is.
# And for weights which survive the pruning,reinitialize it to ONE (1)-
for wts in mask_model_stripped.trainable_weights:
    wts.assign(tf.where(tf.equal(wts, 0.), 0., 1.))


In [28]:
# Count number of mask parameters-
mask_sum_params = 0

for layer in mask_model_stripped.trainable_weights:
    mask_sum_params += tf.math.count_nonzero(layer, axis = None).numpy()

print("\nNumber of mask parameters = {0}\n".format(mask_sum_params))


Number of mask parameters = 130097



In [29]:
# Count number of trainable and non-trainable parameters-

import tensorflow.keras.backend as K


trainable_wts = np.sum([K.count_params(w) for w in orig_model.trainable_weights])
non_trainable_wts = np.sum([K.count_params(w) for w in orig_model.non_trainable_weights])

print("\nNumber of training weights = {0} and non-trainabel weights = {1}\n".format(
    trainable_wts, non_trainable_wts
))
print("Total number of parameters = {0}\n".format(trainable_wts + non_trainable_wts))



Number of training weights = 2425930 and non-trainabel weights = 2425040.0

Total number of parameters = 4850970.0



In [30]:
# Count number of non-zero parameters in winning ticket-1-
pruned_sum_params = 0
    
for layer in orig_model_stripped.trainable_weights:
    # print(tf.math.count_nonzero(layer, axis = None).numpy())
    pruned_sum_params += tf.math.count_nonzero(layer, axis = None).numpy()


In [31]:
print("\nNumber of non-zero parameters in Conv-4 winning ticket (CIFAR-10) = {0} with {1:.4f}% of weights pruned\n".format(pruned_sum_params, 100 - (pruned_sum_params / trainable_wts) * 100))


Number of non-zero parameters in Conv-4 winning ticket (CIFAR-10) = 130097 with 94.6372% of weights pruned



In [32]:
@tf.function
def train_one_step(model, mask_model, optimizer, x, y):
    '''
    Function to compute one step of gradient descent optimization
    '''
    with tf.GradientTape() as tape:
        # Make predictions using defined model-
        y_pred = model(x)

        # Compute loss-
        loss = loss_fn(y, y_pred)
            
    # Compute gradients wrt defined loss and weights and biases-
    grads = tape.gradient(loss, model.trainable_variables)
    
    # type(grads)
    # list

    # List to hold element-wise multiplication between-
    # computed gradient and masks-
    grad_mask_mul = []
    
    # Perform element-wise multiplication between computed gradients and masks-
    for grad_layer, mask in zip(grads, mask_model.trainable_weights):
        grad_mask_mul.append(tf.math.multiply(grad_layer, mask))
    
    # Apply computed gradients to model's weights and biases-
    optimizer.apply_gradients(zip(grad_mask_mul, model.trainable_variables))

    # Compute accuracy-
    train_loss(loss)
    train_accuracy(y, y_pred)

    return None
    
    
@tf.function
def test_step(model, optimizer, data, labels):
    """
    Function to test model performance
    on testing dataset
    """
    
    predictions = model(data)
    t_loss = loss_fn(labels, predictions)

    test_loss(t_loss)
    test_accuracy(labels, predictions)

    return None


In [33]:
# User input parameters for Early Stopping in manual implementation-
minimum_delta = 0.001
patience = 3

In [32]:
# best_val_loss = 100
# loc_patience = 0

In [34]:
# Train winning ticket using 'GradientTape' to observe it's training behavior-
    
# Initialize parameters for Early Stopping manual implementation-
best_val_loss = 100
loc_patience = 0
    
for epoch in range(num_epochs):
    
    # print("\n\nEpoch: {0}\n\n".format(epoch + 1))
    
    if loc_patience >= patience:
        print("\n'EarlyStopping' called!\n")
        break
        
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()
    train_accuracy.reset_states()
    test_loss.reset_states()
    test_accuracy.reset_states()
        
    
    for x, y in train_dataset:
        train_one_step(orig_model_stripped, mask_model_stripped, optimizer, x, y)

    for x_t, y_t in test_dataset:
        test_step(orig_model_stripped, optimizer, x_t, y_t)

    template = 'Epoch {0}, Loss: {1:.4f}, Accuracy: {2:.4f}, Test Loss: {3:.4f}, Test Accuracy: {4:4f}'
    
    print(template.format(epoch + 1, 
                          train_loss.result(), train_accuracy.result()*100,
                          test_loss.result(), test_accuracy.result()*100))
        
    # Count number of non-zero parameters in each layer and in total-
    # print("layer-wise manner model, number of nonzero parameters in each layer are: \n")

    model_sum_params = 0
    
    for layer in orig_model_stripped.trainable_weights:
        # print(tf.math.count_nonzero(layer, axis = None).numpy())
        model_sum_params += tf.math.count_nonzero(layer, axis = None).numpy()
    
    print("Total number of trainable parameters = {0}\n".format(model_sum_params))

    
    # Code for manual Early Stopping:
    if (test_loss.result() < best_val_loss) and (np.abs(test_loss.result() - best_val_loss) >= minimum_delta):
        # update 'best_val_loss' variable to lowest loss encountered so far-
        best_val_loss = test_loss.result()
        
        # reset 'loc_patience' variable-
        loc_patience = 0
        
    else:  # there is no improvement in monitored metric 'val_loss'
        loc_patience += 1  # number of epochs without any improvement

    

Epoch 1, Loss: 1.3280, Accuracy: 54.7720, Test Loss: 1.0256, Test Accuracy: 64.450005
Total number of trainable parameters = 130097

Epoch 2, Loss: 0.8605, Accuracy: 70.5660, Test Loss: 0.8211, Test Accuracy: 72.050003
Total number of trainable parameters = 130097

Epoch 3, Loss: 0.6581, Accuracy: 77.4980, Test Loss: 0.7663, Test Accuracy: 73.979996
Total number of trainable parameters = 130097

Epoch 4, Loss: 0.5323, Accuracy: 81.8320, Test Loss: 0.7427, Test Accuracy: 74.959999
Total number of trainable parameters = 130097

Epoch 5, Loss: 0.4317, Accuracy: 85.2220, Test Loss: 0.7322, Test Accuracy: 76.430000
Total number of trainable parameters = 130097

Epoch 6, Loss: 0.3509, Accuracy: 88.0640, Test Loss: 0.7600, Test Accuracy: 76.560005
Total number of trainable parameters = 130097

Epoch 7, Loss: 0.2912, Accuracy: 90.0380, Test Loss: 0.8234, Test Accuracy: 76.380005
Total number of trainable parameters = 130097

Epoch 8, Loss: 0.2387, Accuracy: 92.0400, Test Loss: 0.8644, Test Acc

In [35]:
import pickle, numpy as np

In [36]:
# Load Python 3 dictionary containing training metrics-
with open("Conv4_history_main_Winning_Ticket_Distribution_Experiment_2_Random_Weights_Experiment_2.pkl", "rb") as f:
    hm = pickle.load(f)

In [42]:
epoch_length = len(hm[1]['val_accuracy'])

print("\nConv-4 val_accuracy = {0:.4f}% using {1} epochs containing {2:.2f}% of weights\n".format(
    hm[1]['val_accuracy'][epoch_length - 1], epoch_length, 100
))



Conv-4 val_accuracy = 75.6000% using 8 epochs containing 100.00% of weights



### Observation:

1. The over-parameterized, original Conv-4 CNN needed 8 epochs to reach a validation accuracy of 75.6%
1. The winning ticket is pruned to __94.6372%__ and needs 8 epochs to reach a __higher validation accuracy of 76.29%__

This result shows the success of the _The Lottery Ticket Hypothesis_ applied to Conv-4 CNN for CIFAR-10 dataset.