<a href="https://colab.research.google.com/github/btcain44/Applied_Deep_Learning/blob/main/Small_Net_Generalist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bi-Weekly #3
### Brian Cain
#### Small_Net_Generalist.ipynb

In a sense, this notebook could be considered a continuation of Generalist_Specialist.ipynb, in that it tries to push us a step close in improving the Generalist model from the Generalist/Specialist model in order to one day get better CIFAR-100 results. 

My belief is that for the Generalist model to perform better, it not only needs data augmentation, but it also needs a deeper but smarter architecture. Here I will try to allow for the generalist model to at times have more layers/convolutions by trimming down the model size. I will attempt to trim down the model size by coming up with some novel and new ideas that are inspired by the small networks module covered in class. 

It is also important to note, that since I am trying some novel ideas here I am not necessarily expecting the accuracy results to be stellar but am expecting to observe some new behaviours from the generalist model. The two novel ideas I will try are based on a revised version of batch normalization to reduce the memory a network must carry while training and the second is a new take on Quantization that instead quantizes inputs rather than weights in order to reduce floating point operations.  

First I will load the CIFAR-100 dataset and the respective coarse (super-class) labels for the data. 

In [None]:
##Import necessary packages
import numpy as np
import tensorflow as tf

##Load the CIFAR-100 dataset fine labels
from tensorflow import keras
(x_train, y_train_fine), (x_test, y_test_fine) = keras.datasets.cifar100.load_data(label_mode='fine') 
(y_train_coarse), (y_test_coarse) = keras.datasets.cifar100.load_data(label_mode='coarse')[0][1], keras.datasets.cifar100.load_data(label_mode='coarse')[1][1] 

##Format the coarse labels so we have shape (50000,) and (10000,) respectively 
y_train_coarse = np.array([i[0] for i in y_train_coarse])
y_test_coarse = np.array([i[0] for i in y_test_coarse])

##We don't need any fine label information for this task, so we'll free up memory here
del y_train_fine
del y_test_fine

#### Min-Max Architecture: 

Replacing Batch Normalization with a Potentially less Intensive Normalization Technique. 

One downfall of batch normalization is that when a model is deployed it can slow down predictions as a result of adding significant extra computations to a network. In the architecture below, I have forgone batch normalization and implemented a more naive normalization method that appears after every convolutional layer. It is essentially the equivalent to min-max normalization, here is the mathematic formulation:

Let $X = $ input tensor for the $ith$ convolutional layer

Then $X^{'} = $modified input tensor for the $ith$ convolutional layer

We compute $X^{'} = \frac{X-min(X)}{max(X)-min(X)}$

The obvious benefit of this is we will not have to store in memory a running average of the $\mu$ and $\sigma^{2}$ during training. Lets now build the architecture below and explore the results. 

In [None]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Conv2D, BatchNormalization, GlobalAveragePooling2D, Dropout, MaxPooling2D

##Create class for a model 
class GeneralistModel(Model):
    
    def __init__(self):
        
        super(GeneralistModel, self).__init__()
        self.conv1 = Conv2D(32,(1,1),activation='relu')
        
        self.conv2 = Conv2D(64,(3,3),activation='relu')
        
        self.conv3 = Conv2D(128,(3,3),activation='relu')

        self.conv4 = Conv2D(256,(3,3),activation='relu')
        
        self.globAvgPool = GlobalAveragePooling2D() ##Aids in regularization
        
        self.d1 = Dense(128, activation='relu')
        self.drop = Dropout(.5)
        self.d2 = Dense(20, activation='softmax')
        
    def call(self, x):

        x = tf.cast(x, tf.float32) ##Had to do because of a datatype issue that was occuring
        x = (x - tf.math.reduce_min(x))/(tf.math.reduce_max(x)-tf.math.reduce_min(x)) ##Not sure if I need this one here but I can remove this in a future experiment if I want to improve the method
        x = self.conv2(x)
        x = (x - tf.math.reduce_min(x))/(tf.math.reduce_max(x)-tf.math.reduce_min(x))
        x = self.conv3(x)
        x = (x - tf.math.reduce_min(x))/(tf.math.reduce_max(x)-tf.math.reduce_min(x))
        x = self.conv4(x)
        x = (x - tf.math.reduce_min(x))/(tf.math.reduce_max(x)-tf.math.reduce_min(x))
        x = self.globAvgPool(x)
        x = self.d1(x)
        x = self.drop(x)
        return self.d2(x)

##Create an instance of the model
generalist = GeneralistModel()

Here I will compile the model, I've typically been using the Adam optimizer but want to try something new so I'll use classic gradient descent. 

In [None]:
generalist.compile(optimizer='sgd',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

Split the data into a validation set

In [None]:
##Split training data so we have a validation set
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train_coarse, test_size=0.25, random_state=42)

##As well we must normalize the data
x_train, x_test = x_train/255, x_test/255

Thus far, performing data augmentation has been a beneficial task, specifically I think to prevent some disastrous over-fitting that could occur on an image dataset like CIFAR-100. So, I will once again perform mixup on this dataset to boost the model performance. 

In [None]:
##Define a function that performs mixup two images 
def mixup(image1, image2, label1, label2, beta_params):
    
    ##Generate sample from lambda distribution 
    lambda_val = np.random.beta(beta_params[0], beta_params[1])
    
    ##Perform mix-up operation 
    newImg = lambda_val*image1 + (1-lambda_val)*image2
    newLabel = lambda_val*label1 + (1-lambda_val)*label2

    ##Assign new label to the label it is closest to
    if np.abs((newLabel-label1)) >= np.abs((newLabel-label2)):
      newLabel = label2
    else:
      newLabel = label1
    
    return tf.cast(newImg, tf.float32), newLabel

##Define new random images
np.random.seed(1) ##Set a different random seed that when we did RGB pixel alteration
rand_images = np.random.randint(0, 37500, size=int(37500*.2), dtype=int)

##Create training dataset using mixup
x_train_mixup = []
y_train_mixup = []
for i in rand_images:
    mixup_result = (mixup(x_train[i],x_train[i-1],y_train[i],y_train[i-1],[.2,.2]))
    x_train_mixup.append(mixup_result[0])
    y_train_mixup.append(mixup_result[1])

##Translate training data into numpy arrays
x_train_mixup = np.array(x_train_mixup)
y_train_mixup = np.array(y_train_mixup)

##Concatenate new data onto existing training data
x_train = np.concatenate((x_train, x_train_mixup), axis=0)
y_train = np.concatenate((y_train, y_train_mixup),axis=None)

Since I am trying a couple of novel approaches out to make the network smaller, I only trained this Min-Max Normalization architecture over 10 epochs. We can see the results below. (Also since I trained so many specialist models in Specialist_Generalist.ipynb I'm not sure if there is a point where Collab will limit my GPU). 

In [None]:
min_max_normalized_generalist = generalist.fit(x_train, y_train, batch_size=32, epochs=10,
                                         validation_data=(x_val, y_val),verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now lets compute the test accuracy:

In [None]:
##Evaluate the testing accuracy of the model
print('Test Accuracy of Generalist Model:')
generalist.evaluate(x_test,  y_test_coarse, verbose=0)[1]

Test Accuracy of Generalist Model:


0.2312999963760376

#### Min-Max Architecture Takeaways

The biggest immediate takeaway I have from these results with this new proposed architecture is the fact that there is no over-fitting occuring. The test accuracy of 23.13% is pretty much spot on with the validation accuracy of 23.77%. Although the training only occured over 10 epochs, this is a good sign as one of the main drawbacks of the generalist model in Generalist_Specialist.ipynb was the fact that there was a decent amount of over-fitting when comparing the testing and validiation accuracy. 

Perhaps in Bi-weekly Report #4 an idea is to construct an architecture search space with this Min-Max normalization included and then use AutoML to see if this method would be selected as an optimal architecture. 

#### Relaxed Quantization Architecture

The next architecture I am going to try out relates to the concepts of Quantization which were covered on the Small Networks module. Specifically I am referring to the "Trained Quantization and Weight Sharing" section in our powerpoints. 

My proposal, is rather than worrying about quantizing the weights in the convolution, lets just round each input tensors elements to the nearest first decimal place, for example, if a tensor has an element of value .67889, then it will turn into .7. Although we are not rounding to integers, this still reduces the floating point operations necessary when computing convolutions and should reduce the model size. One caveat is I don't want to totally 0 out all values, so I set a minimum value in the input tensor to be .1 and a maximum value to be 1, and in doing so we might also reap some normalization benefits. 

In [None]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Conv2D, BatchNormalization, GlobalAveragePooling2D, Dropout, MaxPooling2D

##Define a function that will round tensor elements to the nearest tenth, help from here: https://stackoverflow.com/questions/46688610/tf-round-to-a-specified-precision 
def my_tf_round(x, decimals):
    multiplier = tf.constant(10**decimals, dtype=x.dtype)
    new_x = tf.round(x * multiplier) / multiplier
    final_x = tf.clip_by_value(new_x, clip_value_min=.1, clip_value_max=1)
    return final_x

##Create class for a model 
class GeneralistModel(Model):
    
    def __init__(self):
        
        super(GeneralistModel, self).__init__()
        self.conv1 = Conv2D(32,(1,1),activation='relu')
        self.batch1 = BatchNormalization()
        self.conv2 = Conv2D(64,(3,3),activation='relu')
        self.batch2 = BatchNormalization()
        self.conv3 = Conv2D(128,(3,3),activation='relu')
        self.batch3 = BatchNormalization()
        self.conv4 = Conv2D(256,(3,3),activation='relu')
        self.batch4 = BatchNormalization()
        self.globAvgPool = GlobalAveragePooling2D() ##Aids in regularization
        
        self.d1 = Dense(128, activation='relu')
        self.drop = Dropout(.5)
        self.d2 = Dense(20, activation='softmax')
        
    def call(self, x):

        x = tf.cast(x, tf.float32) ##Had to do because of a datatype issue that was occuring
        x = my_tf_round(x, 1)
        x = self.conv1(x)
        x = self.batch1(x)
        x = my_tf_round(x, 1)
        x = self.conv2(x)
        x = self.batch2(x)
        x = my_tf_round(x, 1)
        x = self.conv3(x)
        x = self.batch3(x)
        x = my_tf_round(x, 1)
        x = self.conv4(x)
        x = self.batch4(x)
        x = my_tf_round(x, 1)
        x = self.globAvgPool(x)
        x = self.d1(x)
        x = self.drop(x)
        return self.d2(x)

##Create an instance of the model
generalist = GeneralistModel()

Again I use stochastic gradient descent. 

In [None]:
generalist.compile(optimizer='sgd',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

Train the model

In [None]:
min_max_normalized_generalist = generalist.fit(x_train, y_train, batch_size=32, epochs=10,
                                         validation_data=(x_val, y_val),verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Well this method was not as successful as the last one. We can actually notice that when I booted off training that I got a warning which seems to indicate that a vanishing gradient issue was occuring. This could be a result of forcing values to a single decimal place which doesn't grant for much flexibility. I would likely say that classical convolutional weights quantization is probably a better route to go as opposed to this proposed input quantization.  

### Final Thoughts

Well, given the Generalist_Specialist.ipynb architecture design and creation process took the bulk of my time, this very small experimentation with some novel small network techniques was a fun change of pace. I am interested in playing around more with replacing batch normalization with Min-Max Normalization especially since in the epochs there was no over-fitting occuring, meaning this could be a decent regularization technique. For my relaxed quantization method, I would say I will end experimentation here with it due to its lack of success and possibly just resort to traditional quantization. 

Now that I tried out some new things, I think I will continue to perform some of these "novel" experiments from what we have learned in class to see if any new ideas could be useful. Especially since the field is moving so rapidly, all new ideas have to be tried for a first time anyways so I might as well play around with some class-inspired novel concepts during the Bi-weekly reports. 