In [1]:
# https://keras.io/
!pip install -q keras
import keras

Using TensorFlow backend.


# Importing libraries

In [0]:
import keras
from keras.datasets import cifar10
from keras.models import Model, Sequential
from keras.layers import Dense, Dropout, Flatten, Input, AveragePooling2D, merge, Activation,GlobalAveragePooling2D
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from keras.layers import Concatenate
from keras.optimizers import Adam
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
import math

In [0]:
# this part will prevent tensorflow to allocate all the avaliable GPU Memory
# backend
import tensorflow as tf
from keras import backend as k

# Don't pre-allocate memory; allocate as-needed
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

# Create a session with the above options specified.
k.tensorflow_backend.set_session(tf.Session(config=config))

# CNN Architecture: Densely Connected Convolutional Networks

Source: https://arxiv.org/abs/1608.06993

![alt text](https://cdn-images-1.medium.com/max/2000/1*_Y7-f9GpV7F93siM1js0cg.jpeg)

# Major insights from the paper:

    1. To further improve the information flow between layers inside a dense block, a different connectivity pattern is introduced which are direct connections from any layer to all subsequent layers.
    2. Each dense block has a composite function of three consecutive operations as batch normalization (BN), followed by a rectified linear unit (ReLU) and a 3 × 3 convolution (Conv).
    3. The transition layers used in our experiments consist of a batch normalization layer and an 1×1 convolutional layer followed by a 2×2 average pooling layer.
    4. An important difference between DenseNet and existing network architectures is that DenseNet can have very narrow layers, e.g., k = 12. Here k is the growth rate which represents the number of feature maps for each layer of convolution.
    5. A 1×1 convolution is be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency.
    6. To further improve model compactness, the number of feature-maps at transition layers is reduced uisng a compression factor. If a dense block contains m feature-maps, the following transition layer generate θ*m output featuremaps.
    7. At the end of the last dense block, i.e, in the output block, a global average pooling is performed and then a softmax classifier is attached.


# Hyperparameters

In [0]:
# Hyperparameters
batch_size = 128
num_classes = 10
epochs = 100
l = 40
num_filter = 24
compression = 0.5
dropout_rate = 0.2

# About Data:

The **CIFAR-10 **dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains **60,000 32x32 color images in 10 different classes**. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class. CIFAR-10 is a labeled subset of the 80 million tiny images dataset.

Source: https://en.wikipedia.org/wiki/CIFAR-10

![alt text](https://cdn-images-1.medium.com/max/356/1*QN007xhxgDTPBdNT0pnZ2g.png)

# Importing data and converting output variable to one-hot vector form

In [5]:
# Load CIFAR10 Data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
img_height, img_width, channel = x_train.shape[1],x_train.shape[2],x_train.shape[3]

# convert to one hot encoing 
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


# Normalising Data

It is required to normalize the data so that all features come to a scale between 0 and 1. This helps better understanding by the network. It is also required to convert the data type of individual pixel value to float before normalization because float is more accurate than integers.

**Caution:** If data type in not converted to float before dividing, then we may end up ceiling the data if proper typecast is not done.


In [0]:
# normalize inputs from 0-255 to 0.0-1.0
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train = x_train / 255.0
x_test = x_test / 255.0


# Data Augmentation

Data Augmentation is a great method to **indirectly acquire more data** from the already availabe data. This happens by horizontal or verticle flips, rotation of images, random cropping and many more image processing ways such as whitening, scaling and different channels improvization. 
Helps in generalization process


In [0]:
#data augmentation
datagen = ImageDataGenerator(
    featurewise_center=False,
    samplewise_center=False,
    featurewise_std_normalization=False,
    samplewise_std_normalization=False,
    zca_whitening=False,
    rotation_range=20,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    vertical_flip=False
    )
datagen.fit(x_train)

# Dense Block

This is the actual feature extracter block.

   1.  Architecture has **3 dense blocks**.
    Batch Normalization helps in **equalising the outcome values** from convolution.
    2. Activation function: **RELU**
    3. **Dropout** is done after 3x3 convolution for **regularization purpose** where 20% of neurons are randomly shut down to avoid overfitting on train data.
    4. **1x1 Convolution** is done to to reduce the number of channels before giving it to next layer so that it helps in achieving lower number of parameters and reducing model complexity.
    5. Standard **3x3 Convolution** is followed.
    6. The number of filters used in 1x1 is **four times(4x) the number of kernals** in 3x3 convolution as advised by the above mentioned paper. This helps by not losing too much of information when reduing the number of channels.



In [0]:
# Dense Block
def add_denseblock(input, num_filter = 16, dropout_rate = 0.25):
    global compression
    temp = input
    for _ in range(l):
      
        BatchNorm = BatchNormalization()(temp)
        reluD_LAYER = Activation('relu')(BatchNorm)
        Conv2D_1_1 = Conv2D(int(4*num_filter*compression), (1,1), use_bias=False ,padding='same')(reluD_LAYER)
        if dropout_rate>0:
          Conv2D_1_1 = Dropout(dropout_rate)(Conv2D_1_1)
          
        BatchNorm = BatchNormalization()(Conv2D_1_1)
        reluD_LAYER = Activation('relu')(BatchNorm)
        Conv2D_3_3 = Conv2D(int(num_filter*compression), (3,3), use_bias=False ,padding='same')(reluD_LAYER)
        if dropout_rate>0:
          Conv2D_3_3 = Dropout(dropout_rate)(Conv2D_3_3)
        concat = Concatenate(axis=-1)([temp,Conv2D_3_3])
        
        temp = concat
        
    return temp

# Transition Block

This is a transition block between 2 dense block.

    1. 1x1 convolution is done for reducing the number of parameters. This is called as **BottleNeck layer** in the reference paper.
    2. **Average pooling** is is done to reduce the feature map size from N to N/2 as pooling size is 2x2. i.e, stride of 2.



In [0]:
def add_transition(input, num_filter = 16, dropout_rate = 0.25):
    global compression
    BatchNorm = BatchNormalization()(input)
    relu = Activation('relu')(BatchNorm)
    Conv2D_BottleNeck = Conv2D(int(4*num_filter*compression), (1,1), use_bias=False ,padding='same')(relu)
    if dropout_rate>0:
      Conv2D_BottleNeck = Dropout(dropout_rate)(Conv2D_BottleNeck)
    avg = AveragePooling2D(pool_size=(2,2))(Conv2D_BottleNeck)
    
    return avg

# Output Layer

This is the final layer of the model.

    1. **Global Average pooling** is done here as advised from the reference paper. It does the **spatial average of the feature maps** of the previous layer. Another advantage here is that there is no parameter to optimize on its own by working on the spacial information.

    2. **A dense layer** is added in the end to with **Softmax activation **and result is brouht to a vector of 10 values as required. Here softmax gives results which are **probability-like values**.



In [0]:
def output_layer(input):
    global compression
    BatchNorm = BatchNormalization()(input)
    relu = Activation('relu')(BatchNorm)
    GblAvgPooling = GlobalAveragePooling2D()(relu)
    #flat = Flatten()(GblAvgPooling)
    output = Dense(num_classes, activation='softmax')(GblAvgPooling)
    
    return output

# Model implementation

In [0]:

input = Input(shape=(img_height, img_width, channel,))
First_Conv2D = Conv2D(num_filter, (3,3), use_bias=False ,padding='same')(input)


l = 16
First_Block = add_denseblock(First_Conv2D, num_filter, dropout_rate)
First_Transition = add_transition(First_Block, num_filter, dropout_rate)

l = 16
Second_Block = add_denseblock(First_Transition, num_filter, dropout_rate)
Second_Transition = add_transition(Second_Block, num_filter, dropout_rate)

l = 16
Third_Block = add_denseblock(Second_Transition, num_filter, dropout_rate)
Third_Transition = add_transition(Third_Block, num_filter, dropout_rate)

#l = 16
#Forth_Block = add_denseblock(Third_Transition, num_filter, dropout_rate)
#Forth_Transition = add_transition(Forth_Block, num_filter, dropout_rate)

#Fifth_Block = add_denseblock(Forth_Transition, num_filter, dropout_rate)
#Fifth_Transition = add_transition(Fifth_Block, num_filter, dropout_rate)

Last_Block = add_denseblock(Third_Transition,  num_filter, dropout_rate)
output = output_layer(Last_Block)


In [12]:
model = Model(inputs=[input], outputs=[output])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 32, 32, 3)    0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 32, 32, 24)   648         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 32, 32, 24)   96          conv2d_1[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 32, 32, 24)   0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
conv2d_2 (

In [0]:
import tensorflow as tf
run_opts = tf.RunOptions(report_tensor_allocations_upon_oom = True)


In [14]:
!rm -rf clr_callback.py*
!wget https://github.com/bckenstler/CLR/raw/master/clr_callback.py
from clr_callback import *

--2018-10-27 05:54:25--  https://github.com/bckenstler/CLR/raw/master/clr_callback.py
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/bckenstler/CLR/master/clr_callback.py [following]
--2018-10-27 05:54:25--  https://raw.githubusercontent.com/bckenstler/CLR/master/clr_callback.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5326 (5.2K) [text/plain]
Saving to: ‘clr_callback.py’


2018-10-27 05:54:26 (59.1 MB/s) - ‘clr_callback.py’ saved [5326/5326]



# Different learning rate decay algorithms 

## 1. Learning Rate Schedular from keras
This reduces the value according to the formula given below. This reduces the learning rate value gradually at each epoch.

# LR = initial_LR * drop**( (1 + epoch) / epochs_drop)

In [0]:
# learning rate schedule
def step_decay(epoch):
	initial_lrate = 0.1
	drop = 0.7
	epochs_drop = 20.0
	lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
	return lrate

In [0]:
from keras.callbacks import LearningRateScheduler,ReduceLROnPlateau
lrate = LearningRateScheduler(step_decay)

## 2. Reduce LR on Plateau

This changes the learning rate when the model is not learning anymore based on validation loss

In [0]:
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
                              patience=3, min_lr=0.005)

In [0]:
# determine Loss function and Optimizer
sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

In [0]:
# Saving the best model based on validation accuracy

In [0]:
# checkpoint
filepath="Assignment_4.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

In [0]:
epochs = 50

In [28]:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),epochs=epochs,verbose=1,steps_per_epoch= 1.8*x_train.shape[0]//batch_size,callbacks=[reduce_lr,checkpoint],validation_data=(x_test,y_test))

Epoch 1/50

Epoch 00001: val_acc improved from -inf to 0.51970, saving model to Assignment_4.hdf5
Epoch 2/50

Epoch 00002: val_acc improved from 0.51970 to 0.59740, saving model to Assignment_4.hdf5
Epoch 3/50

Epoch 00003: val_acc improved from 0.59740 to 0.64460, saving model to Assignment_4.hdf5
Epoch 4/50

Epoch 00004: val_acc improved from 0.64460 to 0.68840, saving model to Assignment_4.hdf5
Epoch 5/50

Epoch 00005: val_acc improved from 0.68840 to 0.69460, saving model to Assignment_4.hdf5
Epoch 6/50

Epoch 00006: val_acc improved from 0.69460 to 0.72940, saving model to Assignment_4.hdf5
Epoch 7/50

Epoch 00007: val_acc improved from 0.72940 to 0.73160, saving model to Assignment_4.hdf5
Epoch 8/50

Epoch 00008: val_acc improved from 0.73160 to 0.79670, saving model to Assignment_4.hdf5
Epoch 9/50

Epoch 00009: val_acc improved from 0.79670 to 0.82820, saving model to Assignment_4.hdf5
Epoch 10/50

Epoch 00010: val_acc did not improve from 0.82820
Epoch 11/50

Epoch 00011: val_a

KeyboardInterrupt: ignored

In [34]:
# Test the model
score = model.evaluate(x_test, y_test, verbose=1)
print('Test loss:', score[0])

print('Test accuracy:', score[1])

Test loss: 0.3867398366123438
Test accuracy: 0.8945


# Best Validation Accuracy: 89.45%

In [30]:
# Save the trained weights in to .h5 format
model.save_weights("DNST_weigts_Best.h5")
print("Saved model to disk")

Saved model to disk


In [31]:
# Save the trained model in to .h5 format
model.save("DNST_model_Best.h5")
print("Saved model to disk")

Saved model to disk


# Remarks

### I had the same model without the dropout after the 1x1 convolution in dense block which gave 91% accuracy. Unfortunately I modified it and no time left for deadline.

### But I'm confident to tell that by removing the droppot after 1x1 in dense block and train for 50 epochs, this model will give better results.

### Among Cyclic LR, LR Schedular and ReduceLRonPlateua, the last one gave best results.

### CyclicLR gave the better initial learning among the three.