# PLAN:

### 1. Plain vanilla alexnet on cifar10 Gluon warmup 
### 2. Finetuning resnet-152
### 3. Building embeddings




#### code will be available starting from wednesday Dec 6th on
https://github.com/asemyanov

## Learning the representations

Another way to cast the state of affairs is that 
the most important part of the pipeline was the representation.
And up until 2012, this part was done mechanically, based on some hard-fought intuition.
In fact, engineering a new set of feature functions, improving results, and writing up the method was a prominent genre of paper.

Another group of researchers had other plans. They believed that features themselves ought to be learned. 
Moreover they believed that to be reasonably complex, the features ought to be hierarchically composed.
These researchers, including Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng, Shun-ichi Amari, and Juergen Schmidhuber believed that by jointly training many layers of a neural network, they might come to learn hierarchical representations of data. 
In the case of an image, the lowest layers might come to detect edges, colors, and textures. 


<img src="../img2/filters.png" alt="Drawing" style="width: 600px;"/>

Higher layers might build upon these representations to represent larger structures, like eyes, noses, blades of grass, and features. 
Yet higher layers might represent whole objects like people, airplanes, dogs, or frisbees. 
And ultimately, before the classification layer, the final hidden state might represent a compact representation of the image that summarized the contents in a space where data belonging to different categories would be linearly separable.

# Warm up: classic Alexnet on classic CIFAR10

#### Based on:
https://github.com/zackchase/mxnet-the-straight-dope

In [1]:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
import numpy as np
mx.random.seed(1)

  import OpenSSL.SSL


In [2]:
ctx = mx.gpu()

## Load up a dataset

Now let's load up a dataset. This time we're going to use gluon's new `vision` package, and import the CIFAR dataset. Cifar is a much smaller color dataset, roughly the dimensions of ImageNet. It contains 50,000 training and 10,000 test images. The images belong in equal quantities to 10 categories. While this dataset is considerably smaller than the 1M image, 1k category, 256x256 ImageNet dataset, we'll use it here to demonstrate the model because we don't want to assume that you have a license for the ImageNet dataset or a machine that can store it comfortably. To give you some sense for the proportions of working with ImageNet data, we'll upsample the images to 224x224 (the size used in the original AlexNet).  

In [3]:
def transformer(data, label):
    data = mx.image.imresize(data, 224, 224)
    data = mx.nd.transpose(data, (2,0,1))
    data = data.astype(np.float32)
    return data, label


In [4]:
batch_size = 150
train_data = gluon.data.DataLoader(
    gluon.data.vision.CIFAR10('./data', train=True, transform=transformer),
    batch_size=batch_size, shuffle=True, last_batch='discard')

test_data = gluon.data.DataLoader(
    gluon.data.vision.CIFAR10('./data', train=False, transform=transformer),
    batch_size=batch_size, shuffle=False, last_batch='discard')

Downloading ./data/cifar-10-binary.tar.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/cifar10/cifar-10-binary.tar.gz...


In [5]:
for d, l in train_data:
    break

In [6]:
print(d.shape, l.shape)

(150, 3, 224, 224) (150,)


In [7]:
d.dtype

numpy.float32

## The AlexNet architecture

This model has some notable features. 
First, in contrast to the relatively tiny LeNet, 
AlexNet contains 8 layers of transformations,
five convolutional layers followed by two fully connected hidden layers and an output layer.

The convolutional kernels in the first convolutional layer are reasonably large at $11 \times 11$, in the second  they are $5\times5$ and thereafter they are $3\times3$. Moreover, the first, second, and fifth convolutional layers are each followed by overlapping pooling operations with pool size $3\times3$ and stride ($2\times2$). 

Following the convolutional layers, the original AlexNet had fully-connected layers with 4096 nodes each. Using `gluon.nn.Sequential()`, we can define the entire AlexNet architecture in just 14 lines of code.  Besides the specific architectural choices and the data preparation, we can recycle all of the code we'd used for LeNet verbatim. 

[**right now relying on a different data pipeline (the new gluon.vision). Sync this with the other chapter soon and commit to one data pipeline.**]

[add dropout once we are 100% final on API]

In [10]:
alex_net = gluon.nn.Sequential()
with alex_net.name_scope():
    #  First convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=96, kernel_size=11, strides=(4,4), activation='relu'))
    alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=2))    
    #  Second convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=192, kernel_size=5, activation='relu'))
    alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=(2,2)))            
    # Third convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=384, kernel_size=3, activation='relu'))
    # Fourth convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=384, kernel_size=3, activation='relu')) 
    # Fifth convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=256, kernel_size=3, activation='relu'))
    alex_net.add(gluon.nn.MaxPool2D(pool_size=3, strides=2))    
    # Flatten and apply fullly connected layers
    alex_net.add(gluon.nn.Flatten())
    alex_net.add(gluon.nn.Dense(4096, activation="relu"))
    alex_net.add(gluon.nn.Dense(4096, activation="relu"))
    alex_net.add(gluon.nn.Dense(10))

## Initialize parameters

In [11]:
alex_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

## Optimizer

In [12]:
trainer = gluon.Trainer(alex_net.collect_params(), 'sgd', {'learning_rate': .001})

## Softmax cross-entropy loss

In [14]:
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

## Evaluation loop

In [15]:
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    for d, l in data_iterator:
        data = d.as_in_context(ctx)
        label = l.as_in_context(ctx)
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=label)
    return acc.get()[1]

## Training loop

In [16]:
###########################
#  Only one epoch so tests can run quickly, increase this variable to actually run
###########################
epochs = 5
smoothing_constant = .01


for e in range(epochs):
    for i, (d, l) in enumerate(train_data):
        data = d.as_in_context(ctx)
        label = l.as_in_context(ctx)
        with autograd.record():
            output = alex_net(data)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        trainer.step(data.shape[0])
        
        ##########################
        #  Keep a moving average of the losses
        ##########################
        curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if ((i == 0) and (e == 0)) 
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)
            
    test_accuracy = evaluate_accuracy(test_data, alex_net)
    train_accuracy = evaluate_accuracy(train_data, alex_net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_accuracy, test_accuracy))    

Epoch 0. Loss: 2.0490405666, Train_acc 0.294334334334, Test_acc 0.293434343434
Epoch 1. Loss: 1.83530748758, Train_acc 0.382422422422, Test_acc 0.37898989899


KeyboardInterrupt: 

## Next


In [None]:
#a =gluon.data.vision.CIFAR10('./data', train=True, transform=transformer)