## Convolutional Neural Networks https://goo.gl/L0lDDp
#### Computational Exercise - Bioelectronics and Biosensors FS 2016

As seen in the lecture, convolutional neural networks are a powerful tool to, amongst other things, classify images. The most famous dataset for this purpose is the MNIST dataset, which we are going to work with. There are many different libraries out there to train neural networks, each with it's own advantages and disadvantages. A very powerful library (used by Facebook, etc.), yet simple to understand is Torch. Torch is a very powerful machine learning framework based on the scripting language Lua. If you are not familiar with Lua, don’t worry, it is very straightforward to learn. After solving this exercise you will have a basic understanding of convolutional neural networks and their applications to vision problems. You will also learn a lot about Torch. You can actually use this knowledge later for real world machine learning problems.

__Part 0: Requirements/OS__


If you are not using an OS with unix under the hood (i.e. Windows), do the following: Download and install Cygwin, click on curl under "Net" category in Cygwin package manager

__Part 1: Setup__

First things first. You should start setting up torch and get it to run by following instructions here: http://torch.ch/docs/getting-started.html#_

If you don’t want to mess up your current environment, please use a virtual environment, as described here:
http://docs.python-guide.org/en/latest/dev/virtualenvs/

The next thing we need for easy scripting and visualisation is itorch, please set it up, following this link:
https://github.com/facebook/iTorch

Well done! Now you are ready to do some fancy machine learning with torch!

__Part 2: Getting started__

Now, to get familiar with Lua and torch, you should follow this tutorial:
https://github.com/soumith/cvpr2015/blob/master/Deep%20Learning%20with%20Torch.ipynb

Well done! You are now very close to tackling the problem.

Please keep this link in mind: https://github.com/torch/torch7/wiki/Cheatsheet
This is the whole Reference for the torch framework. For many of the following tasks you will find many useful functions in there.

__Part 3: MNIST__ 

As mentioned earlier, we want now to start on the (former) goldstandard of image classification, the MNIST dataset. As used in the torch tutorial, we will start off by implementing the LeNet 5 network. It was the first breakthrough, of using convolutional neural networks for image classification. If you are interested, have a look at the paper, which will explain many details(including the architecture) in depth: http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf



We start by loading the mnist dataset in 32x32 geometry. As mentioned in the tutorial, you should normalize your data by it having a mean of 0 and a standard deviation of 1. This is already done for you.

In [1]:
require 'torch'
require 'paths'

mnist = {}

-- download the dataset, 32x32 geometry
mnist.path_remote = 'https://s3.amazonaws.com/torch7/data/mnist.t7.tgz'
mnist.path_dataset = 'mnist.t7'
mnist.path_trainset = paths.concat(mnist.path_dataset, 'train_32x32.t7')
mnist.path_testset = paths.concat(mnist.path_dataset, 'test_32x32.t7')

function mnist.download()
   if not paths.filep(mnist.path_trainset) or not paths.filep(mnist.path_testset) then
      local remote = mnist.path_remote
      local tar = paths.basename(remote)
      os.execute('wget ' .. remote .. '; ' .. 'tar xvf ' .. tar .. '; rm ' .. tar)
   end
end

function mnist.loadTrainSet(maxLoad, geometry)
   return mnist.loadDataset(mnist.path_trainset, maxLoad, geometry)
end

function mnist.loadTestSet(maxLoad, geometry)
   return mnist.loadDataset(mnist.path_testset, maxLoad, geometry)
end

function mnist.loadDataset(fileName, maxLoad)
   mnist.download()

--     read dataset
   local f = torch.load(fileName, 'ascii')
   local data = f.data:type(torch.getdefaulttensortype())
   local labels = f.labels

--     data formatation for torch/lua
   local nExample = f.data:size(1)
   if maxLoad and maxLoad > 0 and maxLoad < nExample then
      nExample = maxLoad
      print('<mnist> loading only ' .. nExample .. ' examples')
   end
   data = data[{{1,nExample},{},{},{}}]
   labels = labels[{{1,nExample}}]
   print('<mnist> done')

   local dataset = {}
   dataset.data = data
   dataset.labels = labels
    
    
--     normalise dataset, like in the tutorial

   function dataset:normalize(mean_, std_)
      local mean = mean_ or data:view(data:size(1), -1):mean(1)
      local std = std_ or data:view(data:size(1), -1):std(1, true)
      for i=1,data:size(1) do
         data[i]:add(-mean[1][i])
         if std[1][i] > 0 then
            tensor:select(2, i):mul(1/std[1][i])
         end
      end
      return mean, std
   end

   function dataset:normalizeGlobal(mean_, std_)
      local std = std_ or data:std()
      local mean = mean_ or data:mean()
      data:add(-mean)
      data:mul(1/std)
      return mean, std
   end

   function dataset:size()
      return nExample
   end

   local labelvector = torch.zeros(10)

--     set meta table for torch to index
   setmetatable(dataset, {__index = function(self, index)
			     local input = self.data[index]
			     local class = self.labels[index]
			     local label = labelvector:zero()
			     label[class] = 1
			     local example = {input, label}
                                       return example
   end})

   return dataset
end

In [2]:
-- create training set and normalize
trainData = mnist.loadTrainSet(nbTrainingPatches, geometry)
trainData:normalizeGlobal(mean, std)
-- create test set and normalize
testData = mnist.loadTestSet(nbTestingPatches, geometry)
testData:normalizeGlobal(mean, std)

<mnist> done	


<mnist> done	


Try to familiarize yourself with the data structure before you go ahead.

Now you got to define and train your neural network. A good starting point would be using the architecture of the above mentioned paper (LeCun). Try to keep it simple at this point, since you will have time for further improvements later during this exercise.

In [None]:
require 'nn'

The error given by the training function is based on the training dataset. It indicates how well the network parameters fit the training data. As you know it is only half the story, since with enough parameters, one could basically fit the training data perfectly, which is then called overfitting. How well your network actually classifies is definied by its performance on the test data. Therefore write a function, which evaluates your CNN classification error on the test dataset. 

___Reminder___

Lua's function call is by reference, so keep that in mind for the rest of the exercise! 

Also the following expressions are equivalent:

```lua
foo:bar()
foo.bar(foo)
```

Also, keep in mind, function headers are given to you as an orientation. If you want to define them in a different way, that is perfectly fine as long as you get your results ;) 

In [4]:
function eval(network, testData)
    
end

___Visualization___

Now, to get some intuition about what is happening in the different layers, people usually visualize different sets of parameters of the network. This makes most sense for the convolutional layers, since there the most important computation is happening. On images it is also possible to see what kind of feature each convolution layer is looking for. Try first to write a function to visualize the convolutional kernels of each convolutional layer.

Be careful if you scale any images what type of interpolation you are using.

In [5]:
require 'image'

function visWeights(network, layers)
    
end

For exploring more what those kernels mean, implement a function, which visualizes the convolution of the kernel with its input.

In [6]:
function visActivation(network, example)

end

If you did everything right, you should now observe something similar to this

![alt text](http://localhost:8888/files/1st%20milestone.png)

___Adjust your network___

If you want to adjust the sizes of your layers, you need to make sure, that the output of the n-th layer can be used as an input for the (n+1)-th layer. In a CNN the size of your feature maps is influenced by many parameters, such as the kernel size, stride, size of padding, and the size of the pooling window.

Try to derive a formula that computes the width/height of an output image for a convolutional and pooling layer.
Try to write a function, that returns the output size for a given input for convolutional and pooling layers

Calculate the size of the feature maps for the following given layers:

```
convolutional layer
    input:400,400
    kernel:15,15
    stride:1
    padding:0
pooling layer:
    input: (from conv layer)
    kernel:2,2
    stride:2
    
convolutional layer
    input:300,300
    kernel:7,7
    stride:2
    padding:0
pooling layer:
    input: (from conv layer)
    kernel:4,4
    stride:4
    
convolutional layer
    input:200,200
    kernel:5,5
    stride:1
    padding:0
```

In [7]:
function spatialDimension(inpWidth, kernelSize, stride, padding) 

end

function poolingDimension(inpWidth, kernelSize, stride)

end

Now with that information in mind, you can go ahead and define your own network,train and evaluate it!

___Large or small networks___


One might argue, that one way to prevent overfitting would be to use a smaller network, since the smaller representational power does lead to better generalization. However, in practice it turns out that instead of reducing the network size, it is almost always better to work with a large network and instead use regularization, Dropout, or injected input noise.

The reasoning behind that is, smaller networks are harder to train with gradient descent, since their loss function has few local minima, which are easier to converge to and with high loss. So the performance of the classification is more prone to a bad initialization of the initial, random weights. On the other hand, big neural networks contain significantly more local minima, which turn out to have a better performance, in terms of their loss. Equally, they rely less on luck of random initialization.
If you are interested in a more in-depth discussion, please read the following paper:
http://arxiv.org/pdf/1412.0233v3.pdf

Try out for yourself, how small networks (few layers, few feature maps) compare, to large networks. Implement a single layer CNN and test it.

___GPU computing___

As you have probably experienced, training CNNs is computationally very expensive. GPUs are known to speed up training substantially, so it might be a good idea to train on the GPU from now on (if you have a good one :wink:).
If you want to try that, start with getting the newest NVIDIA Cuda from: http://docs.nvidia.com/cuda/index.html
Afterwards you might need to also install some more torch module via luarocks, i.e. luarocks install cunn! Don't forget to refresh your env variable, in case luarocks doesn't work immediately

```bash
# On Linux with bash
source ~/.bashrc
# On Linux with zsh
source ~/.zshrc
# On OSX or in Linux with none of the above.
source ~/.profile
```

or alternativetly run can run the following, if you are running torch in a virtual environment

```bash
.../torch/install/bin/torch-activate
```

___Receptive field size___

An interesting question one might ask is why successful convolutional networks have a deep, stacked architecture.
Imagine having 3 convolutional layers, with 3x3 filters, stacked onto each other (with non-linearities) in between. 
A neuron in the first layer would see 3x3 in the input volume. The second one would see 5x5 and the last one would see 7x7 in the input volume. Now, knowing that, one could ask, why not using a single convolutional layer with 7x7 filter. How would you answer? You can read up on that topic in the following paper. Try to give a verbal explanation.

 Bengio Y (2009) Learning Deep Architectures for AI.
 Found Trends Mach Learn 2: 1–127.
 doi:10.1561/2200000006.
 
 link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
 

___Further improvements___

Ok, now we are ready for some optimization to get a better performance on the test data. We will perform a couple of steps to improve our performance. This is a rather iterative process and you maybe have to go back and forth a couple of times in order to find out, which adjustments work well together in which ways.


Now the following section gives you the opportunity to improve your network using state-of-the-art tricks. Try to implement them. You can build up on the network you defined earlier (which you can also edit).

To have an overview of how different classifiers performed on MNIST visit: http://yann.lecun.com/exdb/mnist/

You are strongly advised to use as many routines from the torch framework as possible, rather then implementing from scratch.

Furthermore, you should achieve at least ___98 %___ accuracy on the test data in order to receive points for this exercise.

___Regularization___

Now, as mentioned earlier, the network might overfit on the data. In order to avoid this and generalize better to test datasets, you need now to add regularization to your training. The classical way to do so is to add a regularization (L2) term to the loss function, which penalizes a perfect fit in each iteration. A recent, very promising approach is called Dropout, as you heard in the lecture.

Implement both and compare the performance improvements.


___Optimization___

An integral part of training your neural network is the optimization algorithm you are using. There are a couple of options you can try out. To do so, please use the optim package of torch.

- Stochastic Gradient Descent with

    - learning rate decay
    - momentum

- Adam learning: http://arxiv.org/pdf/1412.6980.pdf

In [None]:
require 'optim'

___Batch normalization___

As mentioned in the beginning, your training data is normalized to have a mean of 0 and std of 1. The problem is, that there is a shift of that happening between some layers of your network as learning progresses and weights change. Please read the following paper on why that is a problem and then implement Batch Normalization to circumvent.

http://arxiv.org/pdf/1502.03167v3.pdf

___Data augmentation (Optional)___

In most machine learning problems a rule of thumb is "the more data the better". Accordingly, we would always strive for a bigger dataset. This we can achieve by augmenting our data. For that we use modifications, which are sensible in the context of the data. So, MNIST is, as you know, a dataset that consists of hand-written digits. Therefore, we want to implement augmentations, which could realistically occur for hand-written digits. This means operations like

- rotation
- scaling
- translation
- distortions
- noise (speckles, lines, etc.)
- and much more (be creative)