# Convolutional Neural Networks (LeNet)

:label:`sec_lenet`


We now have all the ingredients required to assemble
a fully-functional convolutional neural network.
In our first encounter with image data,
we applied a multilayer perceptron (:numref:`sec_mlp_scratch`)
to pictures of clothing in the Fashion-MNIST dataset.
To make this data amenable to multilayer perceptrons,
we first flattened each image from a $28\times28$ matrix
into a fixed-length $784$-dimensional vector,
and thereafter processed them with fully-connected layers.
Now that we have a handle on convolutional layers,
we can retain the spatial structure in our images.
As an additional benefit of replacing dense layers with convolutional layers, 
we will enjoy more parsimonious models (requiring far fewer parameters).

In this section, we will introduce LeNet,
among the first published convolutional neural networks
to capture wide attention for its performance on computer vision tasks.
The model was introduced (and named for) Yann Lecun,
then a researcher at AT&T Bell Labs,
for the purpose of recognizing handwritten digits in images 
[LeNet5](http://yann.lecun.com/exdb/lenet/).
This work represented the culmination
of a decade of research developing the technology.
In 1989, LeCun published the first study to successfully
train convolutional neural networks via backpropagation. 


At the time LeNet achieved outstanding results 
matching the performance of Support Vector Machines (SVMs),
then a dominant approach in supervised learning.
LeNet was eventually adapted to recognize digits 
for processing deposits in ATM machines.
To this day, some ATMs still run the code 
that Yann and his colleague Leon Bottou wrote in the 1990s!


## LeNet

At a high level, LeNet consists of three parts:
(i) a convolutional encoder consisting of two convolutional layers; and
(ii) a dense block consisting of three fully-connected layers;
The architecture is summarized in :numref:`img_lenet`.

![Data flow in LeNet 5. The input is a handwritten digit, the output a probability over 10 possible outcomes.](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/lenet.svg)

:label:`img_lenet`


The basic units in each convolutional block 
are a convolutional layer, a sigmoid activation function,
and a subsequent average pooling operation.
Note that while ReLUs and max-pooling work better,
these discoveries had not yet been made in the 90s. 
Each convolutional layer uses a $5\times 5$ kernel
and a sigmoid activation function.
These layers map spatially arranged inputs
to a number of 2D feature maps, typically 
increasing the number of channels.
The first convolutional layer has 6 output channels,
while th second has 16.
Each $2\times2$ pooling operation (stride 2) 
reduces dimensionality by a factor of $4$ via spatial downsampling.
The convolutional block emits an output with size given by
(batch size, channel, height, width).

In order to pass output from the convolutional block
to the fully-connected block, 
we must flatten each example in the minibatch.
In other words, we take this 4D input and transform it
into the 2D input expected by fully-connected layers:
as a reminder, the 2D representation that we desire
has uses the first dimension to index examples in the minibatch
and the second to give the flat vector representation of each example.
LeNet's fully-connected layer block has three fully-connected layers,
with 120, 84, and 10 outputs, respectively.
Because we are still performing classification,
the 10-dimensional output layer corresponds
to the number of possible output classes.

While getting to the point where you truly understand
what is going on inside LeNet may have taken a bit of work,
hopefully the following code snippet will convince you
that implementing such models with modern deep learning libraries 
is remarkably simple. 
We need only to instantiate a `Sequential` Block 
and chain together the appropriate layers.

In [1]:
%mavenRepo snapshots https://oss.sonatype.org/content/repositories/snapshots/

%maven ai.djl:api:0.7.0-SNAPSHOT
%maven ai.djl:model-zoo:0.7.0-SNAPSHOT
%maven ai.djl:basicdataset:0.7.0-SNAPSHOT
%maven org.slf4j:slf4j-api:1.7.26
%maven org.slf4j:slf4j-simple:1.7.26
%maven ai.djl.mxnet:mxnet-engine:0.7.0-SNAPSHOT
%maven ai.djl.mxnet:mxnet-native-auto:1.7.0-a

In [2]:
%%loadFromPOM
<dependency>
    <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-jsplot</artifactId>
    <version>0.30.4</version>
</dependency>

In [3]:
%load ../utils/plot-utils.ipynb

In [4]:
import ai.djl.Device;
import ai.djl.Model;
import ai.djl.basicdataset.FashionMnist;
import ai.djl.ndarray.NDArray;
import ai.djl.ndarray.NDList;
import ai.djl.ndarray.NDManager;
import ai.djl.ndarray.types.DataType;
import ai.djl.ndarray.types.Shape;
import ai.djl.nn.Activation;
import ai.djl.nn.Block;
import ai.djl.nn.SequentialBlock;
import ai.djl.nn.convolutional.Conv2D;
import ai.djl.nn.core.Linear;
import ai.djl.nn.pooling.Pool;
import ai.djl.training.DefaultTrainingConfig;
import ai.djl.training.GradientCollector;
import ai.djl.training.ParameterStore;
import ai.djl.training.Trainer;
import ai.djl.training.dataset.Batch;
import ai.djl.training.dataset.Dataset;
import ai.djl.training.evaluator.Accuracy;
import ai.djl.training.initializer.NormalInitializer;
import ai.djl.training.initializer.XavierInitializer;
import ai.djl.training.listener.TrainingListener;
import ai.djl.training.loss.Loss;
import ai.djl.training.optimizer.Optimizer;
import ai.djl.training.optimizer.learningrate.LearningRateTracker;
import tech.tablesaw.api.*;
import tech.tablesaw.plotly.api.*;
import tech.tablesaw.plotly.components.*;
import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.components.Figure;
import org.apache.commons.lang3.ArrayUtils;

In [5]:
NDManager manager = NDManager.newBaseManager();

SequentialBlock block = new SequentialBlock();

Block block1 = Conv2D.builder()
                .setKernel(new Shape(5, 5))
                .optPad(new Shape(2, 2))
                .optBias(false)
                .setNumFilters(6).build();
block1.setInitializer(new NormalInitializer());
block1.initialize(manager, DataType.FLOAT32, new Shape(1, 1, 28, 28));
block.add(block1);
block.add(Activation::sigmoid);

Block block2 = Pool.avgPool2DBlock(new Shape(5, 5), new Shape(2, 2), new Shape(2, 2));
block2.initialize(manager, DataType.FLOAT32, new Shape(1, 6, 28, 28));
block.add(block2);

Block block3 = Conv2D.builder()
                .setKernel(new Shape(5, 5))
                .setNumFilters(16).build();
block.add(block3);
block.add(Activation::sigmoid);

Block block4 = Pool.avgPool2DBlock(new Shape(5, 5), new Shape(2, 2), new Shape(2, 2));
block4.initialize(manager, DataType.FLOAT32, new Shape(1, 16, 10, 10));
block.add(block4);

block.add(Linear.builder().optFlatten(true).setOutChannels(120).build());
block.add(Activation::sigmoid);

block.add(Linear.builder().setOutChannels(84).build());
block.add(Activation::sigmoid);

block.add(Linear.builder().setOutChannels(10).build());
block.setInitializer(new XavierInitializer());

We took a small liberty with the original model,
removing the Gaussian activation in the final layer.
Other than that, this network matches
the original LeNet5 architecture.

By passing a single-channel (black and white)
$28 \times 28$ image through the net
and printing the output shape at each layer,
we can inspect the model to make sure 
that its operations line up with 
what we expect from :numref:`img_lenet_vert`.

In [6]:
NDArray X = manager.randomUniform(0f, 1.0f, new Shape(1, 1, 28, 28));
block.initialize(manager, DataType.FLOAT32, X.getShape());

ParameterStore parameterStore = new ParameterStore(manager, false);
for (int i = 0; i < block.getChildren().size(); i++) {

        X = block.getChildren().get(i).getValue().forward(parameterStore, new NDList(X), false).singletonOrThrow();
        System.out.println(block.getChildren().get(i).getKey());
        System.out.println("layer " + (i) + " output : " + X.getShape());
}

Note that the height and width of the representation
at each layer throughout the convolutional block 
is reduced (compared to the previous layer).
The first convolutional layer uses $2$ pixels of padding 
to compensate for the the reduction in height and width
that would otherwise result from using a $5 \times 5$ kernel.
In contrast, the second convolutional layer foregoes padding, 
and thus the height and width are both reduced by $4$ pixels.
As we go up the stack of layers,
the number of channels increases layer-over-layer
from 1 in the input to 6 after the first convolutional layer
and 16 after the second layer.
However, each pooling layer halves the height and width.
Finally, each fully-connected layer reduces dimensionality,
finally emitting an output whose dimension 
matches the number of classes.

![Compressed notation for LeNet5](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/lenet-vert.svg)

:label:`img_lenet_vert`


## Data Acquisition and Training

Now that we have implemented the model,
let's run an experiment to see how LeNet fares on Fashion-MNIST.

In [7]:
int batchSize = 256;

FashionMnist trainIter = FashionMnist.builder()
                            .optUsage(Dataset.Usage.TRAIN)
                            .setSampling(batchSize, true)
                            .build();


FashionMnist testIter = FashionMnist.builder()
                            .optUsage(Dataset.Usage.TEST)
                            .setSampling(batchSize, true)
                            .build();
                            
trainIter.prepare();
testIter.prepare();

While convolutional networks have few parameters,
they can still be more expensive to compute
than similarly deep multilayer perceptrons
because each parameter participates in many more
multiplications.
If you have access to a GPU, this might be a good time
to put it into action to speed up training.

For evaluation, we have defined the accuracy function 
which takes in the labels and predictions computed from
the training.

In [8]:
public static float accuracy(NDArray yHat, NDArray y) {
    // Check size of 1st dimension greater than 1
    // to see if we have multiple samples
    if (yHat.getShape().size(1) > 1) {
        // Argmax gets index of maximum args for given axis 1
        // Convert yHat to same dataType as y (int32)
        // Sum up number of true entries
        return yHat.argMax(1).toType(DataType.INT32, false).eq(y.toType(DataType.INT32, false))
                .sum().toType(DataType.FLOAT32, false).getFloat();
    }
    return yHat.toType(DataType.INT32, false).eq(y.toType(DataType.INT32, false))
            .sum().toType(DataType.FLOAT32, false).getFloat();
}

The training function `trainChapter6` is also similar 
to `trainChapter3` defined in :numref:`sec_softmax_scratch`. 
Since we will be implementing networks with many layers 
going forward, we will rely primarily on DJL.
The following train function assumes a Gluon model 
as input and is optimized accordingly. 
We initialize the model parameters 
on the block using the Xavier initializer.
Just as with MLPs, our loss function is cross-entropy,
and we minimize it via minibatch stochastic gradient descent. 

In [9]:
int numEpochs = 10;
float lr = 0.9f;

double[] trainLoss;
double[] testAccuracy;
double[] epochCount;
double[] trainAccuracy;

trainLoss = new double[numEpochs];
trainAccuracy = new double[numEpochs];
testAccuracy = new double[numEpochs];
epochCount = new double[numEpochs];

In [10]:
public void trainChapter6(){
    
    Loss loss = Loss.softmaxCrossEntropyLoss();

    LearningRateTracker lrt = LearningRateTracker.fixedLearningRate(lr);
    Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();

    DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
                    .optOptimizer(sgd) // Optimizer (loss function)
                    .addEvaluator(new Accuracy()) // Model Accuracy
                    .addTrainingListeners(TrainingListener.Defaults.logging()); // Logging

    Model model = Model.newInstance(Device.defaultDevice().toString());
    model.setBlock(block);
    Trainer trainer = model.newTrainer(config);
    
    float epochLoss = 0f;
    float accuracyVal = 0f;

        for(int epoch=1; epoch <= numEpochs; epoch++){

            for (Batch batch : trainer.iterateDataset(trainIter)) {

                NDArray X = batch.getData().head();
                NDArray y = batch.getLabels().head();

                try (GradientCollector gc = trainer.newGradientCollector()) {

                    NDArray yHat = block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
                    NDArray lossVal = loss.evaluate(new NDList(y), new NDList(yHat));
                    epochLoss += lossVal.mul(batchSize).getFloat();
                    accuracyVal += accuracy(yHat, y);
                    gc.backward(lossVal);
                }

                trainer.step();

                batch.close();
            }

            trainLoss[epoch-1] = epochLoss/trainIter.size();
            trainAccuracy[epoch-1] = accuracyVal/trainIter.size();
            epochLoss = 0f;
            accuracyVal = 0f;

            for (Batch batch : testIter.getData(manager)) {

                NDArray X = batch.getData().head();
                NDArray y = batch.getLabels().head();

                NDArray yHat = block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
                accuracyVal += accuracy(yHat, y);
            }
            testAccuracy[epoch-1] = accuracyVal/testIter.size();
            epochCount[epoch-1] = epoch;
            accuracyVal = 0f;
        }
}

Now let us train the model.

In [11]:
trainChapter6();

In [12]:
String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];

Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");
Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");
Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length,
                trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");

Table data = Table.create("Data").addColumns(
            DoubleColumn.create("epochCount", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))),
            DoubleColumn.create("loss", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))),
            StringColumn.create("lossLabel", lossLabel)
);

render(LinePlot.create("", data, "epochCount", "loss", "lossLabel"),"text/html");

## Summary

* A ConvNet is a network that employs convolutional layers.
* In a ConvNet, we interleave convolutions, nonlinearities, and (often) pooling operations.
* These convolutional blocks are typically arranged so that they gradually decrease the spatial resolution of the representations, while increasing the number of channels.
* In traditional ConvNets, the representations encoded by the convolutional blocks are processed by one (or more) dense layers prior to emitting output.
* LeNet was arguably the first successful deployment of such a network.

## Exercises

1. Replace the average pooling with max pooling. What happens?
1. Try to construct a more complex network based on LeNet to improve its accuracy.
    * Adjust the convolution window size.
    * Adjust the number of output channels.
    * Adjust the activation function (ReLU?).
    * Adjust the number of convolution layers.
    * Adjust the number of fully connected layers.
    * Adjust the learning rates and other training details (initialization, epochs, etc.)
1. Try out the improved network on the original MNIST dataset.
1. Display the activations of the first and second layer of LeNet for different inputs (e.g., sweaters, coats).


## [Discussions](https://discuss.mxnet.io/t/2353)

![](../img/qr_lenet.svg)