<a href="https://colab.research.google.com/github/bacdam91/mxnet-tutorial/blob/master/Create_a_neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install mxnet



### Create a neural network

Creating a neural network with ```MXNet``` is quite simple. In this tutorial, we will be going through the basic of setting up a neural net in ```MXNet```.

### Importing ```mxnet.gluon.nn```

The module required to create a neural net is called ```mxnet.gluon.nn```. We can use the following import statement to import in the module.

In [0]:
from mxnet import nd
from mxnet.gluon import nn

### Layers

Neural networks (NNs) usually consist of many layers working together. Here we will examine how to create one layer first before creating a sequence of layers. 

A ```Dense``` layer is a common type of layer that you will encounter, especially when you are starting out with Machine Learning. It is also known as a Full-connected (FC) layer.

The code block below shows by defining a dense layer with 2 output units.

In [0]:
layer = nn.Dense(2)
layer

Dense(None -> 2, linear)

Next, we will initialise its weight with the default initialisation method, which draws random values uniformly from $[-0.7, 0.7]$.

In [0]:
layer.initialize()

We will now create a random $3 \times 4$ feature matrix, $X$, with random uniform values between $[-1,1]$.

In [0]:
X = nd.random.uniform(-1, 1, (3, 4))
display(X)


[[ 0.09762704  0.18568921  0.43037868  0.6885315 ]
 [ 0.20552671  0.71589124  0.08976638  0.6945034 ]
 [-0.15269041  0.24712741  0.29178822 -0.23123658]]
<NDArray 3x4 @cpu(0)>

Now, we will feed the feature matrix, $X$, into our layer to compute the output.

In [0]:
layer(X)


[[-0.02524132 -0.00874885]
 [-0.06026538 -0.01308061]
 [ 0.02468396 -0.02181557]]
<NDArray 3x2 @cpu(0)>

We can see that output matrix is a $3 \times 2$ matrix which was created from our $3 \times 4$ input. Note that we did not specify the input size of ```layer``` before (although we can specify it with the argument ```in_units=4```), the system will automatically infer it during the first time we feed in data, create and initialise the weights.

We can inspect the weights with the following line of code.

In [0]:
W = layer.weight.data()
W


[[-0.00873779 -0.02834515  0.05484822 -0.06206018]
 [ 0.06491279 -0.03182812 -0.01631819 -0.00312688]]
<NDArray 2x4 @cpu(0)>

Note that the shape of the weight matrix, $W$, is $(2,4)$. 

Recalling the rules of matrix-matrix product, the number of columns of the preceeding matrix must equal the number of rows of the succeeding matrix. However, we have a mismatch with our input of shape $(3, 4)$ and weight of shape $(2, 4)$. 

The resulting matrix is of shape $(3, 2)$. Intuitively, we can see that the equation is $X \times W$ or more precisely $W^T$, where $W^T$ is the transpose matrix of $W$ with shape of $(4, 2)$.

We can check this by manually using the ```nd.dot()``` function.

In [0]:
print(layer(X))
display(nd.dot(X,W.T))


[[-0.02524132 -0.00874885]
 [-0.06026538 -0.01308061]
 [ 0.02468396 -0.02181557]]
<NDArray 3x2 @cpu(0)>



[[-0.02524132 -0.00874885]
 [-0.06026538 -0.01308061]
 [ 0.02468396 -0.02181557]]
<NDArray 3x2 @cpu(0)>

### Chaining layers into a neural network

We can create a chain of layers by adding different layer types into ```nn.Sequential()```. You can think of ```nn.Sequential()``` as the entire model which contains all the layers needed to form our network and each layer is run sequentially, hence the name.

The code below implements a famous network called [LeNet](http://yann.lecun.com/exdb/lenet/).

In [0]:
net = nn.Sequential()

net.add(
    nn.Conv2D(channels=6, kernel_size=5, activation="relu", in_channels=3),
    nn.MaxPool2D(pool_size=2, strides=2),
    nn.Conv2D(channels=16, kernel_size=3, activation="relu"),
    nn.MaxPool2D(pool_size=2, strides=2),
    nn.Dense(120, activation="relu"),
    nn.Dense(84, activation="relu"),
    nn.Dense(10)
)

net

Sequential(
  (0): Conv2D(3 -> 6, kernel_size=(5, 5), stride=(1, 1), Activation(relu))
  (1): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False, global_pool=False, pool_type=max, layout=NCHW)
  (2): Conv2D(None -> 16, kernel_size=(3, 3), stride=(1, 1), Activation(relu))
  (3): MaxPool2D(size=(2, 2), stride=(2, 2), padding=(0, 0), ceil_mode=False, global_pool=False, pool_type=max, layout=NCHW)
  (4): Dense(None -> 120, Activation(relu))
  (5): Dense(None -> 84, Activation(relu))
  (6): Dense(None -> 10, linear)
)

In [0]:
net.initialize()
# Input shape is (batch_size, color_channels, height, width
X = nd.random.uniform(shape=(4, 3, 28, 28))
Y = net(X)
Y.shape

(4, 10)

We can index the network, ```net```, for a particular layer with the ```[]``` operators. Below we will examine the weight of the 1st layer and the bias of the 6th layer.

In [0]:
print(f"First layer's weight: {net[0].weight.data().shape}")
print(f"Last layer's bias: {net[5].bias.data().shape}")

First layer's weight: (6, 3, 5, 5)
Last layer's bias: (84,)


Recall how we defined our first layer.

```nn.Conv2D(channels=6, kernel_size=5, activation="relu", in_channels=3),```

The output tuple of ```(6, 3, 5, 5)``` contains four values which represents, respectively:

1. the number of channels/filters/kernels
2. the number of input channels to this layer
3. the width of the kernel
4. the height of the kernel

Now recall how we defined our 6th layer.

```nn.Dense(84, activation="relu")```

The output tuple of ```(84,)``` corresponds to the dimensionality of the output space. 



### Creating a neural network

In this section, we will construct a neural network with a flexible forward function. The network will be a subclass of ```nn.Block```.

We will implement two methods from ```nn.Block``` to define our neural network:
1. ```__init__()``` method which creates the layers
2. ```forward()``` method which defines the forward function

In [0]:
class MixMLP(nn.Block):
    def __init__(self, **kwargs):
        # Run `nn.Block`'s init method
        super(MixMLP, self).__init__(**kwargs)
        self.block = nn.Sequential()
        self.block.add(nn.Dense(3, activation="relu"),
                     nn.Dense(4, activation="relu"))
        self.dense = nn.Dense(5)
    def forward(self, x):
        y = nd.relu(self.block(x))
        print(y)
        return self.dense(y)

net = MixMLP()
net

MixMLP(
  (block): Sequential(
    (0): Dense(None -> 3, Activation(relu))
    (1): Dense(None -> 4, Activation(relu))
  )
  (dense): Dense(None -> 5, linear)
)

The above neural network consists of a sequential layer named ```block``` and a dense layer named ```dense```. 

The ```block``` layer has 2 dense layers which have ReLU as the activation function and are executed sequential as explained above. 

The ```dense``` layer has no activation function and is executed after **because we defined it as such in the forward function**. This means, if we did not call the ```dense``` layer in the forward function, no computation on the layer will happen and effectively using only the ```block``` layer.

We will now initialise the network, create a feature matrix, $X$, with random values between 0 and 1, and feed it into the network.

In [0]:
net.initialize()
X = nd.random.uniform(shape=(2,2))
Y = net(X)
Y


[[0.00091355 0.         0.00028848 0.00129108]
 [0.0015378  0.         0.         0.00164204]]
<NDArray 2x4 @cpu(0)>



[[-4.2858930e-05  1.1725605e-04  9.9832926e-07  6.2495194e-05
   2.2238225e-06]
 [-8.8140361e-05  1.6045298e-04  2.3532573e-06  1.0022008e-04
   1.5071701e-05]]
<NDArray 2x5 @cpu(0)>

### Manual computation

To gain a deeper understanding of the network we just defined we will work backwards and manually compute, $Y$.

As we need to know what the weights of each layer are, we will grab them from our network above and store them in three separate variables.

In [0]:
W1 = net.block[0].weight.data()
W2 = net.block[1].weight.data()
W3 = net.dense.weight.data()

As explained previously explained, the output of each layer is the matrix-matrix (cross) product of the input, $X$, and the transpose matrix of the weights ($W_i.T$) generated at each layer.

As such we will perform the steps as per the code below and yeild the same $Y$ as the network.

In [0]:
print(X)

X = nd.relu(nd.dot(X, W1.T))
X = nd.relu(nd.dot(X, W2.T))
Y = nd.dot(X, W3.T)

Y


[[0.8268704  0.18115096]
 [0.5093421  0.7885455 ]]
<NDArray 2x2 @cpu(0)>



[[-4.2858930e-05  1.1725605e-04  9.9832926e-07  6.2495194e-05
   2.2238225e-06]
 [-8.8140361e-05  1.6045298e-04  2.3532573e-06  1.0022008e-04
   1.5071701e-05]]
<NDArray 2x5 @cpu(0)>