# Neural Networks

Since our aim is to build machine learning models (whether statistical or neural network) to understand IoT data, let's begin by building some simple models in Python. 

In this lecture we look at implementing neural networks in Python. For the most part we will be using Keras, a high-level model-based library for implementing highly complex and sophisticated deep networks. 

But we will start small, and actually code a learning law on our own. We start by implementing a simple Single Layer Perceptron to learn the Lorry/Van Classification Problem that's in the lecture. The table below shows the data we have. We use -1 to mean "Lorry" and "1" to mean "Van":

|Mass    |Length     |Class   |
|:------:|:---------:|:------:|
|10      |6          |-1      |
|20      |5          |-1      |
|5       |4          |1       |
|2       |5          |1       |
|3       |6          |-1      |
|10      |7          |-1      |
|5       |9          |-1      |
|2       |5          |1       |
|2.5     |5          |1       |
|20      |5          |-1      |

Let's begin by importing Numpy and creating our table by defining a function to create our datasets. Each input example contains the mass and length of the vehicle, and the labels are -1 for truck and 1 for van. The make_dataset function returns a 10x2 matrix for the input, and a 10x1 vector for the labels.


In [1]:
import numpy as np

# Create our dataset
def make_dataset():
    train_data = np.array([[[10, 6]], [[20, 5]], [[5, 4]], 
                           [[2, 5]], [[3, 6]], [[10, 7]], 
                           [[5, 9]], [[2, 5]], [[2.5, 5]], 
                           [[20, 5]]])
    train_labels = np.array([-1, -1, 1, 1, -1, -1, -1, 1, 1, -1])
    
    return (train_data, train_labels)


Let's now initialize the SLP. We define the SLP as a dictionary defined as follows:

slp = {

"inputs":<1x3 input vector>,

"weights":<3x1 weights>,

"output":<1x1 output>

}

Although we have only 2 inputs, our input is defined as 1x3 as we need to include the bias. There is only one output, and thus the weights will be 3x1 matrix of random numbers:

In [2]:
# Initialize the SLP:
# We store our SLP as a dictionary. There are 3 inputs since we have Mass,
# Length, and a bias which is always 1.0. There are 3 weights to connect
# the 3 inputs to the output, and a single output

def init_slp(slp):
    slp['inputs'] = np.array([0.0, 0.0, 1.0])
    slp['weights'] = np.random.randn(3, 1)
    slp['output'] = np.array(0)
    

Now we come to the meat of the SLP: The feedforward and learn functions. The feedforward function is defined as:

$$
f(in, w) = \tanh\left(\sum_{i=0}^{n-1}in_i \times w_{i,0}\right)
$$

Since we have defined our input as a $1\times3$ matrix and the weights as a $3\times1$ matrix, the feedforward is simply a matrix multiply.  We use a tanh transfer function since this maps us between -1 and 1. We set a parameter *alpha* to control the speed of learning. The learning function returns the absolute error, which we will later use to compute the mean absolute error (MAE) across all samples:

In [3]:
# Compute the feedfoward
def feed_forward(slp, inputs):
    # Take dot-product of the inputs and the weights
    slp["inputs"][0:2] = inputs
    slp["output"] = np.tanh(np.matmul(slp["inputs"], slp["weights"]))
    return slp["output"]

def learn(slp, alpha, inputs, target):
    feed_forward(slp, inputs)
    
    # Find error
    E = target - slp['output']
    slp["weights"] = np.add(slp["weights"], (alpha * E[0] 
                                          * slp['inputs'].reshape(3,1)))
    return abs(E[0])

Finally we can create our SLP and train it. We iterate 600 times and print the MAE every 50 iterations.

In [4]:
slp = {}
init_slp(slp)
feed_forward(slp, np.array([[20.0, 5.0]]))

(train_in, train_out) = make_dataset()
for i in range(601):
    ctr = 0
    E = 0
    for j, data in enumerate(train_in):
        ctr = ctr + 1
        E = E + learn(slp, 0.1, data, train_out[j])
    
    if (i % 50) == 0:
        print("Iteration %d, Average Absolute Error: %3.2f" % (i, E / ctr))



Iteration 0, Average Absolute Error: 1.36
Iteration 50, Average Absolute Error: 0.45
Iteration 100, Average Absolute Error: 0.44
Iteration 150, Average Absolute Error: 0.44
Iteration 200, Average Absolute Error: 0.43
Iteration 250, Average Absolute Error: 0.43
Iteration 300, Average Absolute Error: 0.42
Iteration 350, Average Absolute Error: 0.42
Iteration 400, Average Absolute Error: 0.41
Iteration 450, Average Absolute Error: 0.39
Iteration 500, Average Absolute Error: 0.01
Iteration 550, Average Absolute Error: 0.01
Iteration 600, Average Absolute Error: 0.01


We can see that the MAE settles at a decent value of 0.01. Now we can try three sample inputs:

|Mass    |Length     |
|:------:|:---------:|
|12      |7          |
|3       |5          |
|15      |12         |



In [5]:
test_inputs = np.matrix([[12, 7], [3, 5], [15, 12]])

print("Mass\tLength\tClass")
print("-----\t------\t-----")

for x in test_inputs:
    y = feed_forward(slp = slp, inputs = x)
    veh_type = 'truck' if y<=0.0 else 'van'
    print("%3.1f\t%3.1f\t%s"% (x[0,0], x[0,1], veh_type))
    

Mass	Length	Class
-----	------	-----
12.0	7.0	truck
3.0	5.0	van
15.0	12.0	truck


Since we didn't put aside some of the training data for testing (there's only 10 of them), we don't have a "gold standard" to evaluate how good this SLP is. That's alright, since our main aim was to see how to implement the learning law. In any case the outputs here seem consistent with the training data (large mass, length -> truck, otherwise it's a van.)

## Keras Models

In this course we will use the Keras library to implement our neural networks. Keras is a convenient high-level library that is built on top of Google's TensorFlow project, which is in turn a very large and complex library for handling vector arithmetic.

TensorFlow can be configured to run on graphics processing units (GPUs) or Tensor Processing Units (TPUs), but you will need to install GPU or TPU versions separately from what is already provided in this course.

Full documentation on Keras can be found at [Keras docs](https://keras.io/api/)

Keras is much easier to learn than TensorFlow, and provides flexibility in using other deep learning backends like Microsoft CNTK and Theano. It is however less flexible and powerful.  The latest versions of TensorFlow include Keras, but here we will use Keras on its own (built on TensorFlow) rather than as a library within TensorFlow.

Keras is also definitely much more convenient to use than NumPy. ;)

While TensorFlow is centred around the idea of transformations to tensors (basically vectors and matrices) as it flows through the system (hence the name TensorFlow), Keras is centred around higher level models. In Keras there are two models:

    1. Sequential Model
    
    In the Sequential Model neural network layers are just simply stacked together. Keras infers how to connect the layers together based on tensor sizes between the outputs of one layer and the inputs of the next layer. The Sequential Model is easy to understand but restricts you to a simple stacking model.
    
  ![Stacked neural network layers](./Images/fully-connected.png)
    
    2. The Functional API
    
    The Function API is a bit more complicated in that it requires you to specify how to connect one layer to another. However it allows you to build much more sophisticated architectures, like neural networks that take inputs further inside (as opposed to only the input layer), or incorporate outputs from various other neural networks to produce a single consolidated output.
    
![More complex network](./Images/complex.png)
    
In this document we will look at how to create the simpler Sequential model, and see how to use the Functional API in a future session. Let's begin with building a simple Multi-Layer Perceptron using the Sequential Model to recognize handwritten digits from the famouse MNIST dataset.

The MNIST dataset consists of a 28x28 black and white images of handwritten digits:

![MNIST set](./Images/mnist.jpg)

Our job then is to build a classifier that takes a 28x28 image and classify it as one of the 10 digits.

## Imports

We begin by importing:

    - Dense: The Keras implementation for a layer of normal fully-connected NN nodes
    - Dropout: A special layer to control overfitting. More on this in later lectures.
    - Sequential: The Keras Sequential Model.
    - SGD: Stochastic Gradient Descent, a modified version of the gradient descent method that you saw in the lecture.
    


In [6]:
from tensorflow.keras.models import Sequential

# Import Dense and Dropout
from tensorflow.keras.layers import Dense, Dropout

# Bring in the MNIST dataset
from tensorflow.keras.datasets import mnist

# This is a utility function that generates "one-hot" vectors. E.g. if 
# there are four categories, and the samples have the following output
# labels: (0, 3, 1, 2), this will be converted to the following labels:
# 0: [1, 0, 0, 0]
# 3: [0, 0, 0, 1]
# 1: [0, 1, 0, 0]
# 2: [0, 0, 1, 0]

from tensorflow.keras.utils import to_categorical

# Finally we bring in the optimizer
from tensorflow.keras.optimizers import SGD


## Designing and Building the Model

Now let's begin building our model. The weights of the neural networks are called "parameters", and these are decided upon using an optimization algorithm. However we ourselves need to decide on "hyperparameters", which refer to the design of the NN:

    - The size and shape of the input
    - Encoding for the input
    - # of hidden layers
    - Size of each hidden layer
    - Transfer functions
    - Size of the output
    
Some of these are easy to determine. Since our inputs are 28x28 images and it's easier to work with a single dimension vector, we will reshape them into a single 784 element input. Hence the input layer will consist of 784 input nodes. We will scale all inputs to between 0 and 1 for performance reasons. There are 10 digits and thus 10 output nodes.

For the rest we apply a combination of two well respected design techniques called "intuition" and "guesswork" and produce the following design:

    - # of input nodes: 784
    - Encoding: Scale between 0 and 1
    - # of hidden layers: 2
    - Sizee of hidden layer 1: 1024 nodes
    - Transfer function: ReLU (see below)
    - Size of hidden layer 2: 256 nodes
    - Transfer function: ReLU
    - Size of output: 10 nodes
    - Transfer function: Softmax
    
The ReLU, Sigmoid (similar to Softmax) and other transfer functions are shown below. We saw these in the lecture:

![Transfer Functions](./Images/transfer.png)

We also add a "dropout" layer which randomly drops a percentage of the nodes for training, to reduce overfitting. We will look at this in more detail in a later lecture.

Let's build our network. We first create a Sequential model then add the layers. We then call create the optimizer and call compile to finalize the model:


In [8]:
model = Sequential()

# First hidden layer
model.add(Dense(1024, input_shape = (784, ), activation = 'relu'))

# Randomly drop 30% of this layer for training
model.add(Dropout(0.3))

# Add the second hidden layer. No need to specify the input shape since
# Keras can infer it from the previous layer (1024 nodes)
# As before we randomly drop 30% of the nodes for training.
model.add(Dense(256, activation = 'relu'))
model.add(Dropout(0.3))

# Finally our output with softmax activation
model.add(Dense(10, activation = 'softmax'))

# Create a Stochastic Gradient Descent optimizer with a learn rate of 0.01
#  Finally there's a momentum of 0.9 which helps control
# "overshoot"
sgd  = SGD(learning_rate = 0.01, momentum = 0.9)

# Now compile the model. We use a "categorical cross entropy" loss function
# which is more sophisticated than the simple mean-squared loss
# function in the lecture and well suited for classification problems.
# We will look at it again at a later lecture.
model.compile(loss = 'categorical_crossentropy', optimizer = sgd,
             metrics = 'accuracy')



## Loading the Dataset

This is literally it! We've built the network! Now let's bring in the MNIST dataset. We will reshape the data from 28x28x1 to 784x1, then convert it to 32-bit floats and divide by 255 to scale the pixel values to between 0 and 1. We will also convert the labels to one-hot vectors.

In [9]:
# Load the data and reshape it. train_x.shape[0] contains the number
# of rows.
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x = train_x.reshape(train_x.shape[0], 784)
test_x = test_x.reshape(test_x.shape[0], 784)

# Conver to 32-bit floats
train_x = train_x.astype('float32')
test_x = test_x.astype('float32')

# Now scale from 0 to 255 to 0 to 1:
train_x = train_x / 255.0
test_x = test_x / 255.0

# Now convert the labels to one-hot vectors
print("Before: train_y: ", train_y)
train_y = to_categorical(train_y, 10)
print("After: train_y: ", train_y)
test_y = to_categorical(test_y, 10)



Before: train_y:  [5 0 4 ... 5 6 8]
After: train_y:  [[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]]


## Start Training

Now that we have built the network, and loaded and properly encoded the data, let's start training. Here we will call the "fit" method, and shuffle the data around. We will train for 10 epochs in batches of 60 samples. "Batches" are useful for controlling memory usage, especially when you are working in memory limited environments like GPUs.

In [10]:
model.fit(x = train_x, y = train_y, shuffle = True, batch_size = 60, 
          epochs = 10, validation_data = (test_x, test_y))

print("Done testing. Now evaluating:")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Done testing. Now evaluating:


## Evaluation

Finally once training is over we evaluate the network for performance:


In [11]:
loss, acc = model.evaluate(x = test_x, y = test_y)
print("Final loss is %3.2f, accuracy is %3.2f." % (loss, acc))

Final loss is 0.06, accuracy is 0.98.
