### MA755 Machine Learning - Chapter 10. Introduction to Artificial Neural Networks - 2 May 2017 

These notes are based on, and include images from, [_Hands-On Machine Learning with Scikit-Learn and TensorFlow_](http://shop.oreilly.com/product/0636920052289.do) by Aurélien Géron, published by O'Reilly Media, Inc., 2017

In [1]:
!pip install tensorflow --quiet

Collecting tensorflow
  Downloading tensorflow-1.1.0-cp35-cp35m-manylinux1_x86_64.whl (31.0MB)
[K    100% |████████████████████████████████| 31.0MB 46kB/s  eta 0:00:01  6% |██▏                             | 2.1MB 10.2MB/s eta 0:00:03    16% |█████▍                          | 5.2MB 26.7MB/s eta 0:00:01    62% |████████████████████            | 19.5MB 29.5MB/s eta 0:00:01
Collecting werkzeug>=0.11.10 (from tensorflow)
  Downloading Werkzeug-0.12.1-py2.py3-none-any.whl (312kB)
[K    100% |████████████████████████████████| 317kB 1.4MB/s eta 0:00:01
[?25hCollecting protobuf>=3.2.0 (from tensorflow)
  Downloading protobuf-3.2.0-cp35-cp35m-manylinux1_x86_64.whl (5.6MB)
[K    100% |████████████████████████████████| 5.6MB 293kB/s eta 0:00:01
Installing collected packages: werkzeug, protobuf, tensorflow
Successfully installed protobuf-3.2.0 tensorflow-1.1.0 werkzeug-0.12.1
[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip ins

In [5]:
import tensorflow as tf
import numpy      as np

The basic building block of a neural network is a _linear threshold unit_ (LTU.)

In practice the step function is not the heaviside function, but it is included below to demonstrate the origins of the LTU. 

###  Linear Threshold Unit (LTU)

A _linear threshold unit_ is defined with:

- _weights_: $\textbf{w}=w_1, w_2, ..., w_m$
- _input_: $\textbf{x}=x_1, x_2, ..., x_m$ (a vector of independent variable values, for a row)
- _output_: $\operatorname{step}({w}^{T}\cdot\textbf{x})=\operatorname{step}(w_1 x_1 + w_2 x_2, ..., w_n x_m)$ 

where the most common step function is the heaviside function:
\begin{align}
\operatorname{heaviside}(z) = 0 & \text{ if } z < 0
\\                          = 1 & \text{ if } z \ge 0
\end{align}

![alt](LTU.png)
Source: Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron

__Notice that the weights are associated with the edges (or arrows.)__

### Perceptron

> A Perceptron is simply composed of a single layer of LTUs, with each neuron connected to all the inputs. 

In addition, each neuron is connected to all of the outputs. 

The perceptron `fit` method:
- input: a 2D numpy array of shape `(n_samples, n_features()`
- input: a 1D numpy array of shape `(n_samples,)`
- output: multiple binary values (multioutput)

The following code is a mini-example of the `Perceptron` function.

In [9]:
from sklearn.datasets     import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]  # petal length, petal width
y = (iris.target == 0).astype(np.int)  # Iris Setosa?

per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([1])

The perceptron model is graphically displayed below.

![alt](perceptron.png)
Source: Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron

### Perceptron learning rule (weight update)

The network is a flow chart which takes input values to output values. The arrows and their weights determine the output(s) for each set of inputs. 

Below the weight $w_{i,j}$ is the arrow between the $i$-th input neuron and the $j$-th output neuron. This and other details are listed below. 

![alt](perceptronlearningrule.png)

Where: 
- $w_{i, j}$ is the connection weight between the $i$-th input neuron and the $j$-th output neuron
- $x_i$ is the ith input value of the current training instance
- $\hat{y}_j$ is the predicted output of the $j$-th output neuron for the current training instance
- $y_j$ is the target output of the $j$-th output neuron for the current training instance
- $\eta$ is the learning rate

Assuming that the learning rate is positive (in the text it is set to `1`) I believe the expression $\hat{y}_j - y_j$ from the text (above) should be reversed. 

Exercise: Check this. See also [Perceptron at Wikipedia](https://en.wikipedia.org/wiki/Perceptron).

### Replacing the step function

The step function only returns two values (`0` and `1`), its derivative is zero everywhere and so the gradient is not useful.

For these reasons it is replaced with an __activation function__, such as one of the logistic/sigmoid, hyperbolic tangent or ReLU functions.

Sigmoid function:
$$\sigma(z) = \frac{1}{1 + e^{(-z)}}
$$

Hypbolic tangent function:
$$\tanh(z) =  2\sigma(2z) – 1
$$

ReLU function:
$$\operatorname{ReLU}(z) = \max(0,z)
$$

In practice the ReLU function is used most often.

These three functions and their derivatives are displayed below.

![alt](derivatives.png)

### Softmax

When a neural network is used for classification and the number of classes predicted is greater than two, then the softmax function can provide, for each instance, probabilities for each class that add to `1`.

The softmax function has $K$ input values $z_j$ for $j=1, ..., K$ and $K$ output values
$$
    \sigma(z)_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}
$$
for $j=1, ..., K$ where $K$ is usually the number of classes of the target variable.

These output values are interpreted as probabilities. 

See [Softmax at Wikipedia](https://en.wikipedia.org/wiki/Softmax_function).

![alt](softmax.png)

### Multi-Layer Perceptron (MLP)

A MLP consists of multiple layers of LTUs. See graphic below.

![alt](mlp.png)

> For each training instance, 
1. the algorithm feeds it to the network and computes the output of every neuron in each consecutive layer (this is the forward pass, just like when making predictions). 
1. Then it measures the network’s output error (i.e., the difference between the desired output and the actual output of the network), and it computes how much each neuron in the last hidden layer contributed to each output neuron’s error.
1. It then proceeds to measure how much of these error contributions came from each neuron in the previous hidden layer—and so on until the algorithm reaches the input layer. This reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward in the network (hence the name of the algorithm).

### Training Model

> For each training instance the _backpropagation algorithm_ first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

1. For each instance, feed its input values into the network and successively calculate each layer. The last layer calculates the prediction
1. Measure the error at each layer (in reverse order of the previous step)
1. Modify the weights to decrease the error (gradient descent)

### Training Model - Backpropogation

For a nicely worked out example, code and visualizations see:
- https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

### The `sklearn` implementation of MLP

- http://scikit-learn.org/stable/modules/neural_networks_supervised.html

### Training an MLP with TensorFlow’s High-Level API

Use function `DNNClassifier` from the `tensorflow.contrib.learn` class.

In [11]:
import pickle
mnist = pickle.load(open( "mnist.p", "rb" ))

In [12]:
print("mnist.data.shape  ", mnist.data.shape)
print("mnist.target.shape", mnist.target.shape)

mnist.data.shape   (70000, 784)
mnist.target.shape (70000,)


In [13]:
from sklearn.preprocessing   import StandardScaler
mnist_data_scaled = StandardScaler().fit_transform(mnist.data.astype(float))

In [16]:
from sklearn.model_selection import train_test_split

mnist_data_scaled = StandardScaler().fit_transform(mnist.data.astype(float))
mnist_target      = mnist.target.astype(int)

(train_data,        test_data, 
 train_target,      test_target, 
) = train_test_split(mnist_data_scaled,
                     mnist_target,
                     test_size=0.2, 
                     random_state=42)

(train_data.shape,        test_data.shape, 
 train_target.shape,      test_target.shape,
)

((56000, 784), (14000, 784), (56000,), (14000,))

In [19]:
import tensorflow as tf

feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(train_data)
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300, 100], n_class
                                         es=10,
                                         feature_columns=feature_columns)
dnn_clf.fit(x=train_data, y=train_target, batch_size=50, steps=40000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_environment': 'local', '_num_ps_replicas': 0, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_model_dir': None, '_num_worker_replicas': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1fea8f10b8>, '_keep_checkpoint_every_n_hours': 10000, '_is_chief': True, '_tf_random_seed': None, '_save_summary_steps': 100, '_evaluation_master': '', '_task_type': None, '_keep_checkpoint_max': 5, '_master': '', '_task_id': 0}
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by movin

  equality = a == b


Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpncrcrp1x/model.ckpt.
INFO:tensorflow:step = 1, loss = 2.87816
INFO:tensorflow:global_step/sec: 128.948
INFO:tensorflow:step = 101, loss = 0.202792 (0.777 sec)
INFO:tensorflow:global_step/sec: 148.674
INFO:tensorflow:step = 201, loss = 0.196656 (0.673 sec)
INFO:tensorflow:global_step/sec: 177.962
INFO:tensorflow:step = 301, loss = 0.343907 (0.561 sec)
INFO:tensorflow:global_step/sec: 204.302
INFO:tensorflow:step = 401, loss = 0.281701 (0.490 sec)
INFO:tensorflow:global_step/sec: 205.828
INFO:tensorflow:step = 501, loss = 0.314509 (0.486 sec)
INFO:tensorflow:global_ste

DNNClassifier(params={'gradient_clip_norm': None, 'input_layer_min_slice_size': None, 'dropout': None, 'hidden_units': [300, 100], 'feature_columns': (_RealValuedColumn(column_name='', dimension=784, default_value=None, dtype=tf.float64, normalizer=None),), 'activation_fn': <function relu at 0x7f201e5d29d8>, 'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x7f1fea8f11d0>, 'embedding_lr_multipliers': None, 'optimizer': None})

In [23]:
from sklearn.metrics import accuracy_score
predict_target = list(dnn_clf.predict_classes(test_data))
accuracy_score(test_target, predict_target)

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Restoring parameters from /tmp/tmpncrcrp1x/model.ckpt-40000


  equality = a == b


0.97714285714285709

In [20]:
from sklearn.metrics import accuracy_score
predict_target = list(dnn_clf.predict(test_data))
accuracy_score(test_target, predict_target)

Instructions for updating:
Please switch to predict_classes, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


  equality = a == b


INFO:tensorflow:Restoring parameters from /tmp/tmpncrcrp1x/model.ckpt-40000


0.97714285714285709

### Hyperparameters

__Number of hidden layers__

> deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train

> Real-world data is often ... hierarchical. 
- earlier hidden layers model low-level structures
- intermediate hidden layers combine these low-level structures to model intermediate-level structures 
- later hidden layers (and the output layer) combine these intermediate structures to model high-level structures

> In general you will get more bang for the buck by increasing the number of layers than the number of neurons per layer.

__Number of neurons per layer__

> a common practice is to size them ... with fewer and fewer neurons at each layer

It is also common practice to make set the same number of neurons in each layer.

__Activation functions__

ReLU is a good/common choice. 

> For the output layer, the softmax activation function is generally a good choice for classification tasks (when the classes are mutually exclusive). For regression tasks, you can simply use no activation function at all.

