# Intro to Artificial Neural Networks with Keras

ANNs are the core of **Deep Learning**

### Why this wave of interest in ANN's is unlike to die out like died the 1960s and 1980s
* ANN's frequently outperform other ML techniques on very large and complex problems;
* The increase in computer power since 1990s and cloud platforms have made training large neural networks accessible;
* The training algorithms have been improved since 1990s;
* ANNs seem to have entered a virtuous circle of funding and progress, as new products based on ANNs are launched more attention towards them are pulled.

## Logical Computations with Neurons

A simple model of a artificial neuron has on or more binary inputs and one binary output. The AN activates its output when more than a certain number of its inputs are active.

*Assumption: a neuron is activated when at least two inputs are active*

### Identity function
$C = A$

$A \Rightarrow C$

*if* A is activated *then* C is activated as well (since it receives two inputs signal)

### AND
$C = A \land B$

$A \rightarrow C \leftarrow B$

Neuron C is activated *if and only if* both A *and* B are activated.

### OR
$C = A \lor B$

$A \Rightarrow C \Leftarrow B$

Neuron C gets activated *if at least* neuron A *or* B is activated.

### When a input connection can inhibit the neuron's activity
$C = A \land \neg B$

$A \Rightarrow C \leftarrow \neg B$

Neuron C is activated *only if* A is activated *and* B is deactivated.

## The Perceptron
One of the simplest ANN architectures and it is based on a slightly different artificial neuron called *threshold logic unit* (TLU) or *linear threshold unit* (LTU). The inputs and outputs are numbers (instead of binary) and each input is associated with a weight. The TLU computes a weighted sum of its inputs
$$z = w_1x_1+w_2x_2+\cdots+w_nx_n = \mathbf{X}^{\top}\mathbf{W}$$
then applies a step function to that sum and outputs the result
$$h_{\mathbf{W}}(\mathbf{X}) = step(z)$$

Most common step function used in Perceptrons

$$ Heaviside (z) =
  \begin{cases}
    0       & \quad \text{if } z < t\\
    1  & \quad \text{if } z \geq t
  \end{cases}
$$


$$
sgn(z)=
\begin{cases}
-1 & \quad \text{if} z < t\\
0 & \quad \text{if} z = t\\
+1 &\quad \text{if} z> t
\end{cases}
$$


$$
\text{t: threshold}
$$

A single TLU would be used for simple linear classification like Logistic Regression or SVM classifier. Training a TLU in this case means finding the right values for $\mathbf{W}$

### Composition

A **Perceptron** is composed of a single layer of TLUs with each TLU connected to all inputs (when all neurons in a layer are connected to every single in the previous layer, the layer is called a *fully connected layer* or *dense layer*)

The inputs of the Perceptron are fed to special passthrough neurons called input neurons: they output whatever input they are fed. In addition, an extra bias feature is generelly added ($x_0=1$), it's represented using a neuron called *bias neuron*, which outputs 1 all the time.

$$h_{\mathbf{W, b}}=\phi(\mathbf{XW}+b)$$
Where:  
$\mathbf{X}$: matrix($m\times n$) of input features.  
$\mathbf{W}$: matrix($n\times j$) of connection weights one column ($j$) per artificial neuron in the layer.  
$\mathbf{b}$: bias terms vector ($j$) contains all the connection weights between the bias neuron and the artificial neurons. It has one bias term per artificial neuron.$

The function $\phi$ is called activation function

### How is a Perceptron trained?
Hebb's rule: The connection weight between two neurons tends to increase when they fire simultaneously

A variant of the rule takes into account the error made by the network when making a prediction. **The Perceptron learning rule reinforces connections that help reduce the error**.

$$W_{i, j}^{\text{next step}}=W_{i, j}+\eta(y_j-\hat{y}_j)x_i$$

Where:  
$w_{i, j}$ is the connection weight between the $i^{th}$ input neuron and the $j^{th}$ output neuron. 
$x_i$ is the $i^{th}$ input value of the current training instance.  
$\hat{y}_j$ is the output of the $j^{th}$ output neuron for the current training instance.  
$y_j$ is the target output of the $j^{th}$ output neuron for the current training instance.  
$\eta$ is the learning rate.  

The decision boundary of each output neuron is linear, so Perceptron are incapable of learning complex patterns. However, if the training instances are linearly separables the algorithm would converge to a solution (*Perceptron convergence theorem*)


In [1]:
# Imports
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

import numpy as np
import os

In [2]:
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

In [3]:
iris = load_iris()
X = iris.data[:, (2, 3)] # petal length and petal width
y = (iris.target == 0).astype('int')

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])

`Perceptron` in scikit-learn is equivalent to using an `SGDClassifier` with the fallowing hyperparameters:  
`loss='perceptron'`  
`learning_rate='constant'`  
`eta0='1'`  
`penalty=None`  

*Contrary to Logistic Regression classifier, Perceptrons do not output a class probability, rather they make predictions based on hard threshold. This is one reason to **prefer** Logistic Regression over Perceptrons*

**Perceptron are incabable of solving some trivial problems like *Exclusive OR (XOR)* classification problem. However some of the limitations of perceptrons can be solved by stacking multiple Perceptrons (called Multilayer Perceptron (MLP)).

## The Multilayer Perceptron and Backpropagation
An MLP is composed of one input layer, one or more layers of TLUs (hidden layers) plus a final TLUs' layer called the output layer.

The layers close the input are called *lower layers* and those close to the output *upper layers*. Every layer except the output one includes a bias neuron and is fully connected to the next layer.

**Note**: The signal flows only in one direction (from the inputs to outputs), this architecture is an example of *feedforward neural network (FNN)*.

**The backpropagation** training algorithm in short is a Gradient Descent using an efficient technique for computing the gradients automatically. In just two pass through the (one forward and one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regard to every single model parameter. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has there gradients it just performs a regular gradient descent step, and the whole process is repeated until the network converge to the solution.

**Note**: Automatically computing gradients is called *automatic differentiation*, or *autodiff*. There are various techniques. the one used by backpropagation is called *reverse-mode autodiff*

### The algorithm
* handles one mini-batch at a time. It goes through the training set multiple times (**Epochs**).
* The weights must be randomly initiated.
* The algorithm computes the output of all neurons in each layer until the last layer (**forward pass**) and all intermediates results are preserved.
* The algorithm computes the network's output error (using a loss function).
* Compute how much each output connection contributed to the error (chain rule) and how much of these error contributions come from each connection in the layer below and so on until reaches the input layer. This measures the error gradient across all connection weights in the network by propagating the error backward (**backward pass**).
* Finally the algorithm performs a Gradient Descent step to tweak all connection weights in the network using error gradient computed.

**Gradiant Descent needs a well-defined non-zero derivative function to make progress at every step. Initially  this function was the sigmoid function**
$$\sigma(z)=\frac{1}{1+e^{-z}}$$
**Other choices:**
$$tanh(z)=2\sigma(2z)-1$$
Unlike the sigmoid its output range from $-1$ to $1$ (instead of $0$ to $1$), and the range tends to make each layer's output centered around $0$ at the beginning of training speeding up convergence.
$$ReLU(z)=max(0,z)$$
Not differentiable at $z=0$ and the derivative is $0$ for $z<0$, but in practice it works well and is fast to compute (has become the default).

**A large enough DNN with nonlinear activations can theoretically approximate any continuous function**

## Regression MLPs

When building an MLP for regression, one don't want use any activation function for the output neurons and they can output any value. To Guarantee positive outputs use *ReLU* activation function or *softplus* ($log(1+exp(z))$). 

**TIP:** The Huber loss is quadratic when the error is smaller than a threshold $\delta$ (tipically 1) but linear when larger than $\delta$. 

### Typical regression MLP architecture

|**Hyperparameter**|**Typical value**|
|-|-|
|input neurons|One per feature|
|hidden layers| Typically 1 to 5|
|neurons per hidden layer|Typically 10 to 100|
|output neurons|1 per prediction dimension|
|Hidden activation|ReLU or SELU|
|Output Activation|None, Or ReLU/softplus(if positive) or logistic/tanh (if bounded)|
|loss function| MSE or MAE/Huber|

## Classification MLPs
* For binary classification problem: Single output neuron using the logistic activation function: the output will be a number between 0 and 1 (probability estimation of the positive class).  
* For multilabel binary classification: One neuron per positive class.  
* For multiclass classification: One neuron per class and a softmax activation function.

Regarding the loss function, cross-entropy (log loss) is usually good. as the objective is to predict probability distributions.

### Typical classification MLP architecture

|**Hyperparameter**|**Binary**|**Multilabel Binary**|**Multiclass**|
|-|-|-|-|
|input neurons and hidden layers|Same as regression|Same as regression|Same as regression|
|output neurons|1|1 per label|1 per class|
|Output Activation|logistic|logistic|softmax|
|loss function|Cross entropy|Cross entropy|Cross entropy|

## Implementing MLPs with Keras

Docs: [Keras](https://keras.io/)



In [1]:
import tensorflow as tf
from tensorflow import keras

2021-10-21 08:25:33.286337: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-21 08:25:33.286394: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [4]:
tf.__version__

'2.6.0'

In [5]:
keras.__version__

'2.6.0'

### Building an Image Classifier Using the sequencial API


In [6]:
fashion_mnist = keras.datasets.fashion_mnist

(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [7]:
X_train_full.shape

(60000, 28, 28)

In [8]:
X_train_full.dtype

dtype('uint8')

In [9]:
# Create validation set and scaling the input features

X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

# Class names
class_names = ['T-shirt/top', 'Trouser', 'pullover', 'dress',
              'coat', 'sandall', 'shirt', 'sneaker', 'bag', 'ankle boot']

In [10]:
class_names[y_train[0]]

'coat'

In [None]:
# Create the model
