# Intro to Artificial Neural Networks with Keras

ANNs are the core of **Deep Learning**

### Why this wave of interest in ANN's is unlike to die out like died the 1960s and 1980s
* ANN's frequently outperform other ML techniques on very large and complex problems;
* The increase in computer power since 1990s and cloud platforms have made training large neural networks accessible;
* The training algorithms have been improved since 1990s;
* ANNs seem to have entered a virtuous circle of funding and progress, as new products based on ANNs are launched more attention towards them are pulled.

## Logical Computations with Neurons

A simple model of a artificial neuron has on or more binary inputs and one binary output. The AN activates its output when more than a certain number of its inputs are active.

*Assumption: a neuron is activated when at least two inputs are active*

### Identity function
$C = A$

$A \Rightarrow C$

*if* A is activated *then* C is activated as well (since it receives two inputs signal)

### AND
$C = A \land B$

$A \rightarrow C \leftarrow B$

Neuron C is activated *if and only if* both A *and* B are activated.

### OR
$C = A \lor B$

$A \Rightarrow C \Leftarrow B$

Neuron C gets activated *if at least* neuron A *or* B is activated.

### When a input connection can inhibit the neuron's activity
$C = A \land \neg B$

$A \Rightarrow C \leftarrow \neg B$

Neuron C is activated *only if* A is activated *and* B is deactivated.

## The Perceptron
One of the simplest ANN architectures and it is based on a slightly different artificial neuron called *threshold logic unit* (TLU) or *linear threshold unit* (LTU). The inputs and outputs are numbers (instead of binary) and each input is associated with a weight. The TLU computes a weighted sum of its inputs
$$z = w_1x_1+w_2x_2+\cdots+w_nx_n = \mathbf{X}^{\top}\mathbf{W}$$
then applies a step function to that sum and outputs the result
$$h_{\mathbf{W}}(\mathbf{X}) = step(z)$$

Most common step function used in Perceptrons

$$ Heaviside (z) =
  \begin{cases}
    0       & \quad \text{if } z < t\\
    1  & \quad \text{if } z \geq t
  \end{cases}
$$


$$
sgn(z)=
\begin{cases}
-1 & \quad \text{if} z < t\\
0 & \quad \text{if} z = t\\
+1 &\quad \text{if} z> t
\end{cases}
$$


$$
\text{t: threshold}
$$

A single TLU would be used for simple linear classification like Logistic Regression or SVM classifier. Training a TLU in this case means finding the right values for $\mathbf{W}$

### Composition

A **Perceptron** is composed of a single layer of TLUs with each TLU connected to all inputs (when all neurons in a layer are connected to every single in the previous layer, the layer is called a *fully connected layer* or *dense layer*)

The inputs of the Perceptron are fed to special passthrough neurons called input neurons: they output whatever input they are fed. In addition, an extra bias feature is generelly added ($x_0=1$), it's represented using a neuron called *bias neuron*, which outputs 1 all the time.

$$h_{\mathbf{W, b}}=\phi(\mathbf{XW}+b)$$
Where:  
$\mathbf{X}$: matrix($m\times n$) of input features.  
$\mathbf{W}$: matrix($n\times j$) of connection weights one column ($j$) per artificial neuron in the layer.  
$\mathbf{b}$: bias terms vector ($j$) contains all the connection weights between the bias neuron and the artificial neurons. It has one bias term per artificial neuron.$

The function $\phi$ is called activation function

### How is a Perceptron trained?
Hebb's rule: The connection weight between two neurons tends to increase when they fire simultaneously

A variant of the rule takes into account the error made by the network when making a prediction. **The Perceptron learning rule reinforces connections that help reduce the error**.

$$W_{i, j}^{\text{next step}}=W_{i, j}+\eta(y_j-\hat{y}_j)x_i$$

Where:  
$w_{i, j}$ is the connection weight between the $i^{th}$ input neuron and the $j^{th}$ output neuron. 
$x_i$ is the $i^{th}$ input value of the current training instance.  
$\hat{y}_j$ is the output of the $j^{th}$ output neuron for the current training instance.  
$y_j$ is the target output of the $j^{th}$ output neuron for the current training instance.  
$\eta$ is the learning rate.  

The decision boundary of each output neuron is linear, so Perceptron are incapable of learning complex patterns. However, if the training instances are linearly separables the algorithm would converge to a solution (*Perceptron convergence theorem*)
