<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

-----

# Introduction to Deep Learning



**Deep learning** is one of the leading tools in data analysis these days and one of the most common frameworks for deep learning is **Keras**. 

Deep learning allows computational models that are composed of multiple processing **layers** to learn representations of data with multiple levels of abstraction.

These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. 

<img src="assets/images/dl_overview.png" >

Credits: Yam Peleg ([@Yampeleg](https://twitter.com/yampeleg))


<h2><a>Building Blocks: Artificial Neural Networks (ANN)</a></h2>


In machine learning and cognitive science, an artificial neural network (ANN) is a network inspired by biological neural networks which are used to estimate or approximate functions that can depend on a large number of inputs that are generally unknown

An ANN is based on a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. 

An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.



In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. 

The connections between artificial neurons are called 'edges'. 



Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. 

Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. 

Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. 

An early version of ANN built from one node was called the **Perceptron**

<img src="assets/images/Perceptron.png" width="45%">

The Perceptron is an algorithm for supervised learning of binary classifiers - functions that can decide whether an input (represented by a vector of numbers) belongs to one class or another.

Much like logistic regression, the weights in a neural net are being multiplied by the input vector summed up and feeded into the activation function's input.

<h2><a>Single Layer Neural Network</a></h2>




<img src="assets/images/single_layer.png" width="65%" />

_(Source: Python Machine Learning, S. Raschka)_

### Weights Update Rule

- We use a **gradient descent** optimization algorithm to learn the _Weights Coefficients_ of the model.
<br><br>
- In every **epoch** (pass over the training set), we update the weight vector $w$ using the following update rule:

$$
w = w + \Delta w, \text{where } \Delta w = - \eta \nabla J(w)
$$

<br><br>

In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the **opposite direction** of the gradient $ \nabla J(w)$. 

In order to fin the **optimal weights of the model**, we optimized an objective function (e.g. the Sum of Squared Errors (SSE)) cost function $J(w)$. 

Furthermore, we multiply the gradient by a factor, the learning rate $\eta$ , which we choose carefully to balance the **speed of learning** against the risk of overshooting the global minimum of the cost function.

### Gradient Descent

In **gradient descent optimization**, we update all the **weights simultaneously** after each epoch, and we define the _partial derivative_ for each weight $w_j$ in the weight vector $w$ as follows:

$$
\frac{\partial}{\partial w_j} J(w) = \sum_{i} ( y^{(i)} - a^{(i)} )  x^{(i)}_j
$$

**Note**: _The superscript $(i)$ refers to the i-th sample. The subscript $j$ refers to the j-th dimension/feature_


Here $y^{(i)}$ is the target class label of a particular sample $x^{(i)}$ , and $a^{(i)}$ is the **activation** of the neuron 

(which is a linear function in the special case of _Perceptron_).

We define the **activation function** $\phi(\cdot)$ as follows:

$$
\phi(z) = z = a = \sum_{j} w_j x_j = \mathbf{w}^T \mathbf{x}
$$


<h2><a>Introducing the multi-layer neural network architecture</a></h2>


Now we will see how to connect **multiple single neurons** to a **multi-layer feedforward neural network**; this special type of network is also called a **multi-layer perceptron** (MLP). 



<img src="assets/images/multi-layers-1.png" width="50%" />

The figure shows the concept of an **MLP** consisting of three layers: one _input_ layer, one _hidden_ layer, and one _output_ layer. 

The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer, respectively. 

If such a network has **more than one hidden layer**, we also call it a **deep artificial neural network**.


MLPs are examples of a fully connected NN.

In a fully connected layer each neuron is connected to every neuron in the previous layer, and each connection has it's own weight. 

This is a totally general purpose connection pattern and makes no assumptions about the features in the data. 

It's also very expensive in terms of memory (weights) and computation (connections).

<img src="assets/images/multi-layers-2.png" width="50%" />



## Forward Propagation

<img src="assets/images/neural_net_flowchart.png" width="40%">

* Starting at the input layer, we forward propagate the patterns of the training data through the network to generate an output.



* Based on the network's output, we calculate the error that we want to minimize using a cost function that we will describe later.



* We backpropagate the error, find its derivative with respect to each weight in the network, and update the model.

### Sigmoid Activation

<img src="assets/images/logistic_function.png" width="50%" />

Neural networks are somewhat related to logistic regression. Basically, we can think of logistic regression as a one layer neural network.



<img src="assets/images/schematic_log_reg.png" width="50%" />

In fact, it is very common to use logistic sigmoid functions as activation functions in the hidden layer of a neural network – like the schematic above but without the threshold function.

It’s fine to use the threshold function in the output layer if we have a binary classification task (in this case, you’d only have one sigmoid unit in the output layer). 

In the case of multi-class classification, we can use a generalization of the One-vs-All approach; i.e., we encode your target class labels via one-hot encoding.

For example, we would encode the three class labels in the familiar Iris dataset (0=Setosa, 1=Versicolor, 2=Virginica).

Then, for the prediction step after learning the model, we just return the “argmax,” the index in the output vector with the highest value as the class label. 

That’s fine if we are only interested in the class label prediction. 

Now, if we want “meaningful” class probabilities, that is, class probabilities that sum up to 1, we could use the softmax function (aka “multinomial logistic regression”). 

In softmax, the probability of a particular sample with net input z belongs to the i th class can be computed with a normalization term in the denominator that is the sum of all M linear functions.

## Backward Propagation

The weights of each neuron are learned by **gradient descent**, where each neuron's error is derived with respect to it's weight.

Optimization is done for each layer with respect to the previous layer in a technique known as **BackPropagation**.

<img src="assets/images/backprop.png" width="50%">


<img src="assets/images/neural_net_flowchart.png" width="40%">

One of the nice properties of logistic regression is that the logistic cost function (or max-entropy) is convex, and thus we are guaranteed to find the global cost minimum. 

But, once we stack logistic activation functions in a multi-layer neural network, we’ll lose this convexity. 



Looking only at a single weight / model coefficient, we can picture the cost function in a multi-layer perceptron as a rugged landscape with multiple local minima that can trap the optimization algorithm:

<img src="assets/images/unconvex.png" width="50%">

However, in practice, backpropagation works quite well for 1 or 2 layer neural networks (and there are deep learning algos such as autoencoders) to help with deeper architectures. 

Even if you may likely converge to a local minima, you often still end up with a powerful predictive model.


<h2><a>Recap</a></h2>

<img src="assets/images/neural_net_flowchart.png" width="50%">

Training a neural network revolves around the following objects:
- Layers, which are combined into a network (or model)
- The input data and corresponding targets
- The loss function, which defines the feedback signal used for learning
- The optimizer, which determines how learning proceeds



<img src="assets/images/neural_net_flowchart.png" width="50%">

The network, composed of layers that are chained together, maps the input data to predictions. 

The loss function then compares these predictions to the targets, producing a loss value: a measure
of how well the network’s predictions match what was expected. 

The optimizer uses this loss value to update the network’s weights.