<font size = 7> Deep Learning </font>

*Alejandro Madriñán Fernández*

![](https://www.etsit.upm.es/fileadmin/documentos/laescuela/la_escuela/galerias_fotograficas/Servicios/generales/logos/LOGO_ESCUELA/LOGO_ESCUELA.png)

# Introduction

*Deep Learning* is framed inside the *Machine Learning* environment.
When traditional machine learning techniques lack strength, deep learning comes as an excellent alternative. The great flexibility their models achieve makes them great for complicated tasks.

Models in the machine learning field are built to mimic a certain relationship between a **response** and a set of **predictors**. Given that the true relationship between the two parts, function $f(·)$, exists, the model will develop a function $\hat{f}(·)$ that approximates $f(·)$ according to a certain internal method, dependant on the model.

While traditional machine learning approaches involve making certain assumptions on the model and working over a set of pre-adapted features, the deep learning approach is to learn the form of $\hat{f}(·)$ without making any specific assumptions or being restricted by a general shape. In the process, deep learning models are able to also perform the task known as *feature extraction*, this is, transforming the original predictors to fill the needs of the problem. Their downside is that the computations needed will become more expensive as the complexity increases, but that is only natural when facing a complex challenge.

This report aims at presenting the structure of some of the more basic architectures inside deep learning, which are the basis of some more complicated state-of-the-art models.

## Neural Network

One of the big claims of deep learning is that its models are able to reach a level of cognition comparable to that of the human mind (for a given task) because, in a sense, they tries to mimic the brain's behaviour. In fact, deep learning's models are often referred to as *neural networks*. As the name suggests, the basic learning unit in a deep learning model is a neuron.

A deep learning neuron, much like a biological neuron, receives impulses (in this case they are not electrical) that can then activate said neuron. Unlike a biological neuron, deep learning neurons act first on the coming signal and only afterwards decide if to pass it on or deactivate.

<img src="https://www.researchgate.net/profile/Mohammad_Bataineh5/publication/335190001/figure/fig1/AS:792206994587652@1565888274951/Schemes-for-human-brain-neuron-and-an-artificial-neural-network-ANN.png" width="500"/>

Specifically, neurons apply a linear combination of their input singals $x_i$ to an activation function. It can be expressed as

$$ y = h \left( \sum_i{(x_i w_i) + b} \right) = h ( \mathbf{x}^T \mathbf{w} + b ) $$

where $h(·)$ is the activation function that has as its input the product of the vector of inputs, $\mathbf{x}$, and the vector of *weights* given to those inputs $\mathbf{w}$. Additionally, before the activation function, a certain amount of bias ($b$) is added to the signal.

There are a lot of different activation functions with different behaviours. Nevertheless, most of them are variations of the next three:

- *Sigmoid* function. Traditional activation function used widely for its atractive computational properties. Its output $y$ verifies that $y \in (0, 1)$.

$$ h_{sigmoid}(x) = \frac{e^x}{1 + e^x} = \frac{1}{1 + e^{-x}} = \sigma(x) $$

- *Rectified Linear Unit* function. Commonly known as ReLU, its major advantage is that is able to deactivate neurons when the input is not significant.

$$ h_{relu}(x) = \bigg\{ \begin{matrix} x & x \gt 0 \\ 0 & x \leq 0 \end{matrix} = \max(0, x) $$

- *Hyperbolic Tangent* function. Is similar to the sigmoid function but its output ranges from $y \in (-1, 1)$, giving more freedom to the learning process.

$$ h_{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{\sinh(x)}{\cosh(x)} = \tanh(x) $$

<!-- matplotlib plot -->
<!-- <img src="" alt="" align="left" width="" style="padding: 0 15px; float: left;"/> -->

Neurons are connected with each other forming a network and the impulses change as they travel through the network.


## Layered architecture

Neural networks are constructed as a set of layers one on top of another. Every layer containing a defined number of neurons.
The first and last layers are called input and output layer, and every other layer in between is a hidden layer.

The name **deep** learning is inspired by the high number of hidden layers that in some cases these models have. As the depth of the model is measured with the number of layers.

### Softmax layer

As with the rest of machine learning problems, they can be divided in classification and regression, depending on the type of the response. But it is a common practice to turn regression problems into classification problems restricting a continuous response into N possible values.
For classification settings, the output layer becomes a *softmax* layer. This layer's function is to express its outputs $y_j$ as probabilities. To do that, it computes

$$ y_i = \frac{e^{x_i}}{\sum_{j = 1}^n e^{x_j}} $$

where $n$ is the total number of inputs and each input $x_i$ is converted to its probability $y_i = p(x_i)$.
These outputs fulfill the next two properties:

1. The sum of all probabilities is the unit. $\sum_{i = 1}^n y_i = 1$

1. Probabilties are positive. $y_i \in [0, 1] \ \forall i$

## Optimization

Deep learning problems can be formulated as *optimization* problems. The optimization strategies are based on minimizing a loss function or maximizing a gain function. Depending on the problem we have:

- Classification: *maximum likelihood estimation* (MLE) or *cross-entropy*. The cross-entropy statistic can be computed from the MLE like this

$$ CE = - \log( \theta_{MLE} ) $$

- Regression: MSE (mean standard error) or MAE (mean average error). The average error based on either the $\ell_2$ or $\ell_1$ norm respectively.

### Algorithms

There is a large number of alternatives for learning the optimal set of weights for each neuron. And, as a general rule, there is no algorithm that outperforms the rest. In this regard, depending on your problem, there might even not be a better solution and you can achieve similar results with different algorithms.

These methods, as a general rule, have in common that are based on moving in the opposite direction of the gradient of the cost function ($-\frac{dy}{dx}$). There are several aspects to be taken into account when moving towards the minimum. In fact, most of the times is not possible to ensure that the algorithm will reach the **global** minimum but a **local** one (unless the cost function is *convex*). Here, we will just mention

- the size of the step $\alpha$. It's the distance traveled by the algorithm each iteration. An overly small value will make the algorithm slow while a big value may cause it to diverge. A usual compromise is to make its size adaptative, shrinking with each iteration $i$.

$$ \alpha_i = \frac{\alpha}{n \sqrt{i}} $$

- *momentum*. Tries to imitate the efect of actual physical momentum.

### Backpropagation

Optimizations algorithms, as a general rule, need to known the gradient of the loss (or gain) function. *Backpropagation* offers an efficient method to do it. It is based on the chain rule to compute partial derivatives.

$$ 
\frac{\partial g}{\partial a} = 
\frac{\partial g}{\partial z} \frac{\partial z}{\partial y}\frac{\partial y}{\partial x} \cdots 
\frac{\partial b}{\partial a} 
$$

First, the algorithm sweeps through the network, starting from the inputs of the network until it reaches the end. Then it computes the goal function $g(·)$, and makes the way back using the chain rule to compute the gradients at each step of the way.

The main idea is that the first derivatives can be reused in the different paths you can take to the beginnning.

## Size of the data

In deep learning the size of the data matters much more than in any other field of machine learning. This is to be expected since the number of hyperparameters the model needs to fine tune grows exponentially but, in most cases, the available data is not enough to reach the optimal behaviour of the model. With more data we could increases the performance.

A roundabout way of improving the performance without increases the amount of data is *transfer learning*. It consists of using already trained networks for similar or more generic tasks and tune just the upper layers (the ones close to the output) with our problem-specific data. This can be reasoned as leaving the feature extraction stage as it is, because it has already been trained with plentiful data, and just tweak the final prediction stage.

An alternative way of gathering more data is generating it from the original samples. This is called *data augmentation*. It involves preprocessing the original data to add noise or perform transformations that generate new samples. It has the additional advantage of increasing the robustness of the model.

<!-- # Feed Forward Neural Networks -->

<!-- ## Architecture -->

<!-- ## Hyperparameters -->

<!-- ## Backpropagation -->

<!-- # Convolutional Neural Networks -->

<!-- ## Architecture -->

<!-- ### Pooling layer -->

<!-- ### Flatten layer -->

<!-- ## Hyperparameters -->

<!-- # Recurrent Neural Networks -->

<!-- ## Architecture -->

<!-- ## Backpropagation -->

<!-- ### Exploding / Vanishing Gradients -->

<!-- ## LSTMN -->

<!-- ## NLP -->

# References