In [1]:
%matplotlib inline

# Introduction to deep learning - an overview of a published paper
---
---
---

## Introduction

***Author:*** Atanas Kuzmanov

***Date:*** 2022-February-20

*This paper is a retelling and an overview of the original paper "Introduction to deep learning" by Lihi Shiloh-Perl and Raja Giryes.*

Published: `[v1] Sat, 29 Feb 2020 14:52:28 UTC (6,344 KB)`

Reference:

Shiloh-Perl, L. and Giryes, R., 2020. Introduction to deep learning. arXiv preprint arXiv:2003.03253.

School of Electrical Engineering, Tel Aviv University, e-mail: {lihishiloh@mail ,raja@tauex}.tau.ac.il

[[Reference]](#Introduction-to-deep-learning)

[[Reference]](#Introduction-to-deep-learning-PDF)

[[Reference]](#Introduction-to-deep-learning-ARXIV)

---

*This is an article developed as a scientific notebook for an exam project assignment for a Deep Learning course from an Artificial Intelligence module.*

*One of the aims of this article is to understand some Deep Learning (DL) basics, more specifically to understand Neural Networks (NNs) and how to improve them, so we can create models, train them, test them and extract predictions and information we might be interested in.*

---

## Notes

### References

_Any and all references, citations, resources or other materials used to understand and explain, provide examples, and build this article have been referenced in order to give credit where credit is due and avoid plagiarism._
_If a citation is the bigger part of a section, and has been edited, added to, modified, etc. the reference to that section would be at the end of it, separated with a horizontal line, like this example:_

> ---
> [[Example Reference]](#ExampleReference)

_If a citation has been inserted and is relatively short, the relevant reference will be at the end of the sentence or paragraph, for example:_

> Example. [[Example Reference]](#ExampleReference)

_In case a reference is missed due to human error, all references can be found in the [References](#References) section._

---

## Abstract

Deep Learning (DL) has made a major impact on data science in the last
decade. This chapter introduces the basic concepts of this field. It includes both the
basic structures used to design deep neural networks and a brief survey of some of
its popular use cases.

## 1 General overview

Neural Networks (NN) have revolutionized the modern day-to-day life. Their sig-
nificant impact is present even in our most basic actions, such as ordering products
on-line via Amazon’s Alexa or passing the time with on-line video games against
computer agents. The NN effect is evident in many more occasions, for example,
in medical imaging NNs are utilized for lesion detection and segmentation [40, 5],
and tasks such as text-to-speech [38, 120] and text-to-image [101] have remarkable
improvements thanks to this technology. In addition, the advancements they have
caused in fields such as natural language processing (NLP) [24, 144, 77], optics
[114, 42], image processing [110, 143] and computer vision (CV) [10, 34] are aston-
ishing, creating a leap forward in technology such as autonomous driving [13, 79],
face recognition [109, 134, 23], anomaly detection [64], text understanding [54] and
art [35, 53], to name a few. Its influence is powerful and is continuing to grow.

The NN journey began in the mid 1960’s with the publication of the Perceptron
[105]. Its development was motivated by the formulation of the human neuron
activity [80] and research regarding the human visual perception [49]. However,
quite quickly, a deceleration in the field was experienced, which lasted for almost
three decades. This was mainly the result of lack of theory with respect to the
possibility of training the (single-layer) perceptron and a series of theoretical results that emphasized its limitations, where the most remarkable one is its inability to
learn the XOR function [82].

This _NN ice age_ came to a halt in the mid 1980’s, mainly with the introduction
of the multi-layer perceptron (MLP) and the backpropagation algorithm [107]. Fur-
thermore, the revolutionary convolutional layer was presented [68], where one of its
notable achievements was successfully recognizing hand-written digits [67].
While some other significant developments have happened in the following
decade, such as the development of the long-short memory machine (LSTM) [46],
the field experienced another deceleration. Questions were arising with no adequate
answers especially with respect to the non-convex nature of the used optimization ob-
jectives, overfitting the training data, and the challenge of vanishing gradients. These
difficulties led to a second _NN winter_ , which lasted two decades. In the meantime,
classical machine learning techniques were developed and attracted much academic
and industry attention. One of the prominent algorithms was the newly proposed
Support Vector Machine (SVM) [17], which defined a convex optimization prob-
lem with a clear mathematical interpretation [16]. These properties increased its
popularity and usage in various applications.

The 21 stcentury began with some advancements in neural networks in the areas
of speech processing and Natural Language Processing (NLP). Hinton _et al._ [45]
proposed a method for layer-wise initial training of neural networks, which leveraged
some of the challenges in training networks with several layers. However, the great
NN _tsunami_ truly hit the field with the publication of _AlexNet_ in 2012 [62]. In this
paper, Krizhevsky _et al._ presented a neural network that achieved state-of-the-art
performance on the ImageNet [22] challenge, where the goal is to classify images
into 1000 categories using 1.2 Million images for training and 150000 images for
testing. The improvement over the runner-up, which relied on hand crafted features
and one of the best classification techniques of that time, was notable - more than
10%. This caused the whole research community to understand that neural networks
are way more powerful than what was thought and they bear a great potential for
many applications. This led to a myriad of research works that applied NNs for
various fields showing their great advantage.

Nowadays, it is safe to say that almost every research field has been affected
by this NN _tsunami_ wave, experiencing significant improvements in abilities and
performance. Many of the tools used today are very similar to the ones used in the
previous phase of NN. Indeed, some new regularization techniques such as batch-
normalization [50] and dropout [121] have been proposed. Yet, the key-enablers for
the current success is the large amounts of data available today that are essential for
large NN training, and the developments in GPU computations that accelerate the
training time significantly (sometimes even leading to× 100 speed-up compared to
training on a conventional CPU). The advantages of NN is remarkable especially
at large scales. Thus, having large amounts of data and the appropriate hardware to
process them, is vital for their success.

A major example of a tool that did not exist before is the Generative Adversarial
Network (GAN [39]). In 2014, Goodfellow _et al._ published this novel framework for
learning data distribution. The framework is composed of two models, a generator and a discriminator, trained as adversaries. The generator is trained to capture the
data distribution, while the discriminator is trained to differentiate between generated
(“fake”) data and real data. The goal is to let the generator synthesize data, which the
discriminator fails to discriminate from the real one. The GAN architecture is used
in more and more applications since its introduction in 2014. One such application is
the rendering of real scene images were GANs have proved very successful [36, 151].
For example, Brock _et al._ introduced the BigGAN [7] architecture that exhibited im-
pressive results in creating high-resolution images, shown in Fig. 1. While most GAN
techniques learn from a set of images, recently it has been successfully demonstrated
that one may even train a GAN just using one image [112]. Other GAN application
include inpainting [73, 145], retargeting [115], 3D modeling [1], semi-supervised
learning [31], domain adaptation [47] and more.

![BigGAN_example1.png](2003.03253v1-resources/20031.03253v1-pics/BigGAN_example1.png)

Fig. 1: Class-conditional samples generated by a GAN, [7].

While neural networks are very successful, the theoretical understanding behind
them is still missing. In this respect, there are research efforts that try to provide a
mathematical formulation that explains various aspects of NN. For example, they
study NN properties such as their optimization [124], generalization [52] and ex-
pressive power [108, 88].

The rest of the chapter is organized as follows. In Section 2 the basic structure
of a NN is described, followed by details regarding popular loss functions and
metric learning techniques used today (Section 3). We continue with an introduction
to the NN training process in Section 4, including a mathematical derivation of
backpropagation and training considerations. Section 5 elaborates on the different
optimizers used during training, after which Section 6 presents a review of common
regularization schemes. Section 7 details advanced NN architecture with state-of-
the-art performances and Section 8 concludes the chapter by highlighting some
current important NN challenges.

## 2 Basic NN structure

The basic building block of a NN consists of a linear operation followed by a non-
linear function. Each building block consists of a set of parameters, termed weights
and biases (sometimes the term weights includes also the biases), that are updated
in the training process with the goal of minimizing a pre-defined loss function.

Assume an input data $\mathbf{x}\in \mathbb{R}^{d_0}$, the output of the building block is of the form \mbox{$\psi (\mathbf{W}\mathbf{x}+\mathbf{b})$}, where $\psi(\cdot )$ is a non-linear function, $\mathbf{W}\in \mathbb{R}^{d_1 \times d_0}$ is the linear operation and $\mathbf{b}\in \mathbb{R}^{d_1}$ is the bias. See Fig.2 for an illustration of a single building block. 

![building_block.png](2003.03253v1-resources/20031.03253v1-pics/building_block.png)

Fig. 2: NN building block consists of a linear and a non-linear elements. The weights
**W** and biases **b** are the parameters of the layer.

![NN_illustraion.png](2003.03253v1-resources/20031.03253v1-pics/NN_illustraion.png)

Fig. 3: NN layered structure: concatenation of N building blocks, e.g., model layers.

To form an NN model, such building blocks are concatenated one to another in a
layered structure that allows the input data to be gradually processed as it propagates
through the network. Such a process is termed the (feed-)forward pass. Following it,
during training, a backpropagation process is used to update the NN parameters, as
elaborated in Section 4.1. In inference time, only the forward pass is used.

Fig. 3 illustrates the concatenation ofKbuilding blocks, e.g., layers. The inter-
mediate output at the end of the model (before the “task driven block”) is termed the
_network embedding_ and it is formulated as follows:

$$Φ( x , W (^1 ), ..., W (K), b (^1 ), ..., b (K))=ψ( W (K)...ψ( W (^2 )ψ( W (^1 ) x + b (^1 ))+ b (^2 ))...+ b (K)). \quad \quad (1)$$

The final output (prediction) of the network is estimated from the network embedding
of the input data using an additional task driven layer. A popular example is the case
of classifications, where this block is usually a linear operation followed by the
_cross-entropy_ loss function (detailed in Section 3).

When approaching the analysis of data with varying length, such as sequential
data, a variant of the aforementioned approach is used. A very popular example for
such a neural network structure is the Recurrent Neural Network (RNN [51]). In a
vanilla RNN model, the network receives at each time step just a single input but
with a feedback loop calculated using the result of the same network in the previous
time-step (see an illustration in Fig. 4). This enables the network to "remember"
information and support multiple inputs and producing one or more outputs.

More complex RNN structures include performing bi-directional calculations or
adding gating to the feedback and the input received by the network. The most known
complex RNN architecture is the Long-Term-Short-Memory (LSTM) [46, 37], which
adds gates to the RNN. These gates decide what information from the current input
and the past will be used to calculate the output and the next feedback, as well as
what information to mask (i.e., causing the network to forget). This enables an easier
combination of past and present information. It is commonly used for time-series
data in domains such as NLP and speech processing.

![RNN_series.png](2003.03253v1-resources/20031.03253v1-pics/RNN_series.png)

Fig. 4: Recurrent NN (RNN) illustration for time series data. The feedback loop
introduces time dependent characteristics to the NN model using an element-wise
function. The weights are the same along all time steps.

Another common network structure is the _Encoder-Decoder_ architecture. The
first part of the model, the encoder, reduces the dimensions of the input to a compact
feature vector. This vector functions as the input to the second part of the model, the
decoder. The decoder increases its dimension, usually, back to the original input size.
This architecture essentially learns to compress (encode) the input to an efficiently
small vector and then decode the information from its compact representation. In
the context of regular feedforward NN, this model is known as autoencoder [119]
and is used for several tasks such as image denoising [102], image captioning [133],
feature extraction [132] and segmentation [2]. In the context of sequential data, it is
used for tasks such as translation, where the decoder generates a translated sentence
from a vector representing the input sentence [126, 14].

### 2.1 Common linear layers

A common basic NN building block is the Fully Connected (FC) layer. A net-
work composed of a concatenation of such layers is termed Multi-Layer Perceptron
(MLP) [106]. The FC layer connects every neuron in one layer to every neuron in
the following layer, i.e. the matrix **W** is dense. It enables information propagation
from all neurons to all the ones following them. However it may not maintain spatial
information. Figure 5 illustrates a network with FC layers.

The convolutional layer [66, 68] is another very common layer. We discuss here
the 2D case, where the extension to other dimension is straight-forward. This layer
applies one or multiple convolution filters to its input with kernels of sizeW×H.
The output of the convolution layer is commonly termed a _feature map_.

Each neuron in a feature map receives inputs from a set of neurons from the
previous layer, located in a small neighborhood defined by the kernel size. If we
apply this relationship recursively, we can find the part of the input that affects each
neuron at a given layer, i.e., the area of visible context that each neuron sees from
the input. The size of this part is called the _receptive field_. It impacts the type and
size of visual features each convolution layer may extract, such as edges, corners
and even patterns. Since convolution operations maintain spatial information and are
translation equivariant, they are very useful, namely, in image processing and CV.

If the input to a convolution layer has some arbitrary third dimension, for example
3-channels in an RGB image (C= 3 ) or someC> 1 channels from an output of a
hidden layer in the model, the kernel of the matching convolution layer should be
of sizeW×H×C. This corresponds to applying a different convolution for each
input channel separately, and then summing the outputs to create one feature map.
The convolution layer may create a multi-channel feature map by applying multiple
filters to the input, i.e., using a kernel of sizeW×H×Cin×Cout, whereCinandCout
are the number of channels at the input and output of the layer respectively.

### 2.2 Common non-linear functions

The non-linear functions defined for each layer are of great interest since they
introduce the non-linear property to the model and can limit the propagating gradient
from vanishing or exploding (see Section 4).
Non-linear functions that are applied element-wise are known as _activation func-
tions_. Common activation functions are the Rectified Linear Unit (ReLU [20]), leaky
ReLU [141], Exponential Linear Unit (ELU) [15], hyperbolic tangent (tanh) and sig-
moid. There is no universal rule for choosing a specific activation function, however,

![MLP.png](2003.03253v1-resources/20031.03253v1-pics/MLP.png)
Fig. 5: Fully-connected layers.

ReLUs and ELUs are currently more popular for image processing and CV while
sigmoid and tanh are more common in speech and NLP. Fig. 6 presents the response
of the different activation functions and Table 1 their mathematical formulation.

![activation_functions.jpg](2003.03253v1-resources/20031.03253v1-pics/activation_functions.jpg)
Fig. 6: Different activation functions. Leaky ReLU withα= 0. 1 , ELU withα= 1.

Table 1: Mathematical expressions for non-linear activation functions.

<table cellspacing="1" cellpadding="2" valign="middle" style="border-collapse: collapse; border: none;">
    <tbody>
        <tr style="border: none;">
            <td style="border: none;">
Function
            </td>
            <td style="border: none;">
Formulation $s(x)$
            </td>
            <td style="border: none;">
Derivative $\frac{ds(x)}{dx}$
            </td>
            <td style="border: none;">
Function output
            </td>
        </tr>
        <tr style="border: none;">
            <td style="border: none;">
ReLU
            </td>
            <td style="border: none;">
$$\begin{cases}
0, & \text{for } x<0\\ 
x, & \text{for } x\geq 0
\end{cases}$$
              </td>
            <td style="border: none;">
$$\begin{cases}
0, & \text{for } x<0\\ 
1, & \text{for } x\geq 0
\end{cases}$$
            </td>
            <td style="border: none;">
$$[0,\infty )$$
            </td>
        </tr>
        <tr style="border: none;">
            <td style="border: none;">
Leaky ReLU
            </td>
            <td style="border: none;">
$$\begin{cases}
\alpha x, & \text{for } x<0\\ 
x, & \text{for } x\geq 0
\end{cases}$$
            </td>
            <td style="border: none;">
$$\begin{cases}
\alpha, & \text{for } x<0\\ 
1, & \text{for } x\geq 0
\end{cases}$$
            </td>
            <td style="border: none;">
$$(-\infty ,\infty )$$
            </td>
        </tr>
        <tr style="border: none;">
            <td style="border: none;">
ELU
            </td>
            <td style="border: none;">
$$\begin{cases}
\alpha(\mathrm{e}^{x}-1), & \text{for } x<0\\ 
x, & \text{for } x\geq 0
\end{cases}$$
            </td>
            <td style="border: none;">
$$\begin{cases}
\alpha \mathrm{e}^{x}, & \text{for } x<0\\ 
1, & \text{for } x\geq 0
\end{cases}$$
            </td>
            <td style="border: none;">
$$[-\alpha ,\infty )$$
            </td>
        </tr>
        <tr style="border: none;">
            <td style="border: none;">
Sigmoid
            </td>
            <td style="border: none;">
$$\frac{1}{1+\mathrm{e}^{-x}}$$
            </td>
            <td style="border: none;">
$$\frac{\mathrm{e}^{-x}}{(1+\mathrm{e}^{-x})^2}$$
            </td>
            <td style="border: none;">
$$(0,1)$$
            </td>
        </tr>
        <tr style="border: none;">
            <td style="border: none;">
tanh
            </td>
            <td style="border: none;">
$$\tanh(x)=\frac{\mathrm{e}^{2x}-1}{\mathrm{e}^{2x}+1}$$
            </td>
            <td style="border: none;">
$$1-\tanh^2(x)$$
            </td>
            <td style="border: none;">
$$(-1,1)$$
            </td>
        </tr>
    </tbody>
</table>

Another common non-linear operations in a NN model are the pooling functions.
They are aggregation operations that reduce dimensionality while keeping dominant
features. Assume a pooling size ofqand an input vector to a hidden layer of size
d, $z =[z 1 ,z 1 , ...,zd]$. For every $m ∈ [ 1 ,d]$, the subset of the input vector $z ̃ =
[zm,zm+ 1 , ...,zq+m]$ may undergo one of the following popular pooling operations:

Max pooling: $g(\mathbf{\tilde{z}})=\max_i \mathbf{\tilde{z}}$

Mean pooling: $g(\mathbf{\tilde{z}})=\frac{1}{q}\sum_{i=m}^{q+m}z_i$

$\ell_p$ pooling: $g(\mathbf{\tilde{z}})=\sqrt[p]{\sum_{i=m}^{q+m} z^p_i}$

All pooling operations are characterized by a stride, $s$, that effectively defines the output dimensions. Applying pooling with a stride $s$, is equivalent to applying the pooling with no stride (i.e., $s=1$) and then sub-sampling by a factor of $s$. It is common to add zero padding to $\mathbf{z}$ such that its length is divisible by $s$.

Another very common non-linear function is the $\textit{softmax}$, which normalizes vectors into probabilities. The output of the model, the embedding, may undergo an additional linear layer to transform it to a vector of size $1 \times N$, termed $\textit{logits}$, where $N$ is the number of classes. The logits, here denoted as $\mathbf{v}$, are the input to the softmax operation defined as follows: 
\begin{equation}\label{eq:softmax}
    \text{softmax}(v_i)=\frac{\mathrm{e}^{v_i}}{\sum_{j=1}^{N}\mathrm{e}^{v_j}}, ~~~~~ i\in[1,...,N]. \quad \quad (2)
\end{equation}



## 3 Loss functions

Defining the loss function of the model, denoted as $\mathcal{L}$, is critical and usually chosen based on the characteristics of the dataset and the task at hand. 
Though datasets can vary, tasks performed by NN models can be divided into two coarse groups: (1) regression tasks and (2) classification tasks.

A regression problem aims at approximating a mapping function from input variables to a continuous output variable(s). 
For NN tasks, the output of the network should predict a continues value of interest. %, as opposed to discrete values. 
Common NN regression problems include image denoising[148], deblurring [84], inpainting [142] and more.
In these tasks, it is common to use the Mean Squared Error (MSE), Structural SIMilarity (SSIM) or $\ell_1$ loss as the loss function. 
The MSE ($\ell_2$ error) imposes a larger penalty for larger errors, compared to the $\ell_1$ error which is more robust to outliers in the data. 
The SSIM, and its multiscale version [149], help improving the perceptual quality.

In the _classification_ task, the goal is to identify the correct class of a given
sample from pre-defined $N$ classes. A common loss function for such tasks is the _cross-entropy_  loss. 
It is implemented based on a normalized vector of probabilities corresponding to a list of potential outcomes. This normalized vector is calculated by the softmax non-linear function (Eq. (2)). The cross-entropy loss is defined as:

\begin{equation}\label{eq:cross-entropy}
\mathcal{L}_{CE}=-\sum_{i=1}^{N}y_i\log(p_i), \quad \quad (3)
\end{equation}

where $y_i$ is the ground-truth probability (the label) of the input to belong to class $i$ and $p_i$ is the  model prediction score for this class. The label is usually binary, i.e., it contains $1$ in a single index (corresponding to the true class). This type of representation is known as _one-hot encoding_. The class is predicted in the network by selecting the largest probability and the log-loss is used to increase this probability. 

Notice that a network may provide multiple outputs per input data-point. For example, in the problem of image semantic segmentation, the network predicts a class for each pixel in the image. In the task of object detection, the network outputs a list of objects, where each is defined by a bounding box (found using a regression loss) and a class (found using a classification loss). Section 7.1 details these different tasks.
Since in some problems, the labeled data are imbalanced, one may use weighted softmax (that weigh less frequent classes) or the focal loss [72]. 

### 3.1 Metric Learning

An interesting property of the log-loss function used for classification is that it implicitly cluster classes in the network embedding space during training. However, for a clustering task, these vanilla distance criteria often produce unsatisfactory performance as different class clusters can be positioned closely in the embedding space and may cause miss-classification for samples that do not reside in the specific training set distribution.

Therefore, different metric learning techniques have been developed to produce an embedding space that brings closer intra-class samples and increases inter-class distances. This results in better accuracy and robustness of the network. It allows the network to be able to distinguish between two data samples if they are from the same class or not, just by comparing their embeddings, even if their classes have not been present at training time.

Metric learning is very useful for tasks such as face recognition and identification, where the number of subjects to be tested are not known at training time and new identities that were not present during training should also be identified/recognized (e.g., given two images the network should decide whether these correspond to the same or different persons). 

An example for a popular metric loss is the _triplet loss_ [109]. It enforces a margin between instances of the same class and other classes in the embedding feature space. This approach increases performance accuracy and robustness due to the large separation between class clusters in the embedding space.
The triplet loss can be used in various tasks, namely detection, classification, recognition and other tasks of unknown number of classes.

In this approach, three instances are used in each training step $i$: an anchor $\mathbf{x}_i^a$, another instance $\mathbf{x}_i^p$ from the same class of the anchor (positive sample), and a sample $\mathbf{x}_i^n$ from a different class (negative class).
They are required to obey the following inequality:


\begin{equation}
    \left\Vert \Phi(\mathbf{x}_i^a)-\Phi(\mathbf{x}_i^p) \right\Vert_2^2+\alpha<\left\Vert \Phi(\mathbf{x}_i^a)-\Phi(\mathbf{x}_i^n)\right\Vert_2^2, \quad \quad (4)
\end{equation}
where $\alpha<0$ enforces the wanted margin from other classes.
Thus, the triplet loss is defined by:
\begin{equation}
    \mathcal{L}=\sum_i\left\Vert \Phi(\mathbf{x}_i^a)-\Phi(\mathbf{x}_i^p)\right\Vert_2^2-\left\Vert \Phi(\mathbf{x}_i^a)-\Phi(\mathbf{x}_i^n)\right\Vert_2^2+\alpha. \quad \quad (5)
\end{equation}

Fig. 7 presents a  schematic illustration of the triplet loss influence on samples in the embedding space. 
This illustration also exhibits a specific triplet example, where the positive examples are relatively far from the anchor while negative examples are relatively near the anchor. Finding such examples that violate the triplet condition is desirable during training. They may be found by on-line or off-line searches known as \textit{hard negative mining}. A preprocessing of the instances in the embedding space is performed to find violating examples for training the network.

Finding the "best" instances for training can, evidently, aid in achieving improved convergence. However, searching for them is often time consuming and therefore alternative techniques are being explored. 

![triplet.png](2003.03253v1-resources/20031.03253v1-pics/triplet.png)

Fig. 7: Triplet loss: minimizes the distance between two similar class examples (anchor and positive), and maximizes the distance between two different class examples
(anchor and negative).

An intriguing metric learning approach relies on 'classification'-type loss functions, where the network is trained given a fixed number of classes. Yet, these losses are designed to create good embedding space that creates margin between classes, which in turn provides good prediction of similarity between two inputs. Popular examples include the Cos-loss [134], Arc-loss [23] and SphereFace [76].

## 4 Neural network training

Given a loss function, the weights of the neural network are updated to minimize it for a given training set. The training process of a neural network requires a large database due to the nature of the network (structure and amount of parameters) and GPUs for efficient training implementation.

In general, training methods can be divided into supervised and unsupervised training. The former consists of labeled data that are usually very expensive and time consuming to obtain. Whereas the latter is the more common case and does not assume known ground-truth labels.
However, supervised training usually achieves significantly better network performance compared to the unsupervised case. Therefore, a lot of resources are invested in labeling datasets for training. Thus, we focus here mainly on the supervised setting. 
 
In neural networks, regardless of the model task, all training phases have the same goal: to minimize a pre-defined error function, also denoted as the loss/cost function. 
This is done in two stages: (a) a feed-forward pass of the input data through all the network layers, calculating the error using the predicted outputs and their ground-truth labels (if available); followed by (b) backpropogation of the errors through the network to update their weights, from the last layer to the first. 
This process is performed continuously to find the optimized values for the weights of the network. 

The backpropagation algorithm provides the gradients of the error with respect to the network weights. These gradients are used to update the weights of the network. Calculating them based on the whole input data is computationally demanding and therefore, the common practice is to use subsets of the training set, termed $\textit{mini-batches}$, and cycle over the entire training set multiple times. Each cycle of training over the whole dataset is termed an $\textit{epoch}$ and in every cycle the data samples are used in a random order to avoid biases.
The training process ends when convergence in the loss function is obtained. Since most NN problems are not convex, an optimal solution is not assured. We turn now to describe in more details the training process using backpropagation.

### 4.1 Backpropogation

![example.png](2003.03253v1-resources/20031.03253v1-pics/example.png)

The backpropagation process is performed to update all the parameters of the model, with the goal of decreasing the loss function value. 
The process starts with a feed-forward pass of input data, $\mathbf{x}$, through all the network layers. After which the loss function value is calculated and denoted as $\mathcal{L}(\mathbf{x},{\bf W})$, where ${\bf W}$ are the model parameters (including the model weights and biases, for formulation convenience). 
Then the backpropagation is initiated by computing the value of:~$\frac{\partial \mathcal{L}}{\partial {\bf W}}$, followed by the update of the network weights. All the weights are updated recursively by calculating the gradients of every layer, from the final one to the input layer, using the chain rule.

Denote the output of layer $l$ as ${\bf z}^{(l)}$. Following the chain rule, the gradients of a given layer $l$ with parameters ${\bf W}^{(l)}$ with respect to its input ${\bf z}^{(l)}$ are:

\begin{equation}
    \frac{\partial \mathcal{L}}{\partial {\bf z}^{(l-1)}}=\frac{\partial \mathcal{L}}{\partial {\bf z}^{(l)}}\cdot\frac{\partial {\bf z}^{(l)}({\bf W}^{(l)},{\bf z}^{(l-1)})}{\partial {\bf z}^{(l-1)}}, \quad (6)
\end{equation}
and the gradients with respect to the parameters are:
\begin{equation}
    \frac{\partial \mathcal{L}}{\partial {\bf W}^{(l)}}=\frac{\partial \mathcal{L}}{\partial {\bf z}^{(l)}}\cdot\frac{\partial {\bf z}^{(l)}({\bf W}^{(l)},{\bf z}^{(l-1)})}{\partial {\bf W}^{(l)}}. \quad (7)
\end{equation}

## References from original paper <a id="ReferencesOGPaper"></a>

References

1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: J. Dy, A. Krause (eds.) Proceedings of the 35th
International Conference on Machine Learning, Proceedings of Machine Learning Research,
vol. 80, pp. 40–49. PMLR, StockholmsmÃďssan, Stockholm Sweden (2018)
2. Atlason, H.E., AskellLove, Sigurdsson, S., Gudnason, V., Ellingsen, L.M.: Unsupervised brain
lesion segmentation from mri using a convolutional autoencoder. In: Medical Imaging 2019:
Image Processing, vol. 10949, p. 109491H. International Society for Optics and Photonics
(2019)
3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align
and translate. In: 3rd International Conference on Learning Representations, ICLR (2015)
4. Bar, Y., Diamant, I., Wolf, L., Greenspan, H.: Deep learning with non-medical training used
for chest pathology identification. In: Medical Imaging 2015: Computer-Aided Diagnosis,
vol. 9414, pp. 215 – 221. International Society for Optics and Photonics, SPIE (2015)
5. Ben-Cohen, A., Diamant, I., Klang, E., Amitai, M., Greenspan, H.: Fully convolutional
network for liver segmentation and lesions detection. In: Deep learning and data labeling for
medical applications, pp. 77–85. Springer (2016)
6. Boscaini, D., Masci, J., Melzi, S., Bronstein, M.M., Castellani, U., Vandergheynst, P.: Learning class-specific descriptors for deformable shapes using localized spectral convolutional
networks. Comput. Graph. Forum 34, 13–23 (2015)
7. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural
image synthesis. In: International Conference on Learning Representations (ICLR) (2019)
8. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot
video object segmentation. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 221–230 (2017)
9. Castro, F.M., Marín-Jiménez, M.J., Guil, N., Schmid, C., Alahari, K.: End-to-end incremental
learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–
248 (2018)
10. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous
separable convolution for semantic image segmentation. In: Proceedings of the European
conference on computer vision (ECCV), pp. 801–818 (2018)
11. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous
separable convolution for semantic image segmentation. In: ECCV (2018)
12. Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive darts: Bridging the optimization gap for nas
in the wild. arXiv preprint arXiv:1912.10952 (2019)
13. Chen, Z., Zhang, J., Tao, D.: Progressive lidar adaptation for road detection. IEEE/CAA
Journal of Automatica Sinica 6(3), 693–702 (2019)
14. Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine
translation: Encoder–decoder approaches. In: Workshop on Syntax, Semantics and Structure
in Statistical Translation, pp. 103–111. Association for Computational Linguistics (2014)
15. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by
exponential linear units (elus). CoRR (2015)
16. Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)
17. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge University Press (2000). DOI 10.1017/
CBO9780511801389
18. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2019)
19. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. arXiv (2019)
20. Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for lvcsr using
rectified linear units and dropout. In: ICASSP, pp. 8609–8613. IEEE (2013)
21. Dauphin, Y.N., de Vries, H., Chung, J., Bengio, Y.: Rmsprop and equilibrated adaptive
learning rates for non-convex optimization. CoRR (2015)
22. Deng, J., Dong, W., Socher, R., jia Li, L., Li, K., Fei-fei, L.: Imagenet: A large-scale hierarchical image database. CVPR (2009)
23. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face
recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2019)
24. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding (2018)
25. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding. In: NAACL-HLT (2019)
26. DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with
cutout. arXiv (2017)
27. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A
deep convolutional activation feature for generic visual recognition. In: E.P. Xing, T. Jebara
(eds.) Proceedings of the 31st International Conference on Machine Learning, Proceedings
of Machine Learning Research, vol. 32, pp. 647–655. PMLR, Bejing, China (2014)
28. Dovrat, O., Lang, I., Avidan, S.: Learning to sample. In: The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2019)
29. Duchi, J., Hazan, E., yORAM Singer: Adaptive subgradient methods for online learning and
stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
30. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. J. Mach. Learn.
Res. 20, 55:1–55:21 (2018)
31. van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Machine Learning (2019). DOI 10.1007/s10994-019-05855-6. URL https://doi.org/10.1007/
s10994-019-05855-6
32. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: C. Cortes, N.D. Lawrence, D.D. Lee,
M. Sugiyama, R. Garnett (eds.) Advances in Neural Information Processing Systems 28,
pp. 2962–2970. Curran Associates, Inc. (2015). URL http://papers.nips.cc/paper/
5872-efficient-and-robust-automated-machine-learning.pdf
33. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. arXiv
preprint arXiv:1409.7495 (2014)
34. Gao, C., Gu, D., Zhang, F., Yu, Y.: Reconet: Real-time coherent video style transfer network.
In: Asian Conference on Computer Vision, pp. 637–653. Springer (2018)
35. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–
2423 (2016)
36. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2414–2423
(2016)
37. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with lstm.
ICANN (1999)
38. Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., Zhou, Y.: Deep
voice 2: Multi-speaker neural text-to-speech. In: Advances in neural information processing
systems, pp. 2962–2970 (2017)
39. Goodfellow, I., Jean Pouget-Abadieand, M.M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y.: Generative adversarial nets. In: Z. Ghahramani, M. Welling, C. Cortes, N.D.
Lawrence, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems 27,
pp. 2672–2680. Curran Associates, Inc. (2014)
40. Greenspan, H., van Ginneken, B., Summers, R.M.: Guest editorial deep learning in medical
imaging: Overview and future promise of an exciting new technique. CVPR 35, 1153 – 1159
(2016)
41. Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J.: Long text generation via adversarial training with leaked information. In: Thirty-Second AAAI Conference on Artificial Intelligence
(2018)
42. Haim, H., Elmalem, S., Giryes, R., Bronstein, A.M., Marom, E.: Depth estimation from a
single image using deep learned phase coded mask. IEEE Transactions on Computational
Imaging 4(3), 298–310 (2018)
43. Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., Cohen-Or, D.: Meshcnn: A
network with an edge. ACM Transactions on Graphics (TOG) 38(4), 90 (2019)
44. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. IEEE International Conference
on Computer Vision (ICCV) pp. 2980–2988 (2017)
45. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural
Comput. 18(7), 1527–1554 (2006). DOI 10.1162/neco.2006.18.7.1527. URL http://dx.
doi.org/10.1162/neco.2006.18.7.1527
46. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997). DOI 10.1162/neco.1997.9.8.1735
47. Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A., Darrell, T.:
CyCADA: Cycle-consistent adversarial domain adaptation. In: J. Dy, A. Krause (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine
Learning Research, vol. 80, pp. 1989–1998. PMLR, StockholmsmÃďssan, Stockholm Sweden
(2018)
48. Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., Markham, A.: Randla-net:
Efficient semantic segmentation of large-scale point clouds. arXiv preprint arXiv:1911.11236
(2019)
49. Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurons in the cat’s striate cortex. Journal
of Physiology 148, 574–591 (1959)
50. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In: Proceedings of the 32nd International Conference on Machine
Learning, vol. 37, pp. 448–456 (2015)
51. Jain, L.C., Medsker, L.R.: Recurrent Neural Networks: Design and Applications, 1st edn.
CRC Press, Inc., Boca Raton, FL, USA (1999)
52. Jakubovitz, D., Giryes, R., Rodrigues, M.R.D.: Generalization Error in Deep Learning, pp.
153–193. Springer International Publishing, Cham (2019)
53. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and superresolution. In: European conference on computer vision, pp. 694–711. Springer (2016)
54. Kadlec, R., Schmid, M., Bajgar, O., Kleindienst, J.: Text understanding with the attention sum
reader network. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 908–918. Association for Computational
Linguistics, Berlin, Germany (2016). DOI 10.18653/v1/P16-1086
55. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of
artificial intelligence research 4, 237–285 (1996)
56. Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S.: 3d shape segmentation with projective
convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 6630–6639 (2016)
57. Karlinsky, L., Shtok, J., Harary, S., Schwartz, E., Aides, A., Feris, R., Giryes, R., Bronstein,
A.M.: Repmet: Representative-based metric learning for classification and few-shot object
detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2019)
58. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved
quality, stability, and variation. In: International Conference on Learning Representations
(2018). URL https://openreview.net/forum?id=Hk99zCeAb
59. Kemker, R., McClure, M., Abitino, A., Hayes, T.L., Kanan, C.: Measuring catastrophic
forgetting in neural networks. In: Thirty-second AAAI conference on artificial intelligence
(2018)
60. Keskar, N.S., Socher, R.: Improving generalization performance by switching from adam to
sgd. arXiv preprint arXiv:1712.07628 (2017)
61. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR (2014)
62. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in
Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
63. Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: J.E.
Moody, S.J. Hanson, R.P. Lippmann (eds.) Advances in Neural Information Processing Systems 4, pp. 950–957. Morgan-Kaufmann (1992). URL http://papers.nips.cc/paper/
563-a-simple-weight-decay-can-improve-generalization.pdf
64. Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J.: A survey of deep learningbased network anomaly detection. Cluster Computing 22(1), 949–961 (2019). DOI 10.1007/
s10586-017-1117-8
65. Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using
monolingual corpora only. arXiv preprint arXiv:1711.00043 (2017)
66. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.:
Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–
551 (1989). DOI 10.1162/neco.1989.1.4.541. URL http://dx.doi.org/10.1162/neco.
1989.1.4.541
67. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel,
L.D.: Hand-written digit recognition with a back-propagation network. NIPS (1990)
68. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. In: Proceedings of the IEEE, pp. 2278–2324 (1998)
69. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution on x-transformed
points. In: NeurIPS (2018)
70. Li, Y., Pirk, S., Su, H., Qi, C.R., Guibas, L.J.: Fpnn: Field probing neural networks for 3d data. In: D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon,
R. Garnett (eds.) Advances in Neural Information Processing Systems 29, pp.
307–315. Curran Associates, Inc. (2016). URL http://papers.nips.cc/paper/
6416-fpnn-field-probing-neural-networks-for-3d-data.pdf
71. Lim, S., Kim, I., Kim, T., Kim, C., Kim, S.: Fast autoaugment. In: Advances in Neural
Information Processing Systems (NeurIPS) (2019)
72. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In:
Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
73. Liu, G., Reda, F.A., andx Ting-Chun Shih, K.J.S., Tao, A., Catanzaro, B.: Image inpainting
for irregular holes using partial convolutions. In: The European Conference on Computer
Vision (ECCV) (2018)
74. Liu, H., Simonyan, K., Yang, Y.: DARTS: Differentiable architecture search. In: International
Conference on Learning Representations (2019)
75. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C.Y., Berg, A.C.: Ssd: Single
shot multibox detector. In: ECCV (2016)
76. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding
for face recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 6738–6746 (2017)
77. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer,
L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692 (2019)
78. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)
79. Ma, W.C., Wang, S., Hu, R., Xiong, Y., Urtasun, R.: Deep rigid instance scene flow. In:
CVPR (2019)
80. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The
bulletin of mathematical biophysics 5(4), 115–133 (1943). DOI 10.1007/BF02478259
81. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: C.J.C.
Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.)
Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013). URL http://papers.nips.cc/paper/
5021-distributed-representations-of-words-and-phrases-and-their-compositionality.
pdf
82. Minsky, M., Papert, S.: Perceptrons: An Introduction to Computational Geometry. MIT Press,
Cambridge, MA, USA (1969)
83. Monti, F., Boscaini, D., Masci, J., Rodolà, E., Svoboda, J., Bronstein, M.M.: Geometric
deep learning on graphs and manifolds using mixture model cnns. In: IEEE Conference on
Computer Vision and Pattern Recognition, CVPR, pp. 5425–5434 (2017). DOI 10.1109/
CVPR.2017.576
84. Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convolutional neural network for dynamic
scene deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3883–3891 (2017)
85. Nesterov, Y.E.: A method for solving the convex programming problem with convergence
rate o (1/kˆ 2). In: Dokl. akad. nauk Sssr, vol. 269, pp. 543–547 (1983)
86. Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In:
Proceedings of the 33rd International Conference on International Conference on Machine
Learning - Volume 48, ICML’16, pp. 2014–2023. JMLR.org (2016). URL http://dl.acm.
org/citation.cfm?id=3045390.3045603
87. Noy, A., Nayman, N., Ridnik, T., Zamir, N., Doveh, S., Friedman, I., Giryes, R., Zelnik-Manor,
L.: Asap: Architecture search, anneal and prune. arXiv preprint arXiv:1904.04123 (2019)
88. Ongie, G., Willett, R., Soudry, D., Srebro, N.: A function space view of bounded norm
infinite width re{lu} nets: The multivariate case. In: International Conference on Learning
Representations (ICLR) (2020)
89. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner,
N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. In: Arxiv (2016).
URL https://arxiv.org/abs/1609.03499
90. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component
analysis. IEEE Transactions on Neural Networks 22(2), 199–210 (2010)
91. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks.
In: S. Dasgupta, D. McAllester (eds.) Proceedings of the 30th International Conference on
Machine Learning, Proceedings of Machine Learning Research, vol. 28, pp. 1310–1318.
PMLR, Atlanta, Georgia, USA (2013)
92. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep
contextualized word representations. In: Proc. of NAACL (2018)
93. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 77–85 (2016)
94. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point
sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)
95. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Networks
12(1), 145–151 (1999)
96. Radford, A., Sutskever, I.: Improving language understanding by generative pre-training. In:
arxiv (2018)
97. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International
Conference on Learning Representations (ICLR) (2018)
98. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, realtime object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) pp. 779–788 (2015)
99. Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6517–6525 (2016)
100. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. ArXiv abs/1804.02767
(2018)
101. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text
to image synthesis. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICMLâĂŹ16, p. 1060âĂŞ1069. JMLR.org
(2016)
102. Remez, T., Litany, O., Giryes, R., Bronstein, A.M.: Class-aware fully convolutional gaussian
and poisson denoising. IEEE Transactions on Image Processing 27(11), 5707–5722 (2018)
103. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
region proposal networks. In: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett
(eds.) Advances in Neural Information Processing Systems 28, pp. 91–99. Curran Associates,
Inc. (2015)
104. Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image
segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI),
LNCS, vol. 9351, pp. 234–241. Springer (2015)
105. Rosenblatt, F.: The perceptron: A probabilistic model for information storage and organization
in the brain. Psychological Review pp. 65–386 (1958)
106. Ruck, D.W., Rogers, S.K.: Feature Selection Using a Multilayer Perceptron. Journal of Neural
Network Computing 2(July 1993), 40–48 (1990)
107. Rumelhart, D.E., Hinton, G.E.,Williams, R.J.: Learning Representations by Back-propagating
Errors. Nature 323(6088), 533–536 (1986). DOI 10.1038/323533a0
108. Safran, I., Eldan, R., Shamir, O.: Depth separations in neural networks: What is actually being
separated? In: Conference on Learning Theory (COLT), pp. 2664–2666. PMLR (2019)
109. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition
and clustering. CVPR pp. 815–823 (2015)
110. Schwartz, E., Giryes, R., Bronstein, A.M.: Deepisp: Toward learning an end-to-end image
processing pipeline. IEEE Transactions on Image Processing 28(2), 912–923 (2019). DOI
10.1109/TIP.2018.2872858
111. Schwartz, E., Karlinsky, L., Shtok, J., Harary, S., Marder, M., Kumar, A., Feris, R., Giryes,
R., Bronstein, A.: Delta-encoder: an effective sample synthesis method for few-shot object
recognition. In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.) Advances in Neural Information Processing Systems 31, pp. 2845–2855. Curran
Associates, Inc. (2018)
112. Shaham, T.R., Dekel, T., Michaeli, T.: Singan: Learning a generative model from a single
natural image. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
113. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang,
Y., Skerrv-Ryan, R., Saurous, R.A., Agiomvrgiannakis, Y., Wu, Y.: Natural tts synthesis
by conditioning wavenet on mel spectrogram predictions. In: International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)
114. Shiloh, L., Eyal, A., Giryes, R.: Efficient processing of distributed acoustic sensing data using
a deep learning approach. J. Lightwave Technol. 37(18), 4755–4762 (2019)
115. Shocher, A., Bagon, S., Isola, P., Irani, M.: Ingan: Capturing and retargeting the "dna" of a
natural image. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
116. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning.
Journal of Big Data 6(1), 60 (2019). DOI 10.1186/s40537-019-0197-0
117. Shu, R., Bui, H.H., Narui, H., Ermon, S.: A DIRT-T approach to unsupervised domain
adaptation. In: 6th International Conference on Learning Representations, ICLR (2018)
118. Singh, S., Krishnan, S.: Filter response normalization layer: Eliminating batch dependence
in the training of deep neural networks. arXiv (2019)
119. Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational
autoencoders. In: Advances in neural information processing systems, pp. 3738–3746 (2016)
120. Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A.C., Bengio, Y.:
Char2wav: End-to-end speech synthesis. In: ICLR (2017)
121. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A
simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–
1958 (2014)
122. Such, F.P., Sah, S., Domínguez, M., Pillai, S., Zhang, C., Michael, A., Cahill, N.D., Ptucha,
R.W.: Robust spatial filtering with graph convolutional neural networks. IEEE Journal of
Selected Topics in Signal Processing 11, 884–896 (2017)
123. Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
124. Sun, R.: Optimization for deep learning: theory and algorithms. arXiv preprint
arXiv:1912.08957 (2019)
125. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare:
Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1199–1208 (2018)
126. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In:
Advances in Neural Information Processing Systems, pp. 3104–3112. Curran Associates, Inc.
(2014)
127. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
128. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning.
In: V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, I. Maglogiannis (eds.) Artificial
Neural Networks and Machine Learning – ICANN 2018, pp. 270–279. Springer International
Publishing, Cham (2018)
129. Tetko, I.V., Livingstone, D.J., Luik, A.I.: Neural network studies. 1. comparison of overfitting
and overtraining. Journal of chemical information and computer sciences 35(5), 826–833
(1995)
130. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2962–2971
(2017)
131. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u.,
Polosukhin, I.: Attention is all you need. In: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, R. Garnett (eds.) Advances in Neural Information Processing
Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). URL http://papers.nips.
cc/paper/7181-attention-is-all-you-need.pdf
132. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust
features with denoising autoencoders. In: Proceedings of the 25th international conference
on Machine learning, pp. 1096–1103 (2008)
133. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption
generator. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015)
134. Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., Liu, W.: Cosface: Large
margin cosine loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition (2018)
135. Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation via adversary for
object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2606–2615 (2017)
136. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph
cnn for learning on point clouds. ACM Transactions on Graphics (TOG) (2019)
137. Wang, Y., Yao, Q.: Generalizing from a few examples: A survey on few-shot learning. ArXiv
(2019)
138. Wilson, G., Cook, D.J.: A survey of unsupervised deep domain adaptation. In: arxiv (2018)
139. Wu, Y., He, K.: Group normalization. In: The European Conference on Computer Vision
(ECCV) (2018)
140. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep
representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) pp. 1912–1920 (2014)
141. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. ArXiv (2015)
142. Yang, C., Lu, X., Lu, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting
using multi-scale neural patch synthesis. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6721–6729 (2017)
143. Yang, W., Zhang, X., Tian, Y., Wang, W., Xue, J., Liao, Q.: Deep learning for single image
super-resolution: A brief review. IEEE Transactions on Multimedia 21(12), 3106–3121
(2019). DOI 10.1109/TMM.2019.2919431
144. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet: Generalized
autoregressive pretraining for language understanding (2019)
145. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated
convolution. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
146. Zeiler, M.D.: Adadelta: An adaptive learning rate method. ArXiv abs/1212.5701 (2012)
147. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk
minimization. In: International Conference on Learning Representations (2018). URL
https://openreview.net/forum?id=r1Ddp1-Rb
148. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual
learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26(7),
3142–3155 (2017)
149. Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural
networks. IEEE Transactions on Computational Imaging 3, 47–57 (2017)
150. Zhi Tian Chunhua Shen, H.C., He, T.: Fcos: Fully convolutional one-stage object detection.
In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), ICCV
’19. IEEE Computer Society (2019)
151. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycleconsistent adversarial networks. 2017 IEEE International Conference on Computer Vision
(ICCV) pp. 2242–2251 (2017)

## References <a id="ReferencesSection"></a>


### Introduction to deep learning
<https://paperswithcode.com/paper/introduction-to-deep-learning>

### Introduction to deep learning PDF
<https://arxiv.org/pdf/2003.03253v1.pdf>

### Introduction to deep learning ARXIV
<https://arxiv.org/abs/2003.03253v1>

---

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>
