# Introduction into Deep Learning

&nbsp;

### Andrey Ustyuzhanin<sup>1,2</sup>

&nbsp;

###### ICCUB School, 2016-10, Institute of Cosmos sciences, Barcelona

&nbsp;

#### <sup>1</sup> Yandex School of Data Analysis,
#### <sup>2</sup> Higher School of Economics
<img src="imgs/YSDA_logo.png" height=20>

## whoami, Yandex

- A Dutch company (according to NASDAQ)
- The leading web search engine in Russia
- Image search
- Speech recognition
- Car traffic prediction
- Mail and spam filtering
- Natural language translation
- Yandex Data Factory - data science for business
- Yandex School of Data Analysis (YSDA)

<!-- img src="http://www.marekrei.com/blog/wp-content/uploads/2016/01/CYh2GMnWkAELDTL.jpg" -->

## whoami, YSDA

- Education:
    - Curricula in Data & Computer Science
    - Free tuition
    - No employment obligations on part of the students (yet many go to Yandex)
    - 500+ students graduated since 2007
- Research
    - Organizes Machine Learning Conference
    - Interest in interdisciplinary research (eScience)
    - A full member of LHCb and SHiP

## Goals

- understand Deep Learning landscape
- understand basics, terminology
- understand toolboxes/frameworks
- get hands-on experience with some problems
- know where to look for more

## Why deep learning

  - image recognition
  - text recognition
  - voice recognition
  - Go

## Physics examples

- Higgs boson exotic decay (ATLAS, simulated dataset)
- Jet tagging (CMS)
- muon track identification (CRAYFIS)

## Not so long time ago, 1965
<img src=https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2015/12/GMDH-network.png>
<small>The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer. Image of Prof. Alexey Ivakhnenko courtesy of Wikipedia.
</small>

## Short history, continued

- The earliest convolutional networks were used by Fukushima, 1979
- Backpropagation in the modern form was derived first by Linnainmaa, 1970
- Rumelhart, Hinton, and Williams, 1985 backpropagation in neural networks could yield interesting representations
- LeCunn, 1989 convolutional networks + backpropagation = LeNet

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("https://www.youtube.com/watch?v=FwFduRA_L6Q&feature=youtu.be")

## a few more milestones

- Schmidhuber 1992, pretraining approaches for RNN
- Schmidhuber and Hochreiter in 1997, LSTM (Long-Short Term Memory)
- 2011, 2012 Ciresan et al, won character recognition, traffic sign, medical imaging competitions with convolutional architecture
- Krizhevsky, Sutskever, Hinton, 2012, CNN + ReLU + dropout = won ImageNet competition (AlexNet)
- Google, Facebook, Microsoft made major acquisitions in 2012-2014 of deep learning startups

## References

- https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/

# Basic concepts

## Logistic regression

$p_i = \sigma(\sum_k X_{ik} w_k)$

$\text{llh}=\sum_i y_i \log{p_i} + (1-y_i)\log{(1 - p_i)}\qquad$  (here $y \in \{0, 1\}$)

$loss = -\text{llh}, \qquad loss \to \min$


## Artificial neuron, unit

<img  src="imgs/Perceptron.png"/>

## Activation function

<img  src="imgs/activation.png" width="80%"/>

## Popular activation functions

* **Sigmoid**:

    $f(x) = \frac{1}  {1+e^{-x}}$


* **ReLU - rectifier linear unit**

    In the context of artificial neural networks, the rectifier is an activation function defined as

    $f(x) = \max(0, x)$

    (argued to be more biologically plausible than the widely used logistic sigmoid (which is inspired by probability theory; see logistic regression)). The most popular activation function for deep neural networks.

* **Softplus**
    A smooth approximation to the rectifier is the analytic function

    $f(x) = \ln(1 + e^x)$

    which is called the softplus function.

## Layer

Layer – a building block for NNs :
- “Dense layer”: $f(x) = Wx+b$
- “Nonlinearity layer”: $f(x) = σ(x)$
- Input layer, output layer
- A few more we gonna cover later


<img  src="imgs/MLP.png" width="80%"/>

## Backpropagation (aka backprop)

method for finding the gradient of the error with respect to weights over a neural network. The gradient signifies how the error of the network changes with changes to the network’s weights.
<img  src="imgs/backprop-step.png"/>

<img  src="imgs/backprop.png"/>

## Optimizers

<img  src="imgs/optimizers.gif" width="80%"/>

<small> Behavior of different methods to accelerate gradient descent on a saddle point. Saddle points are thought to be the main difficulty in optimizing deep networks. Image by Alec Radford.</small>

#### Theoretical Motivations for depth

>Much has been studied about the depth of neural nets. Is has been proven mathematically[1] and empirically that convolutional neural network benifit from depth! 

[1] - On the Expressive Power of Deep Learning: A Tensor Analysis - Cohen, et al 2015