### You can also run the notebook in [COLAB](https://colab.research.google.com/github/deepmipt/DeepPavlov/blob/master/examples/super_convergence_tutorial.ipynb).

In [None]:
!pip3 install deeppavlov

# Super Convergence in DeepPavlov

In [the paper by Leslie N. Smith, Nicholay Topin](https://arxiv.org/abs/1708.07120) authors introduced a phenomenon called "super-convergence", where 
  * <font color='green'>neural networks can be trained</font> an order of magnitude <font color='green'>faster</font> than with standard training methods,
  * there is <font color='green'>a greater boost in performance</font> relative to standard training <font color='green'>when the amount of labeled training data is limited</font>.

### Tutorial Plan:

0. [What is Super Convergence?](#0.-What-is-Super-Convergence?)
1. [DeepPavlov learning rate schedules](#1.-Learning-rate-schedules)
     * [LRScheduledTFModel](#LRScheduledTFModel) [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/models/lr_scheduled_tf_model.py)
     * [DecayType.NO](#DecayType.NO)
     * [DecayType.LINEAR](#DecayType/LINEAR)
     * [DecayType.COSINE](#DecayType/COSINE)
     * [DecayType.EXPONENTIAL](#DecayType.EXPONENTIAL)
     * [DecayType.POLYNOMIAL](#DecayType.POLYNOMIAL)
     * [DecayType.ONECYCLE](#DecayType.ONECYCLE)
     * [DecayType.TRAPEZOID](#DecayType.TRAPEZOID)
     

2. [DeepPavlov learning rate search](#2.-Optimal-learning-rate-search)
3. [DeepPavlov Super Convergence](#3.-Super-Convergence)

### Useful materials
   * Original Super Convergence Paper ["Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" by Leslie N. Smith, Nicholay Topin](https://arxiv.org/abs/1708.07120)
   * Post by Sylvian Gugger on ["How do you find an optimal learning rate"](https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html)
   * [1cycle policy overview](https://sgugger.github.io/the-1cycle-policy.html#the-1cycle-policy)
   * Post by fast.ai with results on CIFAR10, ["Training Imagenet in 3 hours for 25dollars; and CIFAR10 for 0.26dollars"](https://www.fast.ai/2018/04/30/dawnbench-fastai/)

### 0. What is Super Convergence?

The simplest explanation of what it is:
  - method that helps to train complex neural models faster.

As an example, see how it allows to train a resnet-56 on cifar10 to the same or a better precision than the authors in their original paper but with far less iterations.

By training with high learning rates you can reach a model that gets 93% accuracy in 70 epochs which is less than 7k iterations (as opposed to the 64k iterations which made roughly 360 epochs in the original paper).

![cs_loss_comparison.png](img/sc_loss_comparison.png)

One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance.

Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet datasets, and resnet, wide-resnet, densenet, and inception architectures.

### 1. Learning rate schedules

#### LRScheduledTFModel

`class LRScheduledTFModel` [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/models/lr_scheduled_tf_model.py):
  * initializes optimizer
  * updates learning rate and momentum according to a schedule configured in config
  * can search for an optimal learning rate
  
That means that your model doesn't need to handle learning rate and momentum placeholders and initialize optimizer. Just inherit your class from `LRScheduledMOdel`:

```python
from deeppavlov.core.models.lr_scheduled_tf_model import LRScheduledTFModel

class MyModel(LRScheduledTFModel):
```

Examples of wrapped in `LRScheduledTFModel` models are:
   * Goal-Oriented Bot [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/go_bot/network.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/go_bot)
   * Named Entity recognizer [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/ner/network.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/ner)
   * SQUAD model [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/models/squad/squad.py) [[configs]](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/squad)

#### LRScheduledKerasModel 

`class LRScheduledKerasModel` [[source]](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/models/keras_model.py):
  * updates learning rate and momentum according to a schedule configured in config
  * can search for an optimal learning rate
  
That means that your model doesn't need to handle learning rate and momentum placeholders, just need to initialize optimizer and compile model. Just inherit your class from `LRScheduledKerasModel`:

```python
from deeppavlov.core.models.keras_model import LRScheduledKerasModel

class MyModel(LRScheduledKerasModel):
```


#### Optimizer

You can set optimizer by:

```json
{
    "class_name": "my_model",
    ...
    "optimizer": "tf.train:AdadeltaOptimizer"
}
```

If no `optimizer` is mentioned then `tf.train:AdamOptimizer` will be used.

#### DecayType.NO

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": 0.1,
    "learning_rate_decay": "no"
}
```

or just

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": 0.1
}
```

corresponds to the following learning rate update schedule:

![cs_ner_lr_no.png](img/sc_ner_lr_no.png)

#### DecayType.LINEAR

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.01, 0.1]
    "learning_rate_decay": "linear",
    "learning_rate_decay_batches": 800
}
```
corresponds to :

![cs_ner_lr_linear.png](img/sc_ner_lr_linear.png)

Or reverse `learning_rate` parameter to go from larger learning rate to smaller: 

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.1, 0.01]
    "learning_rate_decay": "linear",
    "learning_rate_decay_batches": 800
}
```

![cs_ner_lr_linear2.png](img/sc_ner_lr_linear2.png)

#### DecayType.COSINE

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.1, 0.01]
    "learning_rate_decay": "cosine",
    "learning_rate_decay_batches": 800
}
```

![cs_ner_lr_cosine.png](img/sc_ner_lr_cosine.png)

#### DecayType.EXPONENTIAL

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.1, 0.01]
    "learning_rate_decay": "exponential",
    "learning_rate_decay_batches": 800
}
```
corresponds to :

![cs_ner_lr_exponential.png](img/sc_ner_lr_exponential.png)

#### DecayType.POLYNOMIAL

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.1, 0.01]
    "learning_rate_decay": ["polynomial", 1.0],
    "learning_rate_decay_batches": 800
}
```
corresponds to:

![cs_ner_lr_polynomial.png](img/sc_ner_lr_polynomial.png)

Polynomial decay has a parameter of "decay power" (which was equal to `1.0` preciously).

Let's try decay power value of `0.1`:

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.1, 0.01]
    "learning_rate_decay": ["polynomial", 0.1],
    "learning_rate_decay_batches": 800
}
```

![cs_ner_lr_polynomial1.png](img/sc_ner_lr_polynomial1.png)

And decay power value of `10`:

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.1, 0.01]
    "learning_rate_decay": ["polynomial", 10],
    "learning_rate_decay_batches": 800
}
```

![cs_ner_lr_polynomial2.png](img/sc_ner_lr_polynomial2.png)

#### DecayType.ONECYCLE

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.01, 0.1]
    "learning_rate_decay": "onecycle",
    "learning_rate_decay_batches": 800
}
```
corresponds to :

![cs_ner_lr_onecycle.png](img/sc_ner_lr_onecycle.png)

#### DecayType.TRAPEZOID

```json
{
    "class_name": "my_model",
    ...
    "learning_rate": [0.01, 0.1]
    "learning_rate_decay": "trapezoid",
    "learning_rate_decay_batches": 800
}
```
corresponds to :

![cs_ner_lr_trapezoid.png](img/sc_ner_lr_trapezoid.png)

### 2. Optimal learning rate search

You can also tune learning rate on data before training.

Add `fit_on` and `fit_batch_size` in your component along with desired `learning_rate_decay` (+`learning_rate_decay_batches`), and `learning_rate` parameter will be set automatically.

For example,

```json
{
    "class_name": "my_model",
    ...
    "learning_rate_decay": "trapezoid",
    "learning_rate_decay_batches": 800,
    
    "fit_batch_size": 16,
    "fit_on": ["x0", "x1", "x2", "y"]
}
```

will find an optimal `learning_rate` for your trapezoid update schedule.

`DecayType.NO`, `DecayType.LINEAR`, `DecayType.POLYNOMIAL`, `DecayType.EXPONENTIAL`, `DecayType.ONECYCLE`, `DecayType.TRAPEZOID` are all supported in learning rate search mode.

### 3. Super Convergence

Super Convergence is then equivalent to the following config parameters:

```json
{
    "class_name": "my_model",
    ...
    "learning_rate_decay": "onecycle",
    "learning_rate_decay_batches": 1000, #hyperparameter
    
    "fit_batch_size": 16, #hyperparameter
    "fit_on": ["x0", "x1", "x2", "y"],
    
    "momentum": [0.95, 0.85],
    "momentum_decay": "onecycle",
    "momentum_decay_batches": 1000 #hyperparameter
}
```
for any optimizer. Which will result in similar to the following learning rate and momentum update schedules:

![cs_ner_lr_sc.png](img/sc_ner_lr_sc.png)

For `tf.train:AdamOptimizer` is it recommended to use trapezoid update schedule:

```json
{
    "class_name": "my_model",
    ...
    "optimizer": "tf.train:AdamOptimizer",
    "learning_rate_decay": "trapezoid",
    "learning_rate_decay_batches": 1000, #hyperparameter
    
    "fit_batch_size": 16, #hyperparameter
    "fit_on": ["x0", "x1", "x2", "y"],
    
    "momentum": [0.95, 0.85],
    "momentum_decay": "trapezoid",
    "momentum_decay_batches": 1000 #hyperparameter
}
```

![cs_ner_lr_sc1.png](img/sc_ner_lr_sc1.png)