In [None]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": None,
        "transition": None,
        "start_slideshow_at": "selected",
        "leap_motion": {
            "naturalSwipe"  : True,     # Invert swipe gestures
            "pointerOpacity": 0.5,      # Set pointer opacity to 0.5
            "pointerColor"  : "#d80000" # Red pointer"nat.png"
        },
        "header": "<h3>Francisco Perez-Sorrosal</h3>",
        "footer": "<h3>Machine Learning/Deep Learning</h3>",
        "scroll": True,
        "enable_chalkboard": True
     }
)

In [None]:
pip install emoji --upgrade

In [None]:
import emoji
print(emoji.emojize('Presenting stuff is easy!!! :thumbs_up:'))

In [None]:
# Emojis http://getemoji.com/

# Brain-inspired replay for continual learning with artiﬁcial neural networks
[Nature (13 August 2020)](https://www.nature.com/articles/s41467-020-17866-2)
## Gido M. van de Ven & Hava T. Siegelmann & Andreas S. Tolias


## With Excerps from:

### "Three Scenarios for Continual Learning" [https://arxiv.org/pdf/1904.07734.pdf](https://arxiv.org/pdf/1904.07734.pdf)
#### Gido M. van de Ven & Andreas S. Tolias
### "Generative Replay with Feedback Connections as a General Strategy for Continual Learning"  [https://arxiv.org/pdf/1809.10635v2.pdf](https://arxiv.org/pdf/1809.10635v2.pdf)
#### Gido M. van de Ven & Andreas S. Tolias





---

Francisco Perez-Sorrosal | 14 Nov 2020


# Relationships Among the Papers

1. Generative Replay with Feedback Connections as a General Strategy for Continual Learning
2. Three Scenarios for Continual Learning
3. Brain-inspired replay for continual learning with artiﬁcial neural networks


### Relationships:

* 1 : Aims at describing a new *scalable generative replay* method to avoid catastrophic forgetting in lifelong learning
* 1 -> 2 : As part of it, the authors suggest a new framework for **fair evaluation of catastrophic forgetting** consisting in 3 different scenarios for Incremental Learning
* 1 + 2 + extensions -> 3  : Extends the *scalable generative replay* from 1. with additional SotA brain-research techniques in neuroscience and applied to the scenarios described in 2.


# Context:

- Catastrophic Forgetting: Cites McCloskey and Ratcliff psychology-based papers from early 90's on the problems of the connectionist paradigm for avoiding "memory loss"

    "New learning may interfere catastrophically with old learning when networks are trained sequentially. The analysis of the causes of interference implies that at least some interference will occur whenever new learning may alter weights involved in representing old learning, and the simulation results demonstrate only that interference is catastrophic in some specific networks."

    -- McCloskey et al.

## Main contributions:
All along the three papers, the main contributions are:

💡 Make comparison of continual learning scenarios easier and more rigurous
  - Identifies 3 distinct scenarios depending on if the task identity is provided at test time or not.

💡 Compare recently proposed methods on continual learning

💡 Propose a new fast, scalable and competitive (performance-wise) generative replay approach that behaves well in the 3 scenarios proposed

💡 Extend the new generative replay approach proposed with recent techniques extracted from neuroscience applied to the structure of the ANN


# Continual Learning Scenarios

💡 **KEY**: *WHETHER OR NOT* the model is required to identify the identity of the task it has to solve at test time

  - e.g. In N. Masse et al.'s "Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization", which assumes task identity is always available, it is reported a great improvement over SotA.


##  Scenarios:
 
 1. Models are **always** informed about which task is solving (Task-IL)
 
 
 2. Models **do not know** the task identity at test time (domain-incremental learning, Domain-IL)
   - However, the model do not need to infer the task, only solve it
 
 
 3. Models need to both, **solve each task seen and infer which task** are currently evaluating (class-incremental learning, Class-IL)
 
 ![3Scenarios for CL](images/3scenarios.png)
    


# [Three Scenarios for Continual Learning (pdf)](https://arxiv.org/pdf/1904.07734.pdf)

## Gido M. van de Ven & Andreas S. Tolias

### Neurips 2019


Extended description of the scenarios addressing the following:

💡 The scenarios for Continual Learning Analisys

💡 Strategies for Continual Learning

💡 Experiments based on the previous

# Scenarios

## Task-IL

- The easiest continual learning scenario
- Models are always informed about which task needs to be performed
- It's possible to train models with task-specific components. 
- Typical network architecture in this scenario has a “multi-headed” output layer

## Domain-IL

- Task identity is not available at test time
- BUT... models however only need to **solve the task**, not infer the current task
- Task structure is always the same, **BUT** the input-distribution changes
- Example: agents that have to survive in different envs without previous knowledge of the environment itself

## Class-IL

- Models have to solve each task seen AND infer the new current task
- Name refers to the fact that the model has to learn to identify new classes of objects like infants do in the real world.

# Single Headed vs Multi-Headed Methods

Classical view of the continual learning:

💡 A multi-headed layout requires task identity to be known (Task-IL)

💡 A single-headed layout **DOES NOT** requires the task identity to be known (Domain-IL/Class-IL)

This distinction is done **based on the architectural layout of a network’s output layer**. 

- **BUT** despite using a separate output layer for each task, it is the most common way to have specific task identity information, **it is not the only way**

- For a single-headed layout might by
itself not require task identity to be known, it is still possible for the model to use task identity in
other way (See "Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization" by N. Masse et al.)

## Major differences/advantages of the 3 scenarios proposal vs the Single/Multi Headed classical view...

1. The scenarios proposed in the paper reflect more generally the conditions under which a model is evaluated.

2. The scenarios extend upon the multi-headed vs single-headed split by recognizing that:
    - when task identity is not provided, there is a further distinction depending on whether the network is explicitly required to infer task identity
    - the two scenarios resulting from this additional split substantially differ in difficult



# Example Task Protocols

💡 Goal is twofold: 

1. Show that any task protocol can be performed according to each scenario
  -  Exercised through two different task protocols for all three scenarios
2. Demonstrate the difference between the three continual learning scenarios

## Protocols

### Sequentially learning to classify MNIST-digits


![Split MNIST](images/splitmnist.png)



* Demonstrates:



1. The Task-IL scenario
  - It is sometimes referred to as ‘multi-headed split MNIST’
  
2. The Class-IL scenario
  - It is referred to as ‘single-headed split MNIST’
  
3. It could also be performed under the Domain-IL scenario


![Split MNIST](images/splitmnistscenarios.png)
 

### Permuted MNIST

- Each task involves classifying all ten MNIST-digits but with a different permutation applied to the pixels for every new task (Figure 2). 

![Permuted MNIST](images/permutedmnist.png)


* Demonstrates:


1. Naturally the Domain-IL scenario
2. But also it can be performed according to the other scenarios too.

![Permuted MNIST](images/permutedmnistscenarios.png)

# Task Boundaries


## Important Assumption: 

**In training, there are clear and well-defined boundaries between the tasks to be learned**

* **Objective of well defined boundaries?**
  - Make the continual learning process more structured
  - Without that structure, the scenarios described become blurry

## Implications:

* Among others, **training with randomly-sampled minibatches** and **multiple passes over each task’s training data** are no longer possible. 
* Without well-defined task-boundaries, see **Task agnostic continual learning using online variational bayes** by *C. Zeno et al.*

# Methods Compared

## [XdG](https://www.pnas.org/content/115/44/E10467)

- Inherited from the Context-dependent gating concept from neurosciences
- Task switching disinhibits nonoverlapping sets of sparse dendritic branches
    - Intersection of changes ~ 0; Minimal interference of synaptic changes for one task with synaptic changes that occurred for previous tasks
- Simplified version of this XdG
    - Algorithm sends an additional signal unique for each task, which is projected onto all hidden neurons
    - X% of the units in each hidden layer was fully gated (i.e., their activations set to zero)
        - X treated as a hyperparameter (set by grid search)
    - In summary: **Each node in the ANN (randomly and a priori) is assigned to be involved in each task**
- Small computational impact


# Methods Compared: Regularization-based

- Add a regularization term to the loss
- The impact of regularization is controlled by a hyperparameter $λ$
- $L_{total} = L_{current} + λ * L_{regularization}$

## [EWC, Elastic Weight Consolidation](https://arxiv.org/abs/1612.00796)

- Main difference with XdG:
    - Train a different part of the network for each task, but always use the entire network for execution
- HOW? 
    - Regularize the ANN params while training each new task
    - For all params in the ANN it is estimated they are for the previously learned tasks
        - Depending on this, they are penalized for future changes
        - This is equivalent to reduce the learning process in some parts of the network that are supposedly "remembering" the previous tasks
- Suitable for reinforcement learning scenarios
- Favours more the initial tasks

## [online EWC](https://arxiv.org/pdf/1805.06370.pdf)

- **EWC criticism** from [this paper](https://arxiv.org/abs/1712.03847): 
    -  For two tasks, EWC is ~ a diagonalized Laplace approximation, with a new hyperparameter $λ_A$ that tries to signify the task importance
    - Basically the critizism states of EWC that, when more than two tasks are considered, the quadratic penalties in EWC are inconsistent with this derivation (grows linearly with the number of tasks) and might lead to double-counting data from earlier tasks.

- Online EWC is a modification of EWC focused on scalability
- Favours more the most recent past
- Tries to tame growth of the computational cost of the regularization


## [SI, Synaptic Intelligence](https://arxiv.org/abs/1703.04200)

- **NOT DIRECTLY related to biological mechanisms, just observations:** "While we make no claim that biological synapses behave like the intelligent synapses of our model, a wealth of experimental data in neurobiology suggests that biological synapses act in much more complex ways than the artificial scalar synapses that dominate current machine learning models. *In essence, whether synaptic changes occur, and whether they are made permanent, or left to ultimately decay, can be controlled by many different biological factors.*"

- Similar to EWC in the sense of training part of the ANN per task, but fully use the ANN for execution
    - Paper states: "The regularization penalty is similar to EWC as recently introduced by Kirkpatrick et al. (2017)."
- On the contrary to EWC, SI computes the per-synapse consolidation strength:
    - Online (Similar to what is proposed in online EWC)
    - Over the entire learning trajectory in parameter space
- Conjeturates that individual synapses not correspond simply to single scalar synaptic weights, and imnplements them to behave as a higher dimensional dynamical systems
- These high dimensional states of these SI synapses makes possible:
    - Accumulate task relevant information in a more efficient way during training
    - Retain a memory of previous parameter values



# Methods Compared: Replay-based methods

- Add a loss-term for the replayed data. 
- A hyperparameter could be avoided
    - The loss for current/replayed data can be weighted according to the # of tasks the model has been trained on so far
    - $L_{total} = 1/N_{tasks so far} * L_{current} + (1 - 1/N_{tasks so far}) * L_{replay}$
    

## [LwF, Learning Without Forgetting](https://arxiv.org/abs/1606.09282)

- Considered by the authors a replay-based method
- Train a model M1 for task, A with labeled data -> Label input data for task B with M1 -> Use the resulting input-target pairs as pseudo-data for task B.
- Inputs to be replayed labeled with “hard targets” + “soft targets”
    - “hard targets” - the most likely category according to the previous tasks’ model
    - “soft targets” - previous tasks’ model predicted probabilities for all target classes
- Goal for the replayed data: to match the probabilities predicted by the model being trained to these target probabilities. Similar to distillation.

![LWF1](images/lwf1.png)![LWF2](images/lwf2.png)![LWF3](images/lwf3.png)


## [DGR](http://arxiv.org/abs/1705.08690)
- Two models:
    - **Generative model** -- creates data to be replayed, sequentially trained on all tasks (according to each task's input data distribution)
    - **Main model** -- used for evaluate task performance (classification, etc.)
- Input samples were paired with “hard targets” provided by the main model


## [DGR+distill](https://arxiv.org/abs/1802.00853)

- Combination of LwF and DGR
- Separate generative model trained to generate images to be replayed, but these were then paired with soft targets (LwF) instead of hard targets (DGR)


# Bonus: Continual Learning Methods Classification

Source: [A continual learning survey: Defying forgetting in classification tasks (2020)](https://arxiv.org/pdf/1909.08383.pdf)

![continual_learning_methods](images/contlearningmethods.png)

# Results

![Split MNIST (Nature)](images/results_p3_smnist.png)



![Split MNIST](images/results_p1_smnist.png)

# Brain-inspired modifications to GR

### Motivation: Replay does not need to be perfect BUT also NOT LOW QUALITY
  - __Simple approach__: Use recent progress in generative modelling with DNNs
    - Drawbacks: Train those models is complex and expensive
  - __Adopted approach__: Follow brain inspiration
  
### Techniques used


# Take Aways

💡 Ensure

# Reflections/Open Questions

- I share with Terrence J. Sejnowski the following “My belief in [artificial] neural networks was based on my intuition that if nature had solved this problems [vision, speech & language,] we should be able to learn from nature how to solve them, too.”

- The paper does not compare their method with [Fearnet (Ronald Kemker and Christopher Kanan) in ICLR 2018](https://openreview.net/forum?id=SJ1Xmf-Rb), a generative model (that is memory efficient) that does not store previous examples, which apparently was getting SotA performance at incremental class learning on CIFAR-100

- Fun Fact: None of the latest two big surveys about Continual Learning includes any reference to these authors:

    - [A continual learning survey: Defying forgetting in classification tasks (2020)](https://arxiv.org/pdf/1909.08383.pdf)
    - [Continual Lifelong Learning with Neural Networks: A Review (2019)](https://arxiv.org/pdf/1802.07569.pdf)


- Nice set of papers appliying some of the recent advances in neuroscience/learning sciences
    - In particular episodic memory and brain structures involved (hyppocampus and neocortex)
    - Show how the current methods applying generative replay do not mimic exactly the real communication happening in the brain when learning
    - Apparently [great source code](https://github.com/GMvandeVen/continual-learning) available


- Can generative replay (or a similar technique) be applied to Transformer-based models?

    - In particular to:
        - Incremental tasks in text classification
        - Hierarchical text classification tasks when adding new categories
    - Would it be really necessary to have an extra generator? 
        - Can be based on the mere sampling of a small set of previous tasks examples only (e.g. applying zero-shot learning techniques)? 
        - Or maybe a can be expanded with a masked language model like ROBERTA to generate stuff automatically, in a similar way as some of the test cases generation described in [Beyond Accuracy: Behavioral Testing of NLP models with CheckList](https://arxiv.org/abs/2005.04118)

 
- Is the controversy introduced in [1] when saying that generative replay _shifts_ the catastrophic forgetting problem to the training of the generative model_ really overcomed by this work? Or this paper shows that is not true and that a small amount of good enough replay generated by the model itself avoids most of the catastrophic forgetting occurring in the different scenarios?

- I aggree, as it is also stated in [3], that "In essence, in machine learning, in addition to adding depth to our networks, we may need to add intelligence to our synapses."

- Observing and learn about how the brain works and translate the advances of neuroscience/learning science/psicology/psychiatry into the machine learning field has been proven very effective, specially in the last few years where computing power and data has grown almost exponentially. However, may obsesively trying mimic how the brain works, limit/shadow the potential of what additional artificial structures can add/complement to enhance the learning capabilities of current algorithms/techniques? As several theories proposed -See R. Kurtzweil for example-, maybe the changes we should pursue to advance the capacity of artificial "brains", should be similar to the structural changes caused by the evolution of neocortex [2] with regard to other species, and specially inside the primates.




[1] Jonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. [Progress & compress: A scalable framework for continual learning (ICLM 2020)](https://arxiv.org/pdf/1805.06370.pdf)

[2] Jon H. Kaas (Prog Brain Res. 2012 ; 195: 91–102. doi:10.1016/B978-0-444-53860-4.00005-20) [The evolution of neocortex in primates](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3787901/pdf/nihms516054.pdf)

[3] Friedemann Zenke, Ben Poole, Surya Ganguli. [Continual Learning Through Synaptic Intelligence](https://arxiv.org/abs/1703.04200)


# PDF Papers

In [None]:
from IPython.display import IFrame
IFrame("Ven et al. - 2020 - Brain-inspired replay for continual learning with.pdf", width=1500, height=1200)

In [None]:

IFrame("Generative replay with feedback connections as a general strategy for continual learning.pdf", width=1500, height=1200)

In [None]:
IFrame("Three Scenarios for Continual Learning.pdf", width=1500, height=1200)