In [None]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": None,
        "transition": None,
        "start_slideshow_at": "selected",
        "leap_motion": {
            "naturalSwipe"  : True,     # Invert swipe gestures
            "pointerOpacity": 0.5,      # Set pointer opacity to 0.5
            "pointerColor"  : "#d80000" # Red pointer"nat.png"
        },
        "header": "<h3>Francisco Perez-Sorrosal</h3>",
        "footer": "<h3>Machine Learning/Deep Learning</h3>",
        "scroll": True,
        "enable_chalkboard": True
     }
)

In [None]:
pip install emoji --upgrade

In [None]:
import emoji
print(emoji.emojize('Presenting stuff is easy!!! :thumbs_up:'))

In [None]:
# Emojis http://getemoji.com/

# Brain-inspired Replay for Continual Learning with Artiﬁcial Neural Networks
[Nature (13 August 2020)](https://www.nature.com/articles/s41467-020-17866-2)
## Gido M. van de Ven & Hava T. Siegelmann & Andreas S. Tolias


### With Excerps from:

#### "Three Scenarios for Continual Learning" [https://arxiv.org/pdf/1904.07734.pdf](https://arxiv.org/pdf/1904.07734.pdf)
##### Gido M. van de Ven & Andreas S. Tolias
#### "Generative Replay with Feedback Connections as a General Strategy for Continual Learning"  [https://arxiv.org/pdf/1809.10635v2.pdf](https://arxiv.org/pdf/1809.10635v2.pdf)
##### Gido M. van de Ven & Andreas S. Tolias





---

Francisco Perez-Sorrosal | 7 Dec 2020


# Context

__Avoid Catastrophic Forgetting in Artificial Neural Networks__

- As other papers in the field, authors cite seminal work from *McCloskey* and *Ratcliff* psychology-based papers from early 90's on the problems of the connectionist paradigm for avoiding the so-called *memory loss*:

    *_"New learning may interfere catastrophically with old learning when networks are trained sequentially. The analysis of the causes of interference implies that at least some interference will occur whenever new learning may alter weights involved in representing old learning, and the simulation results demonstrate only that interference is catastrophic in some specific networks."_*

    -- McCloskey et al.

# Relationships Among the Papers

1. _Generative Replay with Feedback Connections as a General Strategy for Continual Learning_
2. _Three Scenarios for Continual Learning_
3. _Brain-inspired replay for continual learning with artiﬁcial neural networks_


### Relationships:

* 1 : Aims at describing a new *scalable generative replay* method to avoid catastrophic forgetting in lifelong learning
* 1 $\longrightarrow$ 2 : As part of it, the authors suggest a new framework for **fair evaluation of catastrophic forgetting** consisting in 3 different scenarios for Incremental Learning
* 1 + 2 + extensions $\longrightarrow$ 3  : Extends the *scalable generative replay* from 1. with additional SotA brain-research techniques in neuroscience and applied to the scenarios described in 2.


# Main Goal and Contributions

__Naive Description of the Papers' Goal__: Avoid to always re-training models on the data of all classes so far, avoiding catastrophic forgetting at the same time.

All along the three papers, the __main contributions__ are:

💡 Make comparison of continual learning scenarios easier and more rigurous
  - Identifies 3 distinct scenarios depending on if the task identity is provided at test time or not.

💡 Compare recently proposed SotA methods on continual learning

💡 Recognize Generative Replay (GR) [1] as the only SotA method competive enough in the 3 scenarios identified
  - Choose GR as a reference base method where to build new techniques

💡 Propose a new fast, scalable and competitive (performance-wise) GR approach that behaves well in the 3 scenarios identified

💡 Extend the new GR approach proposed with recent advances from neurosciences applied to ANN

💡 Generate new perspectives and hypotheses about the computational role and possible implementations of replay in the brain


[1] Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual learning with deep generative
replay. In Advances in Neural Information Processing Systems (eds. Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. & Garnett, R.) 2994–3003(Curran Associates, Inc., Long Beach, 2017)

# Continual Learning Scenarios


## Overall Problem framed as Classification: An Example

- Imagine an agent

- The agent trained to learn a simple task **A**: _"classify cats and dogs"_

- After the initial training on **A**, the agent is trained to learn another simple task **B**: _"classify cows and horses"_

The problem is that, in this setup, there may be different expectations/assumptions/interpretations that can be done by researchers:

1. On one hand, some may expect the agent to solve the exact classification tasks it was trained on
2. However, distinguishing between classes from different learning episodes may seem also a rational expectation if the whole scenario is seen as two sequential events

- __That is, naively we would expect that the agent should now also be able to distinguish between cats and cows__

- The difference in the two expectations described above, it turns out to dramatically affect the difficulty of a continual learning problem, hence the __necessity to distinguish between -at least- these two scenarios__

- Most of the __SotA ML algorithms for CL fail in the second scenario__ even on seemingly simple toy examples
 - Only Generative Replay performs well on the second scenario


# Continual Learning Scenarios

💡 **KEY**: *WHETHER OR NOT* the model is required to identify the identity of the task it has to solve at test time

  - e.g. In N. Masse et al.'s "Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization", which assumes task identity is always available, it is reported a great improvement over SotA.


##  Scenarios:
 
 1. Models are **always** informed about which task is solving (Task-IL)
 
 
 2. Models **do not know** the task identity at test time (domain-incremental learning, Domain-IL)
   - However, the model do not need to infer the task, only solve it
 
 
 3. Models need to both, **solve each task seen and infer which task** are currently evaluating (class-incremental learning, Class-IL)
 
 ![3Scenarios for CL](images/3scenarios.png)
    


# [Three Scenarios for Continual Learning (pdf)](https://arxiv.org/pdf/1904.07734.pdf)

## Gido M. van de Ven & Andreas S. Tolias

### Neurips 2019


Extended description of the scenarios addressing the following:

💡 The scenarios for Continual Learning Analisys

💡 Strategies for Continual Learning

💡 Experiments based on the previous scenarios presented

# Scenarios

## Task-IL

- The easiest continual learning scenario
- Models are always informed about which task needs to be performed
- It's possible to train models with task-specific components 
- Typical network architecture in this scenario has a “multi-headed” output layer

## Domain-IL

- Task identity is not available at test time
- BUT... models however only need to **solve the task**, not infer the current task
- Task structure is always the same, **BUT** the input-distribution changes
- Example: agents that have to survive in different envs without previous knowledge of the environment itself

## Class-IL

- Models have to solve each task seen AND infer the new current task
- Name refers to the fact that the model has to learn to identify new classes of objects like infants do in the real world

# Single Headed vs Multi-Headed Methods

Classical view of the continual learning:

💡 A multi-headed layout requires task identity to be known (Task-IL)

💡 A single-headed layout **DOES NOT** requires the task identity to be known (Domain-IL/Class-IL)

This distinction is done **based on the architectural layout of a network’s output layer**. 

- **BUT** despite using a separate output layer for each task, it is the most common way to have specific task identity information, **it is not the only way**

- For a single-headed layout might by
itself not require task identity to be known, it is still possible for the model to use task identity in
other way (See [1])

## Major differences/advantages of the 3 scenarios proposal vs the Single/Multi Headed classical view...

1. The scenarios proposed in the paper reflect more generally the conditions under which a model is evaluated.

2. The scenarios extend upon the multi-headed vs single-headed split by recognizing that:
    - when task identity is not provided, there is a further distinction depending on whether the network is explicitly required to infer task identity
    - the two scenarios resulting from this additional split substantially differ in difficult

[1] *Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization*, by N. Masse et al.

# Example Task Protocols

💡 Goal is twofold: 

1. Show that any task protocol can be performed according to each scenario
  -  Exercised through two different task protocols for all three scenarios
2. Demonstrate the difference between the three continual learning scenarios

## Protocols

### Sequentially learning to classify MNIST-digits


![Split MNIST](images/splitmnist.png)



Demonstrates:



1. The Task-IL scenario
  - It is sometimes referred to as ‘multi-headed split MNIST’
  

2. The Class-IL scenario
  - It is referred to as ‘single-headed split MNIST’


3. It could also be performed under the Domain-IL scenario


![Split MNIST](images/splitmnistscenarios.png)
 

### Permuted MNIST

- Each task involves classifying all ten MNIST-digits but with a different permutation applied to the pixels for every new task (Figure 2). 

![Permuted MNIST](images/permutedmnist.png)


Demonstrates:


1. Naturally the Domain-IL scenario
2. But also it can be performed according to the other scenarios too.

![Permuted MNIST](images/permutedmnistscenarios.png)

# Task Boundaries


## Important Assumption: 

**In training, there are clear and well-defined boundaries between the tasks to be learned**

* **Objective of well defined boundaries?**
  - Make the continual learning process more structured
  - Without that structure, the scenarios described become blurry

## Implications:

* Among others, **training with randomly-sampled minibatches** and **multiple passes over each task’s training data** are no longer possible. 
* Without well-defined task-boundaries, see **Task agnostic continual learning using online variational bayes** by *C. Zeno et al.*

# Methods Compared: Task-specific

## [XdG, Context Dependent Gating](https://www.pnas.org/content/115/44/E10467)

- Inherited from the [Context-dependent gating](https://en.wikipedia.org/wiki/Synaptic_gating) concept from neurosciences
- Task switching disinhibits nonoverlapping sets of sparse dendritic branches
    - Intersection of changes ~ 0; Minimal interference of synaptic changes for one task with synaptic changes that occurred for previous tasks
- Simplified version of this XdG
    - Algorithm sends an additional signal unique for each task, which is projected onto all hidden neurons
    - X% of the units in each hidden layer was fully gated (i.e., their activations set to zero)
        - X treated as a hyperparameter (set by grid search)
    - In summary: **Each node in the ANN (randomly and a priori) is assigned to be involved in each task**
- Small computational impact


# Methods Compared: Regularization-based

- Add a regularization term to the loss
- The impact of regularization is controlled by a hyperparameter $λ$
- $L_{total} = L_{current} + λ * L_{regularization}$

## [EWC, Elastic Weight Consolidation](https://arxiv.org/abs/1612.00796)

- Main difference with XdG:
    - Train a different part of the network for each task, but always use the entire network for execution
- HOW? 
    - Regularize the ANN params while training each new task
    - For all params in the ANN it is estimated if they are for the previously learned tasks
        - Depending on this, they are penalized for future changes
        - This is equivalent to reduce the learning process in some parts of the network that are supposedly "remembering" the previous tasks
- Suitable for reinforcement learning scenarios
- Favours more the initial tasks

## [online EWC](https://arxiv.org/pdf/1805.06370.pdf)

- **EWC criticism** from [this paper](https://arxiv.org/abs/1712.03847): 
    -  For two tasks, EWC is ~ a diagonalized Laplace approximation, with a new hyperparameter $λ_A$ that tries to signify the task importance
    - Basically the critizism states of EWC that, when more than two tasks are considered, the quadratic penalties in EWC are inconsistent with this derivation (grows linearly with the number of tasks) and might lead to double-counting data from earlier tasks.

- Online EWC is a modification of EWC focused on scalability
- Favours more the most recent past
- Tries to tame growth of the computational cost of the regularization


## [SI, Synaptic Intelligence](https://arxiv.org/abs/1703.04200)

- **NOT DIRECTLY related to biological mechanisms, just observations:** "While we make no claim that biological synapses behave like the intelligent synapses of our model, a wealth of experimental data in neurobiology suggests that biological synapses act in much more complex ways than the artificial scalar synapses that dominate current machine learning models. *In essence, whether synaptic changes occur, and whether they are made permanent, or left to ultimately decay, can be controlled by many different biological factors.*"

- Similar to EWC in the sense of training part of the ANN per task, but fully use the ANN for execution
    - Paper states: "The regularization penalty is similar to EWC as recently introduced by Kirkpatrick et al. (2017)."
- On the contrary to EWC, SI computes the per-synapse consolidation strength:
    - Online (Similar to what is proposed in online EWC)
    - Over the entire learning trajectory in parameter space
- Conjeturates that individual synapses not correspond simply to single scalar synaptic weights, and imnplements them to behave as a higher dimensional dynamical systems
- These high dimensional states of these SI synapses makes possible:
    - Accumulate task relevant information more efficiently during training
    - Retain a memory of previous parameter values



# Methods Compared: Replay-based methods

- Loss function consists of:
    - A term for the data of the current task and...
    - A term for the replayed data

- The loss for current/replayed data can be weighted according to the # of tasks the model has been trained on so far
    - $L_{total} = 1/N_{tasks so far} * L_{current} + (1 - 1/N_{tasks so far}) * L_{replay}$
    

## [LwF, Learning Without Forgetting](https://arxiv.org/abs/1606.09282)

- Considered by the authors a replay-based method
- Train a model M1 for task, A with labeled data -> Label input data for task B with M1 -> Use the resulting input-target pairs as pseudo-data for task B.
- Inputs to be replayed labeled with “hard targets” + “soft targets”
    - “hard targets” - the most likely category according to the previous tasks’ model
    - “soft targets” - previous tasks’ model predicted probabilities for all target classes
- Goal for the replayed data: to match the probabilities predicted by the model being trained to these target probabilities. Similar to distillation.

![LWF1](images/lwf1.png)![LWF2](images/lwf2.png)![LWF3](images/lwf3.png)


## [DGR (Deep Generative Replay or just GR)](http://arxiv.org/abs/1705.08690)
- Two models:
    - **Generative model** -- creates data to be replayed, sequentially trained on all tasks (according to each task's input data distribution)
    - **Main model** -- used for evaluate task performance (classification, etc.)
- Input samples were paired with “hard targets” provided by the main model


## [DGR+distill](https://arxiv.org/abs/1802.00853)

- Combination of LwF and DGR
- Separate generative model trained to generate images to be replayed, but these were then paired with soft targets (LwF) instead of hard targets (DGR)


## [iCaRL](https://arxiv.org/abs/1611.07725)

- Only the training data for a small number of classes has to be present at the same time and new classes can be added progressively
- Can learn many classes incrementally over a long period of time where other strategies quickly fail
- It can only be applied in the Class-IL scenario
  - However, two components of iCaRL are suitable for all scenarios:
    - the use of exemplars for classification 
    - and the replay of stored data during training

# Results

__Baselines__:
- None: Model sequentially trained on all tasks in the standard way (fine-tuning)
  - Lower bound
- Offline: Model trained using the data of all tasks so far (joint training,)
  - Upper bound

## Split MNIST

![3 Scenarios Split MNIST](images/3smnist.png)

## Permuted MNIST

![3 Scenarios Permuted MNIST](images/3pmnist.png)

# Take-Aways

* Class-IL scenario (i.e., when task identity must be inferred), __only replay-based methods are capable of producing acceptable results__
  - Replay might be an unavoidable tool

* __Limitation of the current study__: MNIST-images are relatively easy to generate
  - *Open question*: Will GR still be so successful for task protocols with more complex input dist.?

* __Intuition__:
  - Even if the quality of replayed samples is not perfect, they could still be very helpful
  
* Alternative/complement to replaying generated samples:
  - Store examples from previous tasks and replay them

# Bonus: Continual Learning Methods Classification

Source: [A continual learning survey: Defying forgetting in classification tasks (2020)](https://arxiv.org/pdf/1909.08383.pdf)

![continual_learning_methods](images/contlearningmethods.png)

# ["Generative Replay with Feedback Connections as a General Strategy for Continual Learning" (pdf)](https://arxiv.org/pdf/1809.10635v2.pdf)

## Gido M. van de Ven & Andreas S. Tolias

### Arxiv 2018


Early version of the Nature paper, setting up the basis:

💡 Presents also the 3 scenarios for Continual Learning Analisys

💡 Focus on GR as the base where to build brain-inspired techniques

💡 Focus mainly on RtF technique as a potential advantage

💡 Results already shown the advantage of applying RtF over vanilla GR

# Memory vs Generative Replay

![Memory vs Generative Replay](images/memoryvsgenerativereplay.png)

- GR is the only method capable of performing well in the Class-IL scenario without storing data
- __GR drawback__: scaling it up to more challenging problems has been reported to be problematic

# Results

## Split MNIST task protocol
![Split MNIST](images/results_p3_smnist.png)


### Task-IL & Class-IL (Nature)
- Compared with the original paper, EWC methods are doing better here
![Split MNIST (Nature)](images/results_p1_smnist.png)


## Permuted MNIST task protocol

![Permuted MNIST](images/results_p3_pmnist.png)

### Domain-IL (Nature)
![Permuted MNIST (Nature)](images/braininspiredeval.png)

# Brain-inspired modifications to GR

### Current GR Approaches

![Current GR approach](images/currentGR.png)

### Motivation: Replay does not need to be perfect BUT also NOT LOW QUALITY
  - __Simple/Naive approach__: Use recent progress in generative modelling with DNNs
    - Drawbacks: Train those models is complex and expensive
  - __Adopted approach__: Follow brain inspiration
  
### Brain inspired Techniques used

![Nature GR approach](images/brainGR.png)

 1. __Replay Through Feedback__
  - Replay in brain originates in the hippocampus and propagates to the cortex
  - This fact is ignored by current GR methods
  - __Proposal__: merge the generator into the main model, adding generative backward/feedback connections
  - Implemented as VAE with added softmax classification layer to the top layer of its encoder
  
 2. __Conditional Replay__
  - Human brain has control over what memories are recalled, but VAEs don't
  - __Proposal__: Allow the VAE to generate examples of a particular clas
  - Implementation: the standard normal prior is replaced by a Gaussian mixture with a separate mode for each class
 
 3. __Context Dependent Gating__
  - In the human brain, contextual cues (e.g., odours, sounds) bias what memories are replayed
  - XdG inhibits a different, randomly selected subset of neurons in each hidden layer depending the task.
  - Approach not valid for Domain/Class-IL scenarios
  - __Proposal__: Do gating based on internal context
  - Implementation: The internal context conditioned on is the specific task/class to be generated or reconstructed during the generative backward pass.

 4. __Internal replay__
  - Human brain does not replay memories all the way down to the input level (e.g. you don't propagate a mental object till the retina)
  - __Proposal__: Replay representations of previously learned classes not all the way to the input level (e.g., pixel level), but to replay them internally or at the 'hidden level'
   - Avoid or allow only very limited changes to the first few layers that are not being replayed
   - Consistent with observations in neuroscience experiments
   
### ML-inspired Techniques used

 5. __Soft Labels (Embedded Distillation)__
  - __Proposal__: Avoid to label generated data strictly using hard labels.
  - Generated data is labeled with the predicted probabilities for all possible classes (soft labels)
   - When the quality of the generated data is low, might be harmful to label ambiguous inputs (e.g., that are in between two or more classes) as belonging to a single class

# Brain-Inspired Techniques Evaluation

## Context

- Permuted MNIST
- Domain-IL scenario (i.e., no task labels available at test time)
- Reported is average test accuracy based on all permutations so far
- Displayed are the means over 5 repetitions, shaded areas are ±1 SEM (Standard Error of the Mean)

![Brain Inspired Evaluation](images/braininspiredeval.png)

## Take Aways

1. GR outperforms SI for the first 10 tasks __but its performance rapidly degrades after ~15 tasks__
2. Brain-inspired replay outperfors the already strong performance of SI (__achieving SotA performance__)
3. Combining BI-R with SI results in a further boost in performance
4. LwF performs badly on this task protocol because between tasks, the inputs are completely uncorrelated

## Context

- CIFAR 100
- Task IL: Choice only between classes within given task
- Class IL: Choice between all classes seen so far

![Brain Inspired CIFAR](images/braininspiredCIFAR.png)

## Take Aways

- Task IL scenario
 1. BI-R outperforms EWC, SI and LwF
 2. BI-R almost fully mitigated catastrophic forgetting 

- Class-IL scenario
 1. BI-R outperforms the other methods
 2. Performance still remained substantially under the ‘upper bound’
 3. However, BI-R is the best method without storing data
 4. Combining BI-R with SI results in a further boost in performance

# Addition Ablation Experiments

- Standard GR with individual modifications added ('+', left)
- BI-R with individual modifications removed ('−', right)
- Mean results over 5 (permuted MNIST) or 10 (CIFAR-100) repetitions
- Dotted grey lines indicate chance level
- Solid black lines show performance when the base network is trained only on the final task/episode 
- Techniques:
 - rtf replay-through-feedback
 - con conditional replay
 - gat gating based on internal context
 - int internal replay
 - dis distillation

# Permuted MNIST

![Permuted MNIST](images/ablationpmnist.png)

- The gain in performance obtained by combining all components is larger than the sum of the effects of adding each of them in isolation
- None of the individual modifications were sufficient to achieve the performance of BI-R
 - while all of them (with the exception of RtF) were necessary

# CIFAR-100 Task-IL

![CIFAR-100 Task-IL](images/ablationcifar100t.png)

- None of the individual components were necessary for the Task-IL scenario
 - Preventing catastrophic forgetting is easier here


# CIFAR-100 Class-IL

- __Pay attention at the scale of the Y-Axis here__

![CIFAR-100 Class-IL](images/ablationcifar100c.png)

- The gain in performance obtained by combining all components is larger than the sum of the effects of adding each of them in isolation
- None of the individual modifications were sufficient to achieve the performance of BI-R
 - while all of them (with the exception of RtF) were necessary


## Take Aways

- __Most influential__: _Internal replay_. Effect on performance
- __RtF increases efficiency (i.e., removing the need for a separate generative model) without substantially hurting performance__
 - Best overall performance was in fact obtained withOUT RtF <- I think this is a typo in the Nature paper



# Summary and Take Aways

💡Biological NNs are superior to ANNs counterparts when it comes to continual learning
 - The brain has inspired recent attempts to alleviate catastrophic forgetting in ANNs:
  - Regularization-based methods such as EWC and SI model the complexity of biological synapses
  - XtG models brain's ability to process stimuli differently depending on context (Helps in Task-IL scenarios)
 - Successful for scenarios in which tasks must be learned incrementally __BUT__, those methods are unable to incrementally learn new classes
 
💡Authors show how simple, easy-to-implement and efficient brain-inspired modifications can enable GR to successfully scale to problems with many tasks or complex inputs

💡GR facilitates incrementally learning a generative model
 - and the __brain-inspired modifications improve the quality__ of the learned generative model
 
💡In semi/unsupervised settings, the conditional replay and the gating based on internal context components would need to be modified as their current implementation depends on the availability of class labels during training

💡Drawbacks:
 1. Other input modalities:
  - Require different pre-processing layers, and it remains to be confirmed whether replaying internal representations will work with those
  - Work to be done on this
  - Analogous to the separate sensory processing areas in the brain
 2. Rigid, pre-trained convolutional layers likely restrict the ability of the model to learn out-of-distribution inputs
  - e.g., images without natural image statistics
  - Similar for the brain though
 3. CL performance was only quantified by the average accuracy over all tasks or classes seen so far
  - This is a measure that mainly reflects the extent to which a method suffers from catastrophic forgetting
  - Critical aspects of CL such as forward and backward transfer or compressability were not addressed
   - Specially affects Task-IL scenarios (where catastrophic forgetting can be prevented by simply training a different network for each task to be learned)
 4. Preventing catastrophic forgetting in Class-IL is still an unsolved problem
  - Justifies authors' focus on the average accuracy measure
  
  
💡 Generate new perspectives and hypotheses about the computational role and possible implementations of replay in the brain
 - Provides evidence that replay might indeed be a feasible way for the brain to combat catastrophic forgetting
 - Postulates replay in the brain to be a generative process
 - Representations replayed in the brain do not directly reflect experiences, but that they might be samples from a learned model of the world [1,2,3,4]
 - Current models assume that examples for all categories are either observed together or that they can be directly stored in memory (e.g., exemplar-, prototype- and rule- based models all rely on this assumption)
  - GR could be a biologically plausible way to extend these models to the more natural case in which the different categories to be learned are only available sequentially
 - Missing aspects of the proposed model:
  - *Temporal structure*: replay-events in the brain consist of sequences of neuronal activity that reflect the temporal order of the actual experiences

💡 Why GR is so much more effective for Class-IL than regularization-based methods such as EWC and SI
 - How memory is stored?
  - GR in the function or output space of the network
  - Reg-based store and maintain the memory of previous classes entirely in the parameter space of the network
   - Task-IL the memory to be stored is simpler, because only the features important for the specific task learned at that time need to be remembered  
   - Class-IL this might be challenging, since all information about previous classes must be kept, as it is unknown what the future classes will be like
  - Reg-based methods can't solve Class-IL by themselves __BUT__ provide a unique contribution when combined with GR
   - __Hypothesys__: maintaining memories in function space and maintaining them in parameter space each come with their own, separate challenges:
    - GR: the challenge is to learn a generative network that captures enough of the essence of the previous tasks/classes
    - Reg-based: the challenge is to correctly assign credit to the parameters of the network
   - Neurosciences observation: regularization (or metaplasticity) and replay are complementary mechanisms, which is consistent with empirical observations that the brain uses both strategies side by side to protect its memories [5]
 
[1] Gupta, A. S., van der Meer, M. A., Touretzky, D. S. & Redish, A. D. Hippocampal replay is not a simple function of experience. Neuron 65, 695–705 (2010).

[2] Ólafsdóttir, H. F., Barry, C., Saleem, A. B., Hassabis, D. & Spiers, H. J. Hippocampal place cells construct reward related sequences through unexplored space. Elife 4, e06063 (2015).

[3] Liu, Y., Dolan, R. J., Kurth-Nelson, Z. & Behrens, T. E. Human replay spontaneously reorganizes experience. Cell 178, 640–652 (2019).

[4] Foster, D. J. Replay comes of age. Annu. Rev. Neurosci. 40, 581–602 (2017).

[5] Genzel, L. & Wixted, J. T. Cellular and systems consolidation of declarative memory in Cognitive Neuroscience of Memory Consolidation (eds. Axmacher, N. & Rasch, B.) 3–16 (Springer, Switzerland, 2017).


# Reflections/Open Questions

- I share with Terrence J. Sejnowski the following “My belief in [artificial] neural networks was based on my intuition that if nature had solved this problems [vision, speech & language,] we should be able to learn from nature how to solve them, too.”

- The paper does not compare their method with [Fearnet (Ronald Kemker and Christopher Kanan) in ICLR 2018](https://openreview.net/forum?id=SJ1Xmf-Rb), a generative model (that is memory efficient) that does not store previous examples, which apparently was getting SotA performance at incremental class learning on CIFAR-100

- Fun Fact: None of the latest two big surveys about Continual Learning includes any reference to these authors:

    - [A continual learning survey: Defying forgetting in classification tasks (2020)](https://arxiv.org/pdf/1909.08383.pdf)
    - [Continual Lifelong Learning with Neural Networks: A Review (2019)](https://arxiv.org/pdf/1802.07569.pdf)


- Nice set of papers appliying some of the recent advances in neuroscience/learning sciences
    - In particular episodic memory and brain structures involved (hyppocampus and neocortex)
    - Show how the current methods applying generative replay do not mimic exactly the real communication happening in the brain when learning
    - Apparently [great source code](https://github.com/GMvandeVen/continual-learning) available


- Can generative replay (or a similar technique) be applied to Transformer-based models?

    - Probably language learning/processing is using a different mechanism in the brain [0]. Investigate brain-related structures to language. Summary sentence:
    "Our words are bound by an invisible grammar which is embedded in the brain." - Jonah Lehrer, in Proust Was a Neuroscientist.
    - In particular to:
        - Incremental tasks in text classification
        - Hierarchical text classification tasks when adding new categories
    - Would it be really necessary to have an extra generator? 
        - Can be based on the mere sampling of a small set of previous tasks examples only (e.g. applying zero-shot learning techniques)? 
        - Or maybe a can be expanded with a masked language model like ROBERTA to generate stuff automatically, in a similar way as some of the test cases generation described in [Beyond Accuracy: Behavioral Testing of NLP models with CheckList](https://arxiv.org/abs/2005.04118)

 
- Is the controversy introduced in [1] when saying that generative replay _shifts_ the catastrophic forgetting problem to the training of the generative model_ really overcomed by this work? Or this paper shows that is not true and that a small amount of good enough replay generated by the model itself avoids most of the catastrophic forgetting occurring in the different scenarios?

- I aggree, as it is also stated in [3], that "In essence, in machine learning, in addition to adding depth to our networks, we may need to add intelligence to our synapses."

- Observing and learn about how the brain works and translate the advances of neuroscience/learning science/psicology/psychiatry into the machine learning field has been proven very effective, specially in the last few years where computing power and data has grown almost exponentially. However, may obsesively trying mimic how the brain works, limit/shadow the potential of what additional artificial structures can add/complement to enhance the learning capabilities of current algorithms/techniques? As several theories proposed -See R. Kurtzweil for example-, maybe the changes we should pursue to advance the capacity of artificial "brains", should be similar to the structural changes caused by the evolution of neocortex [2] with regard to other species, and specially inside the primates.



[0] Angela D. Friederici. [The Brain Basis of Language Processing: From Structure to Function (2011)](https://journals.physiology.org/doi/full/10.1152/physrev.00006.2011)

[1] Jonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. [Progress & compress: A scalable framework for continual learning (ICLM 2020)](https://arxiv.org/pdf/1805.06370.pdf)

[2] Jon H. Kaas (Prog Brain Res. 2012 ; 195: 91–102. doi:10.1016/B978-0-444-53860-4.00005-20) [The evolution of neocortex in primates](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3787901/pdf/nihms516054.pdf)

[3] Friedemann Zenke, Ben Poole, Surya Ganguli. [Continual Learning Through Synaptic Intelligence](https://arxiv.org/abs/1703.04200)


# PDF Papers

In [None]:
from IPython.display import IFrame
IFrame("Ven et al. - 2020 - Brain-inspired replay for continual learning with.pdf", width=1500, height=1200)

In [None]:

IFrame("Generative replay with feedback connections as a general strategy for continual learning.pdf", width=1500, height=1200)

In [None]:
IFrame("Three Scenarios for Continual Learning.pdf", width=1500, height=1200)