# Paper Summary 

This is a summary of the paper **Variational Continual Learning** written by **Cuong V. Nguyen**, **Yingzhen Li**, **Thang D. Bui** and **Richard E. Turner**

# 1. Introduction

* **Continual Learning**

>* Data continuously arrive in a possibly non i.i.d. way
>* Tasks may change over time (e.g. new classes may be discovered)
>* Entirely new tasks can emerge ([Schlimmer & Fisher 1986](references/Schlimmer&Fisher_1986.pdf); [Sutton & Whitehead, 1993](references/Sutton&Whitehead_1993.pdf); [Ring, 1997](references/Ring_1997.pdf))

* **Challenge for Continual Learning**

>* Balance between **adapting to new data** vs. **retaining existing knowledge**
>  * Too much plasticity $\rightarrow$ **catastrophic forgetting** ([McCloskey & Cohen, 1989](https://www.sciencedirect.com/science/article/pii/S0079742108605368); [Ratcliff, 1990](references/Ratcliff_1990.pdf); [Goodfellow et al., 2014a](references/Goodfellow_2014a.pdf))
>  * Too much stability $\rightarrow$ inability to adapt
>* **Approach 1:** train individual models on each task $\rightarrow$ train to combine them
>  * ([Lee et al., 2017](references/Lee_2017.pdf))
>* **Approach 2:** maintain a single model and use a single type of regularized training that prevents drastic changes in the influential parameters, but allow other parameters to change more freely
>  * ([Li & Hoiem, 2016](references/Li&Hoiem_2016.pdf); [Kirkpatrick et al., 2017](references/Kirkpatrick_2017.pdf); [Zenke et al., 2017](references/Zenke_2017.pdf))

* **Variational Continual Learning**

>* Merge **online VI** ([Ghahramani & Attias, 2000](references/Ghahramani&Attias_2000.pdf); [Sato, 2001](references/Sato_2001.pdf); [Broderick et al., 2013](references/Broderick_2013.pdf))
>* with **Monte Carlo VI for NN** ([Blundell et al., 2015](references/Blundell_2015.pdf))
>* and include a **small episodic memory** ([Bachem et al., 2015](references/Bachem_2015.pdf); [Huggins et al., 2016](references/Huggins_2016.pdf))


# 2. Continual Learning by Approximate Bayesian Inference

* **Online updating, derived from Bayes' rule**

>$$p(\boldsymbol{\theta}|\mathcal{D}_{1:T}) \propto p(\boldsymbol{\theta}) \prod^T_{t=1} \prod^{N_t}_{n_t=1} p(y_t^{(n_t)}|\boldsymbol{\theta},x_t^{(n_t)}) = p(\boldsymbol{\theta}) \prod^T_{t=1} p(\mathcal{D}_t|\boldsymbol{\theta}) \propto p(\boldsymbol{\theta}|\mathcal{D}_{1:T-1}) p(\mathcal{D}_T|\boldsymbol{\theta})$$

>* Posterior after $T$th dataset $\propto$ Posterior after $(T-1)$th dataset $\times$ Likelihood of the $T$th dataset

* **Projection Operation: approximation for intractable posterior** (recursive)

>\begin{align}
p(\boldsymbol{\theta}|\mathcal{D}_1) \approx q_1(\boldsymbol{\theta}) &= \text{proj}(p(\boldsymbol{\theta})p(\mathcal{D}_1|\boldsymbol{\theta})) \\
p(\boldsymbol{\theta}|\mathcal{D}_{1:T}) \approx q_T(\boldsymbol{\theta}) &= \text{proj}(q_{T-1}(\boldsymbol{\theta})p(\mathcal{D}_T|\boldsymbol{\theta}))
\end{align}

>|Projection Operation|Inference Method|References|
|-|-|-|
|Laplace's approximation    |Laplace propagation      |[Smola et al., 2004](references/Smola_2004.pdf)|
|Variational KL minimization|Online VI|[Ghahramani & Attias, 2000](references/Ghahramani&Attias_2000.pdf); [Sato, 2001](references/Sato_2001.pdf)|
|Moment matching            |Assumed density filtering|[Maybeck, 1982](references/Maybeck_1982.pdf)|
|Importance sampling        |Sequential Monte Carlo   |[Liu & Chen, 1998](references/Liu&Chen_1998.pdf)|

>* This paper will use **Online VI** as it outperforms other methods for complex models in the static setting ([Bui et al., 2016](references/Bui_2016.pdf))
>* <font color='red'>** Q. Try building VCL with different projection operation? **</font>

## 2.1. VCL and Episodic Memory Enhancement

* **Projection Operation: KL Divergence Minimization**

>$$q_t(\boldsymbol{\theta}) = \underset{q \in \mathcal{Q}}{\text{argmin}} \text{KL} 
\left( q(\boldsymbol{\theta}) || \frac{1}{Z_t} q_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t|\boldsymbol{\theta}) \right)$$

>* $q_0(\boldsymbol{\theta}) = p(\boldsymbol{\theta})$
>* $Z_t$: normalizing constant (not required when computing the optimum)
>* VCL becomes Bayesian inference if $p(\boldsymbol{\theta}|\mathcal{D}_{1:t}) \in \mathcal{Q} \;\forall\; t$

* **Potential Problems**

>* Errors from repeated approximation $\rightarrow$ forget old tasks
>* Minimization at each step is also approximate $\rightarrow$ information loss

* **Solution: Coreset**

>* **Coreset:** small representative set of data from previously observed tasks
>  * Analogous to **episodic memory** ([Lopez-Paz & Ranzato, 2017](references/Lopez-Paz&Ranzato_2017.pdf))
>* **Coreset VCL:** equivalent to a message-passing implementation of VI in which the coreset data point updates are scheduled after updating the other data

>* <font color='red'>** Q. Implement coreset VCL using message-passing? **</font>

>* $C_t$: updated using $C_{t-1}$ and selected data points from $\mathcal{D}_t$ (e.g. random selection, K-center algorithm, ...)
>  * K-center algorithm: return K data points that are spread throughout the input space ([Gonzalez, 1985](references/Gonzalez_1985.pdf))
>  * <font color='red'>** Q. Finding optimal way of selecting the coreset? **</font>
 

* **Algorithm**

>* **Step 1:** Observe $\mathcal{D}_t$
>* **Step 2:** Update $C_t$ using $C_{t-1}$ and $\mathcal{D}_t$
>* **Step 3:** Update $\tilde{q}_t$ (used for **propagation**)

>\begin{align}
\tilde{q}_t(\boldsymbol{\theta}) &= \text{proj} \left( \tilde{q}_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t \cup C_{t-1} \setminus C_t | \boldsymbol{\theta}) \right) \\
&= \underset{q \in \mathcal{Q}}{\text{argmin}} \; \text{KL} 
\left( q(\boldsymbol{\theta})  \;\big|\big|\; \frac{1}{\tilde{Z}} \tilde{q}_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t \cup C_{t-1} \setminus C_t |\boldsymbol{\theta}) \right)
\end{align}

>* **Step 4:** Update $q_t$ (used for **prediction**)

>\begin{align}
q_t(\boldsymbol{\theta}) &= \text{proj} \left( \tilde{q}_{t}(\boldsymbol{\theta}) p(C_t | \boldsymbol{\theta}) \right) \\
&= \underset{q \in \mathcal{Q}}{\text{argmin}} \; \text{KL} 
\left( q(\boldsymbol{\theta})  \;\big|\big|\; \frac{1}{Z} \tilde{q}_t (\boldsymbol{\theta}) p(C_t |\boldsymbol{\theta}) \right)
\end{align}

>* **Step 5:** Perform prediction

>$$p(y^*|\boldsymbol{x}^*, \mathcal{D}_{1:t}) = \int q_t(\boldsymbol{\theta}) p(y^*|\boldsymbol{\theta},\boldsymbol{x}^*) d\boldsymbol{\theta}$$

# 3. VCL in Deep Discriminative Models

* **Multi-head Networks**

>* Standard architecture used for multi-task learning ([Bakker & Heskes, 2003](references/Bakker&Heskes_2003.pdf))
>* Share parameters close to the inputs / Separate heads for each output
>* **More advanced model structures:**
>  * for continual learning ([Rusu et al., 2016](references/Rusu_2016.pdf))
>  * for multi-task learning in general ([Swietojanski & Renals, 2014](references/Swietojanski&Renals_2014.pdf); [Rebuffi et al., 2017](references/Rebuffi_2017.pdf))
>  * **automatic continual model building:** adding new structure as new tasks are encountered
>  * <font color='red'>** Q. Implementing VCL for different model architecture? **</font>
>* This paper assumes that the model structure is known *a priori*

* **Formulation**

><img src = 'images/summary_1.png' width=350>

>* Model parameter $\boldsymbol{\theta} = \{ \boldsymbol{\theta}^H_{1:T}, \boldsymbol{\theta}^S \} \in \mathbb{R}^D$
>  * **Shared parameters:** updated constantly 
>  * **Head parameter:** $q(\boldsymbol{\theta}^H_K) = p(\boldsymbol{\theta}^H_K)$ at the beginning, updated incrementally as each task emerges

>* For simplicity, use **Gaussian mean-field approximate posterior** $q_t(\boldsymbol{\theta}) = \prod^D_{d=1} \mathcal{N} (\theta_{t,d} ; \mu_{t,d}, \sigma^2_{t,d})$

* **Network Training**

>* Maximize the negative online variational free energy or the variational lower bound to the online marginal likelihood $\mathcal{L}^t_{VCL}$ with respect to the variational parameters $\{\mu_{t,d},\sigma_{t,d}\}^D_{d=1}$

>$$\mathcal{L}^t_{VCL} (q_t(\boldsymbol{\theta})) = \sum^{N_t}_{n=1} \mathbb{E}_{\boldsymbol{\theta} \sim q_t(\boldsymbol{\theta})} \left[ \log p(y_t^{(n)}|\boldsymbol{\theta},\mathbf{x}^{(n)}_t) \right] - \text{KL} (q_t(\boldsymbol{\theta})||q_{t-1}(\boldsymbol{\theta}))$$

>* $\text{KL} (q_t(\boldsymbol{\theta})||q_{t-1}(\boldsymbol{\theta}))$: tractable / set $q_0(\boldsymbol{\theta})$ as multivariate Gaussian ([Graves, 2011](references/Graves_2011.pdf); [Blundell et al., 2015](references/Blundell_2015.pdf))
>* $\mathbb{E}_{\boldsymbol{\theta} \sim q_t(\boldsymbol{\theta})} [\cdot]$: intractable $\rightarrow$ approximate by employing simple Monte Carlo and using the **local reparameterization trick** to compute the gradients ([Salimans & Knowles, 2013](references/Salimans&Knowles_2013.pdf); [Kingma & Welling, 2014](references/Kingma&Welling_2014.pdf); [Kingma et al., 2015](references/Kingma_2015.pdf))

# 4. VCL in Deep Generative Models

* **Deep Generative Models**

>* Can generate realistic images, sounds, and video sequences ([Chung et al., 2015](references/Chung_2015.pdf); [Kingma et al., 2016](references/Kingma_2016.pdf); [Vondrick et al., 2016](references/Vondrick_2016.pdf))
>* Standard batch learning assumes observations to be i.i.d. and are all available at the same time
>* This paper applies VCL framework to **variational auto encoders** ([Kingma & Welling, 2014](references/Kingma&Welling_2014.pdf); [Rezende et al., 2014](references/Rezende_2014.pdf))

>* <font color='red'>** Q. Applying VCL for GAN([Goodfellow et al., 2014b](references/Goodfellow_2014b.pdf))? **</font> - Initial attempt to apply continual learning ([Seff et al., 2017](references/Seff_2017.pdf))

* **Formulation**

><img src = 'images/summary_2.png' width=350>

>* Maximizing the variational lower bound with respect to $\boldsymbol{\theta}$ and $\phi$

>$$\mathcal{L}_{\text{VAE}} (\boldsymbol{\theta},\phi) = \sum^N_{n=1} \mathbb{E}_{q_\phi(\mathbf{z}^{(n)}|\mathbf{x}^{(n)})}
\left[ \log \frac{p(\mathbf{x}^{(n)}|\mathbf{z}^{(n)},\boldsymbol{\theta})p(\mathbf{z}^{(n)})} {q_\phi (\mathbf{z}^{(n)}|\mathbf{x}^{(n)})} \right]$$

>* Continual learning setting: maximizing the **full** variational lower bound with respect to $q_t$ and $\phi$

>$$\mathcal{L}^t_{\text{VAE}} (q_t(\boldsymbol{\theta}),\phi) = 
\mathbb{E}_{q_t(\boldsymbol{\theta})}\left\{
\sum^{N_t}_{n=1} \mathbb{E}_{q_\phi(\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})}
\left[ \log \frac{p(\mathbf{x}_t^{(n)}|\mathbf{z}_t^{(n)},\boldsymbol{\theta})p(\mathbf{z}_t^{(n)})} {q_\phi (\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \right] \right\}
-\text{KL}(q_t(\boldsymbol{\theta})||q_{t-1}(\boldsymbol{\theta}))$$

# 5. Related Work