# Using Fast Weights to Attend to the Recent Past

* 바벨피쉬 / 바벨보이스 : 파트 1 - 딥NLP [1]
* 김무성

# Contents
* Abstract
* 1 Introduction
* 2 Evidence from physiology that temporary memory may not be stored as neural activities
    - 3 Fast Associative Memory
    - 3.1 Layer normalized fast weights
* 4 Experimental results
    - 4.1 Associative retrieval
    - 4.2 Integrating glimpses in visual attention models
    - 4.3 Facial expression recognition
    - 4.4 Agents with memory
* 5 Conclusion

# Abstract

#### restriction
* Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: 
    - Neural activities that 
        - represent the current or recent input and
    - weights that 
        - learn to capture regularities among inputs, outputs and payoffs. 
       
#### different time-scales
* Synapses 
    - have dynamics at many different time-scales and 
* this suggests that artificial neural networks might benefit 
    - from variables that change <font color="red">slower than activities</font> but 
    - much <font color="red">faster than the standard weights</font>. 

#### fast weights
* These “fast weights” can be used to 
    - <font color="red">store temporary memories of the recent past</font> and 
* they provide a neurally plausible way of 
    - <font color="red">implementing the type of attention</font> to the past 
        - that has recently proved very helpful in sequence-to-sequence models. 
* By using fast weights 
    - we can <font color="red">avoid</font> 
        - the <font color="red">need to store copies</font> 
            - of <font color="red">neural activity patterns</font>.

# 1 Introduction

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" width=600 />

#### RNN's memory capacitiy
* Ordinary recurrent neural networks typically have 
    - two types of memory that 
        - have very different time scales, 
        - very different capacities and 
        - very different computational roles. 
* The history of the sequence currently being processed is 
    - stored 
        - in the hidden activity vector, 
            - which acts as 
                - a short-term memory that is 
                    - updated at every time step. 
* The capacity of this memory is
    - <font color="red">O(H)</font> 
        - where H is the number of hidden units. 
    - Long-term memory (matrices)
        - about 
            - how to convert the current input and hidden vectors 
                - into the next hidden vector and 
            - a predicted output vector 
                - is stored 
                    - in the weight matrices 
                        - connecting 
                            - the hidden units 
                                - to themselves and 
                            - to the inputs and outputs. 
        - These matrices are 
            - typically updated 
                - at the end of a sequence and 
            - their capacity is 
                - <font color="red">O(H2) + O(IH) + O(HO)</font> 
                    - where I and O are the numbers of input and output units.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png" width=600 />

#### LSTM
* LSTMs are still limited 
    - to a short-term memory capacity of 
        - <font color="red">O(H)</font> 
            - for the history of the current sequence.

#### thired form of memory (fast weights)
* Several researchers [Hinton and Plaut, 1987, Schmidhuber, 1992] have suggested that 
    - neural networks could benefit 
        - from a third form of memory that 
            - has <font color="red">much higher storage capacity</font> 
                - than <font color="blue">the neural activities</font> but 
            - <font color="red">much faster dynamics</font>
                - than <font color="blue">the standard “slow” weights</font>
* This memory could 
    - store 
        - information 
            - specific to the history of the current sequence
    - so that this information is 
        - available 
            - to <font color="red">influence</font> 
                - the <font color="red">ongoing processing</font> 
                    - <font color="blue">without</font> 
                        - using up the memory capacity of 
                            - the <font color="blue">hidden activities</font>

Until recently, however, there was surprisingly little investigation of other forms of memory in recurrent nets despite strong psychological evidence that it exists and obvious computational reasons why it was needed.

# 2 Evidence from physiology that temporary memory may not be stored as neural activities

* The brain implements a variety of short-term plasticity mechanisms that operate on intermediate timescale.
* These plasticity mechanisms are all synapse-specific. Thus they are more accurately modeled by a memory with O(H2) capacity than the O(H) of standard recurrent artificial recurrent neural nets and LSTMs.

# 3 Fast Associative Memory
* 3.1 Layer normalized fast weights

One of the main preoccupations of neural network research in the 1970s and early 1980s [Willshaw et al., 1969, Kohonen, 1972, Anderson and Hinton, 1981, Hopfield, 1982] was the idea that <font color="red">memories were not stored by somehow keeping copies of patterns of neural activity</font>. <font color="blue">Instead, these patterns were reconstructed when needed from information stored in the weights of an associative network</font> and 
* the very same weights could store many different memories An auto-associative memory that
    - has $N^2$ weights cannot be expected to store more that 
        - N real-valued vectors 
            - with N components each. 

<font color="red">A fast associative memory has several advantages</font> when compared with the type of memory assumed by a Neural Turing Machine (NTM) [Graves et al., 2014], Neural Stack [Grefenstette et al., 2015], or Memory Network [Weston et al., 2014]. 
* First, 
    - it is not at all clear how a real brain would implement the more exotic structures in these models e.g., the tape of the NTM, 
    - whereas it is clear that the brain could 
        - implement a fast associative memory in synapses with the appropriate dynamics. 
* Second, 
    - in a fast associative memory 
        - there is no need to decide 
            - where or when to write to memory and 
            - where or when to read from memory.

Every time the input changes there is a transition to a new hidden state which is determined by a <font color="red">combination of three sources of information</font> :
* The new input via the slow input-to-hidden weights, 
    - C, 
* the previous hidden state via the slow transition weights, 
    - W , and 
* the recent history of hidden state vectors via the fast weights, 
    - A.
* The effect of the first two sources of information 
    - on the new hidden state 
    - can be computed once and 
    - then maintained 
        - as a <font color="red">sustained boundary condition</font> 
            - for a brief iterative settling process 
                - which allows the fast weights 
                    - to influence the new hidden state.
* <font color="blue">Assuming that the fast weights decay exponentially</font>, 
    - we now show that 
        - the effect of the fast weights 
            - on the hidden vector 
            - during an iterative settling phase 
        - is to provide an additional input 
            - that is proportional to the sum over all recent hidden activity vectors 
                - of the scalar product of that recent hidden vector 
                    - with the current hidden activity vector, 
                        - with each term in this sum being weighted 
                            - by the decay rate raised to the power of how long ago that hidden vector occurred.
* So fast weights 
    - <font color="blue">act like a kind of attention to the recent past</font> 
    - but with the <font color="blue">strength of the attention</font> 
        - being <font color="blue">determined</font> 
            - by the <font color="blue">scalar product</font> 
                - between the <font color="blue">current hidden vector and the earlier hidden vector</font> 
                    - rather than being determined by a separate parameterized computation of the type used in neural machine translation models

<img src="figures/cap1.png" width=600 />

#### Update rule : fast memory weight matrix. A

The update rule for the fast memory weight matrix, A, is simply to multiply the current fast weights by a decay rate, λ, and add the outer product of the hidden state vector, h(t), multiplied by a learning
rate, η:

<img src="figures/cap2.png" width=600 />

#### The next vector of hidden activities

The next vector of hidden activities, h(t + 1), is computed in two steps.
* The “preliminary” vector h0(t + 1) 
    - is determined 
        - by the combined effects of 
            - the input vector x(t) and 
            - the previous hidden vector: 
                - h0(t + 1) = f(Wh(t) + Cx(t)), 
                    - where W and C are slow weight matrices and 
                    - f(.) is the nonlinearity used by the hidden units. 
* The preliminary vector is then used to initiate an “inner loop” iterative process 
    - which runs for S steps and 
    - progressively changes the hidden state into 
        - h(t+1) = hS(t+1)

<img src="figures/cap3.png" width=600 />

where the terms in square brackets are the sustained boundary conditions.

<img src="figures/cap1.png" width=600 />

Assuming A is 0 at the beginning of the sequence,

<img src="figures/cap4.png" width=600 />

The term in square brackets is just the scalar product of an earlier hidden state vector, h(τ ), with the current hidden state vector, hs(t+1), during the iterative inner loop. 

#### inner loop
* So at each iteration of the inner loop, 
    - the fast weight matrix is 
        - <font color="red">exactly equivalent to attending to past hidden vectors</font> 
            - in proportion to their scalar product 
                - with the current hidden vector, 
                - weighted by a decay factor.
* During the inner loop iterations, 
    - attention will become 
        - more focussed on 
            - past hidden states that 
                - manage to attract the current hidden state.

## 3.1 Layer normalized fast weights

* A potential problem with fast associative memory is that 
    - the scalar product of two hidden vectors could 
        - vanish or explode 
            - depending on 
                - the norm of the hidden vectors. 
* Recently, layer normalization [Ba et al., 2016] has been shown to be very effective at stablizing the hidden state dynamics in RNNs and reducing training time.

<img src="figures/cap5.png" width=600 />

* We found that applying layer normalization on each iteration of the inner loop makes the fast associative memory more robust to the choice of learning rate and decay hyper-parameters.
* For the rest of the paper, fast weight models are trained using layer normalization and the outer product learning rule with fast learning rate of 0.5 and decay rate of 0.95, unless otherwise noted.

# 4 Experimental results
* 4.1 Associative retrieval
* 4.2 Integrating glimpses in visual attention models
* 4.3 Facial expression recognition

## 4.1 Associative retrieval

<img src="figures/cap6.png" />

where ‘?’ is the token to separate the query from the key-value pairs.

We generated 100,000 training examples, 10,000 validation examples and 20,000 test examples.

Figure 1 and Table 1 show that when the number of recurrent units is small, the fast associative memory significantly outperforms the LSTMs with the same number of recurrent units. The result fits with our hypothesis that the fast associative memory allows the RNN to use its recurrent units more effectively. In addition to having higher retrieval accuracy, the model with fast weights also converges faster than the LSTM models.

<img src="figures/cap7.png" width=600 />

## 4.2 Integrating glimpses in visual attention models

<img src="figures/cap8.png" width=600 />

<img src="figures/cap9.png" />

## 4.3 Facial expression recognition

<img src="figures/cap10.png" width=600 />

<img src="figures/cap11.png" width=600 />

## 4.4 Agents with memory

<img src="figures/cap12.png" width=600 />

# 5 Conclusion

# 참고자료
* [1] (paper) Using Fast Weights to Attend to the Recent Past
* [2] (code) ajarai/fast-weights - https://github.com/ajarai/fast-weights
* [3] https://theneuralperspective.com/2016/12/04/implementation-of-using-fast-weights-to-attend-to-the-recent-past/