In [24]:
import math, time, os, datetime, shutil, pickle

import numpy as np

import torch
from torch import nn
import torch.nn.functional as F

import import_ipynb
from MoveData import *
from Elements import MultiHeadAttention, Norm, FeedForward
from EncoderDecoder import Encoder, Decoder
from Talk import *
from Trainer import *

importing Jupyter notebook from Trainer.ipynb


## Relational Memory Core

From a big picture perspective, the transformer we have built so far, just allows us to map an input sequence to an output sequence. All the layers we put together, essentially are for the purpose of allowing us to do this action -> reaction task. When you are talking to someone, your responses are not just based on what they said last, it is based on what you have said earlier in the conversation, your memory of past conversations, what your currently impression of the person is, and other knowledge of this person and their relationship with the world. This requires us to be able to hold an internal state that persists through time, a memory. In your brain, you have just neurons and the cells that support those neurons. Here we are building a form of neural memory. 

In the work by [Santoro, A. et al](https://arxiv.org/pdf/1806.01822.pdf) from Deepmind, Santoro and collegues built on past work on building memory into neural networks and devised what they called the Relational Memory Core (RMC). The image below summarizes this type of neural memory. 

<img src="../saved/images/RMC_Overview.png" height=400 width=600>

The image above shows a high level graphic of how the memory, in the form of a matrix called Previous Memory, is combined with current experience in the form of an input vector, using attention, A, into a Next or Updated memory. 

The RMC has three high level steps, with a residual and normalization layer between each one. They are 1. Attention, 2. Multi-Layer-Perceptron (MLP) and 3. Gating. 

The cell below creates a blank initial memory matrix to which an agent will fill with it's memories. The memory matrix has `mem_slots` number of rows and `mem_size` number of columns. 

In [3]:
teaching = False

def initial_memory(mem_slots, mem_size, batch_size):
    """Creates the initial memory.
    We should ensure each row of the memory is initialized to be unique,
    so initialize the matrix to be the identity. We then pad or truncate
    as necessary so that init_state is of size(mem_slots, mem_size).
    Args:
      mem_slots: rows in memory matrix
      mem_size: columns in memory matrix
      batch_size: batch size
    Returns:
      init_state: A truncated or padded identity matrix of size (batch_size,mem_slots, mem_size)
    """
    with torch.no_grad():
        init_state = torch.stack([torch.eye(mem_slots) for _ in range(batch_size)])

    # Pad the matrix with zeros.
    if mem_size > mem_slots:
      difference = mem_size - mem_slots
      pad = torch.zeros((batch_size, mem_slots, difference))
      init_state = torch.cat([init_state, pad], -1)
    # Truncation. Take the first `self._mem_size` components.
    elif mem_size < mem_slots:
      init_state = init_state[:, :, :mem_size]
    return init_state


if teaching:
    mem_slots=4
    mem_size=8
    batch_size=1
    memory = initial_memory(mem_slots=mem_slots,mem_size=mem_size,batch_size=batch_size)
    print("Initial Memory")
    print(memory, memory.shape)

In this image notice the 4 x 6 matrix of grey dots labeled memory. This matrix represents a storage of information from the past, past memories. The light grey vector labeled input represents new information from the current time point that we wish to incorporate into our past memories to save them for the future. 

<img src="../saved/images/RMC_MHDPA.png"  height=600 width=800>

The lower panel describes a multi-headed attention mechanism for updating the past memories into the updated memories. The sequence of operations is the same as the multiheaded attention we have already learned in the Transformer. 

In the Transformer we used this type of attention 3 ways:

1. For Encoder-Encoder Attention in which the source sequence attends to itself

2. For Self-Attention in which the Decoder Output attends to the Decoder Inputs so far

3. Decoder-Encoder Attention in which the Decoder Output attends to the Encoder Ouputs

recall the forward method of class MultiHeadAttention

`def forward(self, q, k, v, mask=None, explain=False):`

q had shape (batch size, q_sequence length, embedding dimensions), the output of the MultiHeadAttention will be the same shape as q. In the RMC q is our previous memory of shape (batch size, mem_slots, mem_size). The updated memory will have the same shape as the previous memory after it attends to `mem_plus_input`, which is a matrix that includes previous memory inside it, but it has an extra row, that extra row is the input vector that represents the current experience to be incorporated into memory. `mem_plus_input` has shape (batch size, mem_slots + 1, mem_size)

Using Decoder-Encoder Attention as an analogy. The previous memory plays the role of the sequence that the q projection is derived from, the Decoder Input. The concatenation of the previous memory with the input as a new row is analogous to the sequence that the k and v projections are derived from, the encoder output. 

The <font color='green'>weights matrix (q_seq_len, k_seq_len)</font> is analogous to the score matrix in the transformer. w1,2 in the diagram is the amount of attention that the 1st slot of the previous memory should pay to the 2nd slot in `mem_plus_input`. The scores are normalized just as in the Transformer using a softmax and dividing by 

$$\sqrt(memory size)$$

Looking at the <font color='green'>Normalized Weights</font> in the diagram, notice that in the bottom row, the row selected by the grey rectangle, the first green dot is the most <font color='green'>green</font>. This row of length 5 will be dot producted with each column of the yellow <font color='yellow'>Values</font> matrix to calculate each element of the bottom row of the new updated memory matrix. The significance of the first green dot being most green is that the row that is to be weighted the most, the row in the yellow <font color='yellow'>Values</font> matrix that the bottom row of the updated memory will most be similar to, is the first row of the yellow <font color='yellow'>Values</font> matrix.  

The cell below performs this attention step. After the Attention step notice the line `new_mem_norm = NormalizeMemory1(new_memory + memory)`. This line is both the normalization step and also the residual. Residual is simply adding a vector to the version of itself before it was modified, in this case, adding the memory matrix to the version of itself before attention was applied. 

In [11]:
if teaching:
    print("Input vector that represents your current experience")
    input_vector = torch.randn((batch_size,mem_size))
    print(input_vector ,input_vector.shape)
    memory_plus_input = torch.cat([memory, input_vector.unsqueeze(1)], dim=-2) 
    print("--------------------------------------------------------------")
    print("Previous Memory with the new input as the bottom row")
    print(memory_plus_input, memory_plus_input.shape)
    updatememory = MultiHeadAttention(num_heads=3, emb_dim=8, dim_k=4, dropout=0.0)
    new_memory, scores = updatememory(memory, memory_plus_input, memory_plus_input)
    print("--------------------------------------------------------------")
    print("Next Memory after Attention Step but Before MLP and Gating Steps")
    print(new_memory, new_memory.shape)
    NormalizeMemory1 = Norm(emb_dim=8)
    new_mem_norm = NormalizeMemory1(new_memory + memory)
    new_mem_norm.shape

Input vector that represents your current experience
tensor([[ 0.2986, -0.4072,  1.3784,  0.8294,  1.2003,  0.1381, -1.1695,  0.2096]]) torch.Size([1, 8])
--------------------------------------------------------------
Previous Memory with the new input as the bottom row
tensor([[[ 1.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
           0.0000],
         [ 0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
           0.0000],
         [ 0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,  0.0000,
           0.0000],
         [ 0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,
           0.0000],
         [ 0.2986, -0.4072,  1.3784,  0.8294,  1.2003,  0.1381, -1.1695,
           0.2096]]]) torch.Size([1, 5, 8])
--------------------------------------------------------------
Next Memory after Attention Step but Before MLP and Gating Steps
tensor([[[-0.0562,  0.0282,  0.0104,  0.4376,  0.0601, -0.2478,  0.2248,
           0.4245],
         [-0.0606, 

In [65]:
class Memory(nn.Module):
    def __init__(self, in_vocab_size, out_vocab_size, emb_dim, 
                 n_layers, num_heads, mem_slots, dropout):
        
        super().__init__() 
        
        self.mem_slots = mem_slots
        self.mem_size = emb_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.dim_k = self.mem_size // self.num_heads
        self.batch_size = None 
        
        with torch.no_grad():
            self.memory = torch.eye(self.mem_slots)
        if self.mem_size > self.mem_slots:
          difference = self.mem_size - self.mem_slots
          pad = torch.zeros((self.mem_slots, difference))
          self.memory = torch.cat([self.memory, pad], -1)
        elif self.mem_size < self.mem_slots:
          self.memory = self.memory[:, :self.mem_size]
        
        mem_mask = np.ones((self.mem_slots,self.mem_slots)).astype('uint8')
        self.mem_mask =  torch.from_numpy(mem_mask) == 1
        
        self.encoder = Decoder(in_vocab_size, emb_dim, n_layers, num_heads, dropout)
        self.decoder = Decoder(out_vocab_size, emb_dim, n_layers, num_heads, dropout)
        self.out = nn.Linear(emb_dim, out_vocab_size)
        
        #self.rem_vec =  nn.Parameter(torch.randn(1, 1, emb_dim))
        #self.register_parameter("remember_vector", self.remember)
        
        self.memory_update = MultiHeadAttention(self.num_heads,self.mem_size,
                                                self.dim_k,self.dropout)
        
    def batch_memory(self,src_seq):
        self.batch_size = src_seq.size(0)
        self.memory = torch.stack([self.memory for _ in range(self.batch_size)])
        self.mem_mask = torch.stack([self.mem_mask for _ in range(self.batch_size)])
        #self.rem_vec = torch.stack([self.rem_vec for _ in range(self.batch_size)])
        
    def forward(self, src_seq, trg_seq, src_mask, trg_mask):
        # add the bacth dimension to our memory related tensors
        if self.batch_size == None: self.batch_memory(src_seq)
        print(src_seq.shape,self.memory.shape,src_mask.shape,self.mem_mask.shape)
        e_output = self.encoder(src_seq,self.memory,src_mask,self.mem_mask)
        
        # re-represent the source sequence in context of the memory
        #e_m_output = torch.cat([e_output, self.rem_vec], dim=-2) 
        #m_output = e_m_output[:,:-1,:]
        #memory_vector = e_m_output[:,-1,:]
    
        d_output = self.decoder(trg_seq, e_output, src_mask, trg_mask)
        output = self.out(d_output)
        
        mem_plus_dialogue = torch.cat([self.memory,e_output,d_output], dim=-2) 
        self.memory, scores = self.memory_update(self.memory,
                                                 mem_plus_dialogue,
                                                 mem_plus_dialogue)
    
        return output

In [66]:
opt = Options(batchsize=2, device = torch.device("cpu"), epochs=5, lr=0.01, 
              max_len = 25, save_path = '../saved/weights/memory_weights')

data_iter, infield, outfield, opt = json2datatools(path='../saved/pairs.json', opt=opt)

emb_dim, n_layers, num_heads, mem_slots, dropout = 32, 3, 8, 4, 0.01 

chloe = Memory(len(infield.vocab), len(outfield.vocab), 
               emb_dim, n_layers, num_heads, mem_slots, dropout)

In [67]:
conversation_list = [
    {"listen":"my name is fluffy", "reply":"hello fluffy!"},
    {"listen":"what is my name?", "reply":"its fluffy silly"},
    {"listen":"my name is snuggles", "reply":"hello snuggles!"},
    {"listen":"what is my name?", "reply":"its snuggles silly"},
                    ]
def convo_trainer(conversation_list, model, options):

    optimizer = torch.optim.Adam(chloe.parameters(), lr=opt.lr, betas=(0.9, 0.98), eps=1e-9)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.5, patience=5)

    sos_tok = torch.LongTensor([[outfield.vocab.stoi['<sos>']]]) 
    eos_tok = torch.LongTensor([[outfield.vocab.stoi['<eos>']]]) 

    model.train()
    start = time.time()
    best_loss = 100
    for epoch in range(options.epochs):
        total_loss = 0
        for i in range(len(conversation_list)):
            listen_sequence = string2tensor(conversation_list[i]["listen"], infield)
            reply_sequence = string2tensor(conversation_list[i]["reply"], infield)
            decoder_input = torch.cat((sos_tok,reply_sequence,eos_tok), dim=1)
            decoder_target = torch.cat((reply_sequence,eos_tok), dim=1).contiguous().view(-1)
            src_mask, trg_mask = create_masks(listen_sequence, decoder_input, options)
            
            output = model(listen_sequence, decoder_input, src_mask, trg_mask) 

    return model

In [68]:
chloe = convo_trainer(conversation_list, chloe, options=opt)

torch.Size([1, 4]) torch.Size([1, 4, 32]) torch.Size([1, 1, 4]) torch.Size([1, 4, 4])
torch.Size([1, 5]) torch.Size([1, 4, 32]) torch.Size([1, 1, 5]) torch.Size([1, 4, 4])


RuntimeError: The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 3

In [58]:
#
#chloe = trainer(chloe, data_iter, opt, optimizer, scheduler)

This net step applies the same Feed Forward Neural Network to each memory slot of the updated memory and performs the 2nd residual + normalization

In [14]:
if teaching:
    MLP = FeedForward(emb_dim=8, ff_dim=16, dropout=0.2)
    mem_mlp = MLP(new_mem_norm)
    NormalizeMemory2 = Norm(emb_dim=8)
    new_mem_norm2 = NormalizeMemory2(mem_mlp + new_mem_norm)
    print(new_mem_norm2)

tensor([[[ 1.8800, -0.6799, -0.9192,  0.6317, -0.5448, -1.0849,  0.2395,
           0.4777],
         [-0.7467,  1.6685, -0.8662,  0.8525, -0.5407, -1.2059,  0.2150,
           0.6234],
         [-0.4988, -0.7131,  1.3087,  0.8573, -0.7955, -1.5029,  0.6804,
           0.6638],
         [-0.3512, -0.3280, -0.6274,  2.0999, -0.7908, -0.9259,  0.2931,
           0.6304]]], grad_fn=<AddBackward0>)


The last step is the Gating step. This step is inspired by the Gated Recurrent Unit (Cho et al., 2014) and [LSTM](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) (Hochreiter & Schmidhuber 1997). We use the gate (z_t) for a slightly different reason than the [vanishing gradient problem](https://stats.stackexchange.com/questions/185639/how-does-lstm-prevent-the-vanishing-gradient-problem), the reason it is used for recurrent neural networks. It is essentially a way to make element-wise decisions to add the same amount of change that is removed from the previous state. 

$$z_t = \sigma(W_z \dot [m_{t - 1},x_t])$$

$$m_{t} = (1 - z_t) \circ m_{t - 1} + z_t \circ m_{t}$$

*z_t* is a matrix that is the same shape as the memory matrix. Since it comes out of a [sigmoid function](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) its vales are all between 0 and 1. 

Suppose element ij of z_t is 0.2. This means that for that element (1 - z_t) = 0.8 of the element will come from the previous memory m_(t -1), and z_t or 0.2 of the element will come from the new updated memory m_(t)

In [15]:
if teaching:
    print(memory.shape, input_vector.shape)
    input_stack = torch.stack([input_vector for _ in range(mem_slots)], dim=1)
    print(input_stack.shape)
    h_old_x = torch.cat([memory, input_stack], dim = -1)
    print(h_old_x.shape)
    ZGATE = nn.Linear(mem_size*2, mem_size)
    z_t = torch.sigmoid(ZGATE(h_old_x)) # (batch size, memory slots, memory size)
    print(z_t.shape)
    print(ZGATE.weight.shape)
    new_memory = (1 - z_t)*memory + z_t*new_mem_norm2

torch.Size([1, 4, 8]) torch.Size([1, 8])
torch.Size([1, 4, 8])
torch.Size([1, 4, 16])
torch.Size([1, 4, 8])
torch.Size([8, 16])


## The Relational Memory Core *class*
Putting it all together to create our Relational Memory Core (RelMemCore) class.

In [85]:
class RelMemCore(nn.Module):
    
    def __init__(self, mem_slots, mem_size, num_heads, dim_k=None, dropout=0.1):
        super(RelMemCore, self).__init__()
        self.mem_slots = mem_slots
        self.mem_size = mem_size
        self.num_heads = num_heads
        self.dropout = dropout
        self.dim_k = dim_k if dim_k else self.mem_size // num_heads
        self.attn_mem_update = MultiHeadAttention(self.num_heads,self.mem_size,
                                                  self.dim_k,self.dropout)
        self.normalizeMemory1 = Norm(self.mem_size)
        self.normalizeMemory2 = Norm(self.mem_size)
        self.MLP = FeedForward(self.mem_size, ff_dim=self.mem_size*2, dropout=dropout)
        self.ZGATE = nn.Linear(self.mem_size*2, self.mem_size)
        
    def initial_memory(self, batch_size):
        """Creates the initial memory.
        TO ensure each row of the memory is initialized to be unique, 
        initialize the matrix as the identity then pad or truncate
        so that init_state is of size (mem_slots, mem_size).
        Args:
          batch size
        Returns:
          init_mem: A truncated or padded identity matrix of size (mem_slots, mem_size)
          remember_vector: (1, self.mem_size)
        """
        with torch.no_grad():
            init_mem = torch.stack([torch.eye(self.mem_slots) for _ in range(batch_size)])
            
        # Pad the matrix with zeros.
        if self.mem_size > self.mem_slots:
          difference = self.mem_size - self.mem_slots
          pad = torch.zeros((batch_size, self.mem_slots, difference))
          init_mem = torch.cat([init_mem, pad], -1)
        # Truncation. Take the first `self._mem_size` components.
        elif self.mem_size < self.mem_slots:
          init_mem = init_mem[:, :, :self.mem_size]
        
        return init_mem
        
    def update_memory(self, input_vector, prev_memory):
        '''
        inputs
         input_vector (batch_size, mem_size)
         prev_memory - previous or past memory (batch_size, mem_slots, mem_size)
        output
         next_memory - updated memory (batch_size, mem_slots, mem_size)
        '''
        mem_plus_input = torch.cat([prev_memory, input_vector.unsqueeze(1)], dim=-2) 
        new_mem, scores = self.attn_mem_update(prev_memory, mem_plus_input, mem_plus_input)
        new_mem_norm = self.normalizeMemory1(new_mem + prev_memory)
        mem_mlp = self.MLP(new_mem_norm)
        new_mem_norm2 = self.normalizeMemory2(mem_mlp + new_mem_norm)
        input_stack = torch.stack([input_vector for _ in range(self.mem_slots)], dim=1)
        h_old_x = torch.cat([prev_memory, input_stack], dim = -1)
        z_t = torch.sigmoid(self.ZGATE(h_old_x)) # (batch size, memory slots, memory size)
        next_memory = (1 - z_t)*prev_memory + z_t*new_mem_norm2
        return next_memory

In [98]:
if teaching:
    
    rmc = RelMemCore(mem_slots=4, mem_size=8, num_heads=3)
    cur_mem = rmc.initial_memory(batch_size=1)
    input_vector = torch.randn((batch_size,mem_size))
    new_memory = rmc.update_memory(input_vector, cur_mem)
    print(cur_mem, cur_mem.shape)
    print("------------------------------------------")
    print(new_memory, new_memory.shape)

tensor([[[1., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 1., 0., 0., 0., 0.]]]) torch.Size([1, 4, 8])
------------------------------------------
tensor([[[-0.1236,  0.0346, -1.3370, -0.3710,  0.1886,  0.3484,  1.3409,
           1.0054]]], grad_fn=<RepeatBackward>) torch.Size([1, 1, 8])
------------------------------------------
tensor([[[ 1.6770, -0.6598,  0.0461, -0.1969, -0.3158,  0.1949, -0.2053,
          -0.0955],
         [ 0.5045,  1.3225, -0.1530, -0.3141, -0.7260,  0.2137, -0.2168,
          -0.2624],
         [ 0.3497, -0.8006,  1.4461, -0.0663, -0.1956,  0.1075, -0.1415,
          -0.2532],
         [ 0.3222, -0.7927,  0.0070,  1.6306, -0.1580,  0.1044, -0.3018,
          -0.2089]]], grad_fn=<AddBackward0>) torch.Size([1, 4, 8])


## Chloe, but with memory

The way that our new model will take into account memory is by re-representing the encoding based on this memory using . . . you guessed it, attention. 

`m_output, m_scores = self.mem_encoder(e_output,self.current_memory,self.current_memory)`

We still need to decide when in the conversation to update the memory and implement that update into the conversation

In [87]:
class MemoryTransformer(nn.Module):
    def __init__(self, in_vocab_size, out_vocab_size, emb_dim, n_layers, num_heads, mem_slots, dropout):
        
        super(MemoryTransformer, self).__init__() 
        
        self.mem_slots = mem_slots
        self.mem_size = emb_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.dim_k = self.mem_size // self.num_heads
        
        self.encoder = Encoder(in_vocab_size, emb_dim, n_layers, num_heads, dropout)
        self.rmc = RelMemCore(mem_slots, mem_size=emb_dim, num_heads=num_heads)
        
        self.current_memory = self.rmc.initial_memory(batch_size=1)
        
        self.mem_encoder = MultiHeadAttention(num_heads,self.mem_size,self.dim_k,dropout)
        self.decoder = Decoder(out_vocab_size, emb_dim, n_layers, num_heads, dropout)
        self.out = nn.Linear(emb_dim, out_vocab_size)
             
    def forward(self, src_seq, trg_seq, src_mask, trg_mask):
        e_output = self.encoder(src_seq, src_mask)
        m_output, m_scores = self.mem_encoder(e_output,self.current_memory,self.current_memory)
        d_output = self.decoder(trg_seq, m_output, src_mask, trg_mask)
        output = self.out(d_output)
        return output

In [94]:
if teaching:
    opt = Options(batchsize=2, device = torch.device("cpu"), epochs=25, lr=0.01, 
                  beam_width=3, max_len = 25, save_path = '../saved/weights/model_weights')

    data_iter, infield, outfield, opt = json2datatools(path='../saved/pairs.json', opt=opt)

    emb_dim, n_layers, num_heads, mem_slots, dropout = 32, 3, 8, 4, 0.01 
    chloe = MemoryTransformer(len(infield.vocab), len(outfield.vocab), 
                              emb_dim, n_layers, num_heads, mem_slots, dropout)

In [100]:
def load_subset_weights(whole_model, opt):
    '''
    This function allows you to load saved weights from a saved model that is a subset of your model
    It looks for the named parameters that match and loads those but will not crash trying to load
    parameters that dont have a matching name
    '''
    subset_model_dict = torch.load(opt.save_path)
    whole_model_dict = whole_model.state_dict() 
    for name, param in whole_model_dict.items(): 
        if name in subset_model_dict:
            whole_model_dict[name].copy_(subset_model_dict[name])

if teaching:
    # This function allows you to load saved weights from a saved model that is a subset of your model
    load_subset_weights(chloe, opt)
    # talk_to_chloe only uses the encoder and decoder, it does not use the memory encoder 
    print(talk_to_chloe("how?", chloe, opt, infield, outfield))

meowci beaucoup


The next lesson is about teaching our neural network to work towards a goal, open `Reinforce.ipynb` for the next part in our intellectual adventure


## How can I help you or get help from you?

[Support *ChloeRobotics* on Patreon](https://www.patreon.com/chloerobotics)

email chloe.the.robot [at] gmail [dot] com 