### DeepPavlov sequence-to-sequence tutorial

In this tutorial we are going to implement sequence-to-sequence [[original paper]](https://arxiv.org/abs/1409.3215) model in DeepPavlov.

Sequence-to-sequence is the concept of mapping input sequence to target sequence. Sequence-to-sequence models consist of two main components: encoder and decoder. Encoder is used to encode the input sequence to dense representation and decoder uses this dense representation to generate target sequence.

![sequence-to-sequence](img/seq2seq.png)

Here, input sequence is ABC, special token <EOS\> (end of sequence) is used as indicator to start decoding target sequence WXYZ.

To implement this model in DeepPavlov we have to code some DeepPavlov abstractions:
* **DatasetReader** to read the data
* **DatasetIterator** to generate batches
* **Vocabulary** to convert words to indexes
* **Model** to train it and then use it
* and some other components for pre- and postprocessing

In [2]:
%load_ext autoreload
%autoreload 2

import deeppavlov
import json
import numpy as np
import tensorflow as tf

from itertools import chain
from pathlib import Path

### Download & extract dataset

In [2]:
from deeppavlov.core.data.utils import download_decompress
download_decompress('http://lnsigo.mipt.ru/export/datasets/personachat_v2.tar.gz', './personachat')

2018-07-09 21:27:10.661 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): lnsigo.mipt.ru
2018-07-09 21:27:12.364 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://lnsigo.mipt.ru:80 "GET /export/datasets/personachat_v2.tar.gz HTTP/1.1" 200 223217972
2018-07-09 21:27:12.370 INFO in 'deeppavlov.core.data.utils'['utils'] at line 65: Downloading from http://lnsigo.mipt.ru/export/datasets/personachat_v2.tar.gz to /Users/yaroina-kente/personachat/personachat_v2.tar.gz
100%|██████████| 223M/223M [1:02:56<00:00, 59.1kB/s] 
2018-07-09 22:30:08.427 INFO in 'deeppavlov.core.data.utils'['utils'] at line 149: Extracting personachat/personachat_v2.tar.gz archive into personachat


### DatasetReader

DatasetReader is used to read and parse data from files. Here, we define new PersonaChatDatasetReader which reads [PersonaChat dataset](https://arxiv.org/abs/1801.07243). PersonaChat dataset consists of dialogs and user personalities.

User personality is described by four sentences, e.g.:

    i like to remodel homes.
    i like to go hunting.
    i like to shoot a bow.
    my favorite holiday is halloween.

In [4]:
from deeppavlov.core.commands.train import build_model_from_config
from deeppavlov.core.data.dataset_reader import DatasetReader
from deeppavlov.core.data.utils import download_decompress
from deeppavlov.core.common.registry import register

@register('personachat_dataset_reader')
class PersonaChatDatasetReader(DatasetReader):
    """
    PersonaChat dataset from
    Zhang S. et al. Personalizing Dialogue Agents: I have a dog, do you have pets too?
    https://arxiv.org/abs/1801.07243
    Also, this dataset is used in ConvAI2 http://convai.io/
    This class reads dataset to the following format:
    [{
        'persona': [list of persona sentences],
        'x': input utterance,
        'y': output utterance,
        'dialog_history': list of previous utterances
        'candidates': [list of candidate utterances]
        'y_idx': index of y utt in candidates list
      },
       ...
    ]
    """
    def read(self, dir_path: str, mode='self_original'):
        dir_path = Path(dir_path)
        dataset = {}
        for dt in ['train', 'valid', 'test']:
            dataset[dt] = self._parse_data(dir_path / '{}_{}.txt'.format(dt, mode))

        return dataset

    @staticmethod
    def _parse_data(filename):
        examples = []
        print(filename)
        curr_persona = []
        curr_dialog_history = []
        persona_done = False
        with filename.open('r') as fin:
            for line in fin:
                line = ' '.join(line.strip().split(' ')[1:])
                your_persona_pref = 'your persona: '
                if line[:len(your_persona_pref)] == your_persona_pref and persona_done:
                    curr_persona = [line[len(your_persona_pref):]]
                    curr_dialog_history = []
                    persona_done = False
                elif line[:len(your_persona_pref)] == your_persona_pref:
                    curr_persona.append(line[len(your_persona_pref):])
                else:
                    persona_done = True
                    x, y, _, candidates = line.split('\t')
                    candidates = candidates.split('|')
                    example = {
                        'persona': curr_persona,
                        'x': x,
                        'y': y,
                        'dialog_history': curr_dialog_history[:],
                        'candidates': candidates,
                        'y_idx': candidates.index(y)
                    }
                    curr_dialog_history.extend([x, y])
                    examples.append(example)

        return examples

In [5]:
data = PersonaChatDatasetReader().read('./personachat')

personachat/train_self_original.txt
personachat/valid_self_original.txt
personachat/test_self_original.txt


#### Let's check dataset size

In [6]:
for k in data:
    print(k, len(data[k]))

train 65719
valid 7801
test 7512


In [7]:
data['train'][0]

{'persona': ['i like to remodel homes.',
  'i like to go hunting.',
  'i like to shoot a bow.',
  'my favorite holiday is halloween.'],
 'x': 'hi , how are you doing ? i am getting ready to do some cheetah chasing to stay in shape .',
 'y': 'you must be very fast . hunting is one of my favorite hobbies .',
 'dialog_history': [],
 'candidates': ['my mom was single with 3 boys , so we never left the projects .',
  'i try to wear all black every day . it makes me feel comfortable .',
  'well nursing stresses you out so i wish luck with sister',
  'yeah just want to pick up nba nfl getting old',
  'i really like celine dion . what about you ?',
  'no . i live near farms .',
  'i wish i had a daughter , i am a boy mom . they are beautiful boys though still lucky',
  'yeah when i get bored i play gone with the wind my favorite movie .',
  'hi how are you ? i am eating dinner with my hubby and 2 kids .',
  'were you married to your high school sweetheart ? i was .',
  'that is great to hear !

### Dataset iterator

Dataset iterator is used to generate batches from parsed dataset (DatasetReader). Let's extract only *x* and *y* from parsed dataset and use them to predict sentence *y* by sentence *x*.

In [8]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

@register('personachat_iterator')
class PersonaChatIterator(DataLearningIterator):
    def split(self, *args, **kwargs):
        for dt in ['train', 'valid', 'test']:
            setattr(self, dt, self._to_tuple(getattr(self, dt)))

    @staticmethod
    def _to_tuple(data):
        """
        Returns:
            list of (persona, x, candidates), y
        """
        return list(map(lambda x: (x['x'], x['y']), data))

Let's look on data in batches:

In [9]:
iterator = PersonaChatIterator(data)
batch = [el for el in iterator.gen_batches(5, 'train')][0]
for x, y in zip(*batch):
    print('x:', x)
    print('y:', y)
    print('----------')

x: it certainly does . i love my daily walks with toto in the countryside
y: toto your dog ? yeah i use to live in kansas .
----------
x: i am the supervisor at a power plant so it really sets the tone .
y: i work from home . i am deaf so its just easier for me .
----------
x: i am doing ok i just got done fishing
y: sounds cool , i am reading about traveling to cancun
----------
x: hello , how are you ? where do you live ? i am in california .
y: hi i am mad about work but will be ok how are you
----------
x: i am a teacher already , what do you do ?
y: inbetween jobs now , i am just enjoying sports right now .
----------


### Tokenizer

Tokenizer is used to extract tokens from utterance.

In [10]:
from deeppavlov.models.preprocessors.lazy_tokenizer import LazyTokenizer
tokenizer = LazyTokenizer()
tokenizer(['Hello my friend'])

[['Hello', 'my', 'friend']]

### Vocabulary

Vocabulary prepares mapping from tokens to token indexes. It uses train data to build this mapping.

We will implement DialogVocab (inherited from SimpleVocabulary) wich adds all tokens from *x* and *y* utterances to vocabulary.

In [55]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

@register('dialog_vocab')
class DialogVocab(SimpleVocabulary):
    def fit(self, *args):
        tokens = chain(*args)
        super().fit(tokens)

    def __call__(self, batch, **kwargs):
        indices_batch = []
        for utt in batch:
            tokens = [self[token] for token in utt]
            indices_batch.append(tokens)
        return indices_batch



Let's create instance of DialogVocab. We define save and load paths, minimal frequence of tokens which are added to vocabulary and set of special tokens.

Special tokens are:
* <PAD\> - padding
* <BOS\> - begin of sequence
* <EOS\> - end of sequence
* <UNK\> - unknown token - token which is not presented in vocabulary

And fit it on tokens from *x* and *y*.

In [56]:
vocab = DialogVocab(
    save_path='/Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict', # change this path
    load_path='/Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict', # change this path
    min_freq=2,
    special_tokens=('<PAD>','<BOS>', '<EOS>', '<UNK>',),
    unk_token='<UNK>'
)
    
vocab.fit(tokenizer(iterator.get_instances(data_type='train')[0]) + tokenizer(iterator.get_instances(data_type='train')[1]))
vocab.save()

2018-07-11 00:49:16.961 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 83: [saving vocabulary to /Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict]


Top 10 most frequent tokens in train dataset:

In [57]:
vocab.freqs.most_common(10)\

[('i', 103487),
 ('.', 101599),
 ('you', 48296),
 ('?', 43771),
 (',', 39500),
 ('a', 34214),
 ('to', 32105),
 ('do', 30574),
 ('is', 28579),
 ('my', 26953)]

Number of tokens in vocabulary:

In [58]:
len(vocab)

11595

Let's use built vocabulary to encode some tokenized sentence.

In [59]:
vocab([['<BOS>', 'hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this', '<EOS>', '<PAD>']])

[[1, 70, 13, 240, 3, 3, 2, 0]]

### Padding

To feed sequences of token indexes to neural model we should make their lengths equal. If sequence is too short we add <PAD\> symbols to the end of sequence. If sequence is too long we just cut it.

SentencePadder implements such behavior, it also adds <BOS\> and <EOS\> tokens.

In [60]:
from deeppavlov.core.models.component import Component

@register('sentence_padder')
class SentencePadder(Component):
    def __init__(self, length_limit, pad_token_id=0, start_token_id=1, end_token_id=2, *args, **kwargs):
        self.length_limit = length_limit
        self.pad_token_id = pad_token_id
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id

    def __call__(self, batch):
        for i in range(len(batch)):
            batch[i] = batch[i][:self.length_limit]
            batch[i] = [self.start_token_id] + batch[i] + [self.end_token_id]
            batch[i] += [self.pad_token_id] * (self.length_limit + 2 - len(batch[i]))
        return batch



In [61]:
padder = SentencePadder(length_limit=6)
vocab(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']])))

[['<BOS>', 'hello', 'my', 'friend', '<UNK>', '<UNK>', '<EOS>', '<PAD>']]

### Seq2Seq Model
Model consists of two main components: encoder and decoder. We can implement them independently and then put them together in one Seq2Seq model.

#### Encoder
Encoder builds hidden representation of input sequence.

In [62]:
def encoder(inputs, inputs_len, embedding_matrix, cell_size, keep_prob=1.0):
    # inputs: tf.int32 tensor with shape bs x seq_len with token ids
    # inputs_len: tf.int32 tensor with shape bs
    # embedding_matrix: tf.float32 tensor with shape vocab_size x vocab_dim
    # cell_size: hidden size of recurrent cell
    # keep_prob: dropout keep probability
    with tf.variable_scope('encoder'):
     
        # first of all we should embed every token in input sequence (use tf.nn.embedding_lookup, don't forget about dropout)

        encoder_inputs_embedded = tf.nn.embedding_lookup(embedding_matrix, inputs)


        # define recurrent cell (GRU or LSTM)

        encoder_cell = tf.contrib.rnn.LSTMCell(cell_size)

        encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
                                                encoder_cell, encoder_inputs_embedded,
                                                dtype=tf.float32, time_major=False,
                                                )

        
        # use tf.nn.dynamic_rnn to encode input sequence, use actual length of input sequence
        # encoder_outputs, encoder_state = tf.nn.dynamic_rnn(...)

    return encoder_outputs, encoder_state

Check your encoder implementation:

next cell output shapes are

32 x 10 x 100 and 32 x 100 

In [63]:
tf.reset_default_graph()
vocab_size = 100
hidden_dim = 100
inputs = tf.cast(tf.random_uniform(shape=[32, 10]) * vocab_size, tf.int32) # bs x seq_len
mask = tf.cast(tf.random_uniform(shape=[32, 10]) * 2, tf.int32) # bs x seq_len
inputs_len = tf.reduce_sum(mask, axis=1)
embedding_matrix = tf.random_uniform(shape=[vocab_size, hidden_dim])

encoder(inputs, inputs_len, embedding_matrix, hidden_dim)

(<tf.Tensor 'encoder/rnn/transpose_1:0' shape=(32, 10, 100) dtype=float32>,
 LSTMStateTuple(c=<tf.Tensor 'encoder/rnn/while/Exit_3:0' shape=(32, 100) dtype=float32>, h=<tf.Tensor 'encoder/rnn/while/Exit_4:0' shape=(32, 100) dtype=float32>))

#### Decoder
Decoder uses encoder outputs and encoder state to produce output sequence.

Here, you should write code:
* define your decoder_cell (GRU or LSTM)

it will be your baseline seq2seq model.


And, to improve the model:
* add Teacher Forcing
* add Attention Mechanism

In [66]:
def decoder(encoder_outputs, encoder_state, embedding_matrix, mask,
            cell_size, max_length, y_ph,
            start_token_id=1, keep_prob=1.0,
            teacher_forcing_rate_ph = None,
            use_attention = False, is_train=False):    
    # decoder
    # encoder_outputs: tf.float32 tensor with shape bs x seq_len x encoder_cell_size
    # encoder_state: tf.float32 tensor with shape bs x encoder_cell_size
    # embedding_matrix: tf.float32 tensor with shape vocab_size x vocab_dim
    # mask: tf.int32 tensor with shape bs x seq_len with zeros for masked sequence elements
    # cell_size: hidden size of recurrent cell
    # max_length: max length of output sequence
    # start_token_id: id of <BOS> token in vocabulary
    # keep_prob: dropout keep probability
    # teacher_forcing_rate_ph: rate of using teacher forcing on each decoding step
    # use_attention: use attention on encoder outputs or use only encoder_state
    # is_train: is it training or inference? at inference time we can't use teacher forcing
    with tf.variable_scope('decoder'):
        # define decoder recurrent cell with cell_size
        
        decoder_cell = tf.contrib.rnn.LSTMCell(cell_size)
        # initial value of output_token on previsous step is start_token
        output_token = tf.ones(shape=(tf.shape(encoder_outputs)[0],), dtype=tf.int32) * start_token_id
        # get embeddings_dim from embedding_matrix
        embeddings_dim = embedding_matrix.get_shape()[1]
        # let's define initial value of decoder state with encoder_state
        decoder_state = encoder_state

        pred_tokens = []
        logits = []

        # use for loop to sequentially call recurrent cell
        for i in range(max_length):
             
            input_token_emb = tf.nn.embedding_lookup(embedding_matrix, output_token) 
            
            input_token = tf.cond(tf.logical_and(is_train, tf.logical_and(tf.greater(i,0),
                                                 tf.less(tf.random_uniform(shape=()), teacher_forcing_rate_ph))),
                                 lambda: y_ph[:, i-1], lambda: tf.cast(output_token, tf.int32)) 

            input_token_emb = tf.nn.embedding_lookup(embedding_matrix, input_token)

            att = dot_attention(encoder_outputs, encoder_state, mask)
            input_token_emb = tf.concat([input_token_emb, att], axis=1)
            
            input_token_emb = tf.nn.dropout(input_token_emb, keep_prob=keep_prob)
            # call recurrent cell
            decoder_outputs, decoder_state = decoder_cell(input_token_emb, decoder_state)
            decoder_outputs = tf.nn.dropout(decoder_outputs, keep_prob=keep_prob)
            # project decoder output to embeddings dimension
            output_proj = tf.layers.dense(decoder_outputs, embeddings_dim, activation=tf.nn.tanh,
                                          kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                          name='proj', reuse=tf.AUTO_REUSE)
            # compute logits
            output_logits = tf.matmul(output_proj, embedding_matrix, transpose_b=True)

            logits.append(output_logits)
            output_probs = tf.nn.softmax(output_logits)
            output_token = tf.argmax(output_probs, axis=-1)
            pred_tokens.append(output_token)
        
        # ids of output tokens, they will be used as model prediction
        y_pred_tokens = tf.transpose(tf.stack(pred_tokens, axis=0), [1, 0])
        # output logits of predicted tokens on each step, they will be used to compute loss function
        y_logits = tf.transpose(tf.stack(logits, axis=0), [1, 0, 2])
        
    return y_pred_tokens, y_logits

Output of next cell should be with shapes:

    32 x 10
    32 x 10 x 100

In [67]:
tf.reset_default_graph()
vocab_size = 100
hidden_dim = 100
inputs = tf.cast(tf.random_uniform(shape=[32, 10]) * vocab_size, tf.int32) # bs x seq_len
mask = tf.cast(tf.random_uniform(shape=[32, 10]) * 2, tf.int32) # bs x seq_len
inputs_len = tf.reduce_sum(mask, axis=1)
embedding_matrix = tf.random_uniform(shape=[vocab_size, hidden_dim])

teacher_forcing_rate = tf.random_uniform(shape=())
y = tf.cast(tf.random_uniform(shape=[32, 10]) * vocab_size, tf.int32)

encoder_outputs, encoder_state = encoder(inputs, inputs_len, embedding_matrix, hidden_dim)
decoder(encoder_outputs, encoder_state, embedding_matrix, mask, hidden_dim, max_length=10,
        y_ph=y, teacher_forcing_rate_ph=teacher_forcing_rate)

(<tf.Tensor 'decoder/transpose:0' shape=(32, 10) dtype=int64>,
 <tf.Tensor 'decoder/transpose_1:0' shape=(32, 10, 100) dtype=float32>)

#### Model

Seq2Seq model should be inherited from TFModel class and implement following methods:
* train_on_batch - this method is called in training phase
* \_\_call\_\_ - this method is called to make predictions

In [68]:
import tensorflow as tf
from deeppavlov.core.models.tf_model import TFModel

@register('seq2seq')
class Seq2Seq(TFModel):
    def __init__(self, **kwargs):
        # hyperparameters
        
        # dimenstion of word dense representation
        self.embeddings_dim = kwargs.get('embeddings_dim', 100)
        # size of recurrent cell in encoder and decoder
        self.cell_size = kwargs.get('cell_size', 200)
        # dropout keep_probability
        self.keep_prob = kwargs.get('keep_prob', 0.8)
        # learning rate
        self.learning_rate = kwargs.get('learning_rate', 3e-04)
        # max length of output sequence
        self.max_length = kwargs.get('max_length', 20)
        self.grad_clip = kwargs.get('grad_clip', 5.0)
        self.start_token_id = kwargs.get('start_token_id', 1)
        self.vocab_size = kwargs.get('vocab_size', 11595)
        self.teacher_forcing_rate = kwargs.get('teacher_forcing_rate', 0.0)
        self.use_attention = kwargs.get('use_attention', True)
        
        # create tensorflow session to run computational graph in it
        self.sess_config = tf.ConfigProto(allow_soft_placement=True)
        self.sess_config.gpu_options.allow_growth = True
        self.sess = tf.Session(config=self.sess_config)
        
        self.init_graph()
        
        # define train op
        self.train_op = self.get_train_op(self.loss, self.lr_ph,
                                          optimizer=tf.train.AdamOptimizer,
                                          clip_norm=self.grad_clip)
        # initialize graph variables
        self.sess.run(tf.global_variables_initializer())
        
        super().__init__(**kwargs)
        # load saved model if there is one
        if self.load_path is not None:
            self.load()
        
    def init_graph(self):
        # create placeholders
        self.init_placeholders()

        self.x_mask = tf.cast(self.x_ph, tf.int32) 
        self.y_mask = tf.cast(self.y_ph, tf.int32) 
        
        self.x_len = tf.reduce_sum(self.x_mask, axis=1)
        
        # create embeddings matrix for tokens
        self.embeddings = tf.Variable(tf.random_uniform((self.vocab_size, self.embeddings_dim), -0.1, 0.1, name='embeddings'), dtype=tf.float32)

        encoder_outputs, encoder_state = encoder(self.x_ph, self.x_len, self.embeddings, self.cell_size, self.keep_prob_ph)

        self.y_pred_tokens, y_logits = decoder(encoder_outputs, encoder_state, self.embeddings, self.x_mask,
                                                      self.cell_size, self.max_length,
                                                      self.y_ph, self.start_token_id, self.keep_prob_ph,
                                                      self.teacher_forcing_rate_ph, self.use_attention, self.is_train_ph)
        
        # loss
        self.y_ohe = tf.one_hot(self.y_ph, depth=self.vocab_size)
        self.y_mask = tf.cast(self.y_mask, tf.float32)
        self.loss = tf.nn.softmax_cross_entropy_with_logits(labels=self.y_ohe, logits=y_logits) * self.y_mask
        self.loss = tf.reduce_sum(self.loss) / tf.reduce_sum(self.y_mask)
    
    def init_placeholders(self):
        # placeholders for inputs
        self.x_ph = tf.placeholder(shape=(None, None), dtype=tf.int32, name='x_ph')
        
        # y_ph is used node of computational graph at inference time when teacher forcing is activated, so we add dummy default value
        # this dummy value is not actually used at inference
        self.y_ph = tf.placeholder_with_default(tf.zeros_like(self.x_ph), shape=(None, None), name='y_ph')

        # placeholders for model parameters
        self.lr_ph = tf.placeholder(dtype=tf.float32, shape=[], name='lr_ph')
        self.keep_prob_ph = tf.placeholder_with_default(1.0, shape=[], name='keep_prob_ph')
        self.is_train_ph = tf.placeholder_with_default(False, shape=[], name='is_train_ph')
        self.teacher_forcing_rate_ph = tf.placeholder_with_default(0.0, shape=[], name='teacher_forcing_rate_ph')
            
    def _build_feed_dict(self, x, y=None):
        feed_dict = {
            self.x_ph: x,
        }
        if y is not None:
            feed_dict.update({
                self.y_ph: y,
                self.lr_ph: self.learning_rate,
                self.keep_prob_ph: self.keep_prob,
                self.is_train_ph: True,
                self.teacher_forcing_rate_ph: self.teacher_forcing_rate,
            })
        return feed_dict
    
    def train_on_batch(self, x, y):
        feed_dict = self._build_feed_dict(x, y)
        loss, _ = self.sess.run([self.loss, self.train_op], feed_dict=feed_dict)
        return loss
    
    def __call__(self, x):
        feed_dict = self._build_feed_dict(x)
        y_pred = self.sess.run(self.y_pred_tokens, feed_dict=feed_dict)
        return y_pred



Let's create model with random weights and default parameters, change path to model, otherwise it will be stored in deeppavlov/download folder:

In [69]:
s2s = Seq2Seq(
    save_path='/Users/yaroina-kente/DeepPavlov/download/vocabs/model',
    load_path='/Users/yaroina-kente/DeepPavlov/download/vocabs/model'
)

Here, we firstly run all preprocessing steps and call seq2seq model, and then convert token indexes to tokens. As result we should get some random sequence of words.

In [70]:
vocab(s2s(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']]))))

[['itunes',
  'limit',
  'trades',
  'chats',
  'fresh',
  'hilton',
  'cousins',
  'cousins',
  'can',
  'precious',
  'simpsons',
  'eighth',
  'eighth',
  'cattle',
  'watson',
  'tank',
  'seas',
  'studio',
  'hippy',
  'whoops']]

#### Attention mechanism
Attention mechanism [[paper](https://arxiv.org/abs/1409.0473)] allows to aggregate information from "memory" according to current state. By aggregating we suppose weighted sum of "memory" items. Weight of each memory item depends on current state. 

![attention](img/attention.png)

One of the simpliest ways to compute attention weights (*a_ij*) is to compute them by dot product between memory items and state and then apply softmax function. Other ways of computing *multiplicative* attention could be found in this [paper](https://arxiv.org/abs/1508.04025).

We also need a mask to skip some sequence elements like <PAD\>. To make weight of undesired memory items close to zero we can add big negative value to logits (result of dot product) before applying softmax.

In [64]:
def softmax_mask(values, mask):
    # adds big negative to masked values
    INF = 1e30
    return -INF * (1 - tf.cast(mask, tf.float32)) + values

In [65]:
def dot_attention(memory, state, mask, scope="dot_attention"):
    # inputs: bs x seq_len x hidden_dim
    # state: bs x hidden_dim
    # mask: bs x seq_len
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # YOUR CODE HERE
        # dot product between each item in memory and state
        # logits = ...
        BS, ML, MH = tf.unstack(tf.shape(memory))
        memory_do = tf.nn.dropout(memory, keep_prob=1.0, noise_shape=[BS, 1, MH])
        logits = tf.layers.dense(tf.layers.dense(memory_do, 100, activation=tf.nn.tanh), 1, use_bias=False)
   
        # apply mask to logits
        # logits = softmax_mask(logits, mask)
        
        logits = softmax_mask(tf.squeeze(logits, [2]), mask)          
        # apply softmax to logits
        # att_weights = ...
        att_weights = tf.expand_dims(tf.nn.softmax(logits), axis=2)
                                                                                                                                                   
        # compute weighted sum of items in memory
        # att = tf.reduce_sum(...) 
        
        att = tf.reduce_sum(att_weights * memory, axis=1)
                                   
        return att  

Check your implementation:

outputs should be with shapes 32 x 100

In [31]:
tf.reset_default_graph()
memory = tf.random_normal(shape=[32, 10, 100]) # bs x seq_len x hidden_dim
state = tf.random_normal(shape=[32, 100]) # bs x hidden_dim
mask = tf.cast(tf.random_normal(shape=[32, 10]), tf.int32) # bs x seq_len
dot_attention(memory, state, mask)

<tf.Tensor 'dot_attention/Sum:0' shape=(32, 100) dtype=float32>

#### Teacher forcing

We have implemented decoder, which takes as input it's own output during training and inference time. But, at early stages of training it could be hard for model to produce long sequences depending on it's own close to random output. Teacher forcing can help with this: instead of feeding model's output we can feed ground truth tokens. It helps model on training time, but on inference we still can rely only on it's own output.


Using model's output:

<img src="img/sampling.png" alt="sampling" width=50%/>

Teacher forcing:

<img src="img/teacher_forcing.png" alt="teacher_forcing" width=50%/>

It is not necessary to feed ground truth tokens on each time step - we can randomly choose with some rate if we want ground truth input or predicted by model.
*teacher_forcing_rate* parameter of seq2seq model can control such behavior.

More details about teacher forcing could be found in DeepLearningBook [Chapter 10.2.1](http://www.deeplearningbook.org/contents/rnn.html)

### Postprocessing

In postprocessing step we are going to remove all <PAD\>, <BOS\>, <EOS\> tokens.

In [71]:
@register('postprocessing')
class SentencePostprocessor(Component):
    def __init__(self, pad_token='<PAD>', start_token='<BOS>', end_token='<EOS>', *args, **kwargs):
        self.pad_token = pad_token
        self.start_token = start_token
        self.end_token = end_token

    def __call__(self, batch):
        for i in range(len(batch)):
            batch[i] = ' '.join(self._postproc(batch[i]))
        return batch
    
    def _postproc(self, utt):
        if self.end_token in utt:
            utt = utt[:utt.index(self.end_token)]
        return utt

In [72]:
postprocess = SentencePostprocessor()

In [73]:
postprocess(vocab(s2s(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']])))))

['itunes limit trades chats fresh hilton cousins cousins can precious simpsons eighth eighth cattle watson tank seas studio hippy whoops']

### Create config file
Let's put is all together in one config file.

In [75]:
config = {
  "dataset_reader": {
    "name": "personachat_dataset_reader",
    "data_path": "/Users/yaroina-kente/personachat/" # change this path
  },
  "dataset_iterator": {
    "name": "personachat_iterator",
    "seed": 1337,
    "shuffle": True
  },
  "chainer": {
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      {
        "name": "lazy_tokenizer",
        "id": "tokenizer",
        "in": ["x"],
        "out": ["x_tokens"]
      },
      {
        "name": "lazy_tokenizer",
        "id": "tokenizer",
        "in": ["y"],
        "out": ["y_tokens"]
      },
      {
        "name": "dialog_vocab",
        "id": "vocab",
        "save_path": "/Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict", # change this path
        "load_path": "/Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict", # change this path
        "min_freq": 2,
        "special_tokens": ["<PAD>","<BOS>", "<EOS>", "<UNK>"],
        "unk_token": "<UNK>",
        "fit_on": ["x_tokens", "y_tokens"],
        "in": ["x_tokens"],
        "out": ["x_tokens_ids"]
      },
      {
        "ref": "vocab",
        "in": ["y_tokens"],
        "out": ["y_tokens_ids"]
      },
      {
        "name": "sentence_padder",
        "id": "padder",
        "length_limit": 20,
        "in": ["x_tokens_ids"],
        "out": ["x_tokens_ids"]
      },
      {
        "name": "sentence_padder",
        "id": "y_padder",
        "length_limit": 10,
        "in": ["y_tokens_ids"],
        "out": ["y_tokens_ids"]
      },
      {
        "name": "seq2seq",
        "id": "s2s",
        "max_length": "#y_padder.length_limit+2",
        "cell_size": 250,
        "embeddings_dim": 50,
        "vocab_size": 11595,
        "keep_prob": 0.8,
        "learning_rate": 3e-04,
        "teacher_forcing_rate": 0.2, # change this parameter to use teacher forcing
        "use_attention": True, # change this parameter to use attention
        "save_path": "/Users/yaroina-kente/DeepPavlov/download/vocabs/model", # change this path
        "load_path": "/Users/yaroina-kente/DeepPavlov/download/vocabs/model", # change this path
        "in": ["x_tokens_ids"],
        "in_y": ["y_tokens_ids"],
        "out": ["y_predicted_tokens_ids"],
      },
      {
        "ref": "vocab",
        "in": ["y_predicted_tokens_ids"],
        "out": ["y_predicted_tokens"]
      },
      {
        "name": "postprocessing",
        "in": ["y_predicted_tokens"],
        "out": ["y_predicted_tokens"]
      }
    ],
    "out": ["y_predicted_tokens"]
  },
  "train": {
    "log_every_n_batches": 100,
    "val_every_n_epochs":0,
    "batch_size": 64,
    "validation_patience": 0,
    "epochs": 20,
    "metrics": ["bleu"],
  }
}

### Interact with model using config

In [76]:
from deeppavlov.core.commands.infer import build_model_from_config
model = build_model_from_config(config)

2018-07-11 01:07:24.64 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 94: [loading vocabulary from /Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict]


In [77]:
model(['Hi, how are you?', 'Any ideas my dear friend?'])

['fin fin es es canadian mail clark haven mermaid price price price',
 'fin fin es es lifelong insta insta kool kool aggression aggression aggression']

### Train model


Run experiments with and without attention, with teacher forcing and without. Change config parameters if needed.

In [78]:
from deeppavlov.core.commands.train import train_evaluate_model_from_config

In [79]:
json.dump(config, open('seq2seq.json', 'w'))

In [80]:
train_evaluate_model_from_config('seq2seq.json')

/Users/yaroina-kente/personachat/train_self_original.txt
/Users/yaroina-kente/personachat/valid_self_original.txt
/Users/yaroina-kente/personachat/test_self_original.txt


2018-07-11 01:08:25.768 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 94: [loading vocabulary from /Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict]
2018-07-11 01:09:01.76 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 83: [saving vocabulary to /Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict]
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram 

{"train": {"epochs_done": 0, "batches_seen": 100, "examples_seen": 6400, "metrics": {"bleu": 0.0}, "time_spent": "0:02:57", "loss": 9.323235940933227}}
{"train": {"epochs_done": 0, "batches_seen": 200, "examples_seen": 12800, "metrics": {"bleu": 0.0}, "time_spent": "0:05:58", "loss": 9.223115634918212}}
{"train": {"epochs_done": 0, "batches_seen": 300, "examples_seen": 19200, "metrics": {"bleu": 0.0}, "time_spent": "0:08:47", "loss": 9.20089792251587}}
{"train": {"epochs_done": 0, "batches_seen": 400, "examples_seen": 25600, "metrics": {"bleu": 0.0}, "time_spent": "0:11:30", "loss": 9.208646421432496}}
{"train": {"epochs_done": 0, "batches_seen": 500, "examples_seen": 32000, "metrics": {"bleu": 0.0}, "time_spent": "0:14:41", "loss": 9.18961205482483}}
{"train": {"epochs_done": 0, "batches_seen": 600, "examples_seen": 38400, "metrics": {"bleu": 0.0}, "time_spent": "0:18:06", "loss": 9.210968952178956}}
{"train": {"epochs_done": 0, "batches_seen": 700, "examples_seen": 44800, "metrics": 

2018-07-11 03:59:39.796 INFO in 'deeppavlov.core.commands.train'['train'] at line 367: Stopped training
2018-07-11 03:59:39.798 INFO in 'deeppavlov.core.commands.train'['train'] at line 370: Saving model
2018-07-11 03:59:39.800 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 49: [saving model to /Users/yaroina-kente/DeepPavlov/download/vocabs/model]
2018-07-11 03:59:40.657 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 94: [loading vocabulary from /Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict]
2018-07-11 03:59:47.595 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 40: [loading model from /Users/yaroina-kente/DeepPavlov/download/vocabs/model]


INFO:tensorflow:Restoring parameters from /Users/yaroina-kente/DeepPavlov/download/vocabs/model


2018-07-11 03:59:47.626 INFO in 'tensorflow'['tf_logging'] at line 116: Restoring parameters from /Users/yaroina-kente/DeepPavlov/download/vocabs/model
2018-07-11 03:59:47.812 INFO in 'deeppavlov.core.commands.train'['train'] at line 174: Testing the best saved model


{"valid": {"eval_examples_count": 7801, "metrics": {"bleu": 0.0}, "time_spent": "0:00:33"}}
{"test": {"eval_examples_count": 7512, "metrics": {"bleu": 0.0}, "time_spent": "0:00:28"}}


In [81]:
model = build_model_from_config(config)
model(['hi, how are you?', 'any ideas my dear friend?', 'okay, i agree with you', 'good bye!'])

2018-07-11 04:00:48.65 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 94: [loading vocabulary from /Users/yaroina-kente/DeepPavlov/download/vocabs/vocab.dict]
2018-07-11 04:00:54.577 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 40: [loading model from /Users/yaroina-kente/DeepPavlov/download/vocabs/model]


INFO:tensorflow:Restoring parameters from /Users/yaroina-kente/DeepPavlov/download/vocabs/model


2018-07-11 04:00:54.600 INFO in 'tensorflow'['tf_logging'] at line 116: Restoring parameters from /Users/yaroina-kente/DeepPavlov/download/vocabs/model


['oh oh am am am just . . .',
 'oh oh am am like . . .',
 'oh oh am am like just . . .',
 'oh oh am am like mostly . . .']