# [Convolutional Sequence-to-Sequence Learning (2017)](https://arxiv.org/pdf/1705.03122)

- Authors: Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
- Presenter: Jaemin Cho

## Key References
- Encoder-Decoder for NMT
    - [Cho et al.,  Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014](https://arxiv.org/abs/1406.1078)
- Dot product Attention
    - [Luong et al., Effective approaches to attention-based neural machine translation, 2015](https://arxiv.org/abs/1508.04025)
- Gated Linear Unit (GLU)
    - [Dauphin et al., Language modeling with gated linear units, 2016](https://michaelauli.github.io/papers/gcnn.pdf)
- CNN Encoder for NMT
    - [Gehring et al., A Convolutional Encoder Model for Neural Machine Translation, 2016](https://arxiv.org/abs/1611.02344)
- Multi-Step Attention
    - [Sukhbaatar et al., End-To-End Memory Networks, 2015](https://arxiv.org/abs/1503.08895)
- Residual Connection
    - [He et al, Deep Residual Learning for Image Recognition, 2015](https://arxiv.org/abs/1512.03385)
- Byte-Pair Encoding (BPE) for subword vocabulary
    - [Sennrich et al., Neural Machine Translation of Rare Words with Subword Units, 2016](https://arxiv.org/pdf/1508.07909)
- Vocabulary Selection
    - [Mi et al., Vocabulary Manipulation for Neural Machine Translation, 2016](https://arxiv.org/abs/1605.03209)
- Weight normalizations
    - [Salimans and Kingma, Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, 2016](https://arxiv.org/pdf/1602.07868.pdf)
- Weight initialization focusing on variances of activations
    - [Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks, 2010](http://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Fall.2016/pdfs/1111/AISTATS2010_Glorot.pdf)


In [1]:
import tensorflow as tf
import inspect
import numpy as np
from viz_utils import show_graph

## Brief Intro to seq2seq with RNN

<img src="images/seq2seq.png" style="width:1000px"/>

In [2]:
source_vocab = {
    'le': 1, # 2,
    'chat': 2, # 85,
    'est': 3, # 3,
    'noir': 4, # 12,
    '<EOS>': 0, # 99,
}
target_vocab = {
    '<SOS>': 0,
    'the': 1, # 42,
    'cat': 2, # 82,
    'is': 3, # 16,
    'black': 4,
}

In [3]:
source_sentence = 'le chat est noir <EOS>'
target_sentence = '<SOS> the cat is black'

In [4]:
encoded_source_sentence = [source_vocab[word]
                           for word in source_sentence.split()]
encoded_target_sentence = [target_vocab[word]
                           for word in target_sentence.split()]
print('source sentence:', encoded_source_sentence)
print('target sentence:', encoded_target_sentence)

source sentence: [1, 2, 3, 4, 0]
target sentence: [0, 1, 2, 3, 4]


In [5]:
source_reverse_vocab = {source_vocab[key]: key
                        for key in source_vocab}
target_reverse_vocab = {target_vocab[key]: key
                        for key in target_vocab}
print(source_reverse_vocab)
print(target_reverse_vocab)

{1: 'le', 2: 'chat', 3: 'est', 4: 'noir', 0: '<EOS>'}
{0: '<SOS>', 1: 'the', 2: 'cat', 3: 'is', 4: 'black'}


In [6]:
tf.reset_default_graph()

learning_rate = 0.1
rnn_dimension = 10
batch_size = 1

# Word Embedding
source_vocab_size = len(encoded_source_sentence) # 5
target_vocab_size = len(encoded_target_sentence) # 5

source_word_embedding = tf.get_variable('source_word_embedding',
    dtype=tf.float32,
    shape=[source_vocab_size, rnn_dimension])
target_word_embedding = tf.get_variable('target_word_embedding',
    dtype=tf.float32,
    shape=[target_vocab_size, rnn_dimension])

encoded_source_sentence = tf.constant([encoded_source_sentence], name='source_sentence')
encoded_target_sentence = tf.constant([encoded_target_sentence], name='target_sentence')
print(encoded_source_sentence)
print(encoded_target_sentence)

Tensor("source_sentence:0", shape=(1, 5), dtype=int32)
Tensor("target_sentence:0", shape=(1, 5), dtype=int32)


In [7]:
# Encoder
with tf.variable_scope('Encoder') as scope:
    encoder_cell = tf.contrib.rnn.BasicRNNCell(rnn_dimension)
    encoder_outputs = []
    state = encoder_cell.zero_state(batch_size, tf.float32)
    for i in range(5):
        input_ = encoded_source_sentence[:, i]
        # Word Embedding
        # dimension: vocabulary size => RNN dimension
        input_ = tf.nn.embedding_lookup(source_word_embedding, input_)       
        if i > 0:
            scope.reuse_variables()
        output, state = encoder_cell(input_, state)
        encoder_outputs.append(output) # => Not used!
    encoder_last_state = state

In [8]:
# Decoder
with tf.variable_scope('Decoder') as scope:
    decoder_cell = tf.contrib.rnn.BasicRNNCell(rnn_dimension)
    decoder_outputs = []
    predictions = []
    state = encoder_last_state
    input_ = [target_vocab['<SOS>']] # 00

    # RNN dimension => target vocabulary size
    output_projection = tf.get_variable('output_projection',
        dtype=tf.float32, shape=[rnn_dimension, target_vocab_size])

    for i in range(5):
        input_ = tf.nn.embedding_lookup(target_word_embedding, input_)
        if i > 0:
            scope.reuse_variables()
        output, state = decoder_cell(input_, state)
        logit = tf.matmul(output, output_projection)
        prediction = tf.argmax(logit, axis=-1)
        input_ = prediction
        decoder_outputs.append(logit)
        predictions.append(prediction)

In [9]:
# Train
with tf.name_scope('Loss'):
    loss = tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.one_hot(encoded_target_sentence, target_vocab_size),
        logits=decoder_outputs)
with tf.name_scope('Optimizer'):
    opt = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(f'Raw prediction: {np.squeeze(sess.run(predictions))}')
    for i in range(10):
        # Run one gradient descent step
        sess.run(opt)
        raw_prediction = np.squeeze(sess.run(predictions))
        # index => word
        word_predictions = [target_reverse_vocab[idx] for idx in raw_prediction]
        if word_predictions == ['<SOS>', 'the', 'cat', 'is', 'black']:
            print(f'Step {i+1}: {word_predictions} => Correct!')
        else:
            print(f'Step {i+1}: {word_predictions}')

Raw prediction: [0 2 3 1 2]
Step 1: ['<SOS>', 'the', 'cat', 'is', 'the']
Step 2: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 3: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 4: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 5: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 6: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 7: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 8: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 9: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!
Step 10: ['<SOS>', 'the', 'cat', 'is', 'black'] => Correct!


In [10]:
show_graph(tf.get_default_graph().as_graph_def())

## Notes
- Outputs of encoder are not used => Attention mechanism
- Results depend on initialization => Random initialization within upper bound on activation
- For more breif APIs: [my tutorial codes](https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq)

## Attention Mechanism
- 'Luong' style **dot-product** similarity score

<img src="images/luong_attention.jpg", style="width: 1000px;"/>

## Multiple-hop Attention
- End-to-End Memory Networks

<img src="images/mem_N2N.png", style="width: 1000px;"/>

## What about Convolutional Encoder?
- Convolutional encoders have performed fairly well even in NLP!

- CNN encoder
<img src="images/cnn_encoder.png"/>

In [11]:
n_filters = 8
filter_size = 3
embedding_dim = 128

In [12]:
demo_embedded_words = tf.ones(shape=[10, embedding_dim])

def cnn_layer(embedded_words, activation_fn=tf.nn.relu):
    with tf.variable_scope('conv-%s' % filter_size):
        # Convolution layer
        filter_shape = [filter_size, embedding_dim, 1, n_filters]
        conv_W = tf.get_variable('filter',
            initializer=tf.truncated_normal(filter_shape, stddev=0.1))
        conv_b = tf.get_variable('bias',
            initializer=tf.constant(0.1, shape=[n_filters]))
        conv = tf.nn.conv2d(
            input=embedded_words,
            filter=conv_W,
            strides=[1,1,1,1],
            padding='VALID',
            name='conv')
        
        a_fn_name = 'ReLU' if activation_fn == tf.nn.relu else None
        
        h = activation_fn(tf.nn.bias_add(conv, conv_b), name=a_fn_name)
        return h

In [13]:
# cnn_layer(demo_embedded_words)

- Residual Connection
<img src="images/residual_connection.png"/>

In [14]:
def residual_block(x, activation_fn=tf.nn.relu):
    residual = cnn_layer(x, activation_fn=tf.nn.relu)
    residual = cnn_layer(x, activation_fn=None)
    return activation_fn(residual + x)

## GLU (Gated Linear Units)
- Let's try different activation functions **other than ReLU or tanh**

$$
v([A\ B]) = A⊗\sigma(B)
$$

In [15]:
def GLU(tensor):
    """
    1. Split given tensor as two tensors by last dimension
    2. Return A * sigmoid(B)
    """
    last_axis = len(tensor.get_shape().as_list()) - 1
    
    A_B = tf.split(tensor, num_or_size_splits=2, axis=last_axis)
    A = A_B[0]
    B = A_B[1]
    
    # Element-wise product
    return tf.multiply(A, tf.nn.sigmoid(B))

In [16]:
tf.InteractiveSession()
demo_tensor=tf.random_normal([3, 8])
GLU(demo_tensor).eval()

array([[-0.47210816,  0.24534498, -1.25452566,  0.01262983],
       [ 0.7525444 , -0.51608807, -0.00512831, -0.00344222],
       [-0.30986243, -0.64715064,  0.49930477,  0.17770895]], dtype=float32)

## GLU Initialization (Appendix A., B.)

- Upper Bound on Squared sigmoid
<img src="images/upper_bound_sigmoid.png", style="width: 500px;"/>

- Output variance
$$
y_l =W_{l}x_{l}+ b_l
$$
$$
Var[y_l] = n_{l}Var[w_{l}x_{l}]
$$
- $n_l$: the number inputs to the layer.

- With $w_l$ and $x_l$ independent from each other and normally distributed with zero mean, this amounts
<img src="images/weight_init_3.png", style="width: 500px;"/>

- Approximate GLU output variance
<img src="images/weight_init_2.png", style="width: 500px;"/>

- Initialize $W_l$ from $N(0,\sqrt4/n_l)$

## Weight Normalization

- **Decouples** the **lenth of weight vectors** from their **direction**
<img src="images/weight_norm.png", style="width: 500px;"/>
- "We also use weight normalization for **all layers except for lookup tables**"

## Conolution Decoder
<img src="images/decoding.gif", style="width: 800px;"/>

## Overview
<img src="images/overview.png", style="height: 900px;"/>

## Results
<img src="images/results.png", style="width: 800px"/>
<img src="images/results_1.png", style="width: 800px"/>

## Additional NLP tricks
- Byte pair encoding
- Vocabulary Selection

### Byte pair encoding

Translation of **rare and unseen words** by representing them via **subword units**
- Reduce Vocabulary size

In [17]:
from bpe_utils import get_stats, merge_vocab
from pprint import pprint

In [18]:
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2, 'n e w e s t </w>':6, 'w i d e s t </w>':3}
vocab

{'l o w </w>': 5,
 'l o w e r </w>': 2,
 'n e w e s t </w>': 6,
 'w i d e s t </w>': 3}

In [19]:
bigram_pairs = get_stats(vocab)
bigram_pairs

defaultdict(int,
            {('d', 'e'): 3,
             ('e', 'r'): 2,
             ('e', 's'): 9,
             ('e', 'w'): 6,
             ('i', 'd'): 3,
             ('l', 'o'): 7,
             ('n', 'e'): 6,
             ('o', 'w'): 7,
             ('r', '</w>'): 2,
             ('s', 't'): 9,
             ('t', '</w>'): 9,
             ('w', '</w>'): 5,
             ('w', 'e'): 8,
             ('w', 'i'): 3})

## BPE 10 steps

In [20]:
initial_vocab = vocab
num_merges = 10
for i in range(num_merges):
    print('current vocab')
    print(vocab)
    pairs = get_stats(vocab)
    print('All bigrams')
    pprint(list(pairs.items()))
    best = max(pairs, key=pairs.get)
    print('most frequent pair:', best)
    vocab = merge_vocab(best, vocab)
    print('merged vocab')
    print(vocab)
    print('\n')
print('='*20)
print('initial vocab')
print(initial_vocab)
print('final vocab')
print(vocab)

current vocab
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
All bigrams
[(('l', 'o'), 7),
 (('o', 'w'), 7),
 (('w', '</w>'), 5),
 (('w', 'e'), 8),
 (('e', 'r'), 2),
 (('r', '</w>'), 2),
 (('n', 'e'), 6),
 (('e', 'w'), 6),
 (('e', 's'), 9),
 (('s', 't'), 9),
 (('t', '</w>'), 9),
 (('w', 'i'), 3),
 (('i', 'd'), 3),
 (('d', 'e'), 3)]
most frequent pair: ('e', 's')
merged vocab
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}


current vocab
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
All bigrams
[(('l', 'o'), 7),
 (('o', 'w'), 7),
 (('w', '</w>'), 5),
 (('w', 'e'), 2),
 (('e', 'r'), 2),
 (('r', '</w>'), 2),
 (('n', 'e'), 6),
 (('e', 'w'), 6),
 (('w', 'es'), 6),
 (('es', 't'), 9),
 (('t', '</w>'), 9),
 (('w', 'i'), 3),
 (('i', 'd'), 3),
 (('d', 'es'), 3)]
most frequent pair: ('es', 't')
merged vocab
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}


### Vocabulary Selection

In order to capture rich language phenomena, NMT models have to use a **large vocabulary size**
- high computing time & large memory usage.

For each sentence or batch, **only predict the target words in its sentence-level or batch-level vocabulary**.

For each sentence x, we have a target vocabulary $V_x$

$$
V_x = V^D_x ∪ V^P_x ∪ V^T_x ∪ V^R_x
$$

- $V^D_x$: directly translated words from dictionary

- $V^P_x$: all possibile sub-sequence from dictionary

- $V^T_x$: top n common words from target dictionary

- $V^R_x$: all words in target sentences (training only)