# Learn to calculate with seq2seq model

In this assignment, you will learn how to use neural networks to solve sequence-to-sequence prediction tasks. Seq2Seq models are very popular these days because they achieve great results in Machine Translation, Text Summarization, Conversational Modeling and more.

Using sequence-to-sequence modeling you are going to build a calculator for evaluating arithmetic expressions, by taking an equation as an input to the neural network and producing an answer as it's output.

The resulting solution for this problem will be based on state-of-the-art approaches for sequence-to-sequence learning and you should be able to easily adapt it to solve other tasks. However, if you want to train your own machine translation system or intellectual chat bot, it would be useful to have access to compute resources like GPU, and be patient, because training of such systems is usually time consuming. 

### Libraries

For this task you will need the following libraries:
 - [TensorFlow](https://www.tensorflow.org) — an open-source software library for Machine Intelligence.
 
In this assignment, we use Tensorflow 1.15.0. You can install it with pip:

    !pip install tensorflow==1.15.0
     
 - [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
 
If you have never worked with TensorFlow, you will probably want to read some tutorials during your work on this assignment, e.g. [Neural Machine Translation](https://www.tensorflow.org/tutorials/seq2seq) tutorial deals with very similar task and can explain some concepts to you. 

In [1]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    ! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py
    import setup_google_colab
    setup_google_colab.setup_week4()

### Data

One benefit of this task is that you don't need to download any data — you will generate it on your own! We will use two operators (addition and subtraction) and work with positive integer numbers in some range. Here are examples of correct inputs and outputs:

    Input: '1+2'
    Output: '3'
    
    Input: '0-99'
    Output: '-99'

*Note, that there are no spaces between operators and operands.*


Now you need to implement the function *generate_equations*, which will be used to generate the data.

In [2]:
import random

In [3]:
def generate_equations(allowed_operators, dataset_size, min_value, max_value):
    """Generates pairs of equations and solutions to them.
    
       Each equation has a form of two integers with an operator in between.
       Each solution is an integer with the result of the operaion.
    
        allowed_operators: list of strings, allowed operators.
        dataset_size: an integer, number of equations to be generated.
        min_value: an integer, min value of each operand.
        max_value: an integer, max value of each operand.

        result: a list of tuples of strings (equation, solution).
    """
    sample = []
    op1s = random.choices(range(min_value, max_value), k=dataset_size)
    op2s = random.choices(range(min_value, max_value), k=dataset_size)
    for x, y in zip(op1s, op2s):
        operator = random.choice(allowed_operators)
        question = str(x)+operator+str(y)
        sample.append((question, str(eval(question))))
    return sample

To check the correctness of your implementation, use *test_generate_equations* function:

In [4]:
def test_generate_equations():
    allowed_operators = ['+', '-']
    dataset_size = 10
    for (input_, output_) in generate_equations(allowed_operators, dataset_size, 0, 100):
        if not (type(input_) is str and type(output_) is str):
            return "Both parts should be strings."
        if eval(input_) != int(output_):
            return "The (equation: {!r}, solution: {!r}) pair is incorrect.".format(input_, output_)
    return "Tests passed."

In [5]:
print(test_generate_equations())

Tests passed.


Finally, we are ready to generate the train and test data for the neural network:

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
allowed_operators = ['+', '-']
dataset_size = 100000
data = generate_equations(allowed_operators, dataset_size, min_value=0, max_value=9999)

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

## Prepare data for the neural network

The next stage of data preparation is creating mappings of the characters to their indices in some vocabulary. Since in our task we already know which symbols will appear in the inputs and outputs, generating the vocabulary is a simple step.

#### How to create dictionaries for other task

First of all, you need to understand what is the basic unit of the sequence in your task. In our case, we operate on symbols and the basic unit is a symbol. The number of symbols is small, so we don't need to think about filtering/normalization steps. However, in other tasks, the basic unit is often a word, and in this case the mapping would be *word $\to$ integer*. The number of words might be huge, so it would be reasonable to filter them, for example, by frequency and leave only the frequent ones. Other strategies that your should consider are: data normalization (lowercasing, tokenization, how to consider punctuation marks), separate vocabulary for input and for output (e.g. for machine translation), some specifics of the task.

In [8]:
word2id = {symbol:i for i, symbol in enumerate('#^$+-1234567890')}
id2word = {i:symbol for symbol, i in word2id.items()}

#### Special symbols

In [9]:
start_symbol = '^'
end_symbol = '$'
padding_symbol = '#'

You could notice that we have added 3 special symbols: '^', '\$' and '#':
- '^' symbol will be passed to the network to indicate the beginning of the decoding procedure. We will discuss this one later in more details.
- '\$' symbol will be used to indicate the *end of a string*, both for input and output sequences. 
- '#' symbol will be used as a *padding* character to make lengths of all strings equal within one training batch.

People have a bit different habits when it comes to special symbols in encoder-decoder networks, so don't get too much confused if you come across other variants in tutorials you read. 

#### Padding

When vocabularies are ready, we need to be able to convert a sentence to a list of vocabulary word indices and back. At the same time, let's care about padding. We are going to preprocess each sequence from the input (and output ground truth) in such a way that:
- it has a predefined length *padded_len*
- it is probably cut off or padded with the *padding symbol* '#'
- it *always* ends with the *end symbol* '$'

We will treat the original characters of the sequence **and the end symbol** as the valid part of the input. We will store *the actual length* of the sequence, which includes the end symbol, but does not include the padding symbols. 

 Now you need to implement the function *sentence_to_ids* that does the described job. 

In [10]:
def sentence_to_ids(sentence, word2id, padded_len):
    """ Converts a sequence of symbols to a padded sequence of their ids.
    
      sentence: a string, input/output sequence of symbols.
      word2id: a dict, a mapping from original symbols to ids.
      padded_len: an integer, a desirable length of the sequence.

      result: a tuple of (a list of ids, an actual length of sentence).
    """
    sent_ids = [word2id[x] for x in sentence]
    if(len(sent_ids) >= padded_len):
        sent_ids = sent_ids[:padded_len]
        sent_ids[-1] = word2id['$']
        sent_len = len(sent_ids)
    else:
        sent_ids.append(word2id['$'])
        sent_len = len(sent_ids)
        sent_ids += [word2id['#']]*(padded_len-len(sent_ids))
    return sent_ids, sent_len

Check that your implementation is correct:

In [11]:
def test_sentence_to_ids():
    sentences = [("123+123", 7), ("123+123", 8), ("123+123", 10)]
    expected_output = [([5, 6, 7, 3, 5, 6, 2], 7), 
                       ([5, 6, 7, 3, 5, 6, 7, 2], 8), 
                       ([5, 6, 7, 3, 5, 6, 7, 2, 0, 0], 8)] 
    for (sentence, padded_len), (sentence_ids, expected_length) in zip(sentences, expected_output):
        output, length = sentence_to_ids(sentence, word2id, padded_len)
        if output != sentence_ids:
            return("Convertion of '{}' for padded_len={} to {} is incorrect.".format(
                sentence, padded_len, output))
        if length != expected_length:
            return("Convertion of '{}' for padded_len={} has incorrect actual length {}.".format(
                sentence, padded_len, length))
    return("Tests passed.")

In [12]:
print(test_sentence_to_ids())

Tests passed.


We also need to be able to get back from indices to symbols:

In [13]:
def ids_to_sentence(ids, id2word):
    """ Converts a sequence of ids to a sequence of symbols.
    
          ids: a list, indices for the padded sequence.
          id2word:  a dict, a mapping from ids to original symbols.

          result: a list of symbols.
    """
 
    return [id2word[i] for i in ids] 

#### Generating batches

The final step of data preparation is a function that transforms a batch of sentences to a list of lists of indices. 

In [14]:
def batch_to_ids(sentences, word2id, max_len):
    """Prepares batches of indices. 
    
       Sequences are padded to match the longest sequence in the batch,
       if it's longer than max_len, then max_len is used instead.

        sentences: a list of strings, original sequences.
        word2id: a dict, a mapping from original symbols to ids.
        max_len: an integer, max len of sequences allowed.

        result: a list of lists of ids, a list of actual lengths.
    """
    
    max_len_in_batch = min(max(len(s) for s in sentences) + 1, max_len)
    batch_ids, batch_ids_len = [], []
    for sentence in sentences:
        ids, ids_len = sentence_to_ids(sentence, word2id, max_len_in_batch)
        batch_ids.append(ids)
        batch_ids_len.append(ids_len)
    return batch_ids, batch_ids_len

The function *generate_batches* will help to generate batches with defined size from given samples.

In [15]:
def generate_batches(samples, batch_size=64):
    X, Y = [], []
    for i, (x, y) in enumerate(samples, 1):
        X.append(x)
        Y.append(y)
        if i % batch_size == 0:
            yield X, Y
            X, Y = [], []
    if X and Y:
        yield X, Y

To illustrate the result of the implemented functions, run the following cell:

In [16]:
sentences = train_set[0]
ids, sent_lens = batch_to_ids(sentences, word2id, max_len=10)
print('Input:', sentences)
print('Ids: {}\nSentences lengths: {}'.format(ids, sent_lens))

Input: ('7200-4366', '2834')
Ids: [[11, 6, 14, 14, 4, 8, 7, 10, 10, 2], [6, 12, 7, 8, 2, 0, 0, 0, 0, 0]]
Sentences lengths: [10, 5]


## Encoder-Decoder architecture

Encoder-Decoder is a successful architecture for Seq2Seq tasks with different lengths of input and output sequences. The main idea is to use two recurrent neural networks, where the first neural network *encodes* the input sequence into a real-valued vector and then the second neural network *decodes* this vector into the output sequence. While building the neural network, we will specify some particular characteristics of this architecture.

# Modified Section
## Tensorflow 2.x is used, instead of 1.x, which does away with all the session and placeholder mechanics.


<img src="encoder-decoder-pic.png" style="width: 500px;">

In [17]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}
import tensorflow as tf
import logging
tf.get_logger().setLevel(logging.ERROR)
tf.autograph.set_verbosity(1)
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, GRU, Dense, Concatenate, Dropout
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.utils import to_categorical


In [18]:
batch_size = 512
n_epochs = 32
dropout_drop_probability = 0.25
max_len = 20
recurrent_dims = 512
common_embedding = Embedding(len(word2id), 16)
learning_rate = 0.005
max_decoder_steps = 7

class Encoder(Model):
    def __init__(self, dropout_prob=0.0):
        super(Encoder, self).__init__()
        self.embedding = common_embedding
        self.gru = GRU(recurrent_dims)
        self.dropout = Dropout(dropout_prob)

    def call(self, inputs, training=False, mask=None):
        embeddeds = self.embedding(inputs)
        outs = self.gru(embeddeds, )
        if training:
            outs = self.dropout(outs)
        return outs


class Decoder(Model):
    def __init__(self, dropout_prob=0.0):
        super(Decoder, self).__init__()
        self.embedding = common_embedding
        self.concatter = Concatenate()
        self.gru = GRU(recurrent_dims, return_sequences=True, return_state=True)
        self.dropout = Dropout(dropout_prob)
        self.head = Dense(len(word2id), activation='softmax')

    def call(self, inputs, training=False, mask=None):
        sequence, thought_vector, hidden_states = inputs
        embeddeds = self.embedding(sequence)
        concatted = self.concatter([embeddeds, thought_vector])
        gru_outs, hidden_state = self.gru(concatted, initial_state=hidden_states)
        if training:
            gru_outs = self.dropout(gru_outs)
        final_outs = self.head(gru_outs)
        return final_outs, hidden_state



encoder = Encoder(dropout_drop_probability)
decoder = Decoder(dropout_drop_probability)

optimizer = tf.keras.optimizers.Adam(learning_rate)
loss_op = CategoricalCrossentropy(reduction='none')


def loss_fn(outputs, targets):
    mask = tf.not_equal(targets, 0)
    mask = tf.cast(mask, tf.float32)
    targets = to_categorical(targets, len(word2id), dtype=np.float32)
    loss_ = loss_op(targets, outputs)
    loss_ = loss_ * mask
    loss = tf.reduce_mean(loss_)
    return loss


def train_step(x_batch, y_batch):
    new_column = np.asarray([word2id['^']] * y_batch.shape[0])
    y_batch_in = np.insert(y_batch, 0, new_column, 1)[:, :-1]
    with tf.GradientTape() as tape:
        thought_vectors = encoder(x_batch, training=True)
        thought_vectors = tf.expand_dims(thought_vectors, 1)
        thought_vectors = tf.repeat(thought_vectors, y_batch.shape[1], 1)
        outs, _ = decoder((y_batch_in, thought_vectors, np.zeros((y_batch.shape[0], recurrent_dims))), training=True)
        loss = loss_fn(outs, targets=y_batch)
    all_variables = common_embedding.trainable_variables + encoder.trainable_variables + decoder.trainable_variables
    grads = tape.gradient(loss, all_variables)
    optimizer.apply_gradients(zip(grads, all_variables))


def predict(x_batch):
    thought_vectors = encoder(x_batch)
    hidden_state = np.zeros((x_batch.shape[0], recurrent_dims))
    thought_vectors = tf.expand_dims(thought_vectors, 1)
    generated_ids = [np.asarray([[1]] * x_batch.shape[0])]
    while len(generated_ids) < max_decoder_steps:
        outs, hidden_state = decoder((generated_ids[-1], thought_vectors, hidden_state))
        generated_ids.append(tf.argmax(outs, -1))
    return np.asarray(generated_ids)[:, :, 0].T[:, 1:]


def predict_with_loss(x_batch, y_batch):
    thought_vectors = encoder(x_batch)
    hidden_state = np.zeros((x_batch.shape[0], recurrent_dims))
    thought_vectors = tf.expand_dims(thought_vectors, 1)
    generated_ids = [np.asarray([[1]] * x_batch.shape[0])]
    generated_softs = []
    while len(generated_ids) < max_decoder_steps:
        outs, hidden_state = decoder((generated_ids[-1], thought_vectors, hidden_state))
        generated_softs.append(outs)
        generated_ids.append(tf.argmax(outs, -1))
    generated_softs = np.concatenate(generated_softs, 1)[:, :y_batch.shape[1]]
    loss = loss_fn(generated_softs, y_batch)
    return np.asarray(generated_ids)[:, :, 0].T[:, 1:], loss

def remove_symbols(ls):
    lsnew = []
    for i in ls:
        if i == '$':
            break
        if i not in ['^', '#', '$']:
            lsnew.append(i)
    return lsnew

In [19]:

n_step = int(len(train_set) / batch_size)


invalid_number_prediction_counts = []
all_model_predictions = []
all_ground_truth = []

print('Start training... \n')

for epoch in range(n_epochs):
    random.shuffle(train_set)
    random.shuffle(test_set)

    print('Train: epoch', epoch + 1)
    for n_iter, (X_batch, Y_batch) in enumerate(generate_batches(train_set, batch_size=batch_size)):
        ######################################
        ######### YOUR CODE HERE #############
        ######################################
        # prepare the data (X_batch and Y_batch) for training
        # using function batch_to_ids
        ques_ids, _ = batch_to_ids(X_batch, word2id, 20)
        ans_ids, _ = batch_to_ids(Y_batch, word2id, 20)
        x_batch = np.asarray(ques_ids)
        y_batch = np.asarray(ans_ids)
        train_step(x_batch, y_batch)
        predictions, loss = predict_with_loss(x_batch, y_batch)

        if n_iter % 50 == 0:
            print("Epoch: [%d/%d], step: [%d/%d], loss: %f" % (epoch + 1, n_epochs, n_iter + 1, n_step, loss))

    X_sent, Y_sent = next(generate_batches(test_set, batch_size=batch_size))
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    # prepare test data (X_sent and Y_sent) for predicting
    # quality and computing value of the loss function
    # using function batch_to_ids
    ques_ids, _ = batch_to_ids(X_sent, word2id, 20)
    ans_ids, _ = batch_to_ids(Y_sent, word2id, 20)
    x_batch = np.asarray(ques_ids)
    y_batch = np.asarray(ans_ids)
    predictions, loss = predict_with_loss(x_batch, y_batch)
    print('Test: epoch', epoch + 1, 'loss:', loss, )
    for x, y, p in list(zip(x_batch, y_batch, predictions))[:3]:
        print('X:', ''.join(ids_to_sentence(x, id2word)))
        print('Y:', ''.join(ids_to_sentence(y, id2word)))
        print('O:', ''.join(ids_to_sentence(p, id2word)))
        print('')

    model_predictions = []
    ground_truth = []
    invalid_number_prediction_count = 0
    # For the whole test set calculate ground-truth values (as integer numbers)
    # and prediction values (also as integers) to calculate metrics.
    # If generated by model number is not correct (e.g. '1-1'),
    # increase invalid_number_prediction_count and don't append this and corresponding
    # ground-truth value to the arrays.
    for X_batch, Y_batch in generate_batches(test_set, batch_size=batch_size):
        ######################################
        ######### YOUR CODE HERE #############
        ######################################
        ques_ids, _ = batch_to_ids(X_batch, word2id, 20)
        ans_ids, _ = batch_to_ids(Y_batch, word2id, 20)
        x_batch = np.asarray(ques_ids)
        y_batch = np.asarray(ans_ids)
        predictions = predict(x_batch)
        for row, rowt in zip(predictions, y_batch):
            sent = ids_to_sentence(row, id2word)
            sent = remove_symbols(sent)
            sent = ''.join(sent)
            try:
                intval = int(sent)
                model_predictions.append(intval)
                ground_truth.append(int(''.join([str(x) for x in remove_symbols(ids_to_sentence(rowt, id2word))])))
            except Exception:
                invalid_number_prediction_count += 1
    all_model_predictions.append(model_predictions)
    all_ground_truth.append(ground_truth)
    invalid_number_prediction_counts.append(invalid_number_prediction_count)

print('\n...training finished.')


Start training... 

Train: epoch 1
Epoch: [1/32], step: [1/156], loss: 2.350093
Epoch: [1/32], step: [51/156], loss: 1.928630
Epoch: [1/32], step: [101/156], loss: 1.898332
Epoch: [1/32], step: [151/156], loss: 1.661202
Test: epoch 1 loss: tf.Tensor(1.6845379, shape=(), dtype=float32)
X: 9504+6066$
Y: 15570$
O: 12666$

X: 9044-2933$
Y: 6111$#
O: 1006$$

X: 7559-8128$
Y: -569$#
O: -100$$

Train: epoch 2
Epoch: [2/32], step: [1/156], loss: 1.678942
Epoch: [2/32], step: [51/156], loss: 1.762639
Epoch: [2/32], step: [101/156], loss: 1.519660
Epoch: [2/32], step: [151/156], loss: 1.483795
Test: epoch 2 loss: tf.Tensor(1.4485499, shape=(), dtype=float32)
X: 3368-6131$
Y: -2763$
O: -2699$

X: 2705+9339$
Y: 12044$
O: 11111$

X: 8898+6002$
Y: 14900$
O: 14855$

Train: epoch 3
Epoch: [3/32], step: [1/156], loss: 1.504634
Epoch: [3/32], step: [51/156], loss: 1.423163
Epoch: [3/32], step: [101/156], loss: 1.362841
Epoch: [3/32], step: [151/156], loss: 1.394143
Test: epoch 3 loss: tf.Tensor(1.447626

## Evaluate results

Because our task is simple and the output is straight-forward, we will use [MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) metric to evaluate the trained model during the epochs. Compute the value of the metric for the output from each epoch.

In [21]:
from sklearn.metrics import mean_absolute_error
for i, (gts, predictions, invalid_number_prediction_count) in enumerate(zip(all_ground_truth,
                                                                            all_model_predictions,
                                                                            invalid_number_prediction_counts), 1):
    mae = mean_absolute_error(gts, predictions)
    print("Epoch: %i, MAE: %f, Invalid numbers: %i" % (i, mae, invalid_number_prediction_count))

Epoch: 1, MAE: 1917.066650, Invalid numbers: 0
Epoch: 2, MAE: 621.299050, Invalid numbers: 0
Epoch: 3, MAE: 532.709900, Invalid numbers: 0
Epoch: 4, MAE: 308.023800, Invalid numbers: 0
Epoch: 5, MAE: 263.563950, Invalid numbers: 0
Epoch: 6, MAE: 172.975450, Invalid numbers: 0
Epoch: 7, MAE: 126.614850, Invalid numbers: 0
Epoch: 8, MAE: 114.709650, Invalid numbers: 0
Epoch: 9, MAE: 121.010000, Invalid numbers: 0
Epoch: 10, MAE: 115.920200, Invalid numbers: 0
Epoch: 11, MAE: 132.741350, Invalid numbers: 0
Epoch: 12, MAE: 104.583650, Invalid numbers: 0
Epoch: 13, MAE: 33.266050, Invalid numbers: 0
Epoch: 14, MAE: 52.793900, Invalid numbers: 0
Epoch: 15, MAE: 93.571250, Invalid numbers: 0
Epoch: 16, MAE: 35.929950, Invalid numbers: 0
Epoch: 17, MAE: 22.361354, Invalid numbers: 3
Epoch: 18, MAE: 25.786239, Invalid numbers: 1
Epoch: 19, MAE: 90.178709, Invalid numbers: 1
Epoch: 20, MAE: 25.050900, Invalid numbers: 0
Epoch: 21, MAE: 20.665667, Invalid numbers: 2
Epoch: 22, MAE: 33.821250, Inv