# Learn to calculate with seq2seq model

In this assignment, you will learn how to use neural networks to solve sequence-to-sequence prediction tasks. Seq2Seq models are very popular these days because they achieve great results in Machine Translation, Text Summarization, Conversational Modeling and more.

Using sequence-to-sequence modeling you are going to build a calculator for evaluating arithmetic expressions, by taking an equation as an input to the neural network and producing an answer as it's output.

The resulting solution for this problem will be based on state-of-the-art approaches for sequence-to-sequence learning and you should be able to easily adapt it to solve other tasks. However, if you want to train your own machine translation system or intellectual chat bot, it would be useful to have access to compute resources like GPU, and be patient, because training of such systems is usually time consuming. 

### Libraries

For this task you will need the following libraries:
 - [TensorFlow](https://www.tensorflow.org) — an open-source software library for Machine Intelligence.
 
In this assignment, we use Tensorflow 1.15.0. You can install it with pip:

    !pip install tensorflow==1.15.0
     
 - [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
 
If you have never worked with TensorFlow, you will probably want to read some tutorials during your work on this assignment, e.g. [Neural Machine Translation](https://www.tensorflow.org/tutorials/seq2seq) tutorial deals with very similar task and can explain some concepts to you. 

In [1]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    ! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py
    import setup_google_colab
    setup_google_colab.setup_week4()

### Data

One benefit of this task is that you don't need to download any data — you will generate it on your own! We will use two operators (addition and subtraction) and work with positive integer numbers in some range. Here are examples of correct inputs and outputs:

    Input: '1+2'
    Output: '3'
    
    Input: '0-99'
    Output: '-99'

*Note, that there are no spaces between operators and operands.*


Now you need to implement the function *generate_equations*, which will be used to generate the data.

In [1]:
import random
import numpy as np
import torch

In [2]:
def generate_equations(allowed_operators, dataset_size, min_value, max_value):
    """Generates pairs of equations and solutions to them.
    
       Each equation has a form of two integers with an operator in between.
       Each solution is an integer with the result of the operaion.
    
        allowed_operators: list of strings, allowed operators.
        dataset_size: an integer, number of equations to be generated.
        min_value: an integer, min value of each operand.
        max_value: an integer, max value of each operand.

        result: a list of tuples of strings (equation, solution).
    """
    sample = []
    for _ in range(dataset_size):
        x, y = np.random.randint(min_value, max_value, 2)
        operator = allowed_operators[np.random.randint(0, len(allowed_operators))]
        if operator == '+':
            result = x + y
        elif operator == '-':
            result = x - y
        sample.append((str(x) + operator + str(y),
                       str(result)))
    return sample

To check the correctness of your implementation, use *test_generate_equations* function:

In [3]:
def test_generate_equations():
    allowed_operators = ['+', '-']
    dataset_size = 10
    for (input_, output_) in generate_equations(allowed_operators, dataset_size, 0, 100):
        if not (type(input_) is str and type(output_) is str):
            return "Both parts should be strings."
        if eval(input_) != int(output_):
            return "The (equation: {!r}, solution: {!r}) pair is incorrect.".format(input_, output_)
    return "Tests passed."

In [38]:
print(test_generate_equations())

Tests passed.


Finally, we are ready to generate the train and test data for the neural network:

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
allowed_operators = ['+', '-']
dataset_size = 100000
data = generate_equations(allowed_operators, dataset_size, min_value=0, max_value=9999)

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

In [457]:
train_set[0]

('3537-8098', '-4561')

## Prepare data for the neural network

The next stage of data preparation is creating mappings of the characters to their indices in some vocabulary. Since in our task we already know which symbols will appear in the inputs and outputs, generating the vocabulary is a simple step.

#### How to create dictionaries for other task

First of all, you need to understand what is the basic unit of the sequence in your task. In our case, we operate on symbols and the basic unit is a symbol. The number of symbols is small, so we don't need to think about filtering/normalization steps. However, in other tasks, the basic unit is often a word, and in this case the mapping would be *word $\to$ integer*. The number of words might be huge, so it would be reasonable to filter them, for example, by frequency and leave only the frequent ones. Other strategies that your should consider are: data normalization (lowercasing, tokenization, how to consider punctuation marks), separate vocabulary for input and for output (e.g. for machine translation), some specifics of the task.

In [5]:
word2id = {symbol:i for i, symbol in enumerate('#^$+-1234567890')}
id2word = {i:symbol for symbol, i in word2id.items()}

#### Special symbols

In [6]:
start_symbol = '^'
end_symbol = '$'
padding_symbol = '#'

You could notice that we have added 3 special symbols: '^', '\$' and '#':
- '^' symbol will be passed to the network to indicate the beginning of the decoding procedure. We will discuss this one later in more details.
- '\$' symbol will be used to indicate the *end of a string*, both for input and output sequences. 
- '#' symbol will be used as a *padding* character to make lengths of all strings equal within one training batch.

People have a bit different habits when it comes to special symbols in encoder-decoder networks, so don't get too much confused if you come across other variants in tutorials you read. 

#### Padding

When vocabularies are ready, we need to be able to convert a sentence to a list of vocabulary word indices and back. At the same time, let's care about padding. We are going to preprocess each sequence from the input (and output ground truth) in such a way that:
- it has a predefined length *padded_len*
- it is probably cut off or padded with the *padding symbol* '#'
- it *always* ends with the *end symbol* '$'

We will treat the original characters of the sequence **and the end symbol** as the valid part of the input. We will store *the actual length* of the sequence, which includes the end symbol, but does not include the padding symbols. 

 Now you need to implement the function *sentence_to_ids* that does the described job. 

In [7]:
def sentence_to_ids(sentence, word2id, padded_len, is_target_input, is_target_output):
    """ Converts a sequence of symbols to a padded sequence of their ids.
    
      sentence: a string, input/output sequence of symbols.
      word2id: a dict, a mapping from original symbols to ids.
      padded_len: an integer, a desirable length of the sequence.

      result: a tuple of (a list of ids, an actual length of sentence).
    """
    pad_add_len = padded_len - len(sentence) - 1
    sent_len = len(sentence) + 1
    if not (is_target_input or is_target_output):
        sent_ids = [word2id[el] for el in sentence] + [word2id['$']] + [word2id['#']] * pad_add_len
    elif is_target_input:
        sent_ids = [word2id['^']] + [word2id[el] for el in sentence] + [word2id['#']] * pad_add_len
    elif is_target_output:
        sent_ids = [word2id[el] for el in sentence] + [word2id['$']] + [word2id['#']] * pad_add_len
    
    return sent_ids, sent_len

Check that your implementation is correct:

In [9]:
def test_sentence_to_ids():
    sentences = [("123+123", 7), ("123+123", 8), ("123+123", 10)]
    expected_output = [([5, 6, 7, 3, 5, 6, 7, 2], 8), 
                       ([5, 6, 7, 3, 5, 6, 7, 2], 8), 
                       ([5, 6, 7, 3, 5, 6, 7, 2, 0, 0], 8)] 
    for (sentence, padded_len), (sentence_ids, expected_length) in zip(sentences, expected_output):
        output, length = sentence_to_ids(sentence, word2id, padded_len)
        if output != sentence_ids:
            return("Convertion of '{}' for padded_len={} to {} is incorrect.".format(
                sentence, padded_len, output))
        if length != expected_length:
            return("Convertion of '{}' for padded_len={} has incorrect actual length {}.".format(
                sentence, padded_len, length))
    return("Tests passed.")

In [67]:
print(test_sentence_to_ids())

Tests passed.


We also need to be able to get back from indices to symbols:

In [8]:
def ids_to_sentence(ids, id2word):
    """ Converts a sequence of ids to a sequence of symbols.
    
          ids: a list, indices for the padded sequence.
          id2word:  a dict, a mapping from ids to original symbols.

          result: a list of symbols.
    """
 
    return [id2word[i] for i in ids] 

#### Generating batches

The final step of data preparation is a function that transforms a batch of sentences to a list of lists of indices. 

In [9]:
def batch_to_ids(sentences, word2id, max_len, is_target_input, is_target_output):
    """Prepares batches of indices. 
    
       Sequences are padded to match the longest sequence in the batch,
       if it's longer than max_len, then max_len is used instead.

        sentences: a list of strings, original sequences.
        word2id: a dict, a mapping from original symbols to ids.
        max_len: an integer, max len of sequences allowed.

        result: a list of lists of ids, a list of actual lengths.
    """
    
    max_len_in_batch = min(max(len(s) for s in sentences) + 1, max_len)
    batch_ids, batch_ids_len = [], []
    for sentence in sentences:
        ids, ids_len = sentence_to_ids(sentence, word2id, max_len_in_batch, is_target_input, is_target_output)
        batch_ids.append(ids)
        batch_ids_len.append(ids_len)
    return batch_ids, batch_ids_len

The function *generate_batches* will help to generate batches with defined size from given samples.

In [10]:
def transform_input_data(X, Y, word2id, max_len):
    X, X_lens = batch_to_ids(X, word2id, max_len=max_len, is_target_input=False, is_target_output=False)
    Y_input, Y_lens = batch_to_ids(Y, word2id, max_len=max_len, is_target_input=True, is_target_output=False)
    Y_output, Y_lens = batch_to_ids(Y, word2id, max_len=max_len, is_target_input=False, is_target_output=True)
    return torch.from_numpy(np.array(X)), torch.from_numpy(np.array(Y_input)), \
            torch.from_numpy(np.array(Y_output)), torch.from_numpy(np.array(X_lens)), torch.from_numpy(np.array(Y_lens))


def generate_batches(samples, batch_size=64):
    X, Y = [], []
    for i, (x, y) in enumerate(samples, 1):
        X.append(x)
        Y.append(y)
        if i % batch_size == 0:
            yield transform_input_data(X, Y, word2id, max_len=20)
            X, Y = [], []
    if X and Y:
        yield transform_input_data(X, Y, word2id, max_len=20)

In [72]:
sentences

('3537-8098', '-4561')

To illustrate the result of the implemented functions, run the following cell:

In [71]:
sentences = train_set[0]
ids, sent_lens = batch_to_ids(sentences, word2id, max_len=10)
print('Input:', sentences)
print('Ids: {}\nSentences lengths: {}'.format(ids, sent_lens))

Input: ('3537-8098', '-4561')
Ids: [[7, 9, 7, 11, 4, 12, 14, 13, 12, 2], [4, 8, 9, 10, 5, 2, 0, 0, 0, 0]]
Sentences lengths: [10, 6]


## Encoder-Decoder architecture

Encoder-Decoder is a successful architecture for Seq2Seq tasks with different lengths of input and output sequences. The main idea is to use two recurrent neural networks, where the first neural network *encodes* the input sequence into a real-valued vector and then the second neural network *decodes* this vector into the output sequence. While building the neural network, we will specify some particular characteristics of this architecture.

Let us use TensorFlow building blocks to specify the network architecture.

First, we need to create [placeholders](https://www.tensorflow.org/api_guides/python/io_ops#Placeholders) to specify what data we are going to feed into the network during the execution time. For this task we will need:
 - *input_batch* — sequences of sentences (the shape will equal to [batch_size, max_sequence_len_in_batch]);
 - *input_batch_lengths* — lengths of not padded sequences (the shape equals to [batch_size]);
 - *ground_truth* — sequences of groundtruth (the shape will equal to [batch_size, max_sequence_len_in_batch]);
 - *ground_truth_lengths* — lengths of not padded groundtruth sequences (the shape equals to [batch_size]);
 - *dropout_ph* — dropout keep probability; this placeholder has a predifined value 1;
 - *learning_rate_ph* — learning rate.

### I tried to write on tensroflow 2.5 but I got Out of Memory every time in the middle of epoch. I decided to switch to PyTorch.

#### by the way in folder there is the file with tensroflow realization of architecture

Now, let us specify the layers of the neural network. First, we need to prepare an embedding matrix. Since we use the same vocabulary for input and output, we need only one such matrix. For tasks with different vocabularies there would be multiple embedding layers.
- Create embeddings matrix with [tf.Variable](https://www.tensorflow.org/api_docs/python/tf/Variable). Specify its name, type (tf.float32), and initialize with random values.
- Perform [embeddings lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) for a given input batch.

#### Encoder

The first RNN of the current architecture is called an *encoder* and serves for encoding an input sequence to a real-valued vector. Input of this RNN is an embedded input batch. Since sentences in the same batch could have different actual lengths, we also provide input lengths to avoid unnecessary computations. The final encoder state will be passed to the second RNN (decoder), which we will create soon. 

- TensorFlow provides a number of [RNN cells](https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cells_for_use_with_TensorFlow_s_core_RNN_methods) ready for use. We suggest that you use [GRU cell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/GRUCell), but you can also experiment with other types. 
- Wrap your cells with [DropoutWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper). Dropout is an important regularization technique for neural networks. Specify input keep probability using the dropout placeholder that we created before.
- Combine the defined encoder cells with [Dynamic RNN](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn). Use the embedded input batches and their lengths here.
- Use *dtype=tf.float32* everywhere.

In [11]:
import torch.nn as nn

In [12]:
class EncoderLayer(nn.Module):
    def __init__(self, vocab_size, emb_dim, gru_dim, dropout):
        super(EncoderLayer, self).__init__()

        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, gru_dim, dropout=dropout)

    def forward(self, x, input_lengths):

        embeddings = self.embedding(x)
        packed = torch.nn.utils.rnn.pack_padded_sequence(embeddings, input_lengths,
                                                         enforce_sorted=False)
        outputs, state = self.gru(embeddings)

        return outputs[-1], state.squeeze()

#### Decoder

The second RNN is called a *decoder* and serves for generating the output sequence. In the simple seq2seq arcitecture, the input sequence is provided to the decoder only as the final state of the encoder. Obviously, it is a bottleneck and [Attention techniques](https://www.tensorflow.org/tutorials/seq2seq#background_on_the_attention_mechanism) can help to overcome it. So far, we do not need them to make our calculator work, but this would be a necessary ingredient for more advanced tasks. 

During training, decoder also uses information about the true output. It is feeded in as input symbol by symbol. However, during the prediction stage (which is called *inference* in this architecture), the decoder can only use its own generated output from the previous step to feed it in at the next step. Because of this difference (*training* vs *inference*), we will create two distinct instances, which will serve for the described scenarios.

The picture below illustrates the point. It also shows our work with the special characters, e.g. look how the start symbol `^` is used. The transparent parts are ignored. In decoder, it is masked out in the loss computation. In encoder, the green state is considered as final and passed to the decoder. 

<img src="encoder-decoder-pic.png" style="width: 500px;">

Now, it's time to implement the decoder:
 - First, we should create two [helpers](https://www.tensorflow.org/api_guides/python/contrib.seq2seq#Dynamic_Decoding). These classes help to determine the behaviour of the decoder. During the training time, we will use [TrainingHelper](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/TrainingHelper). For the inference we recommend to use [GreedyEmbeddingHelper](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/GreedyEmbeddingHelper).
 - To share all parameters during training and inference, we use one scope and set the flag 'reuse' to True at inference time. You might be interested to know more about how [variable scopes](https://www.tensorflow.org/programmers_guide/variables) work in TF. 
 - To create the decoder itself, we will use [BasicDecoder](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BasicDecoder) class. As previously, you should choose some RNN cell, e.g. GRU cell. To turn hidden states into logits, we will need a projection layer. One of the simple solutions is using [OutputProjectionWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/OutputProjectionWrapper).
 - For getting the predictions, it will be convinient to use [dynamic_decode](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/dynamic_decode). This function uses the provided decoder to perform decoding.

In [13]:
class DecoderLayer(nn.Module):
    def __init__(self, vocab_size, emb_size, emb_layer, gru_dim, start_token, end_token,
                 dropout=0.1, maximum_iterations=50):
        super(DecoderLayer, self).__init__()

        self.start_token, self.end_token = start_token, end_token
        self.vocab_size = vocab_size
        self.embedding = emb_layer
        self.gru_cell = nn.GRUCell(emb_size + gru_dim, gru_dim)
        self.dropout_layer = nn.Dropout(dropout)
        self.output_layer=nn.Linear(gru_dim, vocab_size)
        self.softmax = nn.Softmax(dim=-1)
        self.maximum_iterations = maximum_iterations
    

    def forward(self, encoder_outputs, state, y):
        batch_size = encoder_outputs.size()[0]

        if self.training is True:
            embeddings = self.embedding(y)
            result = torch.empty([0, batch_size, self.vocab_size])
            for symbol_slice in embeddings:
                inputs = torch.cat([symbol_slice, encoder_outputs], dim=-1)
                state = self.gru_cell(inputs, state)
                outputs = self.output_layer(self.dropout_layer(state))
                result = torch.cat([result, torch.unsqueeze(outputs, dim=0)])
            result = result.permute(1, 2, 0)
        else:
            start_tokens = torch.full([batch_size], self.start_token)
            end_tokens = torch.full([batch_size], self.end_token)
            iteration = 0
            result = torch.empty([0, batch_size])
            finished = torch.equal(start_tokens, end_tokens)
            while not finished and iteration < self.maximum_iterations:
                embeddings = self.embedding(start_tokens)
                inputs = torch.cat([embeddings, encoder_outputs], dim=-1)
                state = self.gru_cell(inputs, state)
                outputs = self.output_layer(self.dropout_layer(state))
                start_tokens = torch.argmax(self.softmax(torch.unsqueeze(outputs, dim=0)), dim=-1)
                result = torch.cat([result, start_tokens])
                start_tokens = torch.squeeze(start_tokens)
                finished = torch.equal(start_tokens, end_tokens)
                iteration += 1

        return result

In [14]:
class Seq2SeqModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, encoder_dim, start_token,
                 end_token, dropout=0.1, maximum_iterations=50):
        super(Seq2SeqModel, self).__init__()
        
        self.encoder = EncoderLayer(vocab_size, emb_dim, encoder_dim, dropout)
        decoder_dim = encoder_dim
        self.decoder = DecoderLayer(vocab_size, emb_dim, self.encoder.embedding, decoder_dim,
                                    start_token, end_token, dropout, maximum_iterations)


    def forward(self, x, y, inputs_mask):
        encoder_outputs, state = self.encoder(x, inputs_mask)
        outputs = self.decoder(encoder_outputs, state, y)
        return outputs

In [25]:
model = Seq2SeqModel(len(word2id), 100, 512,
                     start_token=word2id['^'],
                     end_token=word2id['$'],
                     dropout=0.2)

In this task we will use [sequence_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss), which is a weighted cross-entropy loss for a sequence of logits. Take a moment to understand, what is your train logits and targets. Also note, that we do not want to take into account loss terms coming from padding symbols, so we will mask them out using weights.  

The last thing to specify is the optimization of the defined loss. 
We suggest that you use [optimize_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/optimize_loss) with Adam optimizer and a learning rate from the corresponding placeholder. You might also need to pass global step (e.g. as tf.train.get_global_step()) and clip gradients by 1.0.

In [26]:
import torch.optim as optim

loss_fn = criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


def train_step(X, Y_input, Y_output, X_lens, Y_lens):
    optimizer.zero_grad()
    predictions = model(X, Y_input, X_lens)
#     mask = torch.arange(torch.max(Y_lens))[None, :] < Y_lens[:, None]
#     print(predictions.shape, mask.shape)
#     masked = torch.masked_select(predictions, mask)
    
#     print(Y_output.shape)
    loss = loss_fn(predictions, Y_output)
    loss.backward()
    optimizer.step()
    return loss

Congratulations! You have specified all the parts of your network. You may have noticed, that we didn't deal with any real data yet, so what you have written is just recipies on how the network should function.
Now we will put them to the constructor of our Seq2SeqModel class to use it in the next section. 

## Train the network and predict output

[Session.run](https://www.tensorflow.org/api_docs/python/tf/Session#run) is a point which initiates computations in the graph that we have defined. To train the network, we need to compute *self.train_op*. To predict output, we just need to compute *self.infer_predictions*. In any case, we need to feed actual data through the placeholders that we defined above. 

## Run your experiment

Create *Seq2SeqModel* model with the following parameters:
 - *vocab_size* — number of tokens;
 - *embeddings_size* — dimension of embeddings, recommended value: 20;
 - *max_iter* — maximum number of steps in decoder, recommended value: 7;
 - *hidden_size* — size of hidden layers for RNN, recommended value: 512;
 - *start_symbol_id* — an index of the start token (`^`).
 - *end_symbol_id* — an index of the end token (`$`).
 - *padding_symbol_id* — an index of the padding token (`#`).

Set hyperparameters. You might want to start with the following values and see how it works:
- *batch_size*: 128;
- at least 10 epochs;
- value of *learning_rate*: 0.001
- *dropout_keep_probability* equals to 0.5 for training (typical values for dropout probability are ranging from 0.1 to 1.0); larger values correspond smaler number of dropout units;
- *max_len*: 20.

### I did different experiments with emb_dim, encoder_dim, dropout

Best params for me: emb_dim = 100, encoder_dim = 512, dropout = 0.2

In [33]:
from tensorflow.keras.utils import Progbar
from tqdm.notebook import tqdm


num_epochs = 10
batch_size = 64
num_training_samples = len(train_set) // batch_size
epoch = 1
model.train()

for epoch in range(1, num_epochs + 1):
    print("\nepoch {}/{}".format(epoch,num_epochs))
    pbar = tqdm(
            total=num_training_samples,
            smoothing=0.6,
            bar_format=(
                "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, "
                "{rate_inv_fmt}{postfix}]"
                    )
        )

    i = 0
    for el in generate_batches(train_set, batch_size):
        loss = train_step(torch.transpose(el[0], 1, 0), torch.transpose(el[1], 1, 0), el[2],
                              el[3], el[4])
        i += 1
        pbar.update(1)
        pbar.set_postfix_str(f'Loss: {loss}')


epoch 1/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 2/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 3/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 4/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 5/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 6/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 7/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 8/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 9/10


  0%|          | 0/1250 [00:00<?, ?s/it]


epoch 10/10


  0%|          | 0/1250 [00:00<?, ?s/it]

In [34]:
print('Loss after {} epoch: {}'.format(epoch, loss))

Loss after 10 epoch: 0.44169700145721436


In [417]:
torch.save(model.state_dict(), 'checkpoint')

Finally, we are ready to run the training! A good indicator that everything works fine is decreasing loss during the training. You should account on the loss value equal to approximately 2.7 at the beginning of the training and near 1 after the 10th epoch.

## Evaluate results

Because our task is simple and the output is straight-forward, we will use [MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) metric to evaluate the trained model during the epochs. Compute the value of the metric for the output from each epoch.

In [29]:
from sklearn.metrics import mean_absolute_error

In [30]:
def convert_str_to_number(s):
    el = s.split('$')[0]
    is_minus = True if el[0] == '-' else False
    if is_minus:
        return -int(el[1:])
    return el

In [35]:
model.eval()

invalid_number_prediction_count = 0
y_true, y_pred = [], []
for el in generate_batches(test_set, batch_size):
    preds = model(torch.transpose(el[0], 1, 0), torch.transpose(el[1], 1, 0), el[3])
    pred_answers, real_answers = [], []
    for res, real_answer in zip(torch.transpose(preds, 1, 0).numpy().astype(np.int32), el[2].numpy().astype(np.int32)):
        pred_answers.append(''.join(ids_to_sentence(res, id2word)))
        real_answers.append(''.join(ids_to_sentence(real_answer, id2word)))
        
    for pred_answer, real_answer in zip(pred_answers, real_answers):
        try:
            int_pred_answer = convert_str_to_number(pred_answer)
        except Exception as e:
            invalid_number_prediction_count += 1
            continue
        int_real_answer = convert_str_to_number(real_answer)
        y_true.append(int_real_answer)
        y_pred.append(int_pred_answer)
        

In [36]:
mae = mean_absolute_error(y_true, y_pred)
print("Epoch: %i, MAE: %f, Invalid numbers: %i" % (epoch, mae, invalid_number_prediction_count))

Epoch: 10, MAE: 33.105850, Invalid numbers: 0


  return f(*args, **kwargs)
  return f(*args, **kwargs)
