
## Building a deep learning calculator

In this HW we will use seq2seq for building a calculator. The input will be an equation and the solution will be generated by the network.

### Data Generation

In this task we will generate our own data! We will use three operators (addition, multiplication and subtraction) and work with positive integer numbers in some range. Here are examples of correct inputs and outputs:

    Input: '1+2'
    Output: '3'
    
    Input: '0-99'
    Output: '-99'

*Note, that there are no spaces between operators and operands.*




In [121]:
import random
import numpy as np
import torch.nn as nn
import torch
import time
from torch import optim
import torch.nn.functional as F
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [122]:
def generate_equations(allowed_operators, dataset_size, min_value, max_value):
    """Generates pairs of equations and solutions to them.
    
       Each equation has a form of two integers with an operator in between.
       Each solution is an integer with the result of the operaion.
    
        allowed_operators: list of strings, allowed operators.
        dataset_size: an integer, number of equations to be generated.
        min_value: an integer, min value of each operand.
        max_value: an integer, max value of each operand.

        result: a list of tuples of strings (equation, solution).
    """
    sample = []
    for _ in range(dataset_size):
        left_operand = str(random.randint(min_value, max_value))
        right_operand = str(random.randint(min_value, max_value))
        operator = random.choice(allowed_operators)
        operation = left_operand+operator+right_operand
        sample.append((operation, str(eval(operation))))
    return sample

In [123]:
from sklearn.model_selection import train_test_split
allowed_operators = ['+', '-','*']
dataset_size = 100000
data = generate_equations(allowed_operators, dataset_size, min_value=0, max_value=10000)

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

In [124]:
train_set[0:10]

[('3254*1672', '5440688'),
 ('392-4831', '-4439'),
 ('1640+2270', '3910'),
 ('604-6924', '-6320'),
 ('6183-5508', '675'),
 ('3570*4747', '16946790'),
 ('545*7237', '3944165'),
 ('7994*1627', '13006238'),
 ('6407*3606', '23103642'),
 ('7627+9222', '16849')]

### Building vocabularies and tokenization function

We now need to build vocabularies that map strings to token ids and vice versa. We're gonna need these  when we feed training data into  the model or convert output matrices into words. 

Pay a close attention to the special characters you need to add for the vocabulary:


*    End of equation / solution token
*    Begining of equation / solution token
*    Padding token

Please note that in the exercise we do not need the <UNK> token



In [125]:
class Vocab():

  def __init__(self):
    self.chars = ['0','1','2','3','4','5','6','7','8','9','+','-','*','END','BEGIN','PAD']

    #build a vocabulary  string --> tokenId
    self.char_to_ix = { ch:i for i,ch in enumerate(self.chars) }

    #build a reverse vocabulary  tokenId --> string
    self.ix_to_char = { i:ch for i,ch in enumerate(self.chars) }

    self.eos_ix = self.char_to_ix['END']
    self.bos_ix = self.char_to_ix['BEGIN']
    self.pad_ix = self.char_to_ix['PAD']
    self.max_len = 13

  #build a tokenizer (i.e. a function which takes a string and returns tokenids)
  def tokenize(self, data):
    
    data = list(data)
    tokens = [self.char_to_ix[tok] for tok in data]
    tokens = [self.bos_ix] + tokens + [self.eos_ix]
    pads = []
    padLen = self.max_len - len(tokens)

    for p in range(padLen):
      pads.append(self.pad_ix)
    
    return torch.tensor(tokens + pads, dtype=torch.long, device=device).view(-1, 1)
  
  def __len__(self):
        return len(self.chars)

  def tensorsFromPair(self, pair):
    input_tensor = self.tokenize(pair[0])
    target_tensor = self.tokenize(pair[1])
    return (input_tensor, target_tensor)

vocab = Vocab()

### Encoder-decoder model

The code below contains a template for a simple encoder-decoder model: single GRU encoder/decoder.
**Please note that some places require change and your implementation.**

In [126]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [127]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Training loop

Training encoder-decoder models isn't that different from any other models: sample batches, compute loss, backprop and update.

For training loss we will use ***torch.nn.NLLLoss*** please note that the loss should not be calculated with the padding token. (For ignoring specific labels please look at ***torch.nn.NLLLoss*** documentation 


In [128]:
teacher_forcing_ratio = 0.5
MAX_LENGTH = 13

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH, attention=False):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[vocab.bos_ix]], device=device)

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            
            if attention:
              decoder_output, decoder_hidden, decoder_attention = decoder(
                  decoder_input, decoder_hidden, encoder_outputs)
            
            else:
              decoder_output, decoder_hidden = decoder(
                  decoder_input, decoder_hidden)
              
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            
            if attention:
              decoder_output, decoder_hidden, decoder_attention = decoder(
                  decoder_input, decoder_hidden, encoder_outputs)
              
            else:
              decoder_output, decoder_hidden = decoder(
                  decoder_input, decoder_hidden)

            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == vocab.eos_ix:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

In [129]:
def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

In [130]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01, attention=False):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [vocab.tensorsFromPair(random.choice(train_set))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion, attention=attention)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

In [131]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH, attention=False):
    with torch.no_grad():
        input_tensor = vocab.tokenize(sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[vocab.bos_ix]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):

            if attention:
              decoder_output, decoder_hidden, decoder_attention = decoder(
                  decoder_input, decoder_hidden, encoder_outputs)
              decoder_attentions[di] = decoder_attention.data
              
            else:
              decoder_output, decoder_hidden = decoder(
                  decoder_input, decoder_hidden)
              
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == vocab.eos_ix:
                decoded_words.append('END')
                break
            else:
                decoded_words.append(vocab.ix_to_char[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]

In [132]:
def evaluateRandomly(encoder, decoder, n=10, attention=False):
    for i in range(n):
        pair = random.choice(test_set)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0],attention=attention)
        output_words = output_words[1:-1]
        output_sentence = ''.join(output_words)
        print('<', output_sentence)
        print('')

In [133]:
hidden_size = 256
encoder1 = EncoderRNN(len(vocab), hidden_size).to(device)
decoder1 = DecoderRNN(hidden_size, len(vocab)).to(device)

trainIters(encoder1, decoder1, 90000, print_every=5000)

1m 18s (- 22m 16s) (5000 5%) 1.0297
2m 33s (- 20m 26s) (10000 11%) 1.0133
3m 48s (- 19m 2s) (15000 16%) 1.0023
5m 3s (- 17m 43s) (20000 22%) 0.9830
6m 20s (- 16m 29s) (25000 27%) 0.9689
7m 37s (- 15m 15s) (30000 33%) 0.9413
8m 54s (- 14m 0s) (35000 38%) 0.9261
10m 10s (- 12m 43s) (40000 44%) 0.9078
11m 26s (- 11m 26s) (45000 50%) 0.9022
12m 42s (- 10m 10s) (50000 55%) 0.8917
13m 58s (- 8m 53s) (55000 61%) 0.8775
15m 13s (- 7m 36s) (60000 66%) 0.8685
16m 29s (- 6m 20s) (65000 72%) 0.8663
17m 44s (- 5m 4s) (70000 77%) 0.8678
19m 0s (- 3m 48s) (75000 83%) 0.8527
20m 15s (- 2m 31s) (80000 88%) 0.8470
21m 31s (- 1m 15s) (85000 94%) 0.8374
22m 47s (- 0m 0s) (90000 100%) 0.8353


In [134]:
evaluateRandomly(encoder1, decoder1)

> 6883+3527
= 10410
< 10138

> 3780+4380
= 8160
< 9988

> 6690+7299
= 13989
< 14395

> 8885+9846
= 18731
< 17878

> 9056+9473
= 18529
< 17758

> 9880+7500
= 17380
< 17758

> 4770-3946
= 824
< 208

> 1337+3664
= 5001
< 4988

> 9037*2303
= 20812211
< 11423959

> 1399*2948
= 4124252
< 4407588



## Adding Attention Layer
Here you will have to implement a layer that computes a simple additive attention:

Given encoder sequence $ h^e_0, h^e_1, h^e_2, ..., h^e_T$ and a single decoder state $h^d$,

* Compute logits with a 2-layer neural network
$$a_t = linear_{out}(tanh(linear_{e}(h^e_t) + linear_{d}(h_d)))$$
* Get probabilities from logits, 
$$ p_t = {{e ^ {a_t}} \over { \sum_\tau e^{a_\tau} }} $$

* Add up encoder states with probabilities to get __attention response__
$$ attn = \sum_t p_t \cdot h^e_t $$



In [135]:
#<implement attention layer>

class AdditiveAttention(torch.nn.Module):	 	 
    def __init__(self, encoder_dim=256, decoder_dim=256):	 	 
        super().__init__()	 	 

        self.encoder_dim = encoder_dim	 	 
        self.decoder_dim = decoder_dim	 	 
        self.v = torch.nn.Parameter(torch.rand(self.decoder_dim))	 	 
        self.W_1 = torch.nn.Linear(self.decoder_dim, self.decoder_dim)	 	 
        self.W_2 = torch.nn.Linear(self.encoder_dim, self.decoder_dim)	 	 

    def forward(self, query, values):	 	 
        weights = self._get_weights(query, values) 
        weights = torch.nn.functional.softmax(weights, dim=0)
        return weights @ values, weights 

    def _get_weights(self, 	 	 
      query, # [decoder_dim]	 	 
      values # [seq_length, encoder_dim]	 	 
    ):	
        query = query.repeat(values.size(0), 1) # [seq_length, decoder_dim]	 	 
        weights = self.W_1(query) + self.W_2(values) # [seq_length, decoder_dim]	 	 
        return torch.tanh(weights) @ self.v # [seq_length]

In [136]:
#<add attention layer for your seq2seq model>

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
        self.attention = AdditiveAttention()
        self.dropout_p = dropout_p
        self.dropout = nn.Dropout(self.dropout_p)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        # compute context vector using attention mechanism
        query = hidden.squeeze()
        context, attn_probs = self.attention(
            query=query, values=encoder_outputs)
        embedded = embedded.to(device)
        embedded = embedded.squeeze().unsqueeze(0)
        context = context.unsqueeze(0)
        output = torch.cat((embedded, context), 1)
        output = self.attn_combine(output)
        output = F.relu(output)
        output, hidden = self.gru(output.unsqueeze(0), hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden, attn_probs

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [137]:
#<Train the two models (with/without attention and compare the results)

hidden_size = 256
encoder2 = EncoderRNN(len(vocab), hidden_size).to(device)
decoder2 = AttnDecoderRNN(hidden_size, len(vocab)).to(device)

trainIters(encoder2, decoder2, 90000, print_every=5000, attention=True)

2m 3s (- 34m 57s) (5000 5%) 1.0301
4m 3s (- 32m 31s) (10000 11%) 1.0089
6m 3s (- 30m 19s) (15000 16%) 0.9755
8m 4s (- 28m 17s) (20000 22%) 0.9402
10m 5s (- 26m 14s) (25000 27%) 0.9280
12m 7s (- 24m 14s) (30000 33%) 0.9093
14m 9s (- 22m 15s) (35000 38%) 0.8841
16m 10s (- 20m 13s) (40000 44%) 0.8656
18m 11s (- 18m 11s) (45000 50%) 0.8550
20m 12s (- 16m 10s) (50000 55%) 0.8446
22m 13s (- 14m 8s) (55000 61%) 0.8371
24m 15s (- 12m 7s) (60000 66%) 0.8407
26m 16s (- 10m 6s) (65000 72%) 0.8295
28m 17s (- 8m 4s) (70000 77%) 0.8299
30m 17s (- 6m 3s) (75000 83%) 0.8145
32m 18s (- 4m 2s) (80000 88%) 0.8220
34m 19s (- 2m 1s) (85000 94%) 0.8128
36m 19s (- 0m 0s) (90000 100%) 0.8039


In [138]:
evaluateRandomly(encoder2, decoder2, attention=True)

> 8365+6290
= 14655
< 14399

> 7367*850
= 6261950
< 6630880

> 9356-797
= 8559
< 8889

> 1917+7788
= 9705
< 9036

> 6783+1076
= 7859
< 7768

> 454*9207
= 4179978
< 4330088

> 6857-6053
= 804
< 166

> 8382*2620
= 21960840
< 20000800

> 8804*1283
= 11295532
< 12896998

> 6125*4505
= 27593125
< 20055680



## Good Luck!
Thanks :) We can see some improvement with using attention