<a href="https://colab.research.google.com/github/denocris/MHPC-DeepLearning-Lectures/blob/master/lec3_char_based_check_corrector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Seq2seq Spell-Checker (with Attention)

Some refs about RNN, LSTM and GRU details

* [RNN math](https://hackernoon.com/tutorial-3-what-is-seq2seq-for-text-summarization-and-why-68ebaa644db0)

* [LSTM and GRU math](https://hackernoon.com/multilayer-bidirectional-lstm-gru-for-text-summarization-made-easy-tutorial-4-a63db108b44f)

* [more about LSTM and GRU](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)

* [Attention Mechanism](https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f)

  
  <center>  <img src=https://drive.google.com/uc?id=1Cg9zObxtfGvsf3zzuk--2UTiCpxIbUDd width="700">  </center>
  

  

### What are we going to learn?


* Some theory behind RNN
  
* Seq2Seq Encoder Decoder architecture

* Data Preprocessing and Text Preparation
  
* Attention Mechanims

In [0]:
!pip install tensorflow-gpu==2.0.0-alpha0
!pip uninstall pytest
!pip install pytest
#

Collecting tensorflow-gpu==2.0.0-alpha0
[?25l  Downloading https://files.pythonhosted.org/packages/1a/66/32cffad095253219d53f6b6c2a436637bbe45ac4e7be0244557210dc3918/tensorflow_gpu-2.0.0a0-cp36-cp36m-manylinux1_x86_64.whl (332.1MB)
[K     |████████████████████████████████| 332.1MB 74kB/s 
Collecting tb-nightly<1.14.0a20190302,>=1.14.0a20190301 (from tensorflow-gpu==2.0.0-alpha0)
[?25l  Downloading https://files.pythonhosted.org/packages/a9/51/aa1d756644bf4624c03844115e4ac4058eff77acd786b26315f051a4b195/tb_nightly-1.14.0a20190301-py3-none-any.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 36.8MB/s 
Collecting tf-estimator-nightly<1.14.0.dev2019030116,>=1.14.0.dev2019030115 (from tensorflow-gpu==2.0.0-alpha0)
[?25l  Downloading https://files.pythonhosted.org/packages/13/82/f16063b4eed210dc2ab057930ac1da4fbe1e91b7b051a6c8370b401e6ae7/tf_estimator_nightly-1.14.0.dev2019030115-py2.py3-none-any.whl (411kB)
[K     |████████████████████████████████| 419kB 36.4MB/s 
[?25hCol

## Some Theory 

### Recurrent Neural Networks

* **convolutional neural networks (CNN)** can efficiently process spatial information

*  **recurrent neural networks (RNN)** are designed to better handle sequential information. 

The latter networks introduces state variables (**hidden_state**) to store past information and, together with the current input, determine the current output.

<center>  <img src=https://drive.google.com/uc?id=18kJbtRzVhGaSJxyjOGd78dC3GyN29FHl width="700">  </center>
  
  
 We’ll describe the process of carrying memory forward mathematically by

  $$h_t = \phi( W_{hx} x + W_{hh} h_{t-1} + b).$$
  
 The weight matrices are filters that determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via backpropagation and be used to adjust their weights until error can’t go any lower. The weight matrices are filters that determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via backpropagation and be used to adjust their weights until error can’t go any lower.

### Seq2seq Encoder-Decoder 

Why a sequence to sequence structure?

**“Je ne suis pas le chat noir” → “I am not the black cat”**

Most of the words in the input sentence have a direct translation in the output sentence, but are in slightly different orders, e.g. “chat noir” and “black cat”. Because of the “ne/pas” construction there is also one more word in the input sentence. It would be difficult to produce a correct translation directly from the sequence of input words.

As human, we read the entire source sentence, understand its meaning, and then produce a translation. A Seq2seq Encoder-Decoder (Neural Machine Translation, NMT) mimics that!

<center>  <img src=https://drive.google.com/uc?id=1NvHVcWBmB1FIOoHgbDdWUPgYQ0s0sRsP  width="700">  </center> 
  
An encoder converts a source sentence into a meaning vector which is passed through a decoder to produce a translation.
  
  
  <center>  <img src=  https://drive.google.com/uc?id=1P8vlcD4CDPCNFJUk6K-5elVW7isnGJA3 width="400">  </center> 
  
At the bottom layer, the encoder and decoder RNNs receive as input the following: first, the source sentence, then a boundary marker $<s>$ which indicates the transition from the encoding to the decoding mode, and the target sentence. For training, we will feed the system with the following tensors, which are in time-major format and contain word indices:

* **encoder_inputs** [max_encoder_time, batch_size]: source input words.
*  **decoder_inputs** [max_decoder_time, batch_size]: target input words.
*  **decoder_outputs** [max_decoder_time, batch_size]: target output words, these are decoder_inputs shifted to the left by one time step with an end-of-sentence tag appended on the right.



In [0]:
from __future__ import absolute_import, division, print_function

import tensorflow as tf

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

In [0]:
print(tf.__version__)

2.0.0-alpha0


## Dataset and Text Preprocessing

*correct sentence* ------- *wrong spelled sentence*


In [0]:
!head -n 10 noisy_sentences_v3_50k.txt

non vado a costruire edifici in islanda	non vado a costruire edifici in islanda 
sembrate molto occupati	esmbrate mojto occupati 
dove andranno domani in corsica	dvoe andranno aomani in corspca 
perché non lo chiedi a tom	perché non lo cihedi a tom 
non schiacciare un riccio col piede nudo	non schiacciare un ricciio col ipede nudo 
state tutti in fila indiana	statee tutti in ifla indiana 
tom è un giovane promettente attore	tom è un givoane promettentte attorx 
a me non piacerebbe essere un giudice	a me non piacerebbe esseere un tiudice 
voi siete così anziani	voi siete csoì anziani 
non siete ancora arrivati a metà	non siette ancora arrivati a mevà 


In [0]:
! wc -l noisy_sentences_v3_50k.txt

### Text Preprocessing

In [0]:
path_to_file = './noisy_sentences_v3_50k.txt'

In [0]:
# Converts the unicode file to ascii
def unicode_to_ascii_old(s, ACCENT = True):

    if ACCENT:
      accents=('COMBINING GRAVE TONE MARK', 'COMBINING ACUTE TONE MARK')
      #accents = set(map(unicodedata.lookup, accents))
      return ''.join(c for c in unicodedata.normalize('NFD', s) if c not in accents )
    else:      
      return ''.join(c for c in unicodedata.normalize('NFD', s)
          if unicodedata.category(c) != 'Mn')
    
def unicode_to_ascii(s, ACCENT = True):
    accents=('COMBINING GRAVE TONE MARK', 'COMBINING ACUTE TONE MARK')
    accents = set(map(unicodedata.lookup, accents))
    return ''.join(c for c in unicodedata.normalize('NFD', s) if c not in accents )



def preprocess_sentence(w, ACCENT = True):
    #w = unicode_to_ascii(w.lower().strip())
    w = w.lower().strip()
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    if ACCENT:
      accs='àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸ'
      # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",". "'") and accents in accs
      w = re.sub(r"[^a-zA-Z?.!,'¿"+ accs +"]+", " ", w)
    else:
      # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",". "'") 
      w = re.sub(r"[^a-zA-Z?.!,'¿]+", " ", w)
      
    w = w.rstrip().strip()
    w = ''.join(list(w))
    
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<' + w + '>'
    return w
  
#preprocess_sentence('à è é à ù')

In [0]:
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format
def create_dataset(path, num_examples):
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

  word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
  # * is used to unzip the list
  return zip(*word_pairs)




In [0]:
cln, nsy = create_dataset(path_to_file, None)

for i in range(50):
  print(cln[i], nsy[i])

<non vado a costruire edifici in islanda> <non vado a costruire edifici in islanda>
<sembrate molto occupati> <esmbrate mojto occupati>
<dove andranno domani in corsica> <dvoe andranno aomani in corspca>
<perché non lo chiedi a tom> <perché non lo cihedi a tom>
<non schiacciare un riccio col piede nudo> <non schiacciare un ricciio col ipede nudo>
<state tutti in fila indiana> <statee tutti in ifla indiana>
<tom è un giovane promettente attore> <tom è un givoane promettentte attorx>
<a me non piacerebbe essere un giudice> <a me non piacerebbe esseere un tiudice>
<voi siete così anziani> <voi siete csoì anziani>
<non siete ancora arrivati a metà> <non siette ancora arrivati a mevà>
<sei fantastica> <sei fantastics>
<tom è davvero gentile> <tom è davvero gentlle>
<bisogna cogliere le buone occasioni come vengono> <bisognaa cogliere le bunoe occasioni come vengono>
<è la mia risposta definitiva> <è la mia risposta definntiva>
<quanto dovrebbero essere pagati> <quanto dovrebbero sesere paga

### Tokenization

In [0]:
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='', char_level=True)
  #keras.preprocessing.text.text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')

  # fit_on_texts: this method creates the vocabulary index based on word frequency. Every word gets a unique 
  # integer value. So lower integer means more frequent word
  lang_tokenizer.fit_on_texts(lang)
  #print( lang_tokenizer)
  tensor = lang_tokenizer.texts_to_sequences(lang)
  #print( tensor)
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')
  
  return tensor, lang_tokenizer


def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    cln, nsy = create_dataset(path, num_examples)
    print(cln)  
    print(nsy) 
    print('------------------')
    input_tensor, inp_lang_tokenizer = tokenize(nsy)
    target_tensor, targ_lang_tokenizer = tokenize(cln)
    print(target_tensor, targ_lang_tokenizer)   # <start> go . <end>  ----> [1 4 3 2], being:  <start>=1, go=4, .=3, <end>=2
    print( input_tensor, inp_lang_tokenizer)  # <start> go . <end>  ----> [1 4 3 2], being:  <start>=1, go=4, .=3, <end>=2
    print('------------------')
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer
  
load_dataset(path_to_file, 2)

# every character is transformed in a number

('<non vado a costruire edifici in islanda>', '<sembrate molto occupati>')
('<non vado a costruire edifici in islanda>', '<esmbrate mojto occupati>')
------------------
[[12  5  3  5  1 17  4  9  3  1  4  1  6  3 10  7 11 13  2 11  8  1  8  9
   2 18  2  6  2  1  2  5  1  2 10 14  4  5  9  4 15]
 [12 10  8 16 19 11  4  7  8  1 16  3 14  7  3  1  3  6  6 13 20  4  7  2
  15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]] <keras_preprocessing.text.Tokenizer object at 0x7f4b979ea1d0>
[[12  5  3  5  1 16  4  9  3  1  4  1  6  3 10  7 11 13  2 11  8  1  8  9
   2 17  2  6  2  1  2  5  1  2 10 18  4  5  9  4 14]
 [12  8 10 15 19 11  4  7  8  1 15  3 20  7  3  1  3  6  6 13 21  4  7  2
  14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]] <keras_preprocessing.text.Tokenizer object at 0x7f4b979ea2b0>
------------------


(array([[12,  5,  3,  5,  1, 16,  4,  9,  3,  1,  4,  1,  6,  3, 10,  7,
         11, 13,  2, 11,  8,  1,  8,  9,  2, 17,  2,  6,  2,  1,  2,  5,
          1,  2, 10, 18,  4,  5,  9,  4, 14],
        [12,  8, 10, 15, 19, 11,  4,  7,  8,  1, 15,  3, 20,  7,  3,  1,
          3,  6,  6, 13, 21,  4,  7,  2, 14,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32),
 array([[12,  5,  3,  5,  1, 17,  4,  9,  3,  1,  4,  1,  6,  3, 10,  7,
         11, 13,  2, 11,  8,  1,  8,  9,  2, 18,  2,  6,  2,  1,  2,  5,
          1,  2, 10, 14,  4,  5,  9,  4, 15],
        [12, 10,  8, 16, 19, 11,  4,  7,  8,  1, 16,  3, 14,  7,  3,  1,
          3,  6,  6, 13, 20,  4,  7,  2, 15,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32),
 <keras_preprocessing.text.Tokenizer at 0x7f4b979ea2b0>,
 <keras_preprocessing.text.Tokenizer at 0x7f4b979ea1d0>)

In [0]:

def max_length(tensor):
    return max(len(t) for t in tensor)

# Try experimenting with the size of that dataset
num_examples = 10000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

('<non vado a costruire edifici in islanda>', '<sembrate molto occupati>', '<dove andranno domani in corsica>', '<perché non lo chiedi a tom>', '<non schiacciare un riccio col piede nudo>', '<state tutti in fila indiana>', '<tom è un giovane promettente attore>', '<a me non piacerebbe essere un giudice>', '<voi siete così anziani>', '<non siete ancora arrivati a metà>', '<sei fantastica>', '<tom è davvero gentile>', '<bisogna cogliere le buone occasioni come vengono>', '<è la mia risposta definitiva>', '<quanto dovrebbero essere pagati>', '<ci sono molti turisti a roma>', '<avreste dovuto baciarla>', '<lei soffre di una malattia contagiosa>', "<preferisce l'aragosta o il granchio>", '<tom ricevette un messaggio di testo>', '<stasera farà un rapporto>', "<cos'è successo a tom a boston>", '<quanto spesso vengono qua tom e mary>', '<tom non aveva una fidanzata>', '<lui era abituato a frequentare casa mia>', '<non siete ancora arrabbiati con tom>', '<tom sta cercando di confonderti>', '<di

In [0]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.05)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

train_size = len(input_tensor_train)
val_size = len(input_tensor_val)

In [0]:
def convert(lang, tensor):
  for t in tensor:
    if t!=0:
      print ("%d ----> %s" % (t, lang.index_word[t]))
      
def convert_from_tensor(lang, tensor):
    #print(tensor)
    ret = [lang.index_word[t] for t in tensor if t!=0]
    ret = ''.join(ret)
    return ret
  

print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

convert_from_tensor(inp_lang, input_tensor_train[1])


Input Language; index to word mapping
12 ----> <
3 ----> o
8 ----> r
2 ----> a
1 ---->  
16 ----> d
4 ----> e
18 ----> v
3 ----> o
1 ---->  
15 ----> u
8 ----> r
3 ----> o
18 ----> v
2 ----> a
8 ----> r
10 ----> l
5 ----> i
13 ----> >

Target Language; index to word mapping
12 ----> <
3 ----> o
8 ----> r
2 ----> a
1 ---->  
16 ----> d
4 ----> e
18 ----> v
3 ----> o
1 ---->  
7 ----> t
8 ----> r
3 ----> o
18 ----> v
2 ----> a
8 ----> r
10 ----> l
5 ----> i
13 ----> >


'<noi siamo uomyni>'

In [0]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 16
units = 128
drop_rate = 0.2
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

dataset_val = tf.data.Dataset.from_tensor_slices((input_tensor_val, target_tensor_val))
dataset_val = dataset_val.batch(1, drop_remainder=True)


In [0]:
for (inp, targ) in dataset_val.take(5):
  #print(inp[0],targ[0])
  print(convert_from_tensor(inp_lang, np.array(inp[0])))
  print(convert_from_tensor(targ_lang, np.array(targ[0])))
  print('-------')


<la maggior parte dei cestisti sonoo apti>
<la maggior parte dei cestisti sono alti>
-------
<non so che ocsa faremo>
<non so che cosa faremo>
-------
<non vanno a costruire fabbriche in ltiuania>
<non vanno a costruire fabbriche in lituania>
-------
<era davveor buono>
<era davvero buono>
-------
<vi piace il circoo bulgaro>
<vi piace il circo bulgaro>
-------


In [0]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 61]), TensorShape([64, 58]))

In [0]:
example_input_val, example_target_val = next(iter(dataset_val))
example_input_val.shape, example_target_val.shape

(TensorShape([1, 61]), TensorShape([1, 58]))

In [0]:
convert_from_tensor(inp_lang, np.array(example_input_batch[0]))

'<syete mai statx in australia>'

### Encoder

The encoder of a seq2seq network is a GRU that outputs some value for every token from the input sentence. For every input token the encoder outputs a vector and a hidden state, and uses the hidden state for the next input token.

In [0]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz, drop_rate):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.drop_rate = drop_rate
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    # return_sequences=True returns the full sequence and not just the last output in the output sequence
    # return_state=True returns the last state in addition to the output
    self.gru = tf.keras.layers.GRU(self.enc_units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   dropout=self.drop_rate,
                                   recurrent_initializer='glorot_uniform')
#     self.gruCell = tf.keras.layers.GRUCell(self.enc_units, 
#                                    #return_sequences=True, 
#                                    #return_state=True, 
#                                    dropout=self.drop_rate,
#                                    recurrent_initializer='glorot_uniform')
    
    #sizes = np.array([self.enc_units, self.enc_units])
    sizes = np.array([self.enc_units])
#     self.gruCells = [tf.keras.layers.GRUCell(size, dropout=self.drop_rate,
#                                    recurrent_initializer='glorot_uniform') for size in sizes]
#     self.rnn_gru_cells = tf.keras.layers.RNN(self.gruCells, return_sequences=True,
#                                     return_state=True)
#     self.stack_gru_cells = tf.keras.layers.StackedRNNCells(self.gruCells)
  
  
  def call(self, x, hidden):
    #print('x', x.shape)
    x = self.embedding(x)
    #print('emb(x)', x.shape)
    #print('hidden', hidden.shape)

    output, state = self.gru(x, initial_state = hidden)
    #print(output.shape)
    #print(state.shape)
    return output, state
  
#   def call(self, x, hidden):
#     print('x', x.shape)
#     x = self.embedding(x)
#     print('emb(x)', x.shape)
#     print('hidden', hidden.shape)
#     #cells = self.gruCells
#     #x = tf.keras.layers.StackedRNNCells(cells)(inputs)
#     #print(self.stack_gru_cells(x, initial_state=None))
#     x, y= self.stack_gru_cells(x, initial_state=None)
#     #print('out', state.shape)
#     return x, y

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
    #return tf.zeros(list(([64, 128])))

In [0]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE, drop_rate)

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
#x, y = encoder(example_input_batch, sample_hidden)
#sample_output = encoder(example_input_batch, sample_hidden)
sample_output.shape, sample_hidden.shape
#print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
#print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

(TensorShape([64, 61, 128]), TensorShape([64, 128]))

### Simple Decoder (no Attention)

In the simplest seq2seq decoder we use only last output of the encoder. This last output is sometimes called the **context vector** as it encodes context from the entire sequence. This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string $<SOS>$ token, and the first hidden state is the context vector (the encoder’s last hidden state).

### Attention Mechanism
##  
##  
If only the context vector is passed betweeen the encoder and decoder, that single vector carries the burden of encoding the entire sentence. Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. The result will be a more specific context vector which takes care (thanks to a weighted average) of all the input states.

  <center>  <img src= https://drive.google.com/uc?id=16f2JfE_u4OAAsL81J6sM7Qe60e57w5Jv width="700">  </center> 

The *Attention Mechanism* is a way to deal with the *alignment problem* (see figure above). In simple words, it suggest to the decoder (at each step) which words of the input sentence are more relevant. 

In the example in figure, given that English and French are pretty well-aligned languages, you can see that the decoder chooses to attend to things mostly sequentially expect for the phrase “European Economic Zone” that gets translated to “zone économique européenne”.

Each decoder output now depends not just on the last decoder state, but on a weighted combination of all the input states.



In [0]:
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
  
  def call(self, query, values):
    # query: dec_hidden (h_t)
    # values: enc_output (\bar(h_s))
    
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, hidden_size)
    score = self.V(tf.nn.tanh(self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

 <center> 
  <p>
  <img src=https://drive.google.com/uc?id=1XEC7N3DPhH9yhxNhJ0sV-wtajkAlFr5l width="400">   
  <img src=https://drive.google.com/uc?id=1KMECLBmZ4nQcpir9FMbQAr5UzFX2pn7Y width="500"> 
   </p>
  </center> 
   
The score $\alpha_{ts}$ measures how well each encoder output $\bar{h}_s$ matches the current output of the decoder $h_t$. The attention weights change for each decoder state, and the model learns to “focus” on the relevant parts of the input.

In [0]:
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Attention result shape: (batch size, units) (64, 128)
Attention weights shape: (batch_size, sequence_length, 1) (64, 61, 1)


### Decoder

In [0]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz, drop_rate):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.drop_rate = drop_rate
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   dropout = self.drop_rate,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)
    ###### IF NO ATTENTION, hidden goes directly in self.gru() without passing through self.attention()
    ###### output, state = self.gru(x,hidden)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state , attention_weights

In [0]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE, drop_rate)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)), 
                                      sample_hidden, sample_output)


print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (64, 37)


### Loss Function and Optimizer

In [0]:
optimizer = tf.keras.optimizers.Adam(lr=0.0005)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask
  
  return tf.reduce_mean(loss_)

In [0]:
checkpoint_dir = './training_checkpoints_test'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [0]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0
        
  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<']] * BATCH_SIZE, 1)       

    # Teacher forcing - feeding the target as the next input
    # Rather than feeding as input at time step t the output of the network at the t-1 time step, it instead provides the ground truth.
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      # NOTE: without attention the decoder will simple receive as inputs (dec_input, dec_hidden) and not enc_output
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)


      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))
  
  return batch_loss

In [0]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    
    sentence = preprocess_sentence(sentence)

    #inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = [inp_lang.word_index[i] for i in list(sentence)] 

    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], 
                                                           maxlen=max_length_inp, 
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<']], 0)
    
    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, 
                                                             dec_hidden, 
                                                             enc_out)

        
        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        if predicted_id!=0:
          result += targ_lang.index_word[predicted_id] #+ ' '

        if predicted_id!=0 and targ_lang.index_word[predicted_id] == '>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence , attention_plot
  
  
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()
    
    
def translate(sentence, SHOW_PLOT=False):
    result, sentence, attention_plot = evaluate(sentence)
        
    print('Input: %s' % (sentence).encode('utf-8'))
    print('Predicted translation: {}'.format(result))
    
    if SHOW_PLOT: 
      attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
      plot_attention(attention_plot, sentence.split(' '), result.split(' '))

In [0]:
for (inp, targ) in dataset_val.take(2):
  #print(inp[0],targ[0])
  print(convert_from_tensor(inp_lang, np.array(inp[0])), convert_from_tensor(targ_lang, np.array(targ[0])))
  print('-------')

<la maggior parte dei cestisti sonoo apti> <la maggior parte dei cestisti sono alti>
-------
<non so che ocsa faremo> <non so che cosa faremo>
-------


In [0]:
def pic_one_validation_sentence(num):
  # Random choice a validation sentence from validation dataset
  rnd_idx = np.random.choice(num)
  val_sent = [(convert_from_tensor(inp_lang, np.array(i[0])),convert_from_tensor(targ_lang, np.array(t[0]))) for (i,t) in dataset_val.take(num)]
  noisy, clean = val_sent[rnd_idx]
  noisy, clean = noisy[1:-1], clean[1:-1]
  return noisy, clean

#noisy, clean = valSentence(5)
#print(noisy, clean)

stringMatch = lambda s1,s2: 1 if s1 != s2 else 0


def matchSentences(s1, s2):
  ws1 = s1.split()
  ws2 = s2.split()
  ws = list(zip(ws1,ws2))
  
  if ws: 
    # list not empty
    errors = [stringMatch(w1,w2) for (w1,w2) in ws] 
    # if ws1 and ws1 different length
    errors = sum(errors) + abs(len(ws1) - len(ws2)) 
  else:
    # empty list
    errors = max(len(ws1),len(ws2))
  
  return errors

def computeErrorsInValidationSet(num=100):
  val_sent = [(convert_from_tensor(inp_lang, np.array(i[0]))[1:-1],convert_from_tensor(targ_lang, np.array(t[0]))[1:-1]) for (i,t) in dataset_val.shuffle(val_size).take(num) ]
  
  tot_errors = 0
  tot_len_sent = 0
  tot_correct = 0
  for nsy,cln in val_sent:
    err = matchSentences(nsy,cln)
    if err == 0:
      tot_correct += 1
    tot_errors += err
    tot_len_sent += len(cln.split())
  
  return tot_errors, tot_errors/float(tot_len_sent), tot_correct

def evaluateValidationSet(num=100):
  val_sent = [(evaluate(convert_from_tensor(inp_lang, np.array(i[0]))[1:-1])[0],convert_from_tensor(targ_lang, np.array(t[0]))[1:-1]) for (i,t) in dataset_val.shuffle(val_size).take(num) ]
  
  tot_errors = 0
  tot_len_sent = 0
  tot_correct = 0
  for nsy,cln in val_sent:
    err = matchSentences(nsy,cln)
    if err == 0:
      tot_correct += 1
    tot_errors += err
    tot_len_sent += len(cln.split())
  
  return tot_errors, tot_errors/float(tot_len_sent), tot_correct

In [0]:
val_sent = [(convert_from_tensor(inp_lang, np.array(i[0]))[1:-1],convert_from_tensor(targ_lang, np.array(t[0]))[1:-1]) for (i,t) in dataset_val.take(4)]

evaluation_val_test = [(evaluate(convert_from_tensor(inp_lang, np.array(i[0]))[1:-1])[0],convert_from_tensor(targ_lang, np.array(t[0]))[1:-1]) for (i,t) in dataset_val.take(4)]

for nsy,cln in val_sent:
  print(matchSentences(nsy,cln), len(cln.split()))
  
print(val_sent)
print(evaluation_val_test)

for out,cln in evaluation_val_test:
  print(matchSentences(out,cln), len(cln.split()))

2 7
1 5
1 7
1 3
[('la maggior parte dei cestisti sonoo apti', 'la maggior parte dei cestisti sono alti'), ('non so che ocsa faremo', 'non so che cosa faremo'), ('non vanno a costruire fabbriche in ltiuania', 'non vanno a costruire fabbriche in lituania'), ('era davveor buono', 'era davvero buono')]
[('vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vx', 'la maggior parte dei cestisti sono alti'), ('vatceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 'non so che cosa faremo'), ('vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vxi<vx', 'non vanno a costruire fabbriche in lituania'), ('vatceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 'era davvero buono')]
7 7
5 5
7 7
3 3


In [0]:
# print(convert_from_tensor(inp_lang, np.array(inp[0][0])), convert_from_tensor(targ_lang, np.array(inp[1][0])))
# print(convert_from_tensor(targ_lang, np.array(targ[0][0])), convert_from_tensor(targ_lang, np.array(targ[1][0])))


## Training

In [0]:
EPOCHS = 10

print('Input Errors: ', computeErrorsInValidationSet())

num_errors = []

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    #print('input shape', inp.shape)
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, 
                                                     batch,
                                                     batch_loss.numpy()))
    if batch == 1: #or (batch % 200) == 0:  
          #print('Evaluation at:')
          #rnd_idx = np.random.choice(100)
          #noisy = convert_from_tensor(inp_lang, np.array(inp[rnd_idx]))
          noisy, clean = pic_one_validation_sentence(BATCH_SIZE)

          noisy =u'%s' %(noisy)
          #target = convert_from_tensor(targ_lang, np.array(targ[rnd_idx]))
          #target = convert_from_tensor(targ_lang, np.array(targ_val[rnd_idx]))
          clean =u'%s' %(clean)
          #cleaned = evaluate_from_input(inp[0],enc_hidden[0])
          result, _, _ = evaluate(noisy)
          #print(result)
          print('Input : %s' % (noisy))
          print('Cleaned: {}'.format(result))
          print('Target:  {}'.format(clean))
          
  tot_errors_epoch, ratio, tot_correct = evaluateValidationSet()
  num_errors.append(tot_errors_epoch)
  print('Output Errors: %d, %.3f, %d' %(tot_errors_epoch, ratio, tot_correct))
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Input Errors:  (122, 0.22467771639042358, 25)
Epoch 1 Batch 0 Loss 1.9515
Input : ho bisognoo di una ciotola più grandee
Cleaned:  
Target:  ho bisogno di una ciotola più grande
Epoch 1 Batch 100 Loss 1.5908


KeyboardInterrupt: ignored

## Inference

In [0]:
sentence=u'doev andiamo a mangiare'
#list(sentence)
inputs = [inp_lang.word_index[char] for char in list(sentence)]

inputs

[14, 4, 5, 19, 24, 3, 8, 14, 6, 3, 13, 4, 24, 3, 24, 13, 3, 8, 18, 6, 3, 7, 5]

In [0]:
output, input_nsy, _ = evaluate(u'dove volgiamo andrae')
output, input_nsy

('dove voliamo andra>', '<dove volgiamo andrae>')

In [0]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()
    
    
def translate(sentence, SHOW_PLOT=False):
    result, sentence, attention_plot = evaluate(sentence)
        
    print('Input: %s' % (sentence).encode('utf-8'))
    print('Predicted translation: {}'.format(result))
    
    if SHOW_PLOT: 
      attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
      plot_attention(attention_plot, sentence.split(' '), result.split(' '))

In [0]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fc31db96780>

In [0]:
evaluate(u'continuiamo così che dovrnne anae bene')

('conti uiamo così che dovrnne ande bene>',
 '<continuiamo così che dovrnne anae bene>',
 array([[1.57178938e-02, 6.99134290e-01, 1.67461991e-01, ...,
         2.24874511e-05, 2.29274974e-05, 2.33147966e-05],
        [3.64358239e-05, 1.50078427e-04, 9.99574006e-01, ...,
         3.04932577e-08, 3.09802530e-08, 3.13915649e-08],
        [1.20356253e-05, 6.89217904e-06, 1.49742083e-03, ...,
         1.19090320e-08, 1.20261987e-08, 1.21203820e-08],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]))

In [0]:
translate(u'andiammo a casa a mangiaree qualcossa')

Input: b'<andiammo a casa a mangiaree qualcossa>'
Predicted translation: andiamo a casa a mangiare qualcosa>


In [0]:
s0_nsy = 'e la sorflla maggiore di tom'
s0_cln = ' '

s1_nsy = 'e la sorflla maggiore di tom'
s1_cln = 'he so so so so so so so so so so so so so so so so so so s'

s1_nsy = 'ciao'
s1_cln = 'ciao'

In [0]:
# stringMatch = lambda s1,s2: 1 if s1 != s2 else 0


# def evalSentences(s1, s2):
#   ws1 = s1.split()
#   ws2 = s2.split()
#   ws = list(zip(ws1,ws2))
  
#   if ws: 
#     # list not empty
#     errors = [stringMatch(w1,w2) for (w1,w2) in ws] 
#     # if ws1 and ws1 different length
#     errors = sum(errors) + abs(len(ws1) - len(ws2)) 
#   else:
#     # empty list
#     errors = max(len(ws1),len(ws2))
  
#   return errors
    