#### Contents taken from [tensorflow-tutorial ](https://www.tensorflow.org/tutorials/text/nmt_with_attention )

In [4]:
import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

## Download and process the dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

There are a variety of languages available, but we'll use the English-Spanish dataset. After downloading the dataset, here are the steps we'll take to prepare the data:

1. Add a *start* and *end* token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a Vocabulary with word index (mapping from word → id) and reverse word index (mapping from id → word).
4. Pad each sentence to a maximum length. (Why? we need to fix the maximum length for the inputs to recurrent encoders and recurrent decoders)

In [11]:
# Download the file

# tf.keras.utils.get_file downloads a file from a URL if it not already in the cache.
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


In [5]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

In [6]:
## Check the preprocessing on example input-output text pair
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [7]:
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]

# The file contains text pairs as - english_sentence \t(separator) spanish_sentence

def create_dataset(path, num_examples):
  # path : path to spa-eng.txt file
  # num_examples : Limit the total number of training example for faster training (set num_examples = len(lines) to use full data)
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
  word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

  return zip(*word_pairs)

In [12]:
en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [17]:
# We'll create the vocabulary now. Thankfully tf.keras.preprocessing provides all the tools which are necessary for this. 


def tokenize(lang):
  # lang = list of sentences in a language
  
  # print(len(lang), "example sentence: {}".format(lang[0]))
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>')
  lang_tokenizer.fit_on_texts(lang)

  ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
  ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
  tensor = lang_tokenizer.texts_to_sequences(lang) 

  ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
  ## and pads the sequences to match the longest sequences in the given input
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

  return tensor, lang_tokenizer

In [20]:
## Test your tokenize function
text = ["what are you doing ?", 
        "where have you been ?",
        "I love the mother nature .",
        "Wow !"]

sample_tensor, sample_tokenizer = tokenize(text)
print("text sentence: {}".format(text[0]))
print("integer_ids for the same sequence: {}".format(sample_tensor[0]))
print("whole input tensor is : \n", sample_tensor)

text sentence: what are you doing ?
integer_ids for the same sequence: [4 5 2 6 3 0]
whole input tensor is : 
 [[ 4  5  2  6  3  0]
 [ 7  8  2  9  3  0]
 [10 11 12 13 14 15]
 [16 17  0  0  0  0]]


Since the longest string in our list was "*I love the mother nature .*" which had 6 tokens, the tokenize function has padded all the sequence to be of length 6. In the above example, since the len(text[0]) was 5, there is a trailing '0' in the corresponding integer_ids sequence. Let's see the vocabulary

In [19]:
sample_tokenizer.get_config()

{'char_level': False,
 'document_count': 4,
 'filters': '',
 'index_docs': '{"4": 1, "6": 1, "5": 1, "2": 2, "3": 2, "8": 1, "9": 1, "7": 1, "10": 1, "13": 1, "14": 1, "12": 1, "11": 1, "15": 1, "16": 1, "17": 1}',
 'index_word': '{"1": "<OOV>", "2": "you", "3": "?", "4": "what", "5": "are", "6": "doing", "7": "where", "8": "have", "9": "been", "10": "i", "11": "love", "12": "the", "13": "mother", "14": "nature", "15": ".", "16": "wow", "17": "!"}',
 'lower': True,
 'num_words': None,
 'oov_token': '<OOV>',
 'split': ' ',
 'word_counts': '{"what": 1, "are": 1, "you": 2, "doing": 1, "?": 2, "where": 1, "have": 1, "been": 1, "i": 1, "love": 1, "the": 1, "mother": 1, "nature": 1, ".": 1, "wow": 1, "!": 1}',
 'word_docs': '{"what": 1, "doing": 1, "are": 1, "you": 2, "?": 2, "have": 1, "been": 1, "where": 1, "i": 1, "mother": 1, "nature": 1, "the": 1, "love": 1, ".": 1, "wow": 1, "!": 1}',
 'word_index': '{"<OOV>": 1, "you": 2, "?": 3, "what": 4, "are": 5, "doing": 6, "where": 7, "have": 8,

The dictionary *word_index* describes the token with corresponding *integer_id* assigned to it. 

In [21]:
def load_dataset(path, num_examples=None):
  # creating cleaned input, output pairs
  targ_lang, inp_lang = create_dataset(path, num_examples)

  input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
  target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

  return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

## Now let's prepare the input and output tensors on our data

### Limit the size of the dataset to experiment faster (optional)

Training on the complete dataset of >100,000 sentences will take a long time. To train faster, we can limit the size of the dataset to 30,000 sentences (of course, translation quality degrades with less data):

In [22]:
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]
max_length_targ, max_length_inp

(11, 16)

In [23]:
def convert(lang, tensor):
  for t in tensor:
    if t!=0:
      print ("%d ----> %s" % (t, lang.index_word[t]))

In [25]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

24000 24000 6000 6000


In [26]:
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
2 ----> <start>
5 ----> tom
43 ----> tiene
49 ----> su
1554 ----> propio
3374 ----> dormitorio
4 ----> .
3 ----> <end>

Target Language; index to word mapping
2 ----> <start>
6 ----> tom
52 ----> has
62 ----> his
433 ----> own
166 ----> room
4 ----> .
3 ----> <end>


### Create a tf.data dataset

We'll use this dataset in our modeling notebooks. *tf.data.Dataset* API supports writing descriptive and efficient input pipelines. Dataset usage follows a common pattern:

* Create a source dataset from your input data.
* Apply dataset transformations to preprocess the data.
* Iterate over the dataset and process the elements.

In [29]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64

# One epoch = goes through whole data once
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE 

# Word embedding dimensions
embedding_dim = 256

# units in our recurrent cells
units = 1024

# Total number of tokens in input and target vocabulary
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [30]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 16]), TensorShape([64, 11]))