# Задание
Разобраться с моделькой перевода как она устроена (без механизма внимания), запустить для перевода с русского на английский

-------------------------

In [1]:
import numpy as np
import unicodedata
import re
import os
from pathlib import Path
import io
import time
import warnings
warnings.filterwarnings('ignore')

In [2]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.style.use('ggplot')
plt.rcParams['font.family'] = 'Times New Roman'

In [3]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

*****************************************

# LOAD Data. Preprocessing

In [4]:
# !wget http://www.manythings.org/anki/rus-eng.zip # http://www.manythings.org/anki/ - other languages
# !mkdir rus-eng
# !unzip rus-eng.zip -d rus-eng/

In [5]:
path_to_file = os.path.join(Path(os.getcwd()), "rus-eng", "rus.txt")

In [6]:
def preprocess_sentence(w):
    w = w.lower().strip()

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    w = re.sub(r"([?.!,])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Zа-яА-Я?.!,']+", " ", w)
    w = w.strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [7]:
preprocess_sentence("I can't go!!!")

"<start> i can't go ! ! ! <end>"

In [8]:
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENG, RUS]
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')[
        :2]] for l in lines[:num_examples]]

    return zip(*word_pairs)

In [9]:
en, ru = create_dataset(path_to_file, None)
print(en[-1])
print(ru[-1])

<start> doubtless there exists in this world precisely the right woman for any given man to marry and vice versa but when you consider that a human being has the opportunity of being acquainted with only a few hundred people , and out of the few hundred that there are but a dozen or less whom he knows intimately , and out of the dozen , one or two friends at most , it will easily be seen , when we remember the number of millions who inhabit this world , that probably , since the earth was created , the right man has never yet met the right woman . <end>
<start> несомненно , для каждого мужчины в этом мире где то есть подходящая женщина , которая может стать ему женой , обратное верно и для женщин . но если учесть , что у человека может быть максимум несколько сотен знакомых , из которых лишь дюжина , а то и меньше , тех , кого он знает близко , а из этой дюжины у него один или от силы два друга , то можно легко увидеть , что с уч том миллионов живущих на земле людей , ни один подходящи

In [10]:
def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
        filters='')
    lang_tokenizer.fit_on_texts(lang)

    tensor = lang_tokenizer.texts_to_sequences(lang)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                           padding='post')

    return tensor, lang_tokenizer

In [11]:
def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [12]:
len(en), len(ru)

(431097, 431097)

In [13]:
# Try experimenting with the size of that dataset
num_examples = 100000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]
print(f" max_length of the target tensors: eng-{max_length_targ}, ru-{max_length_inp}")

# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

 max_length of the target tensors: eng-11, ru-15


In [14]:
def convert(lang, tensor):
    for t in tensor:
        if t != 0:
            print("%d ----> %s" % (t, lang.index_word[t]))

In [15]:
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
13 ----> мы
1601 ----> богаты
3 ----> .
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
63 ----> we're
1098 ----> wealthy
3 ----> .
2 ----> <end>


******

# Create a tf.data dataset. Model build

In [16]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 300
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [17]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 15]), TensorShape([64, 11]))

## ENCODER

In [18]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super().__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units,
                                       return_sequences=False,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        return state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

In [19]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

In [20]:
# sample input
sample_hidden = encoder.initialize_hidden_state()
print (f'Encoder Hidden state shape: batch size {sample_hidden.shape[0]}, units: {sample_hidden.shape[1]}')

Encoder Hidden state shape: batch size 64, units: 1024


## DECODER

In [21]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super().__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)

    def call(self, x, hidden):
        # enc_output shape == (batch_size, max_length, hidden_size)
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        # passing the concatenated vector to the GRU
        output, state = self.gru(x, initial_state=hidden)

        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))

        # output shape == (batch_size, vocab)
        x = self.fc(output)

        return x, state

In [22]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

In [23]:
decoder_sample_x, decoder_sample_h = decoder(
    tf.random.uniform((BATCH_SIZE, 1)), sample_hidden)

decoder_sample_x.shape, decoder_sample_h.shape

(TensorShape([64, 7260]), TensorShape([64, 1024]))

In [24]:
optimizer = tf.keras.optimizers.Adam()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')


def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

# TRAIN

In [25]:
checkpoint_dir = './training_nmt_checkpoints'

checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [26]:
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_hidden = encoder(inp, enc_hidden)
        dec_hidden = enc_hidden

        dec_input = tf.expand_dims(
            [targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden = decoder(dec_input, dec_hidden)

            loss += loss_function(targ[:, t], predictions)

            # using teacher forcing
            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / targ.shape[1])

    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

In [27]:
EPOCHS = 50
verbose = 1
for epoch in range(EPOCHS):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0
    last_batch = len(dataset.take(steps_per_epoch))
    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss

        if (batch % 200 == 0) & (verbose == 2):
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs
    if ((epoch + 1) % 5 == 0) | ((epoch + 1) == EPOCHS):
        checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_", str(epoch))
        checkpoint.save(file_prefix=checkpoint_prefix)
        
    if verbose == 1:
        print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
        print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Loss 1.4261
Time taken for 1 epoch 164.70597124099731 sec

Epoch 2 Loss 0.6991
Time taken for 1 epoch 153.70558643341064 sec

Epoch 3 Loss 0.3813
Time taken for 1 epoch 147.8915092945099 sec

Epoch 4 Loss 0.2301
Time taken for 1 epoch 146.7943503856659 sec

Epoch 5 Loss 0.1571
Time taken for 1 epoch 150.3565366268158 sec

Epoch 6 Loss 0.1220
Time taken for 1 epoch 154.02417874336243 sec

Epoch 7 Loss 0.1029
Time taken for 1 epoch 149.26695156097412 sec

Epoch 8 Loss 0.0935
Time taken for 1 epoch 144.3778805732727 sec

Epoch 9 Loss 0.0861
Time taken for 1 epoch 143.12361431121826 sec

Epoch 10 Loss 0.0817
Time taken for 1 epoch 142.26895809173584 sec

Epoch 11 Loss 0.0788
Time taken for 1 epoch 140.5196988582611 sec

Epoch 12 Loss 0.0768
Time taken for 1 epoch 140.44386196136475 sec

Epoch 13 Loss 0.0730
Time taken for 1 epoch 140.3862063884735 sec

Epoch 14 Loss 0.0719
Time taken for 1 epoch 140.53888821601868 sec

Epoch 15 Loss 0.0704
Time taken for 1 epoch 142.35179018974304 

In [28]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    sentence = preprocess_sentence(sentence)

    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                           maxlen=max_length_inp,
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)
    result = ''
    hidden = [tf.zeros((1, units))]
    enc_hidden = encoder(inputs, hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden = decoder(dec_input, dec_hidden)

        # storing the attention weights to plot later on
        predicted_id = tf.argmax(predictions[0]).numpy()
        result += targ_lang.index_word[predicted_id] + ' '

        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence

        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence

In [29]:
def translate(sentence):
    result, sentence = evaluate(sentence)

    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))

# TEST Translation

In [30]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x14e8777bca0>

In [33]:
test_text_list = ['Вы не опоздали?',
                  'Приведите вашу жену',
                  'Позвоните мне в половине третьего.',
                  'Я рассчитываю на вашу помощь',
                  'Я сделал это на прошлой неделе',
                  'Я очень хорошо это сделал'  
]

In [34]:
for text in test_text_list:
    translate(text)

Input: <start> вы не опоздали ? <end>
Predicted translation: aren't you late ? <end> 
Input: <start> приведите вашу жену <end>
Predicted translation: bring your wife . <end> 
Input: <start> позвоните мне в половине третьего . <end>
Predicted translation: call me at . <end> 
Input: <start> я рассчитываю на вашу помощь <end>
Predicted translation: i count your number . <end> 
Input: <start> я сделал это на прошлой неделе <end>
Predicted translation: i did it last week . <end> 
Input: <start> я очень хорошо это сделал <end>
Predicted translation: i did it very well . <end> 
