# Создание модели рекуррентной нейронной сети для проверки орфорграфии

Обучаемые данные - книги сервиса [Project Gutenberg](http://www.gutenberg.org/ebooks/search/?sort_order=downloads). 

Содержание:
- Загрузка исходных данных
- Подготовка данных
- Создание модели
- Обучение модели
- Тестрирование

Необходимые библиотеки

In [0]:
import time
import re
import pandas as pd
import numpy as np
import tensorflow as tf
import os
from os import listdir
from os.path import isfile, join
from collections import namedtuple
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
from sklearn.model_selection import train_test_split

## Загрузка исходных данных

In [2]:
!unzip books.zip -d input_data/

Archive:  books.zip
   creating: input_data/books/
  inflating: input_data/books/A Doll's House a play by Henrik Ibsen.txt  
  inflating: input_data/books/A Tale of Two Cities by Charles Dickens.txt  
  inflating: input_data/books/Adventures of Huckleberry Finn by Mark Twain.txt  
  inflating: input_data/books/Alice's Adventures in Wonderland by Lewis Carroll.txt  
  inflating: input_data/books/Beowulf An Anglo-Saxon Epic Poem by J. Lesslie Hall.txt  
  inflating: input_data/books/Crime and Punishment by Fyodor Dostoyevsky.txt  
  inflating: input_data/books/Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley.txt  
  inflating: input_data/books/Great Expectations by Charles Dickens.txt  
  inflating: input_data/books/Heart of Darkness by Joseph Conrad.txt  
  inflating: input_data/books/Jane Eyre An Autobiography by Charlotte Bronte.txt  
  inflating: input_data/books/Les MisВrables by Victor Hugo.txt  
  inflating: input_data/books/Metamorphosis by Franz Kafka.txt  

Получение имен всех файлов книг

In [0]:
path = 'input_data/books/'
book_files = [f for f in listdir(path) if isfile(join(path, f))]
book_files = book_files[1:]

In [0]:
def load_book(path):
    """Функция для загразки книги из файла"""
    input_file = os.path.join(path)
    with open(input_file, 'rb') as f:
        book = f.read()
    return book

Загрузка книг из файлов

In [0]:
books = []
for book in book_files:
    books.append(load_book(path+book))

Подсчет количества слов в каждой книге

In [6]:
for i in range(len(books)):
    print("{} слов содержится в {}.".format(len(books[i].split()), book_files[i]))

187425 слов содержится в Great Expectations by Charles Dickens.txt.
124592 слов содержится в Pride and Prejudice by Jane Austen.txt.
78098 слов содержится в Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley.txt.
9103 слов содержится в The Yellow Wallpaper by Charlotte Perkins Gilman.txt.
81980 слов содержится в The Picture of Dorian Gray by Oscar Wilde.txt.
191974 слов содержится в The Romance of Lust A classic Victorian erotic novel by Anonymous.txt.
114217 слов содержится в Adventures of Huckleberry Finn by Mark Twain.txt.
23731 слов содержится в The Importance of Being Earnest A Trivial Comedy for Serious People by Oscar Wilde.txt.
107602 слов содержится в The Adventures of Sherlock Holmes by Arthur Conan Doyle.txt.
566310 слов содержится в War and Peace by graf Leo Tolstoy.txt.
42089 слов содержится в Beowulf An Anglo-Saxon Epic Poem by J. Lesslie Hall.txt.
29482 слов содержится в A Doll's House a play by Henrik Ibsen.txt.
29465 слов содержится в Alice's Advent

Проверка текста

In [7]:
books[0][10000:10500]

b'he sky was just a row of long\r\nangry red lines and dense black lines intermixed. On the edge of the\r\nriver I could faintly make out the only two black things in all the\r\nprospect that seemed to be standing upright; one of these was the beacon\r\nby which the sailors steered,--like an unhooped cask upon a pole,--an\r\nugly thing when you were near it; the other, a gibbet, with some chains\r\nhanging to it which had once held a pirate. The man was limping on\r\ntowards this latter, as if he were the pirat'

## Подготовка данных

In [0]:
def clean_text(text):
    '''Функция очистки текста от ненужных символов'''
    text = text.decode('UTF-8', 'ignore')
    text = re.sub(r'\n', ' ', text) 
    text = re.sub(r'[{}@_*>()\\#%+=\[\]]','', text)
    text = re.sub('a0','', text)
    text = re.sub('\'92t','\'t', text)
    text = re.sub('\'92s','\'s', text)
    text = re.sub('\'92m','\'m', text)
    text = re.sub('\'92ll','\'ll', text)
    text = re.sub('\'91','', text)
    text = re.sub('\'92','', text)
    text = re.sub('\'93','', text)
    text = re.sub('\'94','', text)
    text = re.sub('\.','. ', text)
    text = re.sub('\r','', text)
    text = re.sub('\!','! ', text)
    text = re.sub('\?','? ', text)
    text = re.sub(' +',' ', text)
    return text

Очистка текста книг

In [0]:
clean_books = []
for book in books:
    clean_books.append(clean_text(book))

Проверка текста

In [10]:
clean_books[0][10000:10500]

's if he were the pirate come to life, and come down, and going back to hook himself up again. It gave me a terrible turn when I thought so; and as I saw the cattle lifting their heads to gaze after him, I wondered whether they thought so too. I looked all round for the horrible young man, and could see no signs of him. But now I was frightened again, and ran home without stopping. Chapter II My sister, Mrs. Joe Gargery, was more than twenty years older than I, and had established a great reputat'

Создания словаря для перевода символов в числовые эквиваленты (индексы)

In [0]:
vocab_to_int = {}
count = 0
for book in clean_books:
    for character in book:
        if character not in vocab_to_int:
            vocab_to_int[character] = count
            count += 1

Добавление специальных токенов в словарь. 
- Токен "GO" служит для указания начала предложения. 
- Токен "EOS" - для указания конца предложения. 
- Токеном "PAD" дополняют предложения для создания одиннаковых по длине тренировочных партий. 

In [0]:
codes = ['<PAD>','<EOS>','<GO>']
for code in codes:
    vocab_to_int[code] = count
    count += 1

Проверка словаря

In [13]:
vocab_size = len(vocab_to_int)
print("Словарь содержит {} символа.".format(vocab_size))
print(sorted(vocab_to_int))

Словарь содержит 133 символа.
[' ', '!', '"', '$', '&', "'", ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<EOS>', '<GO>', '<PAD>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '\xa0', '£', '§', '½', 'À', 'Á', 'Æ', 'Ç', 'È', 'É', 'Ú', 'Ü', 'Þ', 'à', 'á', 'â', 'ä', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'í', 'î', 'ï', 'ð', 'ñ', 'ó', 'ô', 'ö', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'Œ', 'œ', 'η', 'ο', 'ς', 'τ', 'ϰ', 'ו', 'ח', '—', '‘', '’', '“', '”', '…', '\ufeff']


Создания словаря для обратного преобразования числовых индексов в символы

In [0]:
int_to_vocab = {}
for character, value in vocab_to_int.items():
    int_to_vocab[value] = character

Разделение текста книг на отдельные предложения

In [15]:
sentences = []
for book in clean_books:
    for sentence in book.split('. '):
        sentences.append(sentence + '.')
print("Обучающие данные содержат {} предложений.".format(len(sentences)))

Обучающие данные содержат 161258 предложений.


Проверка корректности разделения текстов на предложения

In [16]:
sentences[100:115]

['She made it a powerful merit in herself, and a strong reproach against Joe, that she wore this apron so much.',
 'Though I really see no reason why she should have worn it at all; or why, if she did wear it at all, she should not have taken it off, every day of her life.',
 "Joe's forge adjoined our house, which was a wooden house, as many of the dwellings in our country were,--most of them, at that time.",
 'When I ran home from the churchyard, the forge was shut up, and Joe was sitting alone in the kitchen.',
 'Joe and I being fellow-sufferers, and having confidences as such, Joe imparted a confidence to me, the moment I raised the latch of the door and peeped in at him opposite to it, sitting in the chimney corner.',
 '“Mrs.',
 'Joe has been out a dozen times, looking for you, Pip.',
 "And she's out now, making it a baker's dozen.",
 "” “Is she? ” “Yes, Pip,” said Joe; “and what's worse, she's got Tickler with her.",
 '” At this dismal intelligence, I twisted the only button on my

Преобразование предложений в наборы числовых индексов

In [0]:
int_sentences = []

for sentence in sentences:
    int_sentence = []
    for character in sentence:
        int_sentence.append(vocab_to_int[character])
    int_sentences.append(int_sentence)

Подсчет длины каждого предложения

In [0]:
lengths = []
for sentence in int_sentences:
    lengths.append(len(sentence))
lengths = pd.DataFrame(lengths, columns=["counts"])

In [19]:
lengths.describe()

Unnamed: 0,counts
count,161258.0
mean,102.96296
std,103.045955
min,1.0
25%,36.0
50%,77.0
75%,140.0
max,5578.0


Выбор наиболее предпочтительных наборов предложений (от 10 до 100 символов)

In [20]:
# Limit the data we will use to train our model
max_length = 100
min_length = 10

good_sentences = []

for sentence in int_sentences:
    if len(sentence) <= max_length and len(sentence) >= min_length:
        good_sentences.append(sentence)

print("Количество предложений входящих в обучаемые данные = {}".format(len(good_sentences)))

Количество предложений входящих в обучаемые данные = 82320


Создание обучаемой и тестовой выборки

In [21]:
training, testing = train_test_split(good_sentences, test_size = 0.15, random_state = 2)

print("Колличество предложений в обучаемом наборе:", len(training))
print("Колличество предложений в тестовом наборе:", len(testing))

Колличество предложений в обучаемом наборе: 69972
Колличество предложений в тестовом наборе: 12348


Сортировка предложений по длине, необходимо для более быстрого обучения модели

In [0]:
training_sorted = []
testing_sorted = []

for i in range(min_length, max_length+1):
    for sentence in training:
        if len(sentence) == i:
            training_sorted.append(sentence)
    for sentence in testing:
        if len(sentence) == i:
            testing_sorted.append(sentence)

Создание орфографических ошибок в обучаемых данных. Имитация ошибок человека в письме (пропуск букв, добавление не тех)

In [0]:
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','q','r','s','t','u','v','w','x','y','z',]

def noise_maker(sentence, threshold):
    '''Функция для создания орфографических ошибок'''
    
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0,1,1)
        # Для высокого значение порога threshold предложения будут верны
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0,1,1)
            # В ~33% символы в слове поменяются местами
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    # Если это последняя буква в слове, она не будет введена
                    continue
                else:
                    # Иначе рядом стоящие буквы поменяются местами
                    noisy_sentence.append(sentence[i+1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            # В ~33% в конец будет добалена дополнительная случайная буква нижнего регистра
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(vocab_to_int[random_letter])
                noisy_sentence.append(sentence[i])
            # В ~33% ничего не будет изменено
            else:
                pass     
        i += 1
    return noisy_sentence

Проверка верного создания имитаций ошибок в предложениях

In [24]:
threshold = 0.9
for sentence in training_sorted[:5]:
    print(sentence)
    print(noise_maker(sentence, threshold))

[51, 3, 9, 10, 23, 7, 13, 4, 43, 34]
[51, 3, 9, 10, 24, 23, 7, 13, 43, 34]
[17, 6, 3, 20, 18, 19, 20, 24, 10, 34]
[17, 6, 3, 20, 19, 20, 24, 10, 34]
[58, 3, 6, 26, 4, 30, 3, 28, 28, 34]
[58, 3, 26, 4, 30, 3, 28, 28, 34]
[11, 30, 3, 13, 31, 7, 28, 3, 13, 34]
[30, 3, 13, 31, 7, 28, 18, 3, 13, 34]
[1, 3, 28, 28, 4, 10, 2, 3, 32, 34]
[1, 3, 28, 28, 4, 10, 2, 3, 32, 34]


# Создание модели

Создание placeholders для входных данных модели

In [0]:
def model_inputs():
    '''Функция для создания placeholders для входных данных модели'''
    
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')
    with tf.name_scope('targets'):
        targets = tf.placeholder(tf.int32, [None, None], name='targets')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    inputs_length = tf.placeholder(tf.int32, (None,), name='inputs_length')
    targets_length = tf.placeholder(tf.int32, (None,), name='targets_length')
    max_target_length = tf.reduce_max(targets_length, name='max_target_len')

    return inputs, targets, keep_prob, inputs_length, targets_length, max_target_length

Создание слоев кодирования и декодирования LSTM 

Модель seq2seq

<img src="images/seq2seq.png" width="1275px">

Модель sequence to sequence состит из двух рекуррентных сетей: кодировщика и декодировщика. Кодировщик строит представление входной последовательности слов.  Полученное представление (последние выход и значение ячейки сети) копируются в декодировщик. По полученному представлению декодировщик пытается восстановить целевую последовательность слов.
На вход декодировщику на первом такте подается специальный символ GO, затем на каждом такте подается сгенерированное в предыдущую итерацию слово. Генерация ответа продолжается до тех пор, пока не будет сгенерировано специальное слово – маркер конца строки EOL (end of line). Во время обучения в качестве сгенерированного символа на следующий такт передается целевой символ, а распределение на предсказанных символах передается в функцию потерь.

In [0]:
def process_encoding_input(targets, vocab_to_int, batch_size):
    '''Функия для удаления последнего слова и добавление токена GO в начало каждого batch'''
    
    with tf.name_scope("process_encoding"):
        ending = tf.strided_slice(targets, [0, 0], [batch_size, -1], [1, 1])
        dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input

In [0]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob, direction):
    '''Функция для создания LSTM уровня кодирования'''
    
    if direction == 1:
        with tf.name_scope("RNN_Encoder_Cell_1D"):
            for layer in range(num_layers):
                with tf.variable_scope('encoder_{}'.format(layer)):
                    lstm = tf.contrib.rnn.LSTMCell(rnn_size)

                    drop = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                         input_keep_prob = keep_prob)

                    enc_output, enc_state = tf.nn.dynamic_rnn(drop, 
                                                              rnn_inputs,
                                                              sequence_length,
                                                              dtype=tf.float32)

            return enc_output, enc_state
        
        
    if direction == 2:
        with tf.name_scope("RNN_Encoder_Cell_2D"):
            for layer in range(num_layers):
                with tf.variable_scope('encoder_{}'.format(layer)):
                    cell_fw = tf.contrib.rnn.LSTMCell(rnn_size)
                    cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                            input_keep_prob = keep_prob)

                    cell_bw = tf.contrib.rnn.LSTMCell(rnn_size)
                    cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                            input_keep_prob = keep_prob)

                    enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                            cell_bw, 
                                                                            rnn_inputs,
                                                                            sequence_length,
                                                                            dtype=tf.float32)
            enc_output = tf.concat(enc_output,2)
            return enc_output, enc_state[0]

In [0]:
def training_decoding_layer(dec_embed_input, targets_length, dec_cell, initial_state, output_layer, 
                            vocab_size, max_target_length):
    '''Функция для создания logits предсказаний обучения (потерь)'''
    
    with tf.name_scope("Training_Decoder"):
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                            sequence_length=targets_length,
                                                            time_major=False)

        training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                           training_helper,
                                                           initial_state,
                                                           output_layer) 

        training_logits, _ , _  = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                               output_time_major=False,
                                                               impute_finished=True,
                                                               maximum_iterations=max_target_length)
        return training_logits

In [0]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_target_length, batch_size):
    '''Функция для создания выходных logits предсказаний'''
    
    with tf.name_scope("Inference_Decoder"):
        start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')

        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                    start_tokens,
                                                                    end_token)

        inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                            inference_helper,
                                                            initial_state,
                                                            output_layer)

        inference_logits, _ , _  = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                                output_time_major=False,
                                                                impute_finished=True,
                                                                maximum_iterations=max_target_length)

        return inference_logits

In [0]:
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, inputs_length, targets_length, 
                   max_target_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers, direction):
    '''Функция для создания LSTM слоя декодирования'''
    
    with tf.name_scope("RNN_Decoder_Cell"):
        for layer in range(num_layers):
            with tf.variable_scope('decoder_{}'.format(layer)):
                lstm = tf.contrib.rnn.LSTMCell(rnn_size)
                dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                         input_keep_prob = keep_prob)
    
    output_layer = Dense(vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                  enc_output,
                                                  inputs_length,
                                                  normalize=False,
                                                  name='BahdanauAttention')
    
    with tf.name_scope("Attention_Wrapper"):
        dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell,attn_mech,rnn_size)
    
    initial_state = dec_cell.zero_state(batch_size=batch_size, dtype=tf.float32).clone(cell_state=enc_state)

    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, 
                                                  targets_length, 
                                                  dec_cell, 
                                                  initial_state,
                                                  output_layer,
                                                  vocab_size, 
                                                  max_target_length)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,  
                                                    vocab_to_int['<GO>'], 
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell, 
                                                    initial_state, 
                                                    output_layer,
                                                    max_target_length,
                                                    batch_size)

    return training_logits, inference_logits

Создание модели seq2seq

<img src="images/seq2seqlogits.png" width="451px">
Модель имеет 4 основных компонента:
- Слой внедрения
- Слой кодирования
- Слой декодирования
- Оптимизатор для обновления весов

In [0]:
def seq2seq_model(inputs, targets, keep_prob, inputs_length, targets_length, max_target_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size, embedding_size, direction):
    '''Функция создания модели seq2seq (подготовка logits обучения и выхода)'''
    
    enc_embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1, 1))
    enc_embed_input = tf.nn.embedding_lookup(enc_embeddings, inputs)
    enc_output, enc_state = encoding_layer(rnn_size, inputs_length, num_layers, 
                                           enc_embed_input, keep_prob, direction)
    
    dec_embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1, 1))
    dec_input = process_encoding_input(targets, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        dec_embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        inputs_length, 
                                                        targets_length, 
                                                        max_target_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers,
                                                        direction)
    
    return training_logits, inference_logits

Подготовка batch'ей

In [0]:
def pad_sentence_batch(sentence_batch):
    """Функция привдения batch к одиннаковой длине, добавление токена PAD"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

In [0]:
def get_batches(sentences, batch_size, threshold):
    """Функция для создания batch"""
    
    for batch_i in range(0, len(sentences)//batch_size):
        start_i = batch_i * batch_size
        sentences_batch = sentences[start_i:start_i + batch_size]
        
        sentences_batch_noisy = []
        for sentence in sentences_batch:
            sentences_batch_noisy.append(noise_maker(sentence, threshold))
            
        sentences_batch_eos = []
        for sentence in sentences_batch:
            sentence.append(vocab_to_int['<EOS>'])
            sentences_batch_eos.append(sentence)
            
        pad_sentences_batch = np.array(pad_sentence_batch(sentences_batch_eos))
        pad_sentences_noisy_batch = np.array(pad_sentence_batch(sentences_batch_noisy))
        
        # Need the lengths for the _lengths parameters
        pad_sentences_lengths = []
        for sentence in pad_sentences_batch:
            pad_sentences_lengths.append(len(sentence))
        
        pad_sentences_noisy_lengths = []
        for sentence in pad_sentences_noisy_batch:
            pad_sentences_noisy_lengths.append(len(sentence))
        
        yield pad_sentences_noisy_batch, pad_sentences_batch, pad_sentences_noisy_lengths, pad_sentences_lengths

Параметры обучения

In [0]:
epochs = 20
batch_size = 128
num_layers = 2
rnn_size = 512
embedding_size = 128
learning_rate = 0.0005
direction = 2
threshold = 0.95
keep_probability = 0.75

In [0]:
def build_graph(keep_prob, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction):

    tf.reset_default_graph()
    
    # Загрузка входных данных модели    
    inputs, targets, keep_prob, inputs_length, targets_length, max_target_length = model_inputs()

    # Создание logits обучения и выхода
    training_logits, inference_logits = seq2seq_model(tf.reverse(inputs, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      inputs_length,
                                                      targets_length,
                                                      max_target_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size,
                                                      embedding_size,
                                                      direction)

    # Создание тензоров для logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')

    with tf.name_scope('predictions'):
        predictions = tf.identity(inference_logits.sample_id, name='predictions')
        tf.summary.histogram('predictions', predictions)

    # Создание весов функции потерь
    masks = tf.sequence_mask(targets_length, max_target_length, dtype=tf.float32, name='masks')
    
    with tf.name_scope("cost"):
        # Функция потерь
        cost = tf.contrib.seq2seq.sequence_loss(training_logits, 
                                                targets, 
                                                masks)
        tf.summary.scalar('cost', cost)

    with tf.name_scope("optimze"):
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Отсечение градиента
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)

    # Подсчет всех сумм
    merged = tf.summary.merge_all()    
    
    export_nodes = ['inputs', 'targets', 'keep_prob', 'cost', 'inputs_length', 'targets_length',
                    'predictions', 'merged', 'train_op','optimizer']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])

    return graph

## Обучение модели

In [0]:
def train(model, epochs, log_string):
    '''Функция для обучения модели RNN'''
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Используется для прекращения обучения на ранней стадии
        testing_loss_summary = []

        # Проверка итерации обучения на batch
        iteration = 0
        
        display_step = 100 # Вывод прогресса каждые 100 batch'ей
        stop_early = 0 
        stop = 10 # Если batch_loss_testing не уменьшается в течении 10 проверок, обучение прекращается
        per_epoch = 2 # Тестирование модели дважды в эпоху
        testing_check = (len(training_sorted)//batch_size//per_epoch)-1

        print()
        print("Обучение модели: {}".format(log_string))

        train_writer = tf.summary.FileWriter('./logs/1/train/{}'.format(log_string), sess.graph)
        test_writer = tf.summary.FileWriter('./logs/1/test/{}'.format(log_string))

        for epoch_i in range(1, epochs+1): 
            batch_loss = 0
            batch_time = 0
            
            for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(
                    get_batches(training_sorted, batch_size, threshold)):
                start_time = time.time()

                summary, loss, _ = sess.run([model.merged,
                                             model.cost, 
                                             model.train_op], 
                                             {model.inputs: input_batch,
                                              model.targets: target_batch,
                                              model.inputs_length: input_length,
                                              model.targets_length: target_length,
                                              model.keep_prob: keep_probability})


                batch_loss += loss
                end_time = time.time()
                batch_time += end_time - start_time

                # Сохранение прогресса обучения
                train_writer.add_summary(summary, iteration)

                iteration += 1

                if batch_i % display_step == 0 and batch_i > 0:
                    print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(epoch_i,
                                  epochs, 
                                  batch_i, 
                                  len(training_sorted) // batch_size, 
                                  batch_loss / display_step, 
                                  batch_time))
                    batch_loss = 0
                    batch_time = 0

                #### Тестирование ####
                if batch_i % testing_check == 0 and batch_i > 0:
                    batch_loss_testing = 0
                    batch_time_testing = 0
                    for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(
                            get_batches(testing_sorted, batch_size, threshold)):
                        start_time_testing = time.time()
                        summary, loss = sess.run([model.merged,
                                                  model.cost], 
                                                     {model.inputs: input_batch,
                                                      model.targets: target_batch,
                                                      model.inputs_length: input_length,
                                                      model.targets_length: target_length,
                                                      model.keep_prob: 1})

                        batch_loss_testing += loss
                        end_time_testing = time.time()
                        batch_time_testing += end_time_testing - start_time_testing

                        # Запись результата тестирования
                        test_writer.add_summary(summary, iteration)

                    n_batches_testing = batch_i + 1
                    print('Testing Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(batch_loss_testing / n_batches_testing, 
                                  batch_time_testing))
                    
                    batch_time_testing = 0

                    # Если batch_loss_testing показывает новый минимум, сохраняем модель
                    testing_loss_summary.append(batch_loss_testing)
                    if batch_loss_testing <= min(testing_loss_summary):
                        print('Новый рекорд!') 
                        stop_early = 0
                        checkpoint = "./{}.ckpt".format(log_string)
                        saver = tf.train.Saver()
                        saver.save(sess, checkpoint)

                    else:
                        print("Нет улучшений.")
                        stop_early += 1
                        if stop_early == stop:
                            break

            if stop_early == stop:
                print("Прекращение обучения.")
                break

In [37]:
for keep_probability in [0.75]:
    for num_layers in [2]:
        for threshold in [0.95]:
            log_string = 'kp={},nl={},th={}'.format(keep_probability,
                                                    num_layers,
                                                    threshold) 
            model = build_graph(keep_probability, rnn_size, num_layers, batch_size, 
                                learning_rate, embedding_size, direction)
            train(model, epochs, log_string)

Instructions for updating:
Colocations handled automatically by placer.

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Обучение модели: kp=0.75,nl=2,th=0.95
Epoch   1/20 Batch  100/546 - Loss:  2.067, Seconds: 16.83
Epoch   1/20 Batch  200/546 - Loss:  0.554, Seconds: 27.79
Testing Loss:  3.641, S

## Тестирование (исправление текста)

In [0]:
def text_to_ints(text):
    '''Функция подготовки текста на вход модели'''
    
    text = clean_text(text)
    return [vocab_to_int[word] for word in text]

In [41]:
# Ввод предложения с ошибками для проверки моделью
#text = b"Spellin is difficult, whch is wyh you need to study everyday."
#text = text_to_ints(text)

# Использование предложения из набора обучения
random = np.random.randint(0,len(testing_sorted))
text = testing_sorted[random]
text = noise_maker(text, 0.95)

checkpoint = "./kp=0.75,nl=2,th=0.95.ckpt"

model = build_graph(keep_probability, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction) 

with tf.Session() as sess:
    # Загрузка модели
    saver = tf.train.Saver()
    saver.restore(sess, checkpoint)
    
    # Умножение на batch_size, приведение предложения к соответствуюещей длине на вход модели 
    answer_logits = sess.run(model.predictions, {model.inputs: [text]*batch_size, 
                                                 model.inputs_length: [len(text)]*batch_size,
                                                 model.targets_length: [len(text)+1], 
                                                 model.keep_prob: [1.0]})[0]

# Удаление токенов PAD из сгенерированного текста моделью
pad = vocab_to_int["<PAD>"] 

print('\nВвод пользователя')
print('  {}'.format([i for i in text]))
print('  {}'.format("".join([int_to_vocab[i] for i in text])))

print('\nРезультат исправления')
print('  {}'.format([i for i in answer_logits if i != pad]))
print('  {}'.format("".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from ./kp=0.75,nl=2,th=0.95.ckpt

Ввод пользователя
  [51, 2, 3, 4, 2, 31, 20, 4, 3, 13, 15, 20, 15, 3, 4, 20, 4, 24, 23, 10, 10, 23, 13, 15, 36, 6, 7, 7, 25, 32, 4, 20, 18, 13, 31, 4, 7, 12, 6, 4, 28, 12, 13, 9, 2, 4, 20, 30, 20, 23, 10, 3, 31, 4, 12, 24, 4, 12, 22, 13, 7, 4, 10, 2, 3, 4, 31, 10, 20, 14, 28, 3, 34, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131]
  She hda engage a sitting-roo,m aknd our lunch awaited us upno the dtable.<EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS>

Результат исправления
  [51, 2, 3, 4, 2, 20, 31, 4, 3, 13, 15, 20, 15, 3, 4, 20, 4, 24, 23, 10, 10, 23, 13, 15, 36, 6, 7, 7, 32, 25, 4, 20, 13, 31, 4, 7, 12, 6, 4, 28, 12, 13, 9,

Примеры корректных исправлений орфографических ошибок:
- Spellin is difficult, whch is wyh you need to study everyday.
- Spelling is difficult, which is why you need to study everyday.


- The first days of her existence in th country were vrey hard for Dolly. 
- The first days of her existence in the country were very hard for Dolly.


- Thi is really something impressiv thaat we should look into right away! 
- This is really something impressive that we should look into right away!