<a href="https://colab.research.google.com/github/egorzhukov-it/medium_study/blob/main/RNN_study_zhe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Тренировочный блокнот для изучения RNN по статье:

Medium: https://medium.com/@annikabrundyn1/the-beginners-guide-to-recurrent-neural-networks-and-text-generation-44a70c34067f

Github:
https://github.com/annikabrundyn/rnn_text_generation

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
import requests

In [None]:
# получаем файл из github
!git clone "https://github.com/egorzhukov-it/medium_study.git"

text = (open("medium_study/wonderland.txt").read())
text = text.lower()

Cloning into 'medium_study'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 15 (delta 3), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (15/15), done.


In [None]:
# полчаем словарь символов

characters = sorted(list(set(text)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  42
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [None]:
X = []   # extracted sequences
Y = []   # the target: follow up character for each sequence in X
length = len(text)
seq_length = 100

for i in range(0, length - seq_length, 1):
    sequence = text[i:i + seq_length]
    label = text[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))

Number of extracted sequences: 143452


In [None]:
# модифицируем входные данные до тензора размерностью (batch_size, sequence_size, feture_size)

X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

In [None]:
# архитектура сети

model = tf.keras.Sequential([
    tf.keras.layers.LSTM(700, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True, dropout=0.2),
    tf.keras.layers.LSTM(700, return_sequences=True, dropout=0.2),
    tf.keras.layers.LSTM(700, dropout=0.2),
    tf.keras.layers.Dense(Y_modified.shape[1], activation=tf.nn.softmax),
])


In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 100, 700)          1965600   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100, 700)          3922800   
_________________________________________________________________
lstm_2 (LSTM)                (None, 700)               3922800   
_________________________________________________________________
dense (Dense)                (None, 42)                29442     
Total params: 9,840,642
Trainable params: 9,840,642
Non-trainable params: 0
_________________________________________________________________


In [None]:
# # load the network weights saved in the folder model_weights
# filename = "medium_study/baseline-improvement-06-0.9927.hdf5"
# model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

# define how model checkpoints are saved
filepath = "model_weights/gigantic-improvement-ctd20-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
model.fit(X_modified, Y_modified, epochs=10, batch_size=128, callbacks = callbacks_list)

Epoch 1/10

Epoch 00001: loss improved from inf to 2.94795, saving model to model_weights/gigantic-improvement-ctd20-01-2.9480.hdf5
Epoch 2/10

Epoch 00002: loss improved from 2.94795 to 2.70172, saving model to model_weights/gigantic-improvement-ctd20-02-2.7017.hdf5
Epoch 3/10

Epoch 00003: loss improved from 2.70172 to 2.48213, saving model to model_weights/gigantic-improvement-ctd20-03-2.4821.hdf5
Epoch 4/10

Epoch 00004: loss improved from 2.48213 to 2.32818, saving model to model_weights/gigantic-improvement-ctd20-04-2.3282.hdf5
Epoch 5/10

Epoch 00005: loss improved from 2.32818 to 2.20672, saving model to model_weights/gigantic-improvement-ctd20-05-2.2067.hdf5
Epoch 6/10

Epoch 00006: loss improved from 2.20672 to 2.11618, saving model to model_weights/gigantic-improvement-ctd20-06-2.1162.hdf5
Epoch 7/10

Epoch 00007: loss improved from 2.11618 to 2.03471, saving model to model_weights/gigantic-improvement-ctd20-07-2.0347.hdf5
Epoch 8/10

Epoch 00008: loss improved from 2.03471 

<tensorflow.python.keras.callbacks.History at 0x7f850c0436d8>

In [None]:
start = 10   #random row from the X array
string_mapped = list(X[start])
full_string = [n_to_char[value] for value in string_mapped]

# generating characters
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    seq = [n_to_char[value] for value in string_mapped]
    full_string.append(n_to_char[pred_index])

    string_mapped.append(pred_index)
    string_mapped = string_mapped[1:len(string_mapped)]

In [None]:
result = ""
for i in string_mapped:
  result = result + characters[i]
result

"sl eddk  ionliryend i tonele tonele ty ha \n'ionlhry,' sa dna tonesy toc'\n\n'iorl tonel ons teresesege"