# Character-level recurrent sequence-to-sequence model

https://keras.io/examples/nlp/lstm_seq2seq/

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2017/09/29<br>
**Last modified:** 2020/04/26<br>
**Description:** Character-level recurrent sequence-to-sequence model.

**Original:**
    
A ten-minute introduction to sequence-to-sequence learning in Keras
    
* https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
* https://github.com/rstudio/keras/blob/main/vignettes/examples/lstm_seq2seq.py
* https://github.com/CosineP/cw-lstm/blob/master/lstm_seq2seq.py

## Introduction

This example demonstrates how to implement a basic character-level
recurrent sequence-to-sequence model. We apply it to translating
short English sentences into short French sentences,
character-by-character. Note that it is fairly unusual to
do character-level machine translation, as word-level
models are more common in this domain.

**Summary of the algorithm**

1. We start with input sequences from a domain (e.g. English sentences)
    and corresponding target sequences from another domain
    (e.g. French sentences).

2. An encoder LSTM turns input sequences to 2 state vectors
    (we keep the last LSTM state and discard the outputs).

3. A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    It uses as initial state the state vectors from the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.

4. In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.

**Data download:**

* English to French sentence pairs.
    - https://www.manythings.org/anki/fra-eng.zip

* Lots of neat sentence pairs datasets can be found at:
    - https://www.manythings.org/anki/

## Setup


In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_10_chollet_lstm.txt"

In [3]:
%%writefile {req_file}
isort
scikit-learn-intelex
watermark

Overwriting setup/requirements_10_chollet_lstm.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
if IS_COLAB:
    from sklearnex import patch_sklearn
    patch_sklearn()

In [6]:
%%writefile setup/chp10_chollet_lstm_imports.py
import locale
import os
import pprint
import random
import warnings

import numpy as np
import seaborn as sns
from tensorflow import keras
from tqdm.auto import tqdm
from watermark import watermark

Overwriting setup/chp10_chollet_lstm_imports.py


In [7]:
!isort setup/chp10_chollet_lstm_imports.py --sl
!cat setup/chp10_chollet_lstm_imports.py

import locale
import os
import pprint
import random

import numpy as np
import seaborn as sns
from tensorflow import keras
from tqdm.auto import tqdm
from watermark import watermark


In [8]:
import locale
import os
import pprint
import random
import warnings

import numpy as np
import seaborn as sns
from tensorflow import keras
from tqdm.auto import tqdm
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)
random.seed(23)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
seaborn: 0.12.1
numpy  : 1.23.5
keras  : 2.9.0



## Download the data


In [10]:
data_fra_eng_dir = 'data/data_fra_eng'
if not os.path.exists(data_fra_eng_dir):
    os.makedirs(data_fra_eng_dir)
        
data_fra_eng_file = 'fra-eng.zip'
data_fra_eng_src = f"{data_fra_eng_dir}/{data_fra_eng_file}"
data_fra_eng_path = f"{data_fra_eng_dir}/fra.txt"

print(data_fra_eng_dir)
print(data_fra_eng_file)
print(data_fra_eng_src)
print(data_fra_eng_path)
HR()

!wget -P {data_fra_eng_dir} -O {data_fra_eng_src} -nc "http://www.manythings.org/anki/fra-eng.zip"
!ls -l {data_fra_eng_dir}

data/data_fra_eng
fra-eng.zip
data/data_fra_eng/fra-eng.zip
data/data_fra_eng/fra.txt
----------------------------------------
File ‘data/data_fra_eng/fra-eng.zip’ already there; not retrieving.
total 75608
-rw-r--r--  1 gb  staff      1441 Feb 20 22:16 _about.txt
-rw-r--r--  1 gb  staff   7155035 Feb 21 08:53 fra-eng.zip
-rw-r--r--  1 gb  staff  31547877 Feb 20 22:16 fra.txt


In [11]:
print(data_fra_eng_src)
print(data_fra_eng_path)

if not os.path.exists(data_fra_eng_path):
    !unzip {data_fra_eng_src} -d {data_fra_eng_dir}

data/data_fra_eng/fra-eng.zip
data/data_fra_eng/fra.txt


In [12]:
model_fra_eng_dir = 'models/model_fra_eng'
if not os.path.exists(model_fra_eng_dir):
    os.makedirs(model_fra_eng_dir)

## Configuration


In [13]:
batch_size = 64  # Batch size for training.
# epochs = 100  # Number of epochs to train for.
epochs = 1  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10_000  # Number of samples to train on.

## Prepare the data


In [14]:
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

with open(data_fra_eng_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")

for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = "\t" + target_text + "\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

In [15]:
line[:50]

'Did I wake you?\tVous ai-je réveillés ?\tCC-BY 2.0 ('

In [16]:
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
HR()

print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)
HR()

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

print(type(encoder_input_data))
print(type(decoder_input_data))
print(type(decoder_target_data))

Number of samples: 10000
Number of unique input tokens: 71
Number of unique output tokens: 93
----------------------------------------
Max sequence length for inputs: 15
Max sequence length for outputs: 59
----------------------------------------
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [17]:
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

## Build the model


In [18]:
# Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens), name="Encoder_Input")
encoder = keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens), name="Decoder_Input")

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True, name="Decoder_LSTM")

decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax", name="DecoderOutput")
decoder_outputs = decoder_dense(decoder_outputs)

HR()

print(f"encoder_inputs:\n\t{encoder_inputs}")
HR()
print(f"decoder_inputs:\n\t{decoder_inputs}")
HR()
print(f"decoder_outputs:\n\t{decoder_outputs}")

2023-03-30 21:49:53.430309: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


----------------------------------------
encoder_inputs:
	KerasTensor(type_spec=TensorSpec(shape=(None, None, 71), dtype=tf.float32, name='Encoder_Input'), name='Encoder_Input', description="created by layer 'Encoder_Input'")
----------------------------------------
decoder_inputs:
	KerasTensor(type_spec=TensorSpec(shape=(None, None, 93), dtype=tf.float32, name='Decoder_Input'), name='Decoder_Input', description="created by layer 'Decoder_Input'")
----------------------------------------
decoder_outputs:
	KerasTensor(type_spec=TensorSpec(shape=(None, None, 93), dtype=tf.float32, name=None), name='DecoderOutput/Softmax:0', description="created by layer 'DecoderOutput'")


## Train the model


In [19]:
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(
    optimizer="rmsprop", 
    loss="categorical_crossentropy", 
    metrics=["accuracy"]
)

model.summary()

hist = model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
    verbose=2
)

# Save model
model.save(f"{model_fra_eng_dir}/s2s")

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Encoder_Input (InputLayer)     [(None, None, 71)]   0           []                               
                                                                                                  
 Decoder_Input (InputLayer)     [(None, None, 93)]   0           []                               
                                                                                                  
 lstm (LSTM)                    [(None, 256),        335872      ['Encoder_Input[0][0]']          
                                 (None, 256),                                                     
                                 (None, 256)]                                                     
                                                                                              



INFO:tensorflow:Assets written to: models/model_fra_eng/s2s/assets


INFO:tensorflow:Assets written to: models/model_fra_eng/s2s/assets


**Note**

We don't have a test dataset here, so we don't run `model.evaluate()`

## Run inference (sampling)

1. Encode input and retrieve initial decoder state
2. Run one step of decoder with this initial state and a "start of sequence" token as target. Output will be the next target token.
3. Repeat with the current target token and current states

In [20]:
print(model.layers[0].output)
print(model.layers[1].output)
HR()
print(model.layers[2].output) # lstm_1
HR()
print(model.layers[3].output)
print(model.layers[4].output)

KerasTensor(type_spec=TensorSpec(shape=(None, None, 71), dtype=tf.float32, name='Encoder_Input'), name='Encoder_Input', description="created by layer 'Encoder_Input'")
KerasTensor(type_spec=TensorSpec(shape=(None, None, 93), dtype=tf.float32, name='Decoder_Input'), name='Decoder_Input', description="created by layer 'Decoder_Input'")
----------------------------------------
[<KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'lstm')>, <KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'lstm')>, <KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'lstm')>]
----------------------------------------
[<KerasTensor: shape=(None, None, 256) dtype=float32 (created by layer 'Decoder_LSTM')>, <KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'Decoder_LSTM')>, <KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'Decoder_LSTM')>]
KerasTensor(type_spec=TensorSpec(shape=(None, None, 93), dtype=tf.float32, name=None), name='DecoderOutput

In [21]:
# Define sampling models
# Restore the model and construct the encoder and decoder.
model = keras.models.load_model(f"{model_fra_eng_dir}/s2s")

In [22]:
test = model.layers[2].output
test

[<KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'lstm')>,
 <KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'lstm')>,
 <KerasTensor: shape=(None, 256) dtype=float32 (created by layer 'lstm')>]

In [23]:
encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]

encoder_model = keras.Model(encoder_inputs, encoder_states)

In [24]:
decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = keras.Input(shape=(latent_dim,))
decoder_state_input_c = keras.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)

# decoder_states is a list so we add list to list resulting in a combined list. 
decoder_states = [state_h_dec, state_c_dec]

decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)


decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [25]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(
        input_seq,
        verbose=0
    )

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value,
            verbose=0
        )

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence

You can now generate decoded sentences as such:


In [26]:
# for seq_index in range(50):
for seq_index in range(10):
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)
    HR()

Input sentence: Go.
Decoded sentence: Je s                                                        
----------------------------------------
Input sentence: Go.
Decoded sentence: Je s                                                        
----------------------------------------
Input sentence: Go.
Decoded sentence: Je s                                                        
----------------------------------------
Input sentence: Go.
Decoded sentence: Je s                                                        
----------------------------------------
Input sentence: Hi.
Decoded sentence: Je                                                          
----------------------------------------
Input sentence: Hi.
Decoded sentence: Je                                                          
----------------------------------------
Input sentence: Run!
Decoded sentence: Jess                                                        
----------------------------------------
Input sentence: Run