# Text classification with an RNN

References: https://www.tensorflow.org/tutorials/text/text_classification_rnn

__Dataset__:

* The IMDB large movie review dataset is a binary classification dataset.

```python
import tensorflow_datasets as tfds
import tensorflow as tf

BATCH_SIZE = 64

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset = dataset['train'].shuffle(10000).padded_batch(BATCH_SIZE)
test_dataset = dataset['test'].padded_batch(BATCH_SIZE)
```

__Model__:

```python
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])
```

* Shapes:
    * inputs: (batch_size, seq_length)
    * after Embedding(): (batch_size, seq_length, 64)
    * after Birectional(): (batch_size, 128)
    * after Dense(): (batch_size, 64)
    * after Dense(): (batch_size, 1)
    
```python
model.compile(optimizer=tf.keras.optimizers.Adam(1e-4),
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=10, validation_data=test_dataset, validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)
```

* The output of `model.predict()` is not necessarily between 0 and 1, since the final dense layer of the model does not use an activation function such as `tanh`.

* If `model.predict()` > 0, then the predicted label is 1. Otherwise, 0.

__Model__ (using two LSTM layers):

```python
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])
```

* Shapes:
    * inputs: (batch_size, seq_length)
    * after Embedding: (batch_size, seq_length, 64)
    * after Bidirectional: (batch_size, seq_length, 128)
    * after Bidrectional: (batch_size, 64)
    * after Dense: (batch_size, 64)
    * after Dense: (batch_size, 1)
    
* The first LSTM layer uses `return_sequences=True` so that its output preserves the axis of timesteps and the second LSTM layer can be used.

* We can stack multiple RNN layers in such a way.

# Text generation with an RNN

References: https://www.tensorflow.org/tutorials/text/text_generation

```python
import tensorflow as tf
```

__Dataset__:

```python
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

text = open(path_to_file, 'rb').read().decode(encoding='utf-8')    # str

vocab = sorted(set(text))
vocab_size = len(vocab)     # 65

char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])             # entries are integers ranging from 0 to 64.


seq_length = 100
BATCH_SIZE = 64

dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
dataset = dataset.batch(seq_length+1, drop_remainder=True)
dataset = dataset.map(lambda chunk: (chunk[:-1], chunk[1:]))
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
```

* The shape of each batch is ((64, 100), (64,100)).
* Entries of a batch are integers ranging from 0 to vocab_size-1.
* Note that we did not use `dataset.shuffle()` before batching the dataset.


__Model__:

```python
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    return tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])

embedding_dim = 256
rnn_units = 1024

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix, save_weights_only=True)

history = model.fit(dataset, epochs=10, callbacks=[checkpoint_callback], shuffle=False)
```


* Shapes through the model:
    * inputs: (batch_size, x)
    * after Embedding: (batch_size, x, embedding_dim)
    * after GRU: (batch_size, x, rnn_units)
    * after Dense: (batch_size, x, vocab_size)
    * Here x is set to `seq_length` during training, but it can be set to any number for a test.
* Roughly, if input is a string of length m, then model(input) returns a string of the same length.

* Note that `batch_input_shape` is set as `[batch_size, None]`. Moreover, `drop_remainder=True` was used in building the dataset.

* `batch_size` is fixed, but `seq_length` can be variable. We can rebuild the trained model later and set a different value for `batch_size`.

* Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

* `model.layers[1].states[0].shape` is (batch_size, rnn_units).

* Note also `stateful=True` in `GRU`. The following is from the source of recurrent.py:

Note on using statefulness in RNNs:
    You can set RNN layers to be 'stateful', which means that the states
    computed for the samples in one batch will be reused as initial states
    for the samples in the next batch. This assumes a one-to-one mapping
    between samples in different successive batches.
    To enable statefulness:
      - Specify `stateful=True` in the layer constructor.
      - Specify a fixed batch size for your model, by passing
        If sequential model:
          `batch_input_shape=(...)` to the first layer in your model.
        Else for functional model with 1 or more Input layers:
          `batch_shape=(...)` to all the first layers in your model.
        This is the expected shape of your inputs
        *including the batch size*.
        It should be a tuple of integers, e.g. `(32, 10, 100)`.
      - Specify `shuffle=False` when calling fit().
    To reset the states of your model, call `.reset_states()` on either
    a specific layer, or on your entire model.
    


__Rebuil the model__ (batch_size=1):

```python
model1 = build_model(vocab_size, embedding_dim, rnn_units, 1)
model1.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model1.build(tf.TensorShape([1,None]))
```

__Generate text__:

```python
start_string = u"ROMEO: "
num_generate = 1000

input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)

text_generated = []
model1.reset_states()
for i in range(num_generate):
    predictions = model1(input_eval)                                 # (1, len(input_eval), vocab_size)
    predictions = tf.squeeze(predictions, 0)                         # (len(input_eval), vocab_size)
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
    input_eval = tf.expand_dims([predicted_id], 0)
    text_generated.append(idx2char[predicted_id])

result = start_string + ''.join(text_generated)
```

* When i=0, the state of the RNN layer is updated by using start_string.
* When i>0, input_eval is a tensor of length 1.
* The state of the RNN layer is updated at each iteration.


__Customized training__:

```python
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

optimizer = tf.keras.optimizers.Adam()

@tf.function
def train_step(inp, target):
    with tf.GradientTage() as tape:
        predictions = model(inp)
        loss = tf.reduce_mean(tf.keras.losses.sparse_categorical_crossentropy(target, predictions, from_logits=True))
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

EPOCHS = 10
for epoch in range(EPOCHS):
    hidden = model.reset_states()
    for (batch_n, (inp, target)) in enumerate(dataset):
        loss = train_step(inp, taget)
        if batch_n % 100 == 0:
            print('Epoch {} Batch {} Loss {}'.format(epoch+1, batch_n, loss))
    if (epoch+1) % 5 == 0:
        model.save_weights(checkpoint_prefix.format(epoch=epoch))
    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))

model.save_weights(checkpoint_prefix.format(epoch=epoch))
```

* `sparse_categorical_crossentropy()` returns a tensor of shape (batch_size, seq_length).
* `tf.reduce_mean(x)` is a tensor having the value `x.numpy().flatten().mean()`.

# Translation with Attention

References: https://www.tensorflow.org/tutorials/text/nmt_with_attention

Language translation from Spanish to English.

```python
import tensorflow as tf

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
```

__Dataset__:

```python
path_to_zip = tf.keras.utils.get_file('spa-eng.zip',
                                      origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
                                      extract=True)
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

lines = io.open(path_to_file, encoding='UTF-8').read().strip().split('\n')
```

* The format of each line is 'eng_expression\tspa_expression'. For example, `lines[0]` is 'Go.\tVe.'.

```python
def preprocess_sentence(w):
    w = w.lower().strip() 
    w = ''.join(c for c in unicodedata.normalize('NFD', w) if unicodedata.category(c) != 'Mn') 
    w = re.sub(r"[^\w?.!,¿]+", " ", w)
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'\s+', " ", w).strip()
    w = '<start> ' + w + ' <end>'
    return w

num_examples = None

word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
en_texts, sp_texts = zip(*word_pairs)
```

* `NFD` stands for Normalization Form Canonical Decomposition.
* `Mn` stands for Markdown, nonspacing.
* `en` and `sp` are tuples of strings. 
* `en[0]` is '&lt;start&gt; go . &lt;end&gt;' and `sp[0]` is '&lt;start&gt; ve . &lt;end&gt;'.

```python
def tokenize(texts):
    tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)  # a list of lists of positive integers 
    sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding='post') # a 2d npumpy array
    return sequences, tokenizer

inp_sequences, inp_tokenizer = tokenize(sp_texts)
targ_sequences, targ_tokenizer = tokenize(en_texts)

inp_train, inp_val, targ_train, targ_val = train_test_split(inp_sequences, targ_sequences, test_size=0.2)

```
* `inp_sequences` is a numpy array of shape (118964, 53) consisting of integers.
* `targ_sequences` ia a numpy array of shape (118964, 51) consisting of integers.

```python
BUFFER_SIZE = 10000
BATCH_SIZE = 64

dataset = tf.data.Dataset.from_tensor_slices((inp_train, targ_train))\
.shuffle(BUFFER_SIZE)\
.batch(BATCH_SIZE, drop_remainder=True)
```

__Model__: 

See `Learn_Models/Models_ANNs.ipynb` for more information on the encoder-decoder model with attention.

```python
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units, 
                                       return_sequences=True, 
                                       return_state=True, 
                                       recurrent_initializer='glorot_uniform')
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))
    

class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, query, values):
        score = self.V(tf.nn.tanh(self.W1(tf.expand_dims(query,1)) + self.W2(values)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = tf.reduce_sum(attention_weights * values, axis=1)
        return context_vector, attention_weights
    
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sze = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units, 
                                       return_sequences=True, 
                                       return_state=True, 
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)
        self.attention = BahdanauAttention(self.dec_units)
        
    def call(self, x, hidden, enc_output):
        context_vector, attention_weights = self.attention(hidden, enc_output)
        x = self.embedding(x)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state = self.gru(x)
        output = tf.reshape(output, (-1, output.shape[2]))
        output = self.fc(output)
        return output, state, attention_weights  
```

```python
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_tokenizer.word_index)+1
vocab_targ_size = len(targ_tokenizer.word_index)+1
seq_length_inp = inp_sequences.shape[1]
seq_length_targ = targ_sequences.shape[1]

encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_targ_size, embedding_dim, units, BATCH_SIZE)
```

__Optimizer and Loss function__:

```python
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    loss_ = loss_object(real, pred)
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ *= tf.cast(mask, dtype=loss_.dtype)
    return tf.reduce_mean(loss_)

```

* `reduction`: Default value is `AUTO`. `AUTO` indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to `SUM_OVER_BATCH_SIZE`.

* We set `reduction` to `none`, since the target vector was padded with 0s so that we need to mask the padding before we compute the loss.


__Training__:

```python
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder, decoder=decoder)

@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0
    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)
        dec_hidden = enc_hidden
        dec_input = tf.expand_dims([targ_tokenizer.word_index['<start>']] * BATCH_SIZE, 1)
        
        for t in range(1, targ.shape[1]):
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
            loss += loss_function(targ[:,t], predictions)
            dec_input = tf.expand_dims(targ[:,t], 1)

    batch_loss = (loss / int(targ.shape[1]))
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss


EPOCHS = 10
steps_per_epoch = len(inp_train)//BATCH_SIZE

for epoch in range(EPOCHS):
    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0
    for (batch_index, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch_index, batch_loss.numpy()))
            
    if (epoch+1) % 2 == 0:
        checkpoint.save(file_prefix=checkpoint_prefix)
    print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
```

* At each epoch, `enc_hidden` is set to `encoder.initialize_hidden_state()` which is simply a zero tensor of shape (batch_size, endec_units). The tensor is not updated at each epoch and each batch iteration. That is, in `train_step(inp, targ, enc_hidden)`, `enc_hidden` is always a zero tensor. We may place `enc_hidden = encoder.initialize_hidden_state()` before `for epoch in range(EPOCHS):`.

# Image Captioning

References: https://www.tensorflow.org/tutorials/text/image_captioning

__Dataset__:

* `img_name_vector` is a list of image path names. For example, 
```
img_name_vector[0]
: '/content/train2014/COCO_train2014_000000324909.jpg'
```
* `train_captions` is a list of image captions. For example,
```
train_captions[0]
: '<start> A skateboarder performing a trick on a skateboard ramp. <end>'
```


__Model extracting features__:

```python
image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
image_features_extract_model = tf.keras.Model(image_model.input, image_model.layers[-1].output)
```
* input.shape: (batch_size, 299, 299, 3)
* output.shape: (batch_size, 8, 8, 2048)


__Cache image features to disk__:

```python
def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299,299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

image_dataset = tf.data.Dataset.from_tensor_slices(sorted(set(img_name_vector)))\  # some image names are the same
    .map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)\
    .batch(16)

for batch_img, batch_path in image_dataset:
    batch_features = image_features_extract_model(batch_img)
    batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
    for bf, p in zip(batch_features, batch_path):
        np.save(p.numpy().decode("utf-8"), bf.numpy())
```

__Preprocess and tokenize captions__:

```python
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k, 
                                                  oov_token="<unk>", 
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')  #$
tokenizer.fit_on_texts(train_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

train_seqs = tokenizer.texts_to_sequences(train_captions)
max_length = max(len(t) for t in train_seqs)                # Every sample has the same sequence length.

cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
```

__Split the data into training and testing__:

```python
img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,
                                                                    cap_vector,
                                                                    test_size=0.2,
                                                                    random_state=0)
```                                                                    

__Create a tf.data dataset for training__:

```python
def map_func(img_name, cap):
    return np.load(img_name.decode('utf-8')+'.npy'), cap
    
BATCH_SIZE = 64
BUFFER_SIZE = 1000

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))\
    .map(lambda img_name, cap: tf.numpy_function(map_func, [img_name, cap], [tf.float32, tf.int32]), 
         num_parallel_calls=tf.data.experimental.AUTOTUNE)\
    .shuffle(BUFFER_SIZE)\
    .batch(BATCH_SIZE)\
    .prebatch(buffer_size=tf.data.experimental.AUTOTUNE)
```

* Each batch has shape ((BATCH_SIZE, 64, 2048), (BATCH_SIZE, max_length))


__Model__:

* Encoder:
    * inputs: (batch_size, 64, 2048); 64 features; each feature is a 2048-dimensional vector. 
    * outputs: (batch_size, 64, 256); each feature is encoded to a 256-dimensional vector.
    * 64 and 256 are like `enc_seq_length` and `enc_units` in the encoder used Translation with Attention.
    * It does not have an embedding layer or a GRU layer.
    
```python
class CNN_Encoder(tf.keras.Model):
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        return tf.nn.relu(self.fc(x))
```
    
* Decoder:
    * The decoder is similar to the one used in Translation with Attention.
    * inputs: (batch_size, 1)
    * outputs: (batch_size, vocab_size)
    
* In Translation with Attention, enc_units is equal to dec_units. But in this example, enc_units is 256 (embedding_dim) and dec_units is 512 (units).

* In Translation with Attention, the hidden state of the decoder is initially set by the hidden state of the encoder at its last timestep. The hidden state of the decoder in this example, however, is simply set to a zero tensor of shape (batch_size, units). In the language translation, it is meaningful to use the hidden state of the encoder at the last timestep. But we deal with images in this example. What is the data belong to the last timestep in the encoder? There are 64 features in the encoder side, but the features are not in time order.

```python
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc1 = tf.keras.layers.Dense(self.units)
        self.fc2 = tf.keras.layers.Dense(vocab_size)
        self.attention = BahdanauAttention(self.units)
    
    def call(self, x, features, hidden):
        # x.shape: (batch_size, 1)
        # features.shape: (batch_size, 64, embedding_dim)
        # hidden.shape: (batch_size, units)
        
        context_vector, attention_weights = self.attention(hidden, features)
        x = self.embedding(x)                                              # x: (batch_size, 1, embedding_dim)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)     # x: (batch_size, 1, embedding_dim + units)
        output, state = self.gru(x)
        x = self.fc1(output)                       # (batch_size, 1, units)
        x = tf.reshape(x, (-1, x.shape[2]))        # (batch_size, units)
        x = self.fc2(x)                            # (batch_size, vocab_size)
        return x, state, attention_weights
        
    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))
```

__Training__:

```python
embedding_dim = 256
units = 512
vocab_size = top_k + 1

encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    loss_ = loss_object(real, pred)
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    return tf.reduce_mean(loss_ * tf.cast(mask, dtype=loss_.dtype))
    
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
                           
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # restoring the latest checkpoint in checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)
```

```python
@tf.function
def train_step(img_tensor, target):
    loss = 0
    hidden = decoder.reset_state(batch_size=target.shape[0])
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)
    
    with tf.GradientTape() as tape:
        features = encoder(img_tensor)
        for i in range(1, target.shape[1]):
            predictions, hidden, _ = decoder(dec_input, features, hidden)
            loss += loss_function(target[:,i], predictions)
            dec_input = tf.expand_dims(target[:,i], 1)
            
    batch_loss = (loss / int(target.shape[1]))
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, trainable_variables)
    optimizer.apply_gradients(zip(gradients, trainable_variables))
    return batch_loss
    

num_steps = len(img_name_train) // BATCH_SIZE    
EPOCHS = 20
loss_plot = []
for epoch in range(start_epoch, EPOCHS):
    total_loss = 0
    for (batch_no, (img_tensor, target)) in enumerate(dataset):
        batch_loss = train_step(img_tensor, target)
        total_loss += batch_loss
        
        if batch_no % 100 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch_no, batch_loss))

    sofar_loss = total_loss / num_steps
    loss_plot.append(sofar_loss)
    
    if epoch % 5 == 0:
        ckpt_manager.save()

    print('Epoch {} Loss {:.6f}'.format(epoch + 1, sofar_loss))
```

__Captioning__:

```python
def evaluate(image):
    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)           # shape: (1, 8, 8, 2048)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
                                                                        # shape: (1, 64, 2048)
    features = encoder(img_tensor_val)
    hidden = decoder.reset_state(batch_size=1)
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
        result.append(tokenizer.index_word[predicted_id])
        if tokenizer.index_word[predicted_id] == '<end>':
            return result
            
    return result
```