# **CHAPTER 16**
# **Natural Language Processing with RNNs and Attention**

**Generating Text Using Character RNNs**

This subchapter introduces the concept of generating text by training a Recurrent Neural Network (RNN) at the character level. Instead of predicting whole words, the model predicts the next character in a sequence, allowing it to learn spelling, punctuation, and writing style directly from raw text data.
The process begins by loading a text corpus and converting it into a numerical representation. Each character is mapped to an integer ID, enabling the neural network to process the data. The training sequences are built using sliding windows, where each input sequence predicts the next character in the text. This approach allows the model to learn sequential dependencies over time.
The RNN architecture is typically built using stacked LSTM layers followed by a dense output layer with softmax activation. During training, the model minimizes categorical cross-entropy loss, gradually learning to generate coherent sequences. Once trained, the model can generate new text by repeatedly sampling predicted probabilities and feeding the output back as input.


In [3]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [4]:
filepath = keras.utils.get_file(
    "shakespeare.txt",
    "https://homl.info/shakespeare"
)
with open(filepath, "r") as f:
    shakespeare_text = f.read()

In [5]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)
max_id = len(tokenizer.word_index)

In [6]:
encoded = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
encoded = encoded.flatten()

In [7]:
dataset_size = len(encoded)
train_size = int(dataset_size * 0.1)  # bisa ganti 0.9 untuk full data
train_data = encoded[:train_size]

In [8]:
n_steps = 50
batch_size = 32

In [9]:
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

train_generator = TimeseriesGenerator(
    train_data, train_data, length=n_steps, batch_size=batch_size
)

In [14]:
model = keras.models.Sequential([
    keras.layers.Embedding(max_id, 16),
    keras.layers.LSTM(128, return_sequences=True),
    keras.layers.LSTM(128),  # return_sequences=False
    keras.layers.Dense(max_id, activation="softmax")
])


In [15]:
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

**Stateful RNNs**

This section explains stateful RNNs, which differ from stateless RNNs by preserving hidden states across batches. This is particularly useful when modeling very long sequences that cannot fit into memory at once.
Instead of resetting the hidden state at each batch, a stateful RNN carries information forward, allowing it to maintain long-term context across batch boundaries. However, this approach requires careful batch alignment and manual resetting of states between epochs.
Stateful RNNs are useful in streaming text generation and continuous sequence modeling but are more complex to manage compared to stateless models.


In [28]:
import tensorflow as tf
from tensorflow import keras

batch_size = 32
n_steps = 100

In [29]:
inputs = keras.layers.Input(batch_shape=(batch_size, n_steps))

In [30]:
x = keras.layers.Embedding(input_dim=max_id, output_dim=16)(inputs)

In [31]:
x = keras.layers.LSTM(128, return_sequences=True, stateful=True)(x)
x = keras.layers.LSTM(128, return_sequences=True, stateful=True)(x)

In [32]:
outputs = keras.layers.Dense(max_id, activation="softmax")(x)

In [33]:
model = keras.Model(inputs, outputs)

model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

model.summary()

**Sentiment Analysis**

This subchapter introduces sentiment analysis, a fundamental NLP task where the goal is to classify text based on emotional polarity (e.g., positive or negative). The IMDB movie reviews dataset is used as a benchmark.
The text data is preprocessed by converting words into integer sequences using a predefined vocabulary size. The model architecture consists of an embedding layer followed by LSTM layers and a sigmoid output for binary classification.
Sentiment analysis demonstrates how RNNs can extract semantic meaning from text sequences and learn contextual word representations.


In [35]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [36]:
vocab_size = 10000
maxlen = 200  # maksimal panjang sequence
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

X_train = pad_sequences(X_train, maxlen=maxlen, padding='post', truncating='post')
X_test = pad_sequences(X_test, maxlen=maxlen, padding='post', truncating='post')

model = keras.models.Sequential([
    keras.layers.Embedding(input_dim=vocab_size, output_dim=16, input_length=maxlen),
    keras.layers.LSTM(32),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

In [38]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_acc:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 20ms/step - accuracy: 0.5124 - loss: 0.6922
Test Accuracy: 0.513


**Masking**

Masking is introduced to handle variable-length sequences. Since sequences are padded to a fixed length, padding tokens must be ignored during training to prevent them from influencing the model.
Keras provides built-in masking support via the Embedding layer using mask_zero=True. This ensures that padded values do not affect the learning process.


In [46]:
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size, 16, mask_zero=True),
    keras.layers.LSTM(32),
    keras.layers.Dense(1, activation="sigmoid")
])


**Reusing Pretrained Embeddings**

This section discusses leveraging pretrained word embeddings such as GloVe to improve model performance and reduce training time. Instead of learning embeddings from scratch, pretrained vectors provide rich semantic representations learned from massive corpora.
The embeddings are loaded and injected into the embedding layer, which can be frozen or fine-tuned during training.


In [53]:
import os
import urllib.request
import zipfile

# URL file GloVe 100d
url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip = "glove.6B.zip"

# Download jika belum ada
if not os.path.exists(glove_zip):
    print("Downloading GloVe embeddings...")
    urllib.request.urlretrieve(url, glove_zip)
    print("Download complete!")

# Extract file yang diperlukan
with zipfile.ZipFile(glove_zip, 'r') as zip_ref:
    zip_ref.extract("glove.6B.100d.txt", ".")
    print("Extraction complete!")

# Sekarang file glove.6B.100d.txt ada di direktori saat ini
glove_path = "glove.6B.100d.txt"


Downloading GloVe embeddings...
Download complete!
Extraction complete!


In [47]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [48]:
vocab_size = 10000  # jumlah kata teratas yang dipakai
maxlen = 200        # panjang maksimal sequence

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

X_train = pad_sequences(X_train, maxlen=maxlen, padding="post")
X_test = pad_sequences(X_test, maxlen=maxlen, padding="post")

In [54]:
embedding_dim = 100  # sesuai file GloVe yang dipakai
embeddings_index = {}
glove_path = "glove.6B.100d.txt"  # pastikan file sudah di-download

with open(glove_path, encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype="float32")
        embeddings_index[word] = vector

print(f"Loaded {len(embeddings_index)} word vectors.")

Loaded 400000 word vectors.


In [55]:
word_index = imdb.get_word_index()  # dict: kata -> indeks
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [56]:
embedding_layer = keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    weights=[embedding_matrix],
    input_length=maxlen,
    trainable=False
)

In [58]:
model = keras.models.Sequential([
    embedding_layer,
    keras.layers.LSTM(128, return_sequences=False),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

In [60]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m73s[0m 93ms/step - accuracy: 0.5121 - loss: 0.6927
Test accuracy: 0.5051


**Encoder Decoder Networks**

This subchapter introduces sequence-to-sequence (seq2seq) models, commonly used in machine translation. The encoder processes the input sequence into a context vector, while the decoder generates the output sequence one token at a time.
This architecture enables the transformation of variable-length input sequences into variable-length outputs, forming the basis for modern NLP systems.


In [51]:
encoder_inputs = keras.layers.Input(shape=[None])
encoder_embedding = keras.layers.Embedding(vocab_size, 16)(encoder_inputs)
encoder_outputs, state_h, state_c = keras.layers.LSTM(512, return_state=True)(
    encoder_embedding
)
encoder_state = [state_h, state_c]


**Attention Mechanisms**

The final subchapter introduces attention, a mechanism that allows the decoder to focus on different parts of the input sequence during generation. Instead of relying on a single context vector, attention computes weighted combinations of encoder outputs.
Attention significantly improves performance in long sequences and is a foundational idea behind Transformers.


In [64]:
import numpy as np
from tensorflow import keras

In [66]:
batch_size = 32
timesteps_encoder = 10
timesteps_decoder = 8
input_dim = 20   # fitur per timestep input encoder
output_dim = 15

In [67]:
encoder_input_data = np.random.rand(batch_size, timesteps_encoder, input_dim).astype(np.float32)
decoder_input_data = np.random.rand(batch_size, timesteps_decoder, output_dim).astype(np.float32)
decoder_target_data = np.random.randint(0, output_dim, size=(batch_size, timesteps_decoder))

In [68]:
encoder_inputs = keras.Input(shape=(None, input_dim))
encoder_lstm = keras.layers.LSTM(128, return_sequences=True, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]


In [70]:
decoder_inputs = keras.Input(shape=(None, output_dim))
decoder_lstm = keras.layers.LSTM(128, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

In [71]:
# Attention
attention_layer = keras.layers.Attention()
attention_output = attention_layer([decoder_outputs, encoder_outputs])

# Gabungkan attention dengan decoder outputs
decoder_combined = keras.layers.Concatenate()([decoder_outputs, attention_output])

# Output layer
decoder_dense = keras.layers.Dense(output_dim, activation='softmax')
decoder_outputs_final = decoder_dense(decoder_combined)

# Model
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs_final)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()

# Coba fit dengan data dummy
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, epochs=1)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - loss: 2.7262


<keras.src.callbacks.history.History at 0x7d0fd5becc80>

**Conclusion**

Chapter 16 provides a comprehensive overview of NLP using RNNs, covering text generation, sentiment analysis, sequence modeling, pretrained embeddings, encoder–decoder architectures, and attention mechanisms. These concepts form the foundation for advanced NLP models and modern architectures such as Transformers.
