# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [18]:
!wget -O book.txt https://www.gutenberg.org/cache/epub/64317/pg64317.txt

with open("book.txt", "r", encoding="utf-8") as f:
    text = f.read()

print("Loaded text length:", len(text))

split_index = int(len(text) * 0.8)
train_text = text[:split_index]
test_text = text[split_index:]

print("Train text length:", len(train_text))
print("Test text length:", len(test_text))

--2025-04-20 10:06:45--  https://www.gutenberg.org/cache/epub/64317/pg64317.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 306594 (299K) [text/plain]
Saving to: ‘book.txt’


2025-04-20 10:06:45 (2.50 MB/s) - ‘book.txt’ saved [306594/306594]

Loaded text length: 290077
Train text length: 232061
Test text length: 58016


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [31]:
import numpy as np

def preprocess(raw_text):
    raw_text = raw_text.lower()
    allowed_punctuation = ['.', '!', '?']
    clean_text = ""
    for c in raw_text:
        if c.isalnum() or c.isspace() or c in allowed_punctuation:
            clean_text += c
    return clean_text


train_text = preprocess(train_text)
test_text = preprocess(test_text)


chars = sorted(set(train_text))
vocab_size = len(chars)

char2id = {ch: idx for idx, ch in enumerate(chars)}
id2char = {idx: ch for ch, idx in char2id.items()}

train_ids = [char2id[c] for c in train_text]
test_ids = [char2id.get(c, char2id[' ']) for c in test_text]

seq_length = 100

def create_sequences(token_ids, seq_length):
    X = []
    y = []
    for i in range(len(token_ids) - seq_length):
        X.append(token_ids[i:i + seq_length])
        y.append(token_ids[i + seq_length])
    return np.array(X), np.array(y)

X_train, y_train = create_sequences(train_ids, seq_length)
X_test, y_test = create_sequences(test_ids, seq_length)

print("Char vocab size:", vocab_size)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


Char vocab size: 46
X_train shape: (223547, 100)
y_train shape: (223547,)
X_test shape: (56059, 100)
y_test shape: (56059,)


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [20]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, GRU
from tensorflow.keras.layers import Embedding



model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=256, input_length=20),
    GRU(128),
    Dropout(0.2),
    Dense(vocab_size, activation='softmax')
])

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [21]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [22]:
import math
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.callbacks import EarlyStopping

class PerplexityCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        train_loss = logs.get('loss')
        val_loss = logs.get('val_loss')
        if train_loss:
            print(f"Train Perplexity: {math.exp(train_loss):.2f}")
        if val_loss:
            print(f"Validation Perplexity: {math.exp(val_loss):.2f}")


early_stop = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

model.fit(
    X_train, y_train,
    epochs=30,
    batch_size=128,
    validation_data=(X_test, y_test),
    callbacks=[PerplexityCallback(), early_stop]
)



Epoch 1/30
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 407ms/step - accuracy: 0.3104 - loss: 2.3880Train Perplexity: 8.50
Validation Perplexity: 7.18
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m797s[0m 454ms/step - accuracy: 0.3104 - loss: 2.3878 - val_accuracy: 0.4213 - val_loss: 1.9712
Epoch 2/30
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 407ms/step - accuracy: 0.4380 - loss: 1.8658Train Perplexity: 6.24
Validation Perplexity: 6.38
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m794s[0m 454ms/step - accuracy: 0.4380 - loss: 1.8657 - val_accuracy: 0.4481 - val_loss: 1.8529
Epoch 3/30
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 401ms/step - accuracy: 0.4693 - loss: 1.7479Train Perplexity: 5.66
Validation Perplexity: 5.97
[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m790s[0m 448ms/step - accuracy: 0.4693 - loss: 1.7479 - val_accuracy: 0.4666 - val_loss: 1.7871
Epoch 4

<keras.src.callbacks.history.History at 0x7c4afa9385d0>

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [28]:
def generate_text(model, seed_text, gen_length=300, temperature=1.0):
    model_input = [char2id.get(c, char2id[' ']) for c in seed_text.lower()]
    input_seq = model_input[-seq_length:]

    generated = seed_text
    for _ in range(gen_length):
        pad_input = np.array([input_seq[-seq_length:]])
        preds = model.predict(pad_input, verbose=0)[0]

        preds = np.asarray(preds).astype("float64")
        preds = np.log(preds + 1e-9) / temperature
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)

        next_id = np.random.choice(len(preds), p=preds)
        next_char = id2char[next_id]

        generated += next_char
        input_seq.append(next_id)

    return generated


In [30]:
# Sample 1
print("Sample 1: 'once upon a time'")
print(generate_text(model, "once upon a time", gen_length=300, temperature=0.8))

# Sample 2
print("\n Sample 2: 'The world was'")
print(generate_text(model, "the world was", gen_length=300, temperature=0.8))


Sample 1: 'once upon a time'
once upon a time i had on another one she said by a little girl of his flast the barnowly confident could contended from a
greats anyhe asked at the
truet up that they was with all way after the bown conversed came and just which she spoke and he was bening surpy mothed the room. is had sense. she was comperman and

 Sample 2: 'The world was'
the world was holassed tom. ashes the esterge and or the longer on mys of the money. i had no back and the some wellspets betcented back a group that the rook of unfilleg his voice he said i think in the afternoon
of clampare on my bendain faster as if he see it
was from conversias in it as he demanded turning
r
