# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [5]:
# download txt from webpage
!wget https://www.gutenberg.org/cache/epub/67098/pg67098.txt

--2025-04-23 20:03:12--  https://www.gutenberg.org/cache/epub/67098/pg67098.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 153227 (150K) [text/plain]
Saving to: ‘pg67098.txt.1’


2025-04-23 20:03:13 (1.90 MB/s) - ‘pg67098.txt.1’ saved [153227/153227]



In [6]:
import re
from collections import Counter
import tensorflow as tf
import numpy as np

# Step 1: downloaded Winnie the Pooh (pg67098.txt)

with open("pg67098.txt", "r", encoding="utf-8") as f:
  full_text = f.read()

# clean the text of the gutenberg info

# the txt of Winnie the Pooh has two starting markers,
#   we want to start on the 2nd
start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK WINNIE-THE-POOH ***"
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK WINNIE-THE-POOH ***"

first_start_idx = full_text.find(start_marker)
second_start_idx = full_text.find(start_marker, first_start_idx + 1)

start_idx = second_start_idx + len(start_marker)
end_idx = full_text.find(end_marker)

book_text = full_text[start_idx:end_idx].strip()


# split into 80% train, 20% val
train_size = int(0.8 * len(book_text))
train_data = book_text[:train_size]
val_data = book_text[train_size:]

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [7]:
import nltk
import numpy as np
import re
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [8]:
# lowercase
train_data = train_data.lower()
val_data = val_data.lower()

# remove punctuation, keep basic sentence delimiters (.!?)
train_data = re.sub(r"[^\w\s.?!]", "", train_data)
val_data = re.sub(r"[^\w\s.?!]", "", val_data)

# tokenize by words
train_tokens = word_tokenize(train_data)
val_tokens = word_tokenize(val_data)

# build vocab
vocabulary = sorted(set(train_tokens))

word_to_id = {word: idx for idx, word in enumerate(vocabulary)}
word_to_id["[UNK]"] = len(word_to_id)
vocab_size = len(word_to_id)

id_to_word = {idx: word for word, idx in word_to_id.items()}

train_ids = [word_to_id.get(word, word_to_id["[UNK]"]) for word in train_tokens]
val_ids = [word_to_id.get(word, word_to_id["[UNK]"]) for word in val_tokens]

# X and y using sliding window len 3
X_train = []
y_train = []
X_val = []
y_val = []

seq_len = 5

for i in range(len(train_ids) - seq_len):
      X_train.append(train_ids[i:i+seq_len])
      y_train.append(train_ids[i+seq_len])

for i in range(len(val_ids) - seq_len):
      X_val.append(val_ids[i:i+seq_len])
      y_val.append(val_ids[i+seq_len])

X_train = np.array(X_train)
y_train = np.array(y_train)
X_val = np.array(X_val)
y_val = np.array(y_val)

print(vocab_size)

1896


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [9]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim= vocab_size,     # size of the vocabulary
    output_dim= 128,                # embedding vector dimension
    input_length= seq_len
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dropout, Dense
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True)

model = Sequential([
    embedding_layer,
    LSTM(256),
    Dropout(0.5),
    # LSTM(256),
    # Dropout(0.4),


    #332
    # GRU(256),
    # Dropout(0.5),


    # GRU(256),
    # Dropout(0.3),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [12]:
network_history = model.fit(X_train, y_train,
                            validation_data=(X_val,y_val),
                            batch_size=64, #64, 318
                            epochs=10,
                            verbose=1,
                            callbacks=[es])

val_loss, val_acc = model.evaluate(X_val, y_val)
val_perplexity = np.exp(val_loss)

print("Validation Perplexity: ", val_perplexity)

Epoch 1/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 55ms/step - accuracy: 0.0517 - loss: 6.3861 - val_accuracy: 0.0506 - val_loss: 5.8831
Epoch 2/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 52ms/step - accuracy: 0.0588 - loss: 5.6638 - val_accuracy: 0.0582 - val_loss: 5.8409
Epoch 3/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 57ms/step - accuracy: 0.0758 - loss: 5.4169 - val_accuracy: 0.0926 - val_loss: 5.7071
Epoch 4/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 52ms/step - accuracy: 0.1119 - loss: 5.1502 - val_accuracy: 0.1074 - val_loss: 5.5923
Epoch 5/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 52ms/step - accuracy: 0.1333 - loss: 4.8700 - val_accuracy: 0.1200 - val_loss: 5.4964
Epoch 6/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 53ms/step - accuracy: 0.1560 - loss: 4.6450 - val_accuracy: 0.1263 - val_loss: 5.4832
Epoch 7/10
[1m3

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [13]:
# function for generating text

def generate_text(seed_phrase, model, word_to_id, seq_len=5, min_tokens=50):
    # tokenize the seed phrase
    tokens = word_tokenize(seed_phrase.lower())
    token_ids = [word_to_id[token] if token in word_to_id else word_to_id["[UNK]"] for token in tokens]

    # padding to min sequence length
    if len(token_ids) < seq_len:
        token_ids = [word_to_id["[UNK]"]] * (seq_len - len(token_ids)) + token_ids

    # generate tokens
    generated_tokens = tokens.copy()
    while len(generated_tokens) < min_tokens:
        # prepare input for the model
        input_sequence = token_ids[-seq_len:]
        input_sequence = np.array(input_sequence).reshape(1, -1)

        # pwredict next word (token)
        predicted_probs = model.predict(input_sequence, verbose=0)
        predicted_id = np.random.choice(len(predicted_probs[0]), p=predicted_probs[0])

        # map predicted ID to word
        predicted_word = id_to_word[predicted_id]
        generated_tokens.append(predicted_word)

        # update token_ids for the next prediction
        token_ids.append(predicted_id)

    return ' '.join(generated_tokens)

In [14]:
generate_text("love is", model, word_to_id)

'love is is coming . thats way you as a good just green left out of i said christopher robin read . dozen you kanga without so lines flying they could just and down to myself and i so asked christopher robin . that only baby pooh looking on it'

In [15]:
generate_text("time will", model, word_to_id)

'time will you always found a heffalump without . i quite remember he long bees after bang . i didnt be something in a week . he was the top behind him of his moment . so please to say what whats the song . piglet ! is him .'