<a href="https://colab.research.google.com/github/fangzhongfionaxu/DeepLearning04/blob/main/Deep_Learning_Assigment_6_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Split into training (80%) and validation (20%).

In [12]:
import requests


Mounted at /content/drive


In [15]:

# read file from url
url = "https://www.gutenberg.org/cache/epub/55/pg55.txt"
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:
    text = response.text
    print(text[:500])  # print first 500 characters
else:
    print(f"Failed to download file. Status code: {response.status_code}")


﻿The Project Gutenberg eBook of The Wonderful Wizard of Oz
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before usi


In [16]:
start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***"
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***"

start_index = text.find(start_marker) + len(start_marker)
end_index = text.find(end_marker)

# only keep the contents of the book, strip away header and footer
if start_index != -1 and end_index != -1:
    text = text[start_index:end_index].strip()

In [7]:
train = 0.8
val = 0.2

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [18]:
import re
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [46]:
text = text.lower()

#remove everything except for character 'w', white space 's', basic sent delimiters '?.!'
text_clean = re.sub(r'[^\w\s\.\!\?]', '', text)

# # Tokenize by words or characters (your choice).
text_tokens = text_clean.split()
text_tokens.append("<unk>")
text_tokens.append("<pad>")

#- Build a vocabulary (map each unique word to an integer ID).
vocab = list(set(text_tokens))
vocab_size = len(vocab)

print(vocab[:20])
word2id = {word: i for i, word in enumerate(vocab)}
print(word2id['exact'])
print(word2id)
text_input = [word2id[word] for word in text_tokens]
print(text_input[:30])

['bondage', 'poppies.', 'inconvenient', 'cost', 'time.', 'homes.', 'vain', 'wedding', 'witch.', 'giant.', 'license', 'took', 'reach', 'noises', 'pricked', 'carved.', 'slapped', 'woke', 'fool', 'eating']
515
{'bondage': 0, 'poppies.': 1, 'inconvenient': 2, 'cost': 3, 'time.': 4, 'homes.': 5, 'vain': 6, 'wedding': 7, 'witch.': 8, 'giant.': 9, 'license': 10, 'took': 11, 'reach': 12, 'noises': 13, 'pricked': 14, 'carved.': 15, 'slapped': 16, 'woke': 17, 'fool': 18, 'eating': 19, 'young': 20, 'praying': 21, 'pavement': 22, 'manage': 23, 'surface': 24, 'polishedand': 25, 'gives': 26, 'stop': 27, 'here?': 28, 'weeping': 29, 'disappear': 30, 'delighted': 31, 'other': 32, 'snatched': 33, 'shouldered': 34, 'goodnatured': 35, 'leads': 36, 'invited': 37, 'beetle': 38, 'gloom': 39, 'diamonds.': 40, 'dwells': 41, 'goodbye!': 42, 'be.': 43, 'marry': 44, 'gracious!': 45, 'hush': 46, 'share': 47, 'tint': 48, 'loud': 49, 'stiffen': 50, 'coat': 51, 'white': 52, 'walk': 53, '1.f.': 54, 'robes': 55, 'advan

## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [47]:
from tensorflow.keras.layers import Embedding
vocab_size = len(vocab)
sequence_length = 20

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [36]:
from tensorflow.keras.layers import Dense,LSTM ,Dropout
from tensorflow.keras.models import Sequential
import numpy as np
from sklearn import preprocessing

from tensorflow.keras.utils import to_categorical
import tensorflow.keras.backend as K

In [40]:

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))

prepare X and Y

In [30]:

window_size = sequence_length

data = np.array(text_input)

X = []
y = []
for i in range(len(data)-window_size):
    X.append(data[i:i+window_size])
    y.append(data[i+window_size])
X = np.asarray(X)
y = np.asarray(y)
print("X.shape" , X.shape)
print("y.shape" , y.shape)

split_index = int(train*len(X))

X_train = X[:split_index]
X_val = X[split_index:]

y_train = y[:split_index]
y_val = y[split_index:]

print("X_train.shape", X_train.shape)
print("X_val.shape", X_val.shape)
print("y_train.shape", y_train.shape)
print("y_val.shape", y_val.shape)


assert X_train.shape[0] == y_train.shape[0]
assert X_val.shape[0] == y_val.shape[0]

X.shape (42657, 20)
y.shape (42657,)
X_train.shape (34125, 20)
X_val.shape (8532, 20)
y_train.shape (34125,)
y_val.shape (8532,)


In [42]:
def perplexity(y_true, y_pred):
    """
    Perplexity = e^cross_entropy
    """
    y_true = tf.cast(y_true, tf.int32)

    cross_entropy = K.mean(K.sparse_categorical_crossentropy(y_true, y_pred))
    perplexity = K.exp(cross_entropy)  # Exponentiate to get perplexity
    return perplexity

In [32]:
import tensorflow.keras.optimizers as optimizers

opt = optimizers.Adam(learning_rate = 0.0001)

In [50]:
from tensorflow.keras.optimizers import Adam
# Compile model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer = Adam(learning_rate = 0.0001),
    metrics=[perplexity]
)


# X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)  # Adding a new dimension for embedding
# X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], 1)  # Adding a new dimension for embedding

model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=64,
    epochs=20
)
# # Train model and monitor validation loss
# history = model.fit(
#     X_train,
#     y_train,
#     epochs=15,
#     batch_size=32,
#     verbose=1,
#     validation_data=(X_val, y_val)  # This enables val_loss and val_perplexity monitoring
# )

Epoch 1/20


InvalidArgumentError: Graph execution error:

Detected at node sequential_2_1/lstm_4_1/while/TensorListPushBack_14 defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelapp.py", line 712, in start

  File "/usr/local/lib/python3.11/dist-packages/tornado/platform/asyncio.py", line 205, in start

  File "/usr/lib/python3.11/asyncio/base_events.py", line 608, in run_forever

  File "/usr/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once

  File "/usr/lib/python3.11/asyncio/events.py", line 84, in _run

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 499, in process_one

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 730, in execute_request

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/ipkernel.py", line 383, in do_execute

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/zmqshell.py", line 528, in run_cell

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-50-da21cfa129c7>", line 13, in <cell line: 0>

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 371, in fit

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 113, in one_step_on_data

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 77, in train_step

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/core/function/capture/capture_container.py", line 154, in capture_by_value

Tried to append a tensor with incompatible shape to a list. Op element shape: [0] list shape: [256,1024]
	 [[{{node sequential_2_1/lstm_4_1/while/TensorListPushBack_14}}]] [Op:__inference_multi_step_on_iterator_28541]

## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible. If you have higher value (which is possible) try to draw conclusions, why doesn't it decrease to a lower value.

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [48]:
id2word = {idx: word for word, idx in word2id.items()}
if "<unk>" not in word2id:
    word2id["<unk>"] = len(word2id)
    id2word[len(id2word)] = "<unk>"
if "<pad>" not in word2id:
    word2id["<pad>"] = len(word2id)
    id2word[len(id2word)] = "<pad>"

def generate(seed, model, word2id, id2word, window_size=20, num_words=50):
    seed = seed.lower()
    seed_tokens = re.sub(r'[^\w\s\.\!\?]', '', seed).split()
    seed_ids = [word2id.get(word, word2id["<unk>"]) for word in seed_tokens]  # Use <unk> for unknowns

    # Pad or truncate seed to match window size
    while len(seed_ids) < window_size:
        seed_ids.insert(0, word2id["<pad>"])  # add pading to fit windowsize
    seed_ids = seed_ids[-window_size:]

    generated = seed_tokens.copy()

    for _ in range(num_words):
        input_seq = np.array(seed_ids).reshape(1, -1)  # shape (1, window_size)

        # Predict next word
        preds = model.predict(input_seq, verbose=0)
        next_id = np.argmax(preds[0])  # Greedy sampling, can also try temperature sampling

        next_word = id2word.get(next_id, "<unk>")
        generated.append(next_word)

        # Update seed_ids
        seed_ids.append(next_id)
        seed_ids = seed_ids[1:]

    return " ".join(generated)

In [49]:
generate("love is", model, word2id, id2word, window_size=20, num_words=50)

'love is created id double accusations recovering refusal dog whereabouts damaged damaged portfolio rumour millions rumour ventured. chair. estate waiting schraeders michaelis catholic emotion consciously consciously win. fashion convinced country country ivory. gutenbergs monkey. ticket slight now nod threw matters. kissed waiter unprotected flow located mastered mastered mastered wasnt hope. suggested suggested'