<a href="https://colab.research.google.com/github/alberttang35/DL-6/blob/main/Tang_09_Assigment_6_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [99]:
import csv
from collections import Counter
from io import BytesIO, StringIO
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import preprocessing
import tensorflow as tf
import requests

GUTENBERG_CSV_URL = "https://www.gutenberg.org/cache/epub/84/pg84.txt"

res = requests.get(GUTENBERG_CSV_URL)

text = res.content.decode("utf-8")

GUTENBERG_TEXT = "PROJECT GUTENBERG EBOOK "

def strip_headers(text):
    in_text = False
    output = []

    for line in text.splitlines():
        if GUTENBERG_TEXT in line:
            if not in_text:
                in_text = True
            else:
                break
        else:
            if in_text:
                output.append(line)

    return "\n".join(output).strip()

stripped_text = strip_headers(text)

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words.  
- Build a vocabulary (map each unique word to an integer ID).

In [106]:
lowercase = stripped_text.lower()

lowercase = lowercase.replace(",", "")
lowercase = lowercase.replace('”', "")
lowercase = lowercase.replace("“", "")
lowercase = lowercase.replace(";", "")
lowercase

sequence_length = len(lowercase.split())

tokens = lowercase.split()

word_ids = {}
vocab_size = 0
for word in tokens:
    if word not in word_ids:
        word_ids[word] = vocab_size
        vocab_size += 1

id_to_word = {id: word for word, id in word_ids.items()}

token_ids = [word_ids[word] for word in tokens]

# split into training and test

window_size = 10

X = np.array( [ token_ids[start:start+window_size] for start in range(0,len(token_ids)-window_size)] ).astype(np.float32)

Y = np.array(token_ids[window_size:]).astype(np.float32)


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

BATCH_SIZE = 64
BUFFER_SIZE = 10000

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
val_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test)).batch(BATCH_SIZE, drop_remainder=True)

## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [101]:
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from keras_hub.metrics import Perplexity

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [102]:
lstm_model = Sequential()
lstm_model.add(Embedding(vocab_size, 128, input_length=window_size))
lstm_model.add(LSTM(256))
# lstm_model.add(LSTM(256))
lstm_model.add(Dense(vocab_size, activation='softmax'))



## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [103]:
def custom_reshape(x, y):
  return x, tf.expand_dims(y, -1)

train_dataset = train_dataset.map(custom_reshape)
val_dataset = val_dataset.map(custom_reshape)

In [104]:
perplexity = Perplexity(name="perplexity", from_logits=False)

lstm_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=[perplexity])
lstm_model.summary()

In [105]:
lstm_model.fit(x=train_dataset, validation_data=val_dataset, epochs=5)

Epoch 1/5
[1m937/937[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 73ms/step - loss: 7.0318 - perplexity: 1327.2151 - val_loss: 6.4664 - val_perplexity: 643.1888
Epoch 2/5
[1m937/937[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 73ms/step - loss: 6.1622 - perplexity: 474.6641 - val_loss: 6.2882 - val_perplexity: 538.1757
Epoch 3/5
[1m937/937[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 71ms/step - loss: 5.7746 - perplexity: 322.3044 - val_loss: 6.2477 - val_perplexity: 516.8079
Epoch 4/5
[1m937/937[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 70ms/step - loss: 5.4095 - perplexity: 223.7734 - val_loss: 6.2653 - val_perplexity: 526.0123
Epoch 5/5
[1m937/937[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 71ms/step - loss: 5.0475 - perplexity: 155.8407 - val_loss: 6.3876 - val_perplexity: 594.4019


<keras.src.callbacks.history.History at 0x7b403b732810>

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [110]:
seed1 = "my travels were long and the sufferings i endured intense"
seed1 = [word_ids[word] for word in seed1.split()]

while len(seed1) < 50:
  seed1.append(lstm_model.predict(np.array(seed1[-window_size:]).reshape(1, window_size)).argmax())

out1 = [id_to_word[id] for id in seed1]
print(out1)



seed2 = "to be or not to be that is the question"
seed2 = [word_ids[word] for word in seed2.split()]
while len(seed2) < 59:
  seed2.append(lstm_model.predict(np.array(seed2[-window_size:]).reshape(1, window_size)).argmax())

out2 = [id_to_word[id] for id in seed2]
print(out2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27

## 7. Submission
- A Jupyter Notebook (or script) showing:
  - **Data loading** and **preprocessing**.
  - **Model definition** and **training process**.
  - **Validation perplexity** calculation.
  - **Two generated text samples** (each >50 tokens).
- Ensure your notebook/script **runs end-to-end without errors**.
