<a href="https://colab.research.google.com/github/diane-park/Deep_Learning_HW04/blob/main/Copy_of_09_Assigment_6_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [36]:
# prompt: from this link with the entire pride and prejudice text, https://www.gutenberg.org/cache/epub/1342/pg1342.txt

import requests

# Downloading text from gutenberg website
url = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"
response = requests.get(url)

# assign to single string
pride_and_prejudice_text = response.text

# extract sonnet
start_idx = pride_and_prejudice_text.find("*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***")
end_idx = pride_and_prejudice_text.find("*** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***")
pride_and_prejudice_text = pride_and_prejudice_text[start_idx:end_idx].strip()

train_val_split_idx = int(len(pride_and_prejudice_text)*.8)


# test and val split
train_data = pride_and_prejudice_text[:train_val_split_idx]
val_data = pride_and_prejudice_text[train_val_split_idx:]



## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [3]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [59]:
import re
from nltk.tokenize import word_tokenize

# to lowercase
train_data = train_data.lower()
val_data = val_data.lower()

# use regex to get rid of all unnecessary punctuation and whitespace
train_data = re.sub(r"[^\w\s.!?]", "", train_data)
val_data = re.sub(r"[^\w\s.!?]", "", val_data)

train_tokens = word_tokenize(train_data)
val_tokens = word_tokenize(val_data)

unique_tokens = set(train_tokens)
vocab = {word: idx for idx, word in enumerate(unique_tokens)}
unk_token = "[UNK]"
vocab[unk_token] = len(vocab)

train_ids = [vocab[word] for word in train_tokens]

val_ids = []

for token in val_tokens:
    if token in vocab:
        val_ids.append(vocab[token])
    else:
        val_ids.append(vocab["[UNK]"])


In [60]:
import numpy as np

train_X = []
train_y = []

for i in range(len(train_ids) - 3):
    train_X.append(train_ids[i:i+3])       # 5-word input
    train_y.append(train_ids[i+3])

val_X = []
val_y = []

for i in range(len(val_ids) - 3):
    val_X.append(val_ids[i:i+3])       # 5-word input
    val_y.append(val_ids[i+3])

train_X = np.array(train_X)
train_y = np.array(train_y)
val_X = np.array(val_X)
val_y = np.array(val_y)

print(train_X.shape)
print(train_y.shape)
print(val_X.shape)
print(val_y.shape)

(106174, 3)
(106174,)
(27360, 3)
(27360,)


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [61]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim= len(vocab),     # size of the vocabulary
    output_dim= 256,           # embedding vector dimension
    input_length= 3
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [62]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, GRU
from tensorflow.keras.callbacks import EarlyStopping

# implement early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)

model = Sequential()
model.add(embedding_layer)
model.add(GRU(256))
model.add(Dropout(0.5))
# model.add(LSTM(256))
# model.add(Dropout(0.3))
model.add(Dense(len(vocab), activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()



## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [63]:
model.fit(train_X, train_y, validation_data= (val_X, val_y), epochs=10, batch_size=64, callbacks=[early_stop])

Epoch 1/10
[1m1659/1659[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m176s[0m 103ms/step - accuracy: 0.0543 - loss: 6.5251 - val_accuracy: 0.1186 - val_loss: 5.4713
Epoch 2/10
[1m1659/1659[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m199s[0m 101ms/step - accuracy: 0.1258 - loss: 5.4451 - val_accuracy: 0.1350 - val_loss: 5.2597
Epoch 3/10
[1m1659/1659[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m164s[0m 99ms/step - accuracy: 0.1544 - loss: 5.1063 - val_accuracy: 0.1455 - val_loss: 5.1894
Epoch 4/10
[1m1659/1659[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 99ms/step - accuracy: 0.1762 - loss: 4.8267 - val_accuracy: 0.1515 - val_loss: 5.1736
Epoch 5/10
[1m1659/1659[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m167s[0m 101ms/step - accuracy: 0.1938 - loss: 4.6014 - val_accuracy: 0.1480 - val_loss: 5.1944
Epoch 6/10
[1m1659/1659[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m168s[0m 101ms/step - accuracy: 0.2145 - loss: 4.3429 - val_accuracy: 0.1514 - val_loss: 5

<keras.src.callbacks.history.History at 0x7995925b0910>

In [65]:
val_loss, val_acc = model.evaluate(val_X, val_y)

print("Val Perplexity: ", np.exp(val_loss))

[1m855/855[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 16ms/step - accuracy: 0.1510 - loss: 5.1841
Val Perplexity:  176.54989624584874


## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [76]:
# make a function to generate text for a seed phrase
def generate_text(seed_phrase, model, vocab, min_tokens=50):
    seed_tokens = word_tokenize(seed_phrase.lower())
    seed_ids = [vocab[word] if word in vocab else vocab["[PAD]"] for word in seed_tokens]
    if len(seed_ids) < 3:
        seed_ids = [vocab["[UNK]"]] * (3 - len(seed_ids)) + seed_ids
    generated_tokens = seed_tokens.copy()

    count = 0
    if count < min_tokens:
        while len(generated_tokens) < min_tokens:
            input_sequence = seed_ids[-3:]
            input_sequence = np.array([input_sequence])
            predicted_probs = model.predict(input_sequence,verbose=0)[0]
            predicted_word_idx = np.random.choice(len(predicted_probs), p=predicted_probs)
            predicted_word = list(vocab.keys())[list(vocab.values()).index(predicted_word_idx)]
            generated_tokens.append(predicted_word)
            seed_ids.append(predicted_word_idx)
            count += 1

    generated_text = " ".join(generated_tokens)
    return generated_text



In [77]:
generate_text("love is", model, vocab)

'love is done my dear she saw your marrying so disgracing him known nor his collection made her remembered her husband . for he do said to only have and allowed him to believe the appearance of being manner among publicly matrimony ten weeks from my aunt . dear lydia'

In [78]:
generate_text("time will", model, vocab)

'time will at present i can have actually hearty houses on goodhumour to their own family . _that_ state to prevent their pretended from colonel fitzwilliam then perceptible i am exceedingly ? but it she can a great sooner one could ever be end to ah mr. wickham to herself'