<a href="https://colab.research.google.com/github/ashleybrea/06-AIT-HW/blob/main/Copy_of_09_Assigment_6_text_generation_AshleyBrea.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [1]:
import requests

url = "https://www.gutenberg.org/cache/epub/514/pg514.txt"
response = requests.get(url)
text = response.text
len(text)


1047545

In [2]:

split_index = int(0.8 * len(text))
train_text = text[:split_index]
val_text = text[split_index:]

print(f"Training text length: {len(train_text)}")
print(f"Validation text length: {len(val_text)}")

Training text length: 838036
Validation text length: 209509


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [3]:
import re
import numpy as np
from tensorflow.keras.utils import to_categorical

def data_preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\wls.!?]', ' ', text)
    tokens = re.findall(r'\b\w+\b|[.!?]', text)
    return tokens

train_tokens = data_preprocess(train_text)
val_tokens = data_preprocess(val_text)

vocab = {word: idx for idx, word in enumerate(sorted(set(train_tokens)))}
vocab_size = len(vocab)

def making_X_and_Y(token_list, vocab, window_size) :
    token_ids = [vocab[token] for token in token_list if token in vocab]
    X = []
    Y = []
    for i in range(len(token_ids) - window_size):
        X.append(token_ids[i:i + window_size])
        Y.append(token_ids[i + window_size])

    return np.array(X), np.array(Y)

sequence_length = 6
X_train, Y_train = making_X_and_Y(train_tokens, vocab, sequence_length)
X_val, Y_val = making_X_and_Y(val_tokens, vocab, sequence_length)

print(f"X_train shape: {X_train.shape}, Y_train shape: {Y_train.shape}")
print(f"X_val shape: {X_val.shape}, Y_val shape: {Y_val.shape}")
print(X_train.shape, Y_train.shape)
print("Example X:", X_train[0])
print("Example Y:", Y_train[0])


X_train shape: (164886, 6), Y_train shape: (164886,)
X_val shape: (38484, 6), Y_val shape: (38484,)
(164886, 6) (164886,)
Example X: [8705 6673 3884 2735 5857 5021]
Example Y: 9656


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# my model seems to be overfitting so implementing early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)


model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128, input_length=sequence_length))
model.add(LSTM(256))
model.add(Dropout(0.3))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()



## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [19]:
model.fit(X_train, Y_train, batch_size=64, epochs=10, validation_data=(X_val, Y_val), callbacks=[early_stopping])

Epoch 1/10
[1m2577/2577[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m314s[0m 120ms/step - accuracy: 0.0564 - loss: 6.5421 - val_accuracy: 0.1071 - val_loss: 5.7509
Epoch 2/10
[1m2577/2577[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m319s[0m 119ms/step - accuracy: 0.1192 - loss: 5.5853 - val_accuracy: 0.1270 - val_loss: 5.5200
Epoch 3/10
[1m2577/2577[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m305s[0m 118ms/step - accuracy: 0.1453 - loss: 5.2243 - val_accuracy: 0.1368 - val_loss: 5.4321
Epoch 4/10
[1m2577/2577[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m303s[0m 117ms/step - accuracy: 0.1631 - loss: 4.9573 - val_accuracy: 0.1404 - val_loss: 5.4121
Epoch 5/10
[1m2577/2577[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m317s[0m 115ms/step - accuracy: 0.1786 - loss: 4.7125 - val_accuracy: 0.1441 - val_loss: 5.4288
Epoch 6/10
[1m2577/2577[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m325s[0m 117ms/step - accuracy: 0.1916 - loss: 4.4833 - val_accuracy: 0.1451 - val_loss:

<keras.src.callbacks.history.History at 0x7e917f1daa90>

In [20]:
val_loss, val_acc = model.evaluate(X_val, Y_val)

print("Val Perplexity: ", np.exp(val_loss))

[1m1203/1203[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 22ms/step - accuracy: 0.1468 - loss: 5.2749
Val Perplexity:  224.09803801008024


## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [21]:
# prompt: make a function to generate text for a seed phrase, with at least 50 tokes, and that sample from the predicted distribution instead of always picking the highest-probability token using argmax to avoid repitition

import numpy as np

def generate_text(model, seed_text, vocab, vocab_inv, num_tokens_to_generate=50):
    generated_text = seed_text
    seed_tokens = preprocess(seed_text)
    token_ids = [vocab[token] for token in seed_tokens if token in vocab]

    for _ in range(num_tokens_to_generate):
        padded_input = token_ids[-sequence_length:]
        if len(padded_input) < sequence_length:
            padded_input = [0]*(sequence_length - len(padded_input)) + padded_input

        input_sequence = np.array([padded_input])
        predicted_probs = model.predict(input_sequence, verbose=0)[0]

        # Sample from the predicted distribution
        predicted_id = np.random.choice(vocab_size, p=predicted_probs)

        # Avoid repeatedly predicting the same token
        # This is not perfect but helps with repetition reduction
        if predicted_id == token_ids[-1]:
          #Try to choose the next highest probability
          second_highest = np.argsort(predicted_probs)[-2]
          predicted_id = second_highest

        token_ids.append(predicted_id)
        generated_text += " " + vocab_inv[predicted_id]

    return generated_text

# Create reverse vocabulary for lookup
vocab_inv = {idx: word for word, idx in vocab.items()}


In [22]:
generate_text(model, "time will", vocab, vocab_inv)

'time will us your coffee ? asked her mother somewhat saturday under the window poor mind that s sake this idea softly so well i got your soft in this tossed and threateningly delighted and too for a king s glove and seized the good ambition of mine . i get very'

In [23]:
generate_text(model, "love is", vocab, vocab_inv)

'love is snodgrass a row down and tomorrow here when they had all toward them then at white tumbling saying pride to bed the old lady made the shining four loving neglecting which experiences flew beautifully therefore she danced years of cherished the german bill rocking fastening for a mother with all'