<a href="https://colab.research.google.com/github/ashleybrea/04-AIT-HW/blob/main/Copy_of_09_Assigment_6_text_generation_AshleyBrea.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [135]:
import requests

url = "https://www.gutenberg.org/cache/epub/514/pg514.txt"
response = requests.get(url)
text = response.text
len(text)


1047545

In [136]:

split_index = int(0.8 * len(text))
train_text = text[:split_index]
val_text = text[split_index:]

print(f"Training text length: {len(train_text)}")
print(f"Validation text length: {len(val_text)}")

Training text length: 838036
Validation text length: 209509


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [138]:
import re
import numpy as np
from tensorflow.keras.utils import to_categorical

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\wls.!?]', '', text)
    tokens = re.findall(r'\b\w+\b|[.!?]', text)
    return tokens

train_tokens = preprocess(train_text)
val_tokens = preprocess(val_text)
vocab = {word: idx for idx, word in enumerate(sorted(set(train_tokens)))}
vocab_size = len(vocab)

def generate_xy(token_list, vocab, window_size) :
    token_ids = [vocab[token] for token in token_list if token in vocab]
    X = []
    Y = []
    for i in range(len(token_ids) - window_size + 1):
        X.append(token_ids[i: + window_size - 1])
        Y.append(token_ids[i + window_size - 1])
        return np.array(X), np.array(Y)

sequence_length = 6
X_train, Y_train = generate_xy(train_tokens, vocab, sequence_length)
X_val, Y_val = generate_xy(val_tokens, vocab, sequence_length)

print(f"X_train shape: {X_train.shape}, Y_train shape: {Y_train.shape}")
print(f"X_val shape: {X_val.shape}, Y_val shape: {Y_val.shape}")

X_train shape: (1, 5), Y_train shape: (1,)
X_val shape: (1, 5), Y_val shape: (1,)


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [139]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128, input_length=(sequence_length,)))
model.add(LSTM(256))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()



In [140]:

model.fit(x_train, y_train, batch_size=64, epochs=5, validation_data=(x_val, y_val))


Epoch 1/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 803ms/step - accuracy: 0.0789 - loss: 8.9924 - val_accuracy: 0.0857 - val_loss: 8.5371
Epoch 2/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 544ms/step - accuracy: 0.1109 - loss: 7.6819 - val_accuracy: 0.0857 - val_loss: 6.2020
Epoch 3/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 826ms/step - accuracy: 0.0986 - loss: 4.6105 - val_accuracy: 0.0857 - val_loss: 4.9363
Epoch 4/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 558ms/step - accuracy: 0.1198 - loss: 3.2557 - val_accuracy: 0.0857 - val_loss: 5.3233
Epoch 5/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 545ms/step - accuracy: 0.1384 - loss: 3.1588 - val_accuracy: 0.0857 - val_loss: 5.5113


<keras.src.callbacks.history.History at 0x799d886c5890>

In [None]:
def create_sequences(encoded_text, seq_length):
    inputs = []
    targets = []
    for i in range(0, len(encoded_text) - seq_length):
        inputs.append(encoded_text[i:i + seq_length])
        targets.append(encoded_text[i + 1:i + seq_length + 1])  # shifted by 1
    return torch.tensor(inputs), torch.tensor(targets)

sequence_length = 100
input_seqs, target_seqs = create_sequences(encoded_text, sequence_length)

print(f"Input shape: {input_seqs.shape}, Target shape: {target_seqs.shape}")


Input shape: torch.Size([563, 100]), Target shape: torch.Size([563, 100])


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).