# Day-75: Text Generation using RNN

In the last few days, we explored RNNs, LSTMs, GRUs, and Bidirectional Networks, learning how these models understand sequential data like text or time series.

Today, we’ll take it one step further — and build something creative:
A Text Generator using RNNs!

We’ll train an RNN on a children’s stories corpus, and then make it generate new story text word-by-word — just like how ChatGPT or any AI writer starts from a word and keeps predicting the next one.

## Topics Covered

- About the Dataset

- Revisiting NLP Concepts We’ll Use

- Preparing Data for RNN Text Generation

- Building the RNN Model

- Generating Text Word-by-Word

- Evaluation & Experimentation

## The Dataset: Children Stories Text Corpus (Kaggle)

Guys, for any project, the first step is always the data! Our dataset is a fantastic collection of children's stories. Why stories? Because they have a sequence and a context.

`Analogy`: 
- Think of this dataset as a massive library of bedtime stories. 
- When a kid reads a lot of stories, they learn the structure: "Once upon a time..." is often followed by a character introduction. 
- Our model is going to "read" this library and learn the grammar, the sentence structure, and the word-to-word dependencies to tell its own story.

We’ll be using the Children Stories Text Corpus:https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus/data from Kaggle.
It contains hundreds of story texts written for kids — simple grammar, repetitive sentence structures, and rich vocabulary — perfect for language generation tasks!

This kind of dataset helps our RNN learn storytelling patterns like:

- Sentence structure (subject → verb → object)

- Repetitive story motifs ("Once upon a time", "and then", etc.)

- Predictable transitions ("The next day...", "Suddenly...")

## Concepts from NLP we will use

We aren't starting from scratch! We'll leverage the power of foundational NLP concepts we covered previously.

| **NLP Concept**                        | **Day Covered**              | **How We Use It in Text Generation**                                                        | **Analogy**                                                                                                                  |
| :------------------------------------- | :--------------------------- | :------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------- |
| Tokenization                           | Day 50                       | We break the stories into words or characters, creating a vocabulary.                       | Breaking a long book into individual words to check the frequency of each word.                                              |
| Lowercasing & Cleaning                 | Day 50                       | We normalize text to ensure uniformity — “Cat” and “cat” are treated the same.              | Making sure everyone wears the same uniform before a group photo.                                                            |
| Stopwords                              | Day 50                       | Usually, we keep stopwords here as they help maintain the flow of sentences.                | Keeping connecting words like “and” or “but” to ensure the story makes sense.                                                |
| Word Embeddings (e.g., Word2Vec/GloVe) | Day 52                       | We convert each token into a dense vector representation to capture semantic meaning.       | Giving each word a unique ID card that also contains information about what the word means and what words are similar to it. |
| Sequence Data                          | General RNN Concept          | RNNs are designed to handle ordered data, so word order defines meaning in text generation. | A detective trying to solve a crime — the order of events is crucial to understanding the full story.                        |
| Padding                                | (General Preprocessing)      | Ensures all input sequences are of the same length before feeding them into RNN.            | Making all sentences the same length by adding blank spaces at the start.                                                    |
| One-hot / Categorical Encoding         | (Used before model training) | Converts the target (next word) into categorical vectors for training.                      | Giving each possible next word its own “slot” in the prediction list.                                                        |
| Text Generation Loop                   | (Core to RNN Workflow)       | Repeatedly predicts the next word and appends it to the input sequence.                     | Like a storyteller who keeps adding one word at a time until the story ends.                                                 |


So yes — we’re reusing everything from Day 50–55!
Only this time, we’re not classifying or clustering text — we’re generating new text.

## Preparing Data for RNN Text Generation

### 1. Load the corpus


In [1]:
import numpy as np
import os
import kagglehub

# Download latest version
path = kagglehub.dataset_download("edenbd/children-stories-text-corpus")

print("Path to dataset files:", path)
DATA_FILE = "cleaned_merged_fairy_tales_without_eos.txt"
CORPUS_PATH = f"{path}\{DATA_FILE}"
print(CORPUS_PATH)
assert os.path.exists(CORPUS_PATH), f'Could not find {CORPUS_PATH}.'

with open(CORPUS_PATH, 'r', encoding='utf-8', errors='ignore') as f:
    raw_text = f.read()

print('Number of characters:', len(raw_text))
print('Sample preview (first 500 chars):\n')
print(f'{raw_text[:500]}\n')

# tiny cleanup
text = " ".join(raw_text.split())
text = text.lower()
print('After Cleanup:\n')
print('Number of characters:', len(text))
print('Sample preview (first 500 chars):\n')
print(text[:500])


  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\amey9\.cache\kagglehub\datasets\edenbd\children-stories-text-corpus\versions\1
C:\Users\amey9\.cache\kagglehub\datasets\edenbd\children-stories-text-corpus\versions\1\cleaned_merged_fairy_tales_without_eos.txt
Number of characters: 20455694
Sample preview (first 500 chars):

The Happy Prince.
HIGH above the city, on a tall column, stood the statue of the Happy Prince.  He was gilded all over with thin leaves of fine gold, for eyes he had two bright sapphires, and a large red ruby glowed on his sword-hilt.
He was very much admired indeed.  “He is as beautiful as a weathercock,” remarked one of the Town Councillors who wished to gain a reputation for having artistic tastes; “only not quite so useful,” he added, fearing lest people should think him unpractical, which h

After Cleanup:

Number of characters: 20395326
Sample preview (first 500 chars):

the happy prince. high above the city, on a tall column, stood the statue of the happy prince. he was gilded

### 2. Tokenize (word-level)

In [2]:
! pip install tensorflow keras





In [4]:
import io, numpy as np, tensorflow as tf
from tensorflow.keras import layers, Sequential
from tensorflow.keras.layers import TextVectorization
# 2) Keras TextVectorization with our standardizer
def standardize_keep_apostrophes(t):
    t = tf.strings.lower(t)
    t = tf.strings.regex_replace(t, r"([0-9A-Za-z])'([0-9A-Za-z])", r"\1§APO§\2")
    t = tf.strings.regex_replace(t, r"([^\w\s])", r" \1 ")
    t = tf.strings.regex_replace(t, r"§APO§", "'")
    t = tf.strings.regex_replace(t, r"\s+", " ")
    return t

vec = TextVectorization(
    standardize=standardize_keep_apostrophes,
    split="whitespace",
    output_mode="int",
    # max_tokens=None  # (optional) cap vocab if needed
)

vec.adapt(tf.constant([text]))
vocab = vec.get_vocabulary()
vocab_size = len(vocab)

# 3) Convert whole corpus to ids
ids = vec(tf.constant([text]))  # (1, T) or ragged depending on TF
ids = ids[0] if not isinstance(ids, tf.RaggedTensor) else ids.values
tokens = ids.numpy().astype("int32")

# 4) Sliding-window dataset: X=t[i:i+SEQ], y=t[i+SEQ]
SEQ_LEN, BATCH = 64, 256
ds = tf.keras.utils.timeseries_dataset_from_array(
    data=tokens[:-1], targets=tokens[1:],
    sequence_length=SEQ_LEN, sequence_stride=1, batch_size=BATCH
).prefetch(tf.data.AUTOTUNE)

# small val split (~2%)
steps = max(1, (len(tokens)-1-SEQ_LEN)//BATCH)
val_steps = max(1, steps//50)
val_ds = ds.take(val_steps)
train_ds = ds.skip(val_steps)

# 5) Model: Embedding → LSTM → logits
model = Sequential([
    layers.Embedding(vocab_size, 256, input_length=SEQ_LEN),
    layers.LSTM(512),
    layers.Dense(vocab_size)  # logits
])
model.compile(optimizer="adam",
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=["accuracy"])

callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True, monitor="val_loss"),
    tf.keras.callbacks.ModelCheckpoint("day75_lstm.keras", save_best_only=True, monitor="val_loss")
]
model.fit(train_ds, validation_data=val_ds, epochs=6, callbacks=callbacks)

# 6) Sampling: temperature + top-k
def sample_logits(logits, temperature=0.9, top_k=50):
    logits = logits / max(temperature, 1e-6)
    if top_k and top_k > 0:
        k = min(top_k, logits.shape[-1])
        top_vals, top_idx = tf.math.top_k(logits, k=k)
        probs = tf.nn.softmax(top_vals).numpy()
        return int(top_idx.numpy()[np.random.choice(k, p=probs)])
    probs = tf.nn.softmax(logits).numpy()
    return int(np.random.choice(len(probs), p=probs))

# Encode seed using SAME standardizer+vocab
table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(
        keys=tf.constant(vocab), values=tf.constant(list(range(vocab_size)), dtype=tf.int64)
    ),
    default_value=tf.constant(0, dtype=tf.int64)
)

def encode_seed(seed_text):
    std = standardize_keep_apostrophes(tf.constant([seed_text]))
    toks = tf.strings.split(std).values
    return table.lookup(toks).numpy().astype("int32")

id_to_word = dict(enumerate(vocab))

def generate(seed_text="once upon a time", num_tokens=80, temperature=0.9, top_k=50):
    from collections import deque
    seed_ids = encode_seed(seed_text)
    window = deque(seed_ids[-SEQ_LEN:], maxlen=SEQ_LEN)
    out_ids = list(seed_ids)
    for _ in range(num_tokens):
        x = np.array([list(window)], dtype=np.int32)
        logits = model(x, training=False).numpy()[0]
        nid = sample_logits(logits, temperature, top_k)
        out_ids.append(nid); window.append(nid)
    words = [id_to_word.get(i, "") for i in out_ids]
    return " ".join(w for w in words if w)

print("\n--- SAMPLE ---\n", generate("once upon a time", 100, 0.9, 50))



Epoch 1/6




[1m   19/18038[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m3:51:53[0m 772ms/step - accuracy: 0.0473 - loss: 9.8959

KeyboardInterrupt: 