# Day-75: Text Generation using RNN

In the last few days, we explored RNNs, LSTMs, GRUs, and Bidirectional Networks, learning how these models understand sequential data like text or time series.

Today, we’ll take it one step further — and build something creative:
A Text Generator using RNNs!

We’ll train an RNN on a children’s stories corpus, and then make it generate new story text word-by-word — just like how ChatGPT or any AI writer starts from a word and keeps predicting the next one.

## Topics Covered

- About the Dataset

- Revisiting NLP Concepts We’ll Use

- Preparing Data for RNN Text Generation

- Building the RNN Model

- Generating Text Word-by-Word

- Evaluation & Experimentation

## The Dataset: Children Stories Text Corpus (Kaggle)

Guys, for any project, the first step is always the data! Our dataset is a fantastic collection of children's stories. Why stories? Because they have a sequence and a context.

`Analogy`:
- Think of this dataset as a massive library of bedtime stories.
- When a kid reads a lot of stories, they learn the structure: "Once upon a time..." is often followed by a character introduction.
- Our model is going to "read" this library and learn the grammar, the sentence structure, and the word-to-word dependencies to tell its own story.

We’ll be using the Children Stories Text Corpus:https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus/data from Kaggle.
It contains hundreds of story texts written for kids — simple grammar, repetitive sentence structures, and rich vocabulary — perfect for language generation tasks!

This kind of dataset helps our RNN learn storytelling patterns like:

- Sentence structure (subject → verb → object)

- Repetitive story motifs ("Once upon a time", "and then", etc.)

- Predictable transitions ("The next day...", "Suddenly...")

## Concepts from NLP we will use

We aren't starting from scratch! We'll leverage the power of foundational NLP concepts we covered previously.

| **NLP Concept**                        | **Day Covered**              | **How We Use It in Text Generation**                                                        | **Analogy**                                                                                                                  |
| :------------------------------------- | :--------------------------- | :------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------- |
| Tokenization                           | Day 50                       | We break the stories into words or characters, creating a vocabulary.                       | Breaking a long book into individual words to check the frequency of each word.                                              |
| Lowercasing & Cleaning                 | Day 50                       | We normalize text to ensure uniformity — “Cat” and “cat” are treated the same.              | Making sure everyone wears the same uniform before a group photo.                                                            |
| Stopwords                              | Day 50                       | Usually, we keep stopwords here as they help maintain the flow of sentences.                | Keeping connecting words like “and” or “but” to ensure the story makes sense.                                                |
| Word Embeddings (e.g., Word2Vec/GloVe) | Day 52                       | We convert each token into a dense vector representation to capture semantic meaning.       | Giving each word a unique ID card that also contains information about what the word means and what words are similar to it. |
| Sequence Data                          | General RNN Concept          | RNNs are designed to handle ordered data, so word order defines meaning in text generation. | A detective trying to solve a crime — the order of events is crucial to understanding the full story.                        |
| Padding                                | (General Preprocessing)      | Ensures all input sequences are of the same length before feeding them into RNN.            | Making all sentences the same length by adding blank spaces at the start.                                                    |
| One-hot / Categorical Encoding         | (Used before model training) | Converts the target (next word) into categorical vectors for training.                      | Giving each possible next word its own “slot” in the prediction list.                                                        |
| Text Generation Loop                   | (Core to RNN Workflow)       | Repeatedly predicts the next word and appends it to the input sequence.                     | Like a storyteller who keeps adding one word at a time until the story ends.                                                 |


So yes — we’re reusing everything from Day 50–55!
Only this time, we’re not classifying or clustering text — we’re generating new text.

## Preparing Data for RNN Text Generation

### 1. Load the corpus


In [1]:
import numpy as np
import os
import kagglehub
import re

# Download latest version
path = kagglehub.dataset_download("edenbd/children-stories-text-corpus")

print("Path to dataset files:", path)
DATA_FILE = "cleaned_merged_fairy_tales_without_eos.txt"
CORPUS_PATH = f"{path}/{DATA_FILE}"
print(CORPUS_PATH)
assert os.path.exists(CORPUS_PATH), f'Could not find {CORPUS_PATH}.'

with open(CORPUS_PATH, 'r', encoding='utf-8', errors='ignore') as f:
    raw_text = f.read()

print('Number of characters:', len(raw_text))
print('Sample preview (first 500 chars):\n')
print(f'{raw_text[:500]}\n')

# tiny cleanup
text = " ".join(raw_text.split())
text = text.lower()
# inser <eos>(end of sentence) after sentence enders
text = re.sub(r"([.!?])", r" \1 <eos>", text)
# space out other punctuation so they becode tokens
text = re.sub(r'([,;:\-—"“”‘’()\[\]])', r' \1 ', text)
text = re.sub(r'\s+', ' ', text).strip()

print('After Cleanup:\n')
print('Number of characters:', len(text))
print('Sample preview (first 500 chars):\n')
print(text[:500])


ModuleNotFoundError: No module named 'kagglehub'

### 2. Tokenize (word-level)

In [2]:
# ! pip install tensorflow keras

In [2]:
# tonkenising words
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
NUM_WORDS = 10000
OOV_TOKEN = "<OOV>"
FILTER = ''

tokenizer = Tokenizer(num_words=NUM_WORDS, oov_token=OOV_TOKEN, filters=FILTER)
tokenizer.fit_on_texts([text])
word_index = tokenizer.word_index
vocab_size = (min(NUM_WORDS, len(word_index)) + 1)if NUM_WORDS else len(word_index) + 1
tokens = tokenizer.texts_to_sequences([text])[0]

print(f'Vocabulary size: {vocab_size}')
print(f"Unique words (vocab):{len(word_index)}")
print(f'Number of tokens: {len(tokens)}')
print(f"Vocab used by model (input_dim):{vocab_size}")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  Fil

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\Users\amey9\.conda\envs\tf-2.10\lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  Fil

AttributeError: _ARRAY_API not found

ImportError: numpy.core._multiarray_umath failed to import

ImportError: numpy.core.umath failed to import

TypeError: Unable to convert function return value to a Python type! The signature was
	() -> handle

### 3. Sequebce Building(sliding window -> next word)

In [4]:
WIN = 60
inputs, targets = [], []
for i in range(WIN, len(tokens)):
    inputs.append(tokens[i-WIN:i])
    targets.append(tokens[i])

X = np.array(inputs, dtype=np.int32)
y = np.array(targets, dtype=np.int32)

print("X:", X.shape, " y:", y.shape)
print(f"Prview of X:\n{X[:5]}")
print(f"Prview of y:\n{y[:5]}")

X: (4806732, 60)  y: (4806732,)
Prview of X:
[[   3  312  136    5    4  358  502    3  565    2   34    9  869 4745
     2  231    3 3825    8    3  312  136    5    4   11   14 5015   39
    98   26 1122  687    8  401  315    2   24  139   11   22  106  505
     1    2    6    9  310  283 5173 5383   34   17  833   29 7979    5
     4   11   14   67]
 [ 312  136    5    4  358  502    3  565    2   34    9  869 4745    2
   231    3 3825    8    3  312  136    5    4   11   14 5015   39   98
    26 1122  687    8  401  315    2   24  139   11   22  106  505    1
     2    6    9  310  283 5173 5383   34   17  833   29 7979    5    4
    11   14   67  114]
 [ 136    5    4  358  502    3  565    2   34    9  869 4745    2  231
     3 3825    8    3  312  136    5    4   11   14 5015   39   98   26
  1122  687    8  401  315    2   24  139   11   22  106  505    1    2
     6    9  310  283 5173 5383   34   17  833   29 7979    5    4   11
    14   67  114 3073]
 [   5    4  358  502 

### 4. Safer train/val split (avoid windowa leakage)

In [5]:
VAL_FRAC = 0.2
split = int(len(X) * (1 - VAL_FRAC))
gap = WIN   # leave a gap of one window

X_train, y_train = X[:split-gap], y[:split-gap]
X_val,   y_val   = X[split+gap:], y[split+gap:]

print("Train:", X_train.shape, y_train.shape, " Val:", X_val.shape, y_val.shape)

Train: (3845325, 60) (3845325,)  Val: (961287, 60) (961287,)


## Building RNN model

### 1. Define LSTM model with embedding

In [6]:
import tensorflow as tf
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

LSTM_UNITS = 256
EMBEDDING_DIM = 128
VAL_DROPOUT = 0.2
ACTIVATION = "softmax"
LOSS = "sparse_categorical_crossentropy"

# WIN = context length of your sequences (must match your X shape)
model = Sequential([
    Input(shape=(WIN,)),
    Embedding(vocab_size, EMBEDDING_DIM, mask_zero=True),
    LSTM(LSTM_UNITS, return_sequences=True),
    Dropout(VAL_DROPOUT),
    LSTM(LSTM_UNITS, return_sequences=True),
    Dropout(VAL_DROPOUT),
    LSTM(LSTM_UNITS),
    Dense(vocab_size, activation=ACTIVATION)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-4, clipnorm=1.0),
    loss=LOSS,
    metrics=["accuracy"]
)

model.summary()


In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau,EarlyStopping,ModelCheckpoint
EPOCHS = 15
BATCH  = 256
MODEL_PATH = "day75_lstm_textgen.keras"

callbacks = [
    ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=2, min_lr=1e-5, verbose=1),
    EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True),
    ModelCheckpoint(MODEL_PATH, monitor="val_loss", save_best_only=True)
]

history = model.fit(
    X_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH,
    validation_data=(X_val, y_val),
    shuffle=True,
    verbose=1,
    callbacks=callbacks
)

import matplotlib.pyplot as plt
plt.plot(history.history["loss"], label="train")
plt.plot(history.history["val_loss"], label="val")
plt.legend(); plt.title("Loss"); plt.show()


Epoch 1/15
[1m   21/15021[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m17:48:25[0m 4s/step - accuracy: 0.0295 - loss: 9.0414