Project Name: **AI Storytelling Adventure**

Project Description:
`This project creates an interactive AI-powered storytelling game using a deep learning model (LSTM) trained on Alice’s Adventures in Wonderland. The AI predicts and generates the next part of the story based on user input, making the game fun and dynamic.
The model learns how words connect, ensuring smooth and meaningful storytelling. The game is built with Streamlit/Flask, so users can play it online easily.
This project highlights AI, NLP, and Deep Learning, making it a great addition to a data science portfolio.
`

Date: 13/3/2025

# Scraping the Book

In [35]:
import requests
from bs4 import BeautifulSoup

In [36]:
# URL of the book on Project Gutenberg
BOOK_URL = "https://www.gutenberg.org/files/11/11-0.txt"

In [37]:
# function to scrap and clean the text
def scrape_book(url):
    response = requests.get(url)
    response.encoding = "utf-8"  # Ensure correct text encoding

    if response.status_code == 200:
        text = response.text

        # Remove Gutenberg's header and footer
        start_marker = "*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES ***"
        end_marker = "*** END OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES ***"

        start_idx = text.find(start_marker) + len(start_marker)
        end_idx = text.find(end_marker)

        if start_idx != -1 and end_idx != -1:
            cleaned_text = text[start_idx:end_idx].strip()
        else:
            cleaned_text = text  # Fallback to full text if markers not found

        # Save to a file
        with open("alice_wonderland.txt", "w", encoding="utf-8") as file:
            file.write(cleaned_text)

        print("✅ Book scraped and saved as 'alice_wonderland.txt'")
    else:
        print("❌ Failed to fetch the book. Check the URL.")

In [38]:
# run the function
scrape_book(BOOK_URL)

✅ Book scraped and saved as 'alice_wonderland.txt'


In [39]:
print("✅ Book Scraping complete! Ready for Text Preprocessing.")

✅ Book Scraping complete! Ready for Text Preprocessing.


# PreProcess The Text

In [40]:
# Import some neccessary libraries
import numpy as np
import pandas as pd

import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re

In [41]:
# load the clean text file in lower case letter
with open("alice_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read().lower()

In [42]:
# Calculate one third of the text length
# half_length = len(text) // 2

# # Slice the text to keep only the first one third
# text = text[:half_length]

# # because of ram overloading

In [43]:
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# Replace multiple spaces with a single space
text = re.sub(r'\s+', ' ', text)

# Remove numbers (if they are not relevant)
text = re.sub(r'\d+', '', text)

# Strip leading and trailing whitespace
text = text.strip()

# Remove non-ASCII characters (if any)
text = re.sub(r'[^\x00-\x7F]+', '', text)

In [44]:
# Tokenize the text
tokenizer = Tokenizer(num_words=10000)  # Limit vocabulary size
tokenizer.fit_on_texts([text])

In [45]:
tokenizer.word_counts

OrderedDict([('start', 1),
             ('of', 511),
             ('the', 1640),
             ('project', 2),
             ('gutenberg', 2),
             ('ebook', 2),
             ('illustration', 1),
             ('alices', 13),
             ('adventures', 6),
             ('in', 367),
             ('wonderland', 3),
             ('by', 57),
             ('lewis', 1),
             ('carroll', 1),
             ('millennium', 1),
             ('fulcrum', 1),
             ('edition', 1),
             ('contents', 1),
             ('chapter', 24),
             ('i', 402),
             ('down', 102),
             ('rabbithole', 4),
             ('ii', 3),
             ('pool', 11),
             ('tears', 12),
             ('iii', 2),
             ('a', 632),
             ('caucusrace', 4),
             ('and', 846),
             ('long', 33),
             ('tale', 5),
             ('iv', 2),
             ('rabbit', 44),
             ('sends', 2),
             ('little', 129),
            

In [46]:
tokenizer.word_index

{'the': 1,
 'and': 2,
 'to': 3,
 'a': 4,
 'she': 5,
 'it': 6,
 'of': 7,
 'said': 8,
 'i': 9,
 'alice': 10,
 'in': 11,
 'you': 12,
 'was': 13,
 'that': 14,
 'as': 15,
 'her': 16,
 'at': 17,
 'on': 18,
 'with': 19,
 'all': 20,
 'had': 21,
 'but': 22,
 'for': 23,
 'so': 24,
 'be': 25,
 'very': 26,
 'not': 27,
 'what': 28,
 'this': 29,
 'little': 30,
 'they': 31,
 'he': 32,
 'out': 33,
 'its': 34,
 'down': 35,
 'is': 36,
 'one': 37,
 'up': 38,
 'his': 39,
 'about': 40,
 'if': 41,
 'then': 42,
 'no': 43,
 'were': 44,
 'like': 45,
 'know': 46,
 'them': 47,
 'would': 48,
 'went': 49,
 'again': 50,
 'herself': 51,
 'do': 52,
 'have': 53,
 'when': 54,
 'could': 55,
 'or': 56,
 'there': 57,
 'thought': 58,
 'off': 59,
 'time': 60,
 'me': 61,
 'queen': 62,
 'into': 63,
 'see': 64,
 'how': 65,
 'who': 66,
 'did': 67,
 'your': 68,
 'king': 69,
 'well': 70,
 'dont': 71,
 'began': 72,
 'my': 73,
 'by': 74,
 'mock': 75,
 'an': 76,
 'im': 77,
 'now': 78,
 'turtle': 79,
 'quite': 80,
 'hatter': 81,
 'gr

In [47]:
total_vocab = len(tokenizer.word_index) + 1
total_vocab

2758

In [48]:
# # Convert words to sequences
# sequences = []
# words = text.split()
# for i in range(1, len(words)):
#     sequence = words[:i+1]  # Create input-output pairs
#     sequences.append(sequence)

In [49]:
# # Convert words to numerical values
# # Handle unknown words by replacing them with a default value (e.g., 0)
# sequences = [[tokenizer.word_index.get(word, 0) for word in seq] for seq in sequences]

In [50]:
# sequences

In [51]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenize and create sequences with the tokenizer
sequences = tokenizer.texts_to_sequences([text])[0]

# Define sequence length
seq_length = 50

# Generate input-output pairs
input_sequences = []
output_words = []

for i in range(seq_length, len(sequences)):
    input_seq = sequences[i-seq_length:i]  # Sequence of words as input
    output_word = sequences[i]  # The next word to predict
    input_sequences.append(input_seq)
    output_words.append(output_word)

# Pad sequences to ensure they all have the same length
X = pad_sequences(input_sequences, maxlen=seq_length)

# Convert output words into numpy array
import numpy as np
y = np.array(output_words)


In [52]:
X.shape

(26435, 50)

In [53]:
y.shape

(26435,)

In [54]:
# Convert y to one-hot encoded format
y = tensorflow.keras.utils.to_categorical(y, num_classes=len(tokenizer.word_index) + 1)

In [55]:
y.shape

(26435, 2758)

In [56]:
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (26435, 50)
y shape: (26435, 2758)


In [57]:
print("✅ Text preprocessing complete! Ready for LSTM training.")

✅ Text preprocessing complete! Ready for LSTM training.


# Building the LSTM Model

In [58]:
import tensorflow
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,LSTM,Embedding,Input, Dropout

In [59]:
X.shape[1]

50

In [60]:
total_vocab + 1

2759

In [61]:
model = Sequential()

model.add(Input(shape=(X.shape[1],)))
model.add(Embedding(input_dim=total_vocab+1,output_dim=50))
model.add(LSTM(64, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(total_vocab,activation='softmax'))

model.summary()

In [62]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [63]:
history = model.fit(X,y,epochs=100,batch_size=32,validation_split=0.2)

Epoch 1/100
[1m661/661[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.0512 - loss: 6.5518 - val_accuracy: 0.0847 - val_loss: 6.1618
Epoch 2/100
[1m661/661[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 7ms/step - accuracy: 0.0567 - loss: 5.8821 - val_accuracy: 0.0870 - val_loss: 6.1969
Epoch 3/100
[1m661/661[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - accuracy: 0.0629 - loss: 5.8118 - val_accuracy: 0.0912 - val_loss: 6.1324
Epoch 4/100
[1m661/661[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 8ms/step - accuracy: 0.0719 - loss: 5.6496 - val_accuracy: 0.0949 - val_loss: 6.0738
Epoch 5/100
[1m661/661[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 8ms/step - accuracy: 0.0869 - loss: 5.4464 - val_accuracy: 0.1037 - val_loss: 6.0432
Epoch 6/100
[1m661/661[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 8ms/step - accuracy: 0.1024 - loss: 5.2844 - val_accuracy: 0.1141 - val_loss: 6.0064
Epoch 7/100
[1m66

# Model Prediction

In [68]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_text(seed_text, model, tokenizer, seq_length, num_words=50, words_per_line=10):
    output_text = seed_text
    generated_words = []  # Store words separately for formatting

    for _ in range(num_words):
        # Tokenize and pad the input sequence
        encoded = tokenizer.texts_to_sequences([output_text])[0]
        encoded = pad_sequences([encoded], maxlen=seq_length, padding='pre')

        # Predict the next word
        predicted_index = np.argmax(model.predict(encoded, verbose=0), axis=-1)[0]

        # Convert index to word
        word = tokenizer.index_word.get(predicted_index, '')  # Default to empty if not found
        if not word:
            break  # Stop if no valid word is found

        generated_words.append(word)
        output_text += ' ' + word

    # Format the output in multiple lines
    formatted_text = ""
    for i in range(0, len(generated_words), words_per_line):
        formatted_text += " ".join(generated_words[i:i+words_per_line]) + "\n"

    return formatted_text.strip()  # Remove trailing newlines

# Get input from the user
seed_text = input("Enter a starting phrase: ")

# Generate and display the output
generated_text = generate_text(seed_text, model, tokenizer, seq_length, num_words=50)

print("\n📝 **Generated Story** 📝\n")
print(generated_text)


Enter a starting phrase: Who in the world am I?"

📝 **Generated Story** 📝

can do to be no use said to the things
went on without attending to her if you had said
the hatter i can herself out of the window and
after a few minutes it was looking for the fan
and gloves and the baby was howling alternately without a


In [65]:
# User_Input Example
'''🔹
Fantasy / Wonderland Style

"Alice stepped through the door and found herself in"
"The white rabbit looked at Alice and said,"
"In the middle of the forest, Alice discovered a"
"The Mad Hatter grinned and whispered,"
"Alice picked up the strange golden key and"

🔹 Mystery / Adventure

"The moment Alice touched the mirror, she felt"
"The Queen of Hearts turned to Alice and shouted,"
"Alice followed the Cheshire Cat through the fog and"
"There was a hidden note on the teacup that said,"
"Alice opened the mysterious book and saw"

'''

'🔹 \nFantasy / Wonderland Style\n\n"Alice stepped through the door and found herself in"\n"The white rabbit looked at Alice and said,"\n"In the middle of the forest, Alice discovered a"\n"The Mad Hatter grinned and whispered,"\n"Alice picked up the strange golden key and"\n\n🔹 Mystery / Adventure\n\n"The moment Alice touched the mirror, she felt"\n"The Queen of Hearts turned to Alice and shouted,"\n"Alice followed the Cheshire Cat through the fog and"\n"There was a hidden note on the teacup that said,"\n"Alice opened the mysterious book and saw"\n\n'