# Fairytale Generator Project

We wanted to make a fairytale generator mostly based on N-gram models. Our primary goal is to provide parents who struggle with creative storytelling with a model that can help generate customizable fairytales for their children every evening.

We used the tools we have covered in lectures and also implemented additional codes to be able to generate more complex structures.

1) First, we chose the Grimms' Fairy Tales by Jacob Grimm and Wilhelm Grimm as our data set as this book is a renowned collection of folk tales compiled in the early 19th century.

In these lines of codes which we have covered in lectures, we retrieve the book from Project Gutenberg's website. It opens the URL of the book, reads the data from the URL and decodes it into a readable string.




In [2]:
#Retrieve the book using its link and decode it via appropriate encoding.

from urllib.request import urlopen # for reading websites

url = "https://www.gutenberg.org/ebooks/2591.txt.utf-8" #declare the url
f = urlopen(url) #
grimms_fairytale = f.read().decode('utf-8-sig') #read the file with (utf-8) encoding

2) Here we are removing the introduction and conclusion of the Project Gutenberg to avoid generating them into final text.

In [3]:
#Get rid of the initial part of the book where Gutenberg project is explained in detail.

import nltk #imports nltk library for text processing
nltk.download('punkt_tab') #downloads tokenizer from NLTK.

start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK GRIMMS' FAIRY TALES ***"
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK GRIMMS' FAIRY TALES ***"

# Split the text
parts = grimms_fairytale.split(start_marker) # Splits the entire 'grimms_fairytale' text into a list of strings using the 'start_marker' as the dividing point.
if len(parts) > 1: #Checks if the 'start_marker' was found in the text.
    story = parts[1].split(end_marker)[0].strip() #Extracts the text between the start and end markers, then removes whitespace.



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


3) Before implementing this code block, we tried to use the cleanup technique from the lectures but this worked better as the initial cleanup. This code block helps us split the large story text into individual tales and performs the cleanup.
It uses RegEx to find story boundaries.

In [4]:
import re

raw_stories = re.split(r'\n[A-Z][A-Z\s\-]+\.?\n', story) #splits the story into potential individual tales using uppercase title patterns as dividers
raw_stories = [s.strip() for s in raw_stories if len(s.strip()) > 300]  #remove junk, filter out short, non-story parts


4) This block is word tokenization using NLTK

In [5]:
import nltk
nltk.download('punkt')

def tokenize(text): # Defines a function named 'tokenize' that takes one argument, 'text'.
    return nltk.word_tokenize(text) # Uses NLTK's 'word_tokenize' function to split the input 'text' into a list of individual word tokens and returns this list.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


5) Thanks to your feedback we were able to use spaCy for advanced text preprocessing. It helps us with named entity recognition and replacing them with placeholders. It was hard to to this without spaCy since spaCy is better with named entity recognition.

Also based on your feedback we tried to train different multiple models for each setting.

In [6]:
import spacy
import re #regex for pattern matching in text
import nltk #nltk for text processing
import string #string module = collection of string constants including punctuation.

#Load spaCy model
nlp = spacy.load("en_core_web_sm") #this line loads small English model from spaCy for named entity recognition

#Hero role words to normalize
hero_types_list = ["princess", "knight", "wizard", "hunter", "peasant", "witch","king","prince","queen"] #we have list of common hero types that we encounter in fairytales, later to be replaced by placeholders.

#Define setting keywords. These lines of code creates a dictionary where keys are setting names and values are lists of keywords associated with that setting
setting_keywords = {
    "forest": ["forest", "woods", "tree"],
    "castle": ["castle", "tower", "palace"],
    "village": ["village", "town", "hamlet"],
    "cave": ["cave", "rock", "underground"]
}

#Initialize setting-specific corpus
#here we create seperate empty lists which will become specialized corpuses for each setting.
corpus_by_setting = {s: [] for s in setting_keywords.keys()}

#This function classifies each individual story into one of the defined settings above based on the keywords in the text
def detect_setting(text):
    text = text.lower() #ocnverts input text to lowercase to ensure case-insensitive matching
    for setting, words in setting_keywords.items(): #itirates thorugh ecah setting
        if any(word in text for word in words): #checks if any keywords are present in the current setting
            return setting #returns detected setting based on the keyword
    return "unknown" #if no keywords are found returns unknown

#Replace real entities with placeholders for customization
def replace_with_placeholders(text):
    doc = nlp(text) #processes the input text using spaCy. This creates
    modified = text #new string with altered placeholders

    for ent in doc.ents: #this itirates through each detected named entity in doc object
        if ent.label_ == "PERSON": #checks if the current entity's label indicates it is person
            modified = modified.replace(ent.text, "[PERSON_1]") #if it is a person this line replaces that entity in modified with the placeholder [PERSON_1]
        elif ent.label_ in {"GPE", "LOC"}: #this line is for the setting, it checks whether the entity's label indicates any location
            modified = modified.replace(ent.text, "[SETTING]") #if it is a location then it replaces that entity in modified with placeholder [SETTING]

    for role in hero_types_list:
        modified = re.sub(rf"\b{role}\b", "[HERO_TYPE]", modified, flags=re.IGNORECASE) # Uses a regex to replace all whole-word occurrences of the current 'role' in modified with the placeholder [HERO_TYPE]

    return modified #returns the whole modified string containing all the customization choices

#Process stories using spaCy-based placeholder replacement
for tale in raw_stories:
    setting = detect_setting(tale) #Calls detect_setting to determine the dominant setting of the current tale.
    if setting != "unknown": #checks if a specific settin was detected
        tale = replace_with_placeholders(tale) #if a setting is known, this line applies replace_with_placeholders to tale
        sents = nltk.sent_tokenize(tale) #splits modified tale into individual sentences using sentence tokenizer
        tokens = [nltk.word_tokenize(sent) for sent in sents]
        for sentence in tokens:
            cleaned = [w.lower() for w in sentence if w not in string.punctuation] #tokenized words cleaning
            corpus_by_setting[setting] += ['<s>'] + cleaned + ['</s>'] #appends cleaned words of the sentence along with sentence start and end markers to the list associated with its setting


In [7]:
#Count occurrences of n-grams
from collections import Counter

#Initialize empty dictionaries to hold the n-gram models for each setting
unigrams_by_setting = {}   #(single words)
bigrams_by_setting = {}    #(pairs of words)
trigrams_by_setting = {}   #(triplets of words)

#Loop over each setting and its associated tokenized corpus
for setting, corpus in corpus_by_setting.items():

    # Count the frequency of each individual word (unigram) in the corpus
    unigrams_by_setting[setting] = Counter(corpus)

    # Generate a list of bigrams by pairing each word with the next
    #Count the frequency of each bigram and store it
    bigrams = [(corpus[i], corpus[i+1]) for i in range(len(corpus) - 1)]
    bigrams_by_setting[setting] = Counter(bigrams)

    # Generate a list of trigrams by pairing each word with the next two
    # Count the frequency of each trigram and store it
    trigrams = [(corpus[i], corpus[i+1], corpus[i+2]) for i in range(len(corpus) - 2)]
    trigrams_by_setting[setting] = Counter(trigrams)


6) Here we add placeholder tokens like PERSON1TOKEN and HEROTYPETOKEN to make the model focus on story structure instead of specific names.

We learned in class that using consistent tokens helps the model generalize better. We also added some fake lines about the character and setting throughout the text to reinforce common fairy tale patterns during training.

In [8]:

def insert_fake_tokens(text, setting):
    #Keywords used for training only
    hero_names = ["Hansel", "Gretel", "Jack", "Aria", "Finn", "Luna", "Kai", "Crabb"]
    hero_types = ["princess", "knight", "wizard", "hunter", "peasant", "witch","prince"]
    settings = ["forest", "castle", "cave", "village"]

    #Injected text with placeholders
    injected_lines = (
        f"PERSON1TOKEN the HEROTYPETOKEN lived in the {setting}. " #first placeholder sentence
        f"Every day, PERSON1TOKEN wandered the {setting}. "        #second placeholder sentence
        f"Everyone in the {setting} knew PERSON1TOKEN was a brave HEROTYPETOKEN. " #third placeholder sentence
    ) * 3 #repeats the placeholder sentences 3 times

    #Periodically insert the fake lines
    chunks = text.split("\n\n") #splits text into paragraphs
    augmented_chunks = [] #initializes list for augmented chunks
    for i, chunk in enumerate(chunks):
        if i % 5 == 0: #Checks if current chunk is every 5th one.
            augmented_chunks.append(injected_lines) #inserts placeholder lines every 5th chunk
        augmented_chunks.append(chunk) #adds augmented chunks to original chunk
    text = "\n\n".join(augmented_chunks) #joins them back into single text

    #Replace keywords with placeholders
    import re
    for name in hero_names:
        text = re.sub(rf"\b{name}\b", "PERSON1TOKEN", text, flags=re.IGNORECASE) #itirate through every hero name and replace them with PERSON1TOKEN
    for role in hero_types:
        text = re.sub(rf"\b{role}\b", "HEROTYPETOKEN", text, flags=re.IGNORECASE) #itirate through every hero type and replace them with HEROTYPETOKEN
    for s in sorted(settings, key=lambda s: -len(s)):
        text = re.sub(rf"\b{s}\b", "SETTINGTOKEN", text, flags=re.IGNORECASE) #itirate through every setting and replace them with SETTINGTOKEN

    return text #finally returns text with injected lines and replaced keywords

7) Here we build n-gram models (bigrams, trigrams, and 4-grams) for each
 setting using MLE. This step trains our model to predict the next word based on the previous ones, depending on the setting of the story.  To handle unseen word combinations we apply Lidstone smoothing, which adds a small constant to all counts to avoid zero probabilities and ensure better generalization.

In [9]:
from nltk.util import ngrams
from nltk.probability import ConditionalFreqDist, ConditionalProbDist, LidstoneProbDist #imports probability classes

models_by_setting = {} #dictionary to store models per setting

for setting, corpus in corpus_by_setting.items(): #loops thorugh each setting and its tokens
    bigrams = list(ngrams(corpus, 2))
    trigrams = list(ngrams(corpus, 3))
    fourgrams = list(ngrams(corpus, 4))

    cf2 = ConditionalFreqDist((w1, w2) for w1, w2 in bigrams) #these lines counts frequencies for each ngram model
    cf3 = ConditionalFreqDist(((w1, w2), w3) for w1, w2, w3 in trigrams)
    cf4 = ConditionalFreqDist(((w1, w2, w3), w4) for w1, w2, w3, w4 in fourgrams)

   # The '0.1' is the k parameter for Lidstone smoothing.
    cp2 = ConditionalProbDist(cf2, LidstoneProbDist, 0.1)
    cp3 = ConditionalProbDist(cf3, LidstoneProbDist, 0.1)
    cp4 = ConditionalProbDist(cf4, LidstoneProbDist, 0.1)

    models_by_setting[setting] = { #this dictionary stores all models for current setting
        "cp2": cp2,
        "cp3": cp3,
        "cp4": cp4
    }

8)This is where we generate a fairy tale using the N-gram models, starting from a random template. This function implements a back-off strategy during word prediction. Here, the function tries to predict the next word using 4-grams, then falls back to 3-grams or bigrams if needed. As a final fallback, if no suitable N-gram context is available, it selects a random word from the entire setting-specific corpus. After generating text it replaces placeholders like [PERSON_1] with real character and setting values at the end to create a full, coherent story.

In [10]:
import random
from scipy import stats

def custom_fairytale_with_format(hero=None, hero_type=None, setting=None, max_len=300): #defines story generation function
    #Define multiple possible starter phrases with placeholders
    starters = [  #here we made a list of possible fairytale starter templates based on conventional fairytale starters
        ["Once", "upon", "a", "time", "the", "[HERO_TYPE]", "[PERSON_1]", "was", "in", "the", "[SETTING]"],
        ["Long", "ago", "in", "a", "[SETTING]", "there", "lived", "a", "[HERO_TYPE]", "named", "[PERSON_1]"],
        ["In", "the", "heart", "of", "the", "[SETTING]", "[PERSON_1]", "the", "[HERO_TYPE]", "awoke"],
        ["A", "[HERO_TYPE]", "called", "[PERSON_1]", "once", "journeyed", "through", "the", "[SETTING]"],
        ["Many", "years", "ago,", "a", "[HERO_TYPE]", "named", "[PERSON_1]", "guarded", "the", "[SETTING]"]
    ]
    #Choose one randomly
    starter = random.choice(starters) # here we choose one of the starters randomly

    words = starter[:] #initializes story words with starter
    current_phrase = tuple(starter[-3:]) #sets initial ngram context

    # Fetch language models and backup corpus
    cp4 = models_by_setting[setting]["cp4"] #gets 4gram model for setting
    cp3 = models_by_setting[setting]["cp3"] #gets 3gram model ""
    cp2 = models_by_setting[setting]["cp2"] #gets 2gram model ""
    fallback_corpus = corpus_by_setting[setting] #this gets full corpus for fallback

    #Generate next word using a specific N-gram model
    #If it fails moves down to smaller model this is backoff model
    #(in case of no data for context)

    used_trigrams = set() #this empty set is to keep track of trigrams already used to avoid immediate repetition

    for _ in range(max_len - 3): #loops to generate words up to max_len
        try: #4gram prediction
            options = list(cp4[current_phrase].samples()) #next word options
            probs = [cp4[current_phrase].prob(w) for w in options] #probabilities
            next_word = random.choices(options, weights=probs, k=1)[0] #choose word for by probability
        except: #fallback to 3gram incase 4gram fails
            try: #3gram prediction
                context = current_phrase[1:] #same as above but for 3grams
                options = list(cp3[context].samples())
                probs = [cp3[context].prob(w) for w in options]
                next_word = random.choices(options, weights=probs, k=1)[0]
            except: #fallback to 2grams incase 3gram fails
                try: #2gram prediction
                    context = (current_phrase[2],) #same as above but for 2grams
                    options = list(cp2[context].samples())
                    probs = [cp2[context].prob(w) for w in options]
                    next_word = random.choices(options, weights=probs, k=1)[0]
                except: #fallback to random word if all ngrams fail
                    next_word = random.choice(corpus)

        #Skip repeated trigrams
        trigram = (current_phrase[0], current_phrase[1], next_word) #forms trigram for check
        if trigram in used_trigrams: #check for repetition and skip if repeated and finally add trigram to used set
          continue
        used_trigrams.add(trigram)

        words.append(next_word)
        current_phrase = (current_phrase[1], current_phrase[2], next_word)

    # Process words into sentences
    sentences = [] # list to store finished sentencess
    current_sentence = [] #list to build current sentence

    for word in words:
        if word == "<s>": #skip sentence start marker
            continue
        elif word == "</s>": #process sentence end marker
          if current_sentence: #if sentence has words join words into sentence string
            sentence = ' '.join(current_sentence)
            if current_sentence: #redundant check
                sentence = ' '.join(current_sentence) #join again and add punctuation if missing
                if sentence[-1] not in '.!?':
                    sentence += '.'
                sentences.append(sentence) #add sentence to list and reset for next sentence
                current_sentence = []
        else:
            current_sentence.append(word)

    #Convert full text
    story_text = ' '.join(sentences) #joins sentences into final text

    starter = [word.replace("[HERO_TYPE]", hero_type) #replace each in starter
                    .replace("[PERSON_1]", hero)
                    .replace("[SETTING]", setting) for word in starter]

    #Ficxng broken placeholder variants
    story_text = story_text.replace("person_1", "[PERSON_1]") #correct lowercase for each
    story_text = story_text.replace("hero_type", "[HERO_TYPE]")
    story_text = story_text.replace("setting", "[SETTING]")

    #Substitute placeholders after generation
    #After training and generation, turns the template back into a real story
    story_text = story_text.replace("[PERSON_1]", hero) #replace each placeholder with actual names
    story_text = story_text.replace("[HERO_TYPE]", hero_type)
    story_text = story_text.replace("[SETTING]", setting)

    #Clean spacing
    story_text = story_text.replace(" .", ".").replace(" ,", ",").replace(" !", "!").replace(" ?", "?") #fix spacing
    story_text = story_text.replace(" '", "'").replace(" n't", "n't").replace(" ’", "’")


    import textwrap

    def capitalize_sentences(text): #This function is for capitalizing the first letter of each sentence.
      sentences = re.split(r'(?<=[.!?]) +', text)
      return ' '.join(s.capitalize() for s in sentences)

    # This part was added for a cleaner output. Since the input data "Project Gutenberg" uses typographic quotation
    # marks, there were some singled out apostrophes appearing in the output that were out of place. The following two
    # lines turn them into ASCII apostrophes.

    story_text = story_text.replace("‘", "'").replace("’", "'")
    story_text = story_text.replace("“", '"').replace("”", '"')

    story_text = capitalize_sentences(story_text)
    wrapped = textwrap.fill(story_text, width=80) #The final output was appearing as a single line, we fixed this issue
    #by using textwrap and setting the width to 80 characters so the output would be more readable especially in colab

    return wrapped




9) This helper function returns the most likely next word based on a 4-gram model.

In [11]:
def get_next_word(w1, w2, w3, cp4): #defines function to get next word
    try: #most likely word
        return cp4[(w1, w2, w3)].max() #returns most frequent next word
    except: #catches if context not found
        return "</s>"  # fallback token

10) This block defines a generate_story function. We use it to construct a story word-by-word using a 4gram model. It starts from a given seed, and then stops when maximum lenght is reached.

In [12]:
def generate_story(seed=["<s>", "<s>", "once"], max_len=300): #defines a story generation function
    story = seed[:] #initializes story with seed
    cp4 = models_by_setting[setting]["cp4"] #gets 4gram model
    while len(story) < max_len: #loops while story is short
        w1, w2, w3 = story[-3], story[-2], story[-1]
        next_word = get_next_word(w1, w2, w3) #predicts next word
        story.append(next_word) #adds word to story
        if next_word == "</s>" and len(story) > 100:
          break #breaks if sentence is long enough

    return " ".join(story[3:])  # skip initial tokens and return joined story

11) Finally this code block is for user interaction and story generation. This code block prompts the user for customization of hero name,type and setting then uses this input to generate customized fairytale

In [None]:
# Get user input
user_hero = input("Enter the hero's name: ") #each line prompts the user for customized details and and stores them
user_hero_type = input("Enter the hero's type (e.g., princess, knight, witch, peasant, wizard, hunter): ")
user_setting = input("Enter the setting (e.g., forest, castle, cave, village): ")

generated = custom_fairytale_with_format(setting=user_setting, hero=user_hero, hero_type=user_hero_type) #generation function passing user inputs

print(generated)