This notebook examines several different ways of generating text. I wanted to create a plot generator, something that would generate story ideas based on different genres. This notebook displays the various ways I tried to accomplish this text generation: hard-coded text combination, an LSTM deep learning model, a RNN model based on  TextGenRNN, a custom Markov chain model, and a model based on Markovify. You can copy the notebook and used the files included in the repo to experiment with the results of the different text generation methods. 

All credits for algorithms found in the repo's README.

At first, I tried hard-coding a function to generate text. This function combines pre-defined terms together to generate text. The theme here is "Sci-fi".

In [1]:
import random

def plot_gen(num_gen):
    i = 0
    while i <= num_gen:
        setting = random.choice(
                ["future Tokyo", "future New York", "a utopia", "a dystopia", "a virtual world", "a base on the Moon",
                 "the heart of Silicon Valley", "a city under the ocean", "a massive underground facility", "an artificial island"])
        gender = random.choice(
                ["male ", "male ", "female ", "female ", "robot ", "third gender "])
        classs = random.choice(
                ["hacker", "cyborg", "engineer", "corporate employee", "street rat", "soldier", "doctor", "detective", "pilot", "writer"])
        protagonist = gender + classs
        antagonist = random.choice(
                ["a massive corporation", "a rogue AI", "a powerful street gang", "a secret society", "a disruptive technology", "robots",
                 "internet trolls", "a virus", "a group of aliens", "a corrupt government", "new pirates/bandits"])
        conflict = random.choice(
            ["falls in love with ", "attempts to stop ", "fights against ", "flees from ", "exceeded beyond ", "defends against ", "tries to befriend ",
             "explores with ", "competes with ", "seeks revenge against "])
        print("In" + " " + setting + ", there is a" + " " + protagonist + " " + "who" + " " + conflict + antagonist + ".")
        i += 1

plot_gen(5)

In a city under the ocean, there is a male engineer who seeks revenge against internet trolls.
In the heart of Silicon Valley, there is a female soldier who exceeded beyond a corrupt government.
In a massive underground facility, there is a female writer who tries to befriend a virus.
In a virtual world, there is a robot hacker who exceeded beyond a corrupt government.
In the heart of Silicon Valley, there is a male engineer who exceeded beyond a powerful street gang.
In future Tokyo, there is a male writer who flees from a massive corporation.


The technique produced acceptable results, but it was very limited. For this reason, I wanted to create a generation function that was more robust. I turned to deep learning to accomplish this. After collecting a bunch of plots of horror movies from the Open Movie Database, this was the code I used to train the model. I used a Long Short-Term Memory model, which excels at generating text. 

First, the text is loaded in and two dictionaries are created. The first dictionary maps text characters to numerical values, while the second dictionary maps numerical values to characters. This allows us to encode and decode the text data for use by the model. After this, the input sentences and the succeeding characters must be defined. Afterwards, the X/features are Y/labels are created for the model using the defined varaibles.

After this, an optimizer for the model is selected. The model is then created and instantiated.

In [2]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM, Dropout
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, LambdaCallback, EarlyStopping
import matplotlib.pyplot as plt
import numpy as np
import random
import sys
import warnings
warnings.filterwarnings("ignore")

text = open('horror_plots_2_correct.txt', 'r').read().lower()
print('text length', len(text))

print(text[:300])

chars = sorted(list(set(text)))
print('total chars: ', len(chars))

# Map characters to numeric values and vice-versa
char_to_num = dict((c, i) for i, c in enumerate(chars))
num_to_char = dict((i, c) for i, c in enumerate(chars))

max_length = 50
step = 3
sentences = []
next = []

# Create input sentences for the model, as well as the next character in the sequence
for i in range(0, len(text) - max_length, step):
    sentences.append(text[i: i + max_length])
    next.append(text[i + max_length])

print('Number of sequences:', len(sentences))

# Create features and labels for the model
# Create numpy array full of zeroes
x = np.zeros((len(sentences), max_length, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

# Fill in array values
for idx, sentence in enumerate(sentences):
    for i, char in enumerate(sentence):
        x[idx, i, char_to_num[char]] = 1
    y[idx, char_to_num[next[idx]]] = 1

# Declare an optimizer for the model
optim = Adam(lr=0.01)

# Create the model format
def create_model(max_len, charas, optim):
    model = Sequential()
    model.add(LSTM(128, input_shape=(max_len, len(charas)), return_sequences=True))
    model.add(LSTM(256))
    model.add(Dense(128))
    model.add(Dropout(0.2))
    model.add(Dense(64))
    model.add(Dropout(0.2))
    model.add(Dense(len(charas)))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy'])

    return model

# Instantiate the model instance
model = create_model(max_length, chars, optim)

Using TensorFlow backend.


text length 244182
a meteor strikes a houseboat in the swamps near a southern town populated by yankees with fake accents. the people on the houseboat become zombies who feed on the alligators in the swamp. ...
"a baby alligator is flushed down a chicago toilet and survives by eating discarded laboratory rats injected
total chars:  61
Number of sequences: 81378


W1018 21:09:51.357419 24504 deprecation_wrapper.py:119] From c:\users\daniel\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1018 21:09:51.358415 24504 deprecation_wrapper.py:119] From c:\users\daniel\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1018 21:09:51.361383 24504 deprecation_wrapper.py:119] From c:\users\daniel\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W1018 21:09:52.048580 24504 deprecation_wrapper.py:119] From c:\users\daniel\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default 

There are two functions needed for the generation of text. The first function gets the probability values for the next characters in the sequence, the predictions. The second function prints the text at the end of every 5 training epochs.

In [3]:
def sample(preds, temperature=1.0):
    # pull an index from probability array
    preds = np.asarray(preds).astype('float64')
    # get the log probability
    preds = np.log(preds) / temperature
    # expand the array
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    # create a multinomial distribution based on the preds and sample from it
    probs = np.random.multinomial(1, preds, 1)
    # get the max (highest probability) prediction
    return np.argmax(probs)

def end_epoch(epoch, logs):
    # function runs every 5 epochs
    # prints a string of text text_created with the current model parameters

    if epoch % 5 == 0:

        print()
        print('Generation results after epoch {}'.format(epoch))
        print()
        # Set an initial index to start our generation on by randomly selecting an integer
        initial_idx = random.randint(0, len(text) - max_length - 1)
        # Generate the characters for our list of different divesrsities
        for diversity in [0.5, 0.75, 1.0, 1.25]:
            print('Current diversity: {}'.format(diversity))
            # Reference the starting index and get the sentence that follows
            # This sentence is what will be used to generate the text
            text_created = ''
            sentence = text[initial_idx: initial_idx + max_length]
            print("Seed to generate from:".format(sentence))
            text_created += sentence
            sys.stdout.write(text_created)
            print(" ")
            print("Generated text:")
            print("___")

            for i in range(500):

                # Construct an array of zeros to fit the predictions into
                feature_pred = np.zeros((1, max_length, len(chars)))

                # Fill in the feature array with the numbers that represent characters
                for n, char in enumerate(sentence):
                    feature_pred[0, n, char_to_num[char]] = 1.

                # Use the model to predict based off of the currents features
                preds = model.predict(feature_pred, verbose=0)[0]
                # get the most likely prediction from the predictions list
                next_idx = sample(preds, diversity)
                # Convert the index to an actual character
                next_char = num_to_char[next_idx]

                # Add the character to the string to be text_created
                text_created += next_char
                # Move on to the next character in the sentence
                sentence = sentence[1:] + next_char

                # Start compiling the list of probable next characters based on sample predictions
                sys.stdout.write(next_char)
                # Write the characters to the terminal
                sys.stdout.flush()

            print(" ")
            print("___")
            print(" ")

    else:
        pass

We'll now go about defining some callbacks for the training, as well as a filepath for the saved weights. We can then fit/train our model and capture some metrics like loss and accuracy. We'll then visualize these metrics after training is done.

In [4]:
print_callback = LambdaCallback(on_epoch_end=end_epoch)

filepath = "weights.hdf5"

callbacks = [print_callback,
            ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min'),
            ReduceLROnPlateau(monitor='loss', factor=0.2, patience=1, verbose=1, mode='min', min_lr=0.00001),
            EarlyStopping(monitor= 'loss', min_delta=1e-10, patience=15, verbose=1, restore_best_weights=True)]

records = model.fit(x, y, batch_size=128, epochs=100, callbacks=callbacks)

t_loss = records.history['loss']
t_acc = records.history['acc']

# gets the lengt of how long the model was trained for
train_length = range(1, len(t_loss) + 1)

def evaluation(model, train_length, training_loss, training_acc):

    # plot the loss across the number of epochs
    plt.figure()
    plt.plot(train_length, training_loss, label='Training Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

    plt.figure()
    plt.plot(train_length, training_acc, 'r', label='Training acc')
    plt.xlabel('Epochs')
    plt.ylabel('Acc')
    plt.title('Accuracy Over Epochs')
    plt.show()

    # compare against the test training set
    # get the score/accuracy for the current model
    scores = model.evaluate(x, y, batch_size=128)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

# evaluation(model, train_length, t_loss, t_acc)

W1018 21:09:52.355729 24504 deprecation.py:323] From c:\users\daniel\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/100

KeyboardInterrupt: 

Finally, we can create a function to generate text, taking in the saved weights. 

In [None]:
def text_gen():

    model.load_weights("weights.hdf5")

    initial_idx = random.randint(0, len(text) - max_length - 1)
    # Generate the characters for our list of different divesrsities
    for diversity in [0.5, 0.75, 1.0, 1.25]:
        print('Current diversity: {}'.format(diversity))
        # Reference the starting index and get the sentence that follows
        # This sentence is what will be used to generate the text
        text_created = ''
        sentence = text[initial_idx: initial_idx + max_length]
        print("Seed to generate from:".format(sentence))
        text_created += sentence
        sys.stdout.write(text_created)
        print(" ")
        print("Generated text:")
        print("___")
        for i in range(500):

            # Construct an array of zeros to fit the predictions into
            feature_pred = np.zeros((1, max_length, len(chars)))

            # Fill in the feature array with the numbers that represent characters
            for n, char in enumerate(sentence):
                feature_pred[0, n, char_to_num[char]] = 1.

            # Use the model to predict based off of the currents features
            preds = model.predict(feature_pred, verbose=0)[0]
            # get the most likely prediction from the predictions list
            next_idx = sample(preds, diversity)
            # Convert the index to an actual character
            next_char = num_to_char[next_idx]

            # Add the character to the string to be text_created
            text_created += next_char
            # Move on to the next character in the sentence
            sentence = sentence[1:] + next_char

            # Start compiling the list of probable next characters based on sample predictions
            sys.stdout.write(next_char)
            # Write the characters to the terminal
            sys.stdout.flush()

        print(" ")
        print("___")
        print(" ")

The results I got from this where underwheleming. Even after 100 epochs of training and minimal loss, the text that was generated wasn't very coherent. Ultimately I probably need more training data for this approach, and to spend more time experimenting with the model.

More training data and training for more epochs might help, but I began to wonder if there was an easier way to accomplish my goal. I found out about the TextGenRNN model referenced in the README attached to the repo and tried implementing it.

In [None]:
from textgenrnn import textgenrnn

input_file = "horror_plots.txt"
epochs = 50
weights_file = "textgenrnn_weights.hdf5"

def train_generator(input_file, epochs=0):
    textgen_model = textgenrnn()
    textgen_model.train_from_file(input_file, num_epochs=epochs)
    textgen_model.save("textgenrnn_weights.hdf5")

#train_generator(input_file, epochs)

def gen_text(weights_file):
    textgen_model = textgenrnn()
    textgen_model.load(weights_file)
    textgen_model.generate()

gen_text(weights_file=weights_file)

The text genreated by this library was of higher quality, but it still wasn't quite what I was looking for. Training also took quite a bit of time. I investigated Markov chains as a potential solution to my problem.

What follows is an implementation of a Markov chain for text generation.

In [None]:
import random
from nltk.corpus import stopwords
import unidecode
import re

class Markov(object):

    def __init__(self, order):

        # order refers to how far back the process will look or remember

        self.order = order

        # controls the actual size of the word groups to be analyzed
        self.group_size = self.order + 1

        # the training text

        self.text = None

        #graph dictionary will hold the actual information
        self.graph = {}

        return

    def train(self, filename):
        self.text = filename.read().split()

        # this appends the beginning of the text to the end of the text
        # so that it always has something to generate
        self.text = self.text + self.text[:self.order]

        # iterate one by one over text, for the entire range of the text starting
        # from word 0 to the last possible groups of word
        for i in range(0, len(self.text) - self.group_size):

            # key is the few words that came before the value
            key = tuple(self.text[i:i + self.order])
            # value is the word that is coming up now, final word in the sequence
            # order 2 markov chain will have value be word 3
            value = self.text[i + self.order]

            # if the word has already been seen, just append the value to the end of the dict
            if key in self.graph:
                self.graph[key].append(value)
            # if word hasn't been seen before, just add it to value of
            # all words we've seen come after specific word pair
            # save the data
            else:
                self.graph[key] = [value]

    def generate(self, length):

        # index defines where the text generation begins at, picks a randomn start word
        index = random.randint(0, len(self.text) - self.order)

        # result comes after the randomly chosen word
        result = self.text[index: index + self.order]

        for i in range(length):

            # current state is the last few words of the current result
            state = tuple(result[len(result) - self.order:])
            # next word is randomly chosen from possible values in the graph
            next_word = random.choice(self.graph[state])
            # append the value to the result
            result.append(next_word)

        print(" ".join(result[self.order:]))

Here's the model generator. The generate function takes the amount of character to generate, while the declaration of the generator takes the order/length to check.

In [None]:
markov_data = open("Horror_plots.txt")
generator = Markov.Markov(2)
generator.train(markov_data)
print("Basic Markov model generated:")
generator.generate(30)
print("_______")

Finally, I hit on a solution that seemed to work well for what I was trying accomplish. 

In [None]:
import markovify
import en_core_web_sm

input_text = open('Horror_plots.txt').read()
nlp = en_core_web_sm.load()

# regular markovify

# Build the model.
text_model = markovify.Text(input_text, state_size=2)

# Print five randomly-generated sentences
print("Vanilla Markovify:")
print("---")
print("Sentence Gen:")
for i in range(10):
    print(text_model.make_sentence())

print(" ")
print("Short sentence gen:")
# Print three randomly-generated sentences of no more than 140 characters
for i in range(10):
    print(text_model.make_short_sentence(140))

print("________")

#sentence_gen()

# overwrite default markovify model

class POSText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

text_model2 = POSText(input_text, state_size=3)

print("Modified Markov model:")
print("---")
print("Markov full sentence gen:")
print(" ")
for i in range(10):
    print(text_model2.make_sentence())
print("________")

print("Markov short sentence gen:")
print(" ")
for i in range(10):
    print(text_model2.make_short_sentence(150))
print("________")