# A few simple corpus-driven approaches to narrative analysis and generation

By [Allison Parrish](http://www.decontextualize.com/)

This notebook is a fast introduction to a few techniques for working with narrative corpora. By "narrative corpora," I mean pre-existing bodies of text that mostly contain the texts of narratives. In particular, we're going to use Mark Riedl's [WikiPlots corpus](https://github.com/markriedl/WikiPlots), which has the titles and plot summaries of more than one hundred thousand movies, books, television shows and other media from Wikipedia.

The notebook takes you through using [spaCy](http://spacy.io) to extract words, noun chunks, parts of speech and entities from the text and then sew them back together with [Tracery](http://tracery.io). It then shows how to use [Markovify](https://github.com/jsvine/markovify) to create new narratives from existing narrative text, along with a quick example of recurrent neural network text generation with [textgenrnn](https://github.com/minimaxir/textgenrnn).

The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed. Even if the notebook itself doesn't end up being useful to you, hopefully it spurs a few ideas that you can take with you into your practice as a storyteller and/or programmer.

## Loading the corpus

The first step is to get the narrative corpus into the program. Because WikiPlots is so big, we're actually going to be working with a smaller subset: only the plot summaries for romantic comedy movies. The subcorpus was made using [this notebook on creating a subcorpus of WikiPlots](https://github.com/aparrish/corpus-driven-narrative-generation/blob/master/creating-a-wikiplots-subcorpus.ipynb), which you can consult if you want to make your own with a different subset of WikiPlots.

The corpus we're working with takes the form of a TSV file ("tab separated values"), with each line containing the title of the movie, a number indicating where in the plot summary the sentence for this line occurs, the total number of sentences in the summary, and the actual text of the sentence. The following cell loads the data into a list of dictionaries:

In [91]:
sentences = []
for line in open("romcom_plot_sentences.tsv"):
    line = line.strip()
    items = line.split("\t")
    sentences.append(
        {'title': items[0],
         'index': int(items[1]),
         'total': int(items[2]),
         'text': items[3]})

Just to make sure it worked, we'll print out a random sentence:

In [92]:
import random

In [93]:
random.choice(sentences)

{'index': 17,
 'text': 'Instead of writing, Freddie goes after her.',
 'title': 'Ever Since Eve',
 'total': 23}

Note: You can make your own corpus that works with the code in this notebook by exporting your data in TSV format with one line per sentence, with columns for the following:

* `title`: the title of the work that the sentence comes from
* `index`: the index of the sentence in the work
* `total`: the total number of sentences in the work
* `text`: the text of the sentence

## Natural language processing

To get an idea of what's happening in the text of the plots, we can do a bit of Natural Language Processing. I cover just the bare essentials in this notebook. [Here's a more in-depth tutorial that I wrote](https://github.com/aparrish/rwet/blob/master/nlp-concepts-with-spacy.ipynb).

Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to install it (i.e., download the code and put it in a place where Python can find it) and download the language model. (The language model contains statistical information about a particular language that makes it possible for spaCy to do things like parse sentences into their constituent parts.)

If you're using this notebook in Binder, then spaCy has already been installed for you! Otherwise, to install spaCy, [follow the instructions here](https://spacy.io/usage/). If you're using Anaconda, you'll need to open a Terminal window (or the equivalent on your operating system) and type

    conda install -c conda-forge spacy

This line installs the library. You'll also need to download a language model. For that, type:

    python -m spacy download en_core_web_sm

(Replace en with the language code for your desired language, if there's a model available for it.) The language model contains the statistical information necessary to parse text into sentences and sentences into parts of speech. Note that this download is several hundred megabytes, so it might take a while!

Once you've installed the library and downloaded the model, you should be able to load the model in the following cell:

In [94]:
import spacy
nlp = spacy.load('en_core_web_sm')

(This could also take a while–the model is potentially very large and your computer needs to load it from your hard drive and into memory. When you see a `[*]` next to a cell, that means that your computer is still working on executing the code in the cell.)

Right off the bat, the spaCy library gives us access to a number of interesting units of text:

* All of the sentences (`doc.sents`)
* All of the words (`doc`)
* All of the "named entities," like names of places, people, #brands, etc. (`doc.ents`)
* All of the "noun chunks," i.e., nouns in the text plus surrounding matter like adjectives and articles

The cell below, we extract these into variables so we can play around with them a little bit. (Parsing sentences is hungry work and the following cell will take a while to execute.)

In [95]:
words = []
noun_chunks = []
entities = []
# only use 1000 sentences sampled at random by default; comment out this `for...`
# uncomment the `for...` beneath to use every sentence in the corpus.
for i, sent in enumerate(random.sample(sentences, 1000)):
#for i, sent in enumerate(sentences):
    if i % 100 == 0:
        print(i, len(sentences))
    doc = nlp(sent['text'])
    words.extend([w for w in list(doc) if w.is_alpha])
    noun_chunks.extend(list(doc.noun_chunks))
    entities.extend(list(doc.ents))

0 22593
100 22593
200 22593
300 22593
400 22593
500 22593
600 22593
700 22593
800 22593
900 22593


Just to make sure it worked, print out ten random words:

In [96]:
for item in random.sample(words, 10):
    print(item.text)

provides
year
Yount
to
confesses
Magazine
the
the
first
men


Ten random noun chunks:

In [97]:
for item in random.sample(noun_chunks, 10):
    print(item.text)

he
his butler position
a man
he
Julie
an underling
they
Jr
She
Zack


Ten random entities:

In [98]:
for item in random.sample(entities, 10):
    print(item.text)

Helen
Angela
Christmas
Sierra
Katie
William Scott
Rex Harrison
Omar
Jeff Bridges
Trish


### Grammatical roles

The parser included with spaCy can also give us information about the grammatical roles in the sentence. For example, the `.root.dep_` attribute of a noun chunk tells us whether that noun chunk is the subject of the sentence ("nsubj") or a direct object ("dobj") of the sentence. (See the "Universal Dependency Labels" of spaCy's [annotation specs](https://spacy.io/api/annotation) for more possible roles.) Using this information, we can make a list of sentence subjects and sentence objects:

In [99]:
subjects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'nsubj']
objects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'dobj']

In [100]:
random.sample(subjects, 10)

[he,
 Gordon,
 her twin sister,
 him,
 Penelope,
 Beth,
 their friends,
 Amy,
 The princess' entourage,
 Natalie]

In [101]:
random.sample(objects, 10)

[Sadie,
 What,
 the side,
 a job,
 the deception,
 a coincidence,
 her tram,
 his own therapist,
 Vegas,
 the choir]

### Parts of speech

The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—`nouns`, `verbs`, `adjs` and `advs`—that contain only words of the specified parts of speech. Using the `.tag_` attribute, we can easily get only particular forms of verbs; in this case, I'm just getting verbs that are in the past tense. ([There's a full list of part of speech tags here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english).)

In [102]:
nouns = [w for w in words if w.pos_ == "NOUN"]
verbs = [w for w in words if w.pos_ == "VERB"]
past_tense_verbs = [w for w in words if w.tag_ == 'VBD']
adjs = [w for w in words if w.tag_ == "JJ"]
advs = [w for w in words if w.pos_ == "ADV"]

And now we can print out a random sample of any of these:

In [103]:
for item in random.sample(nouns, 12): # change "nouns" to "verbs" or "adjs" or "advs" to sample from those lists!
    print(item.text)

crush
phone
dinner
terminal
prison
adventure
years
night
perfection
victim
marriage
pianist


### Entity types

The parser in spaCy not only identifies "entities" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:

In [104]:
people = [e for e in entities if e.label_ == "PERSON"]
locations = [e for e in entities if e.label_ == "LOC"]
times = [e for e in entities if e.label_ == "TIME"]

And then you can print out a random sample:

In [105]:
for item in random.sample(times, 12): # change "times" to "people" or "locations" to sample those lists
    print(item.text.strip())

the night
only seconds
The next morning
morning
morning
The next morning
That night
the night
The same night
several hours
the evening
that night


### Finding the most common

We won't go too deep into text analysis in this tutorial, but it's useful to be able to do the most fundamental task in text analysis: finding the things that are most common. The code to do this task looks like the following, which gives us a way to look up how often any word occurs in the text:

In [106]:
from collections import Counter
word_count = Counter([w.text for w in words])

In [107]:
word_count['Meanwhile']

22

... and also tells us which words are most common:

In [108]:
word_count.most_common(12)

[('the', 863),
 ('to', 795),
 ('and', 684),
 ('a', 581),
 ('her', 366),
 ('of', 348),
 ('is', 347),
 ('in', 314),
 ('his', 269),
 ('with', 248),
 ('that', 239),
 ('he', 221)]

You can make a counter for any of the other lists we've worked with using the same syntax. Just make up a unique variable name on the left of the `=` sign and put the name of the list you want to count in the brackets to the right (replacing `words`). E.g., to find the most common people:

In [109]:
people_count = Counter([w.text for w in people])

In [110]:
people_count.most_common(12)

[('Mike', 14),
 ('George', 14),
 ('Joe', 13),
 ('Jack', 13),
 ('Peter', 12),
 ('Claire', 11),
 ('Robert', 11),
 ('Tom', 9),
 ('Andy', 8),
 ('Ben', 8),
 ('Kate', 8),
 ('Sarah', 7)]

The most common past-tense verbs:

In [111]:
vbd_count = Counter([w.text for w in past_tense_verbs])

In [112]:
vbd_count.most_common(12)

[('was', 40),
 ('had', 18),
 ('did', 7),
 ('were', 6),
 ('got', 5),
 ('told', 4),
 ('made', 3),
 ('began', 3),
 ('met', 3),
 ('happened', 3),
 ('thought', 3),
 ('ended', 2)]

### Writing to a file

The following cell defines a function for writing data from a `Counter` object to a file. The file is in "tab-separated values" format, which you can open using most spreadsheet programs. Execute it before you continue:

In [113]:
def save_counter_tsv(filename, counter, limit=1000):
    with open(filename, "w") as outfile:
        outfile.write("key\tvalue\n")
        for item, count in counter.most_common():
            outfile.write(item.strip() + "\t" + str(count) + "\n")    

Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts:

In [114]:
save_counter_tsv("100_common_words.tsv", word_count, 100)

Try opening this file in Excel or Google Docs or Numbers!

If you want to write the data from another `Counter` object to a file:

* Change the filename to whatever you want (though you should probably keep the `.tsv` extension)
* Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count`
* Change the number to the number of rows you want to include in your spreadsheet.

### When do things happen in this text?

Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular "times" (durations, times of day, etc.) are mentioned in the text.

In [115]:
time_counter = Counter([e.text.lower().strip() for e in times])
save_counter_tsv("time_count.tsv", time_counter, 100)

Do the same thing, but with people:

In [116]:
people_counter = Counter([e.text.lower() for e in people])
save_counter_tsv("people_count.tsv", people_counter, 100)

### Generating stories from a corpus and Tracery grammars

Once you've isolated entities and parts of speech, you can recombine them in interesting ways. One is to use a Tracery grammar to write sentences that include the isolated parts. Because the parts have been labelled using spaCy, you can be reasonbly sure that they'll fit into particular slots in the sentence. (I used a similar technique for my [Cheap Space Nine](https://twitter.com/cheapspacenine) bot.)

In [117]:
import tracery
from tracery.modifiers import base_english

In [118]:
rules = {
    "subject": [w.text for w in subjects],
    "object": [w.text for w in objects],
    "verb": [w.text for w in past_tense_verbs],
    "adj": [w.text for w in adjs],
    "people": [w.text for w in people],
    "loc": [w.text for w in locations],
    "time": [w.text for w in times],
    "origin": "#scene#\n\n[charA:#subject#][charB:#subject#][prop:#object#]#sentences#",
    "scene": "SCENE: #loc#, #time.lowercase#",
    "sentences": [
        "#sentence#\n#sentence#",
        "#sentence#\n#sentence#\n#sentence#",
        "#sentence#\n#sentence#\n#sentence#\n#sentence#"
    ],
    "sentence": [
        "#charA.capitalize# #verb# #prop#.",
        "#charB.capitalize# #verb# #prop#.",
        "#prop.capitalize# became #adj#.",
        "#charA.capitalize# and #charB# greeted each other.",
        "'Did you hear about #object.lowercase#?' said #charA#.",
        "'#subject.capitalize# is #adj#,' said #charB#.",
        "#charA.capitalize# and #charB# #verb# #object#.",
        "#charA.capitalize# and #charB# looked at each other.",
        "#sentence#\n#sentence#"
    ]
}

In [119]:
grammar = tracery.Grammar(rules)
grammar.add_modifiers(base_english)

In [120]:
for i in range(3):
    print(grammar.flatten("#origin#"))
    print()

SCENE: Earth, that night

Phil and she looked at each other.
Phil and she had nanobots.
She granted Helena.
Helena became critical.

SCENE: West, mabel quitting

He and LeBlanc looked at each other.
He and LeBlanc found a play.
'He is similar,' said LeBlanc.
'Did you hear about abigail?' said he.
Her sleeping husband became bright.

SCENE: West, only seconds

Brian saw Julia audition.
Charley and Brian looked at each other.
Charley did Julia audition.



## Markov chain text generation

Another way to produce new narratives from existing narrative text is to find statistical patterns in the text itself and then make the computer create new text that follows those statistical patterns. Markov chain text generation has been a pastime of poets and programmers going back [all the way to 1983](https://www.jstor.org/stable/24969024), so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is [Markovify](https://github.com/jsvine/markovify), a Markov chain text generation library originally developed for BuzzFeed, apparently. Writing [code to implement a Markov chain generator](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) on your own is certainly possible, but Markovify comes with a lot of extra niceties that will make our lives easier.

To install Markovify on your computer, run the cell below. (You can skip this step if you're using this notebook in Binder.)

In [121]:
!pip install markovify

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


And then run this cell to make the library available in your notebook:

In [122]:
import markovify

We need a list of strings to train the Markov generator. For now, let's just get all of the sentences from any movie in the corpus:

In [123]:
all_text = [item['text'] for item in sentences]

The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable `all_text_gen`.

In [124]:
all_text_gen = markovify.Text(all_text)

You can then call the `.make_sentence()` method to generate a sentence from the model:

In [125]:
print(all_text_gen.make_sentence())

Besotted, the Princess and Jeff decide to separate but promises to come up with his computer skills, and Zack together.


The `.make_short_sentence()` method allows you to specify a maximum length for the generated sentence:

In [126]:
print(all_text_gen.make_short_sentence(50))

Meanwhile, Ludo's best friend Clem goes with them.


By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the `.make_sentence()` or `.make_short_sentence()` methods will return `None`, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the `tries` parameter:

In [127]:
print(all_text_gen.make_short_sentence(40, tries=100))

They quarrel, but it attacks the thugs.


Or by disabling the check altogether with `test_output=False`:

In [128]:
print(all_text_gen.make_short_sentence(40, test_output=False))

While Ryan has been evicted.


### Changing the order

When you create the model, you can specify the order of the model using the `state_size` parameter. It defaults to 2. Let's make two model with different orders and compare:

In [129]:
gen_1 = markovify.Text(all_text, state_size=1)
gen_4 = markovify.Text(all_text, state_size=4)

In [130]:
print("order 1")
print(gen_1.make_sentence(test_output=False))
print()
print("order 4")
print(gen_4.make_sentence(test_output=False))

order 1
To Find Love blossoms into trouble with Leo therefore challenges him in her as a woman, working for the woman escape her apartment, where she takes the wedding ring and although she plans to see him if Pitka gets into taking off with him, particularly bright red bracelet re-engraved for the two soon leave the bowels of another attempt to expose Paolo is given up.

order 4
Christie also returns and she and Francis jointly run the vineyard while trying to reconcile their vastly different philosophies of wine production.


In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error.

### Changing the level

Markovify, by default, works with *words* as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing:

In [131]:
class SentencesByChar(markovify.Text):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

Any of the parameters you passed to `markovify.Text` you can also pass to `SentencesByChar`. The `state_size` parameter still controls the order of the model, but now the n-grams are characters, not words.

The following cell implements a character-level Markov text generator for the word "condescendences":

In [132]:
con_model = SentencesByChar("condescendences", state_size=2)

Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier!

In [133]:
con_model.make_sentence()

'condendendes'

Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A:

In [134]:
gen_char = SentencesByChar(all_text, state_size=7)

And the cell below prints out a random sentence from this generator. (The `.replace()` is to get rid of any newline characters in the output.)

In [135]:
print(gen_char.make_sentence(test_output=False))

Mark and his mysterious with her and Howard destroy any car and offers to resolved things right and rejoins the purchase water from China, 7 - year of making his offer and abusive, alcoholic middle of the school so they continues to get Mary if the dance-hall girls devises a repentant.


### Thinking about structure

It's one thing to be able to produce one plausible sentence of a plot summary using Markov chains, but another to create a sense of overall structure between sentences, and generating narratives with these kinds of long-term dependencies is still an open problem in computational creativity. The approach I'm going to suggest below relies on the intuition that sentences in a plot summary share characteristics based on their position in the summary. First sentences will generally introduce characters and present an initial situation; last sentences will generally describe how the situation was resolved; and sentences in between will describe developing action.

Following this intuition, let's create *three different Markov chains*: one for beginning sentences, one for middle sentences, and one for final sentences. We can use the `index` of each sentence in our corpus to give us this information.

First, the beginnings are lines whose index is zero (i.e., they're the first sentence for this plot):

In [136]:
beginnings = [line['text'] for line in sentences if line['index'] == 0]

In [137]:
random.sample(beginnings, 5)

['Two lovers are separated and marry other partners.',
 'King Serge IV of Molvania (Menjou) comes to a small American town, and falls in love with one of its residents, Mary Young (Love).',
 "Sabrina Watson (Paula Patton) is the only child of the wealthy Watson family; her mother Claudine  (Angela Bassett) and father Greg Watson (Brian Stokes Mitchell) live in Martha's Vineyard.",
 'Portrait painter and caricaturist David Grant (Ronald Colman), newly arrived in Greenwich Village, wishes Jean Newton (Ginger Rogers) good luck on a whim as they pass on the sidewalk.',
 'Sheila Moore (Gish) takes a job at a candy store to support her father, an out-of-work vaudevillian.']

And endings are sentences that come last in the plot (i.e., their index is one less than the total number of sentences):

In [138]:
endings = [line['text'] for line in sentences if line['index'] == line['total'] - 1]

In [139]:
random.sample(endings, 5)

['When the wedding is over Andy makes plans to go to South America and forget his sorrows, but his father talks to him and convinces him to go back to Wainwright and complete his degree.',
 "Along the way there are humorous subplots involving the office manager's violent ex-husband, Becker's attempt to find the 'Australian sound', and an odd waiter who is under the mistaken belief that Becker is a secret agent.",
 'It seems Emily must learn the hard way that love and family require sacrifice and not everybody can be happy.',
 "His mother, Sarah Lee (Angela Lansbury), wants him to follow in his father's footsteps and take over management at the Great Southern Hawaiian Fruit Company, the family business, but Chad is reluctant, so he goes to work as a tour guide at his girlfriend's agency.",
 'She devises a scheme involving amnesia to lure Jeff back to her.']

And "middles" are anything in between:

In [140]:
middles = [line['text'] for line in sentences if 0 < line['index'] < line['total'] - 1]

In [141]:
random.sample(middles, 5)

['However, John arrives, forcing Beth to hide in the back of his Land Rover.',
 "to the town's barbecue where they raise money for equipment for the Kalispell Search & Rescue team.",
 "With help from his two ex-pimp friends Pope Sweet Jesus (Eddie Griffin) and Lord Have Mercy (Katt Williams) and the other townspeople, Norbit manages to meet with Kate without Rasputia's knowledge.",
 'Inès tries to seduce Estelle by offering to be her "mirror" by telling her everything she sees, but ends up frightening her instead.',
 "She refuses to believe that they have all ended up in the room by accident and soon realizes that they have been placed together to make each other miserable; she deduces that they are to be one another's torturers."]

The following cell creates the models:

In [142]:
beginning_gen = markovify.Text(beginnings)
middle_gen = markovify.Text(middles)
ending_gen = markovify.Text(endings)

Now you can generate tiny narratives by producing a beginning sentence, a middle sentence, and an ending sentence:

In [143]:
print(beginning_gen.make_short_sentence(100))
print(middle_gen.make_short_sentence(100))
print(ending_gen.make_short_sentence(100))

Henry Hart is a baseball player for the underdog.
Millie is determined to get an order, she turns against her.
In the end, Gloria leaves her some money and his guests arrive to where Eva is.


The narratives still feel disconnected (and there are often jarring mismatches in pronoun antecedents), but the artifacts produced with this method do feel a bit narrative-like? Maybe?

### Combining models

Markovify has a handy feature that allows you to *combine* models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call `.combine()` to combine them.

The code below combines models for beginning sentences, middle sentences, and ending sentences into one model:

In [144]:
combo = markovify.combine([beginning_gen, middle_gen, ending_gen], [10, 1, 10])

The bit of code `[10, 1, 10]` controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly beginnings with but a bit of middles and a *soupçon* of ends, try `[10, 2, 1]`.)

Then you can create sentences using the combined model:

In [145]:
print(combo.make_short_sentence(120))

Already a beloved figure among the characters come out of the season.


## Neural network text prediction with `textgenrnn`

Like a [Markov chain](ngrams-and-markov-chains.ipynb), a recurrent neural network (RNN) is a way to make predictions about what will come next in a sequence. For our purposes, the sequence in question is a sequence of characters, and the prediction we want to make is *which character will come next*. Both Markov models and recurrent neural networks do this by using statistical properties of text to make a *probability distribution* for what character will come next, given some information about what comes before. The two procedures work very differently internally, and we're not going to go into the gory details about implementation here. (But if you're interested in the gory details, [here's a good place to start](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).) For our purposes, the main *functional* difference between a Markov chain and a recurrent neural network is the *portion* of the sequence used to make the prediction. A Markov model uses a fixed window of history from the sequence, while an RNN (theoretically) uses the *entire history* of the sequence.

The primary benefit of an RNN over a Markov model for text generation is that an RNN takes into account *the entire history* of a sequence when generating the next character. This means that, for example, an RNN can theoretically learn how to close quotes and parentheses, which a Markov chain will never be able to reliably do (at least for pairs of quotes and parentheses longer than the n-gram of the Markov chain).

The drawback of RNNs is that they are *computationally expensive*, from both a processing and memory perspective. This is (again) a simplification, but internally, RNNs work by "squishing" information about the training data down into large matrices, and make predictions by performing calculations on these large matrices. That means that you need a lot of CPU and RAM to train an RNN, and the resulting models (when stored to disk) can be very large. Training an RNN also (usually) takes a lot of time.

Another consideration is the size of your corpus. Markov models will give interesting and useful results even for very small datasets, but RNNs require large amounts of data to train—the more data the better.

So what do you do if you *don't* have a very large corpus? Or if you don't have a lot of time to train on your corpus?

### RNN generation from pre-trained models

Fortunately for us, developer and data scientist [Max Woolf](https://github.com/minimaxir) has made a Python library called [textgenrnn](https://github.com/minimaxir/textgenrnn) that makes it really easy to experiment with RNN text generation. This library includes a model (according to the documentation) "trained on hundreds of thousands of text documents, from Reddit submissions (via BigQuery) and Facebook Pages (via my Facebook Page Post Scraper), from a very diverse variety of subreddits/Pages," and allows you to use this model as a starting point for your own training.

First install textgenrnn with `pip`: (Again, you can skip this step if you're working with this notebook in Binder)

In [None]:
!pip install --upgrade textgenrnn

Once it's installed, import the `textgenrnn` class from the package:

In [146]:
from textgenrnn import textgenrnn

(If you get an error at this point, you may need to skip back a step and install [Tensorflow](https://www.tensorflow.org/install). Try running these commands at the Terminal prompt instead of in the notebook itself.)

And create a new `textgenrnn` object like so. (The `name` parameter controls the filename used when automatically saving the model to disk, so pick something descriptive!)

In [147]:
textgen = textgenrnn(name="all_text")

This object has a `.generate()` method which will, by default, generate text from the pre-trained model only.

In [148]:
textgen.generate()

Zeromide: The Stephen Streamer in the Saxopilities



To train a text generator on your own text, use the `.train_on_texts()` method, passing in a list of strings. The `num_epochs` parameter allows you to indicate how many epochs (i.e., passes over the data) should be performed. The more epochs the better, especially for shorter texts, but you'll get okay results even with just a few.

Training a neural network usually takes a really long time! So it makes sense to "try out" a text before committing to the many hours it might take to train the network on the full text. The following example trains the neural network on 100 randomly sampled lines from all plot sentences, which lets you get an idea of what the output will look like when training on its entire contents. You'll notice that the `train_on_texts()` function prints output as it goes, showing what the generated text is likely to look like.

In [149]:
textgen.train_on_texts(random.sample(all_text, 100), num_epochs=3)

Training on 12,244 character sequences.
Epoch 1/3
####################
Temperature: 0.2
####################
When allows he is shire in a company and the companion to his his in her his his and his far is her to the official and him.

All the procession to his in her his his for his in her the procising and his his she she is share to his is the money to him.

The companion and the into the company of his his in her the share to his is the companion to a procession to him.

####################
Temperature: 0.5
####################
While his family of the former is him.

All the into her and alivote in the moonstone and her to a procession as a child to see him.

In a decide and the decide langer to move to his him.

####################
Temperature: 1.0
####################
That become the rush just he bize her as flucking mother to a prier in the lda's shi underegar him still has been in, spitinggugly guys.

It in a perferry girl the connection bricashains he idensions price an again

After training, you can generate new text using the `.generate()` method again:

In [150]:
textgen.generate()

Conspilated the popus for the cops to stealing the company with her out with her head and her heart because he was a there because he waited to find the prometent about they sleep until self-adaption attorn from the company guys to be about the prom with his far, and reports to his family and he di



The results aren't very interesting because by default the generator is very conservative in how it samples from the probability distribution. You can use the `temperature` parameter to make the sampling a bit more likely to pick improbable outcomes. The higher the value, the weirder the results. The default is 0.2, and going above 1.0 is likely to produce unacceptably strange results:

In [151]:
textgen.generate(temperature=0.5)

Seans his her father of serious as he was a new the award and is ended his father and program and give his security of her heart has been them.



In [152]:
textgen.generate(temperature=0.9)

After slip ime supposting father has felched into an endler.



In [153]:
textgen.generate(temperature=1.5)

Hobets backyard, When my Grads later lets identized to help.



If you pass a number `n` to the `.generate()` method as its first parameter, `.generate()` will print out `n` instances of text generation from the model. The code in the following cell prints out ten examples from the specified temperature:

In [154]:
textgen.generate(10, temperature=0.5)

 10%|█         | 1/10 [00:03<00:30,  3.43s/it]

All the motoring her despite his friends to take a car late of her is falled to a provish of the consequent with a new home to help Misageraly her to have the daughter that he wants to receive some other works and is the old himself to watch to the fellow is they be able to spend him that they read



 20%|██        | 2/10 [00:07<00:27,  3.47s/it]

Treesive When Dense to Bermal Man Katharis spend his impostion of her interviewers wants to be able to stop a prom with a paint to her his her the money and he was a girl and she share the child and dominates to the messations to the walkus of her friend she is self-ended the polustine and sick of 



 30%|███       | 3/10 [00:07<00:18,  2.70s/it]

Despite his heart to self-tend the place of her and therefore with her.



 40%|████      | 4/10 [00:11<00:17,  2.96s/it]




 50%|█████     | 5/10 [00:13<00:13,  2.65s/it]

The confirment for her despite her and and he was to help her to a train calls search for her friend and a watch to despiore to his responsion with her her.



 60%|██████    | 6/10 [00:17<00:11,  2.94s/it]

As he will see to share her to man and the prom to help herself and she replaces a themage of a country where he is seen the meeting when a company to help in them, he is take her to her option and asks her in her friend has been the recently for the character while him they find a trence in his fa



 70%|███████   | 7/10 [00:17<00:06,  2.21s/it]

They are the most with Caracit stationer.



 80%|████████  | 8/10 [00:20<00:04,  2.42s/it]

Testica and so weak the married the old that he should be about the incluse of her boyfriend who despite to the provide to go the solart of his best her trip and she speeches to Kango with her boyfriend and drive his way to be rup there.



 90%|█████████ | 9/10 [00:21<00:01,  1.93s/it]

Testain Jones (Reddenn), and female final car with his friends.



100%|██████████| 10/10 [00:24<00:00,  2.43s/it]

He is a new story his parent to a country at a week and the allegence of survee his service for her the money and them are share and he was a lot of his and and her friend is a process and after all married when he sees the formers and her healther and dogs he was the goes to successful the Brian, 






(This may take a little while.)

When you're satisfied with the results and you're ready to train on all of the sentences, just remove the `[:100]` from the call to `.train_on_texts()`. (Given the size of the corpus we're working with in this example, the following will take a *long* time on most computers. You might consider using a machine with a GPU.)

In [None]:
textgen.train_on_texts(all_text, num_epochs=5)

The textgenrnn library automatically saves the model to disk after each epoch in the same directory as this notebook. You can load a model you've previously trained by passing its filename to the `textgenrnn` function:

In [None]:
textgen = textgenrnn("all_text_weights.hdf5")

And then you can call the `.generate()` method as normal:

### Generating with shorter texts

I've found that `textgenrnn` works especially well with very short, texts. For example, let's generate romantic comedy titles using the information in our corpus!

The code in the following cell makes a list of all of the titles:

In [155]:
titles = list(set([item['title'] for item in sentences]))

In [156]:
random.sample(titles, 5)

['Lucky Partners',
 'Movie Crazy',
 'The Boudoir Diplomat',
 'O Homem do Futuro',
 "Straight A's"]

And create another textgenrnn object:

In [157]:
title_gen = textgenrnn(name="titles")

Now, train the RNN on these titles. One epoch will do the trick:

In [158]:
title_gen.train_on_texts(titles, num_epochs=1)

Training on 22,247 character sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
The Bride Girl

The Settler Say

She Good Start

####################
Temperature: 0.5
####################
She All Brading Bearling

Story Stealtel Gary

I Love Your Story

####################
Temperature: 1.0
####################
Footlife Fool, Twating Killed

Jouned in Marring

Dru 3D: All Ladderim Dreak



Now generate a list of new titles:

In [159]:
title_gen.generate(25, temperature=0.5)

  4%|▍         | 1/25 [00:00<00:06,  3.85it/s]

The Girl Seven Manda



 12%|█▏        | 3/25 [00:00<00:05,  4.28it/s]

The Love the Heart

Kind Bride Gold



 20%|██        | 5/25 [00:00<00:03,  5.24it/s]

Mary Amber

We Leet Beautice



 24%|██▍       | 6/25 [00:01<00:03,  5.34it/s]

All the Bunder

The Boy



 32%|███▏      | 8/25 [00:01<00:02,  5.75it/s]

The Gay Settling



 36%|███▌      | 9/25 [00:01<00:03,  4.30it/s]

She's Belling Come Your German



 40%|████      | 10/25 [00:02<00:03,  4.19it/s]

The Princess Good



 48%|████▊     | 12/25 [00:02<00:02,  4.56it/s]

She Good and Stud

The Beauticion



 52%|█████▏    | 13/25 [00:02<00:02,  4.77it/s]

Straight Start



 60%|██████    | 15/25 [00:03<00:02,  4.94it/s]

I Love How to True Too

Blue Love



 64%|██████▍   | 16/25 [00:03<00:01,  5.39it/s]

A Send Slut



 68%|██████▊   | 17/25 [00:03<00:01,  4.38it/s]

Forgetting Love Blue Girls



 72%|███████▏  | 18/25 [00:03<00:01,  4.25it/s]

She's Gich vs Shark



 80%|████████  | 20/25 [00:04<00:01,  4.27it/s]

Love Brooms Bowl Tattoo Beth

She Seeking



 88%|████████▊ | 22/25 [00:04<00:00,  4.54it/s]

Ambire Civea Boob

All Groopsy Stuck



 92%|█████████▏| 23/25 [00:04<00:00,  4.49it/s]

Me All Other Bungd



100%|██████████| 25/25 [00:05<00:00,  4.16it/s]

The Blood Singer Bundle Star

Dericon Stringo






## Further reading

* [This notebook from the creator of textgenrnn](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-demo.ipynb) covers everything about the library that I covered in this tutorial—and much more, including how to start generation from a particular "seed" and how to save and load models (useful if you spent an afternoon training a model on your own corpus and don't want to have to do it again!)
* The author of textgenrrn made a similar easy-to-use [library and interface for finetuning GPT-2](https://github.com/minimaxir/gpt-2-simple). (GPT-2 is one of several recent "transformer" language models which produce token predictions with a noticeably greater level of coherence than Markov chains or RNNs.)
* Take a look at [Janelle Shane's wonderful overview of how she uses RNNs in her process](http://aiweirdness.com/faq). And then take a look at her [wonderful creative work with RNNs](http://aiweirdness.com/).
* Hayes, Brian. “Computer recreations.” Scientific American, vol. 249, no. 5, 1983, pp. 18–31. JSTOR, http://www.jstor.org/stable/24969024. (Original column from Scientific American that described how Markov chain text generation works—very readable! I can send a PDF, hit me up.)
* [A Travesty Generator for Micros](https://elmcip.net/critical-writing/travesty-generator-micros) is a follow-up to Hayes' article that has some more theory and an actual Pascal listing (which is now mostly of only historical interest).
* [This notebook](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) shows how to implement a Markov chain generator from scratch in Python, if you're interested in such things!