# The Average Novel

[Allison Parrish](http://www.decontextualize.com/)

This is my project for [NaNoGenmo 2017](https://github.com/NaNoGenMo/2017/).

What I did: I took every novel in [Project Gutenberg](http://www.gutenberg.org/), converted them to arrays of word vectors, normalized their lengths to exactly 50000 (the minimum length for a qualifying NaNoGenMo novel), averaged the arrays, and then found the nearest word for each of the vectors in the resulting array. The result: *The Average Novel.*

This didn't result in something that I liked, but it's worth documenting the attempt as a guide for future research.

## Step one: Get a bunch of novels

The `gutenfetch` module here uses my idiosyncratic/undocumented dump of Project Gutenberg (contact me for more info), plus a JSON metadata file, to get metadata and text for items from the Project Gutenberg corpus. Here I'm using it to get only items with 'fiction' in a subject identifier that also labelled as being written in English.

In [1]:
import gutenfetch

In [2]:
novels = gutenfetch.search(lambda x: any(['fiction' in t['identifier'].lower() \
    for t in x['subjects']]) and x['language'] == 'en')

How many texts is that?

In [3]:
len(novels)

11781

Here's what the metadata looks like for one...

In [12]:
novels[999]

{'audience': 'Adult',
 'authors': [{'display_name': 'Nathaniel Hawthorne',
   'family_name': 'Hawthorne',
   'lc': 'n79007728',
   'sort_name': 'Hawthorne, Nathaniel',
   'viaf': '44435463',
   'wikipedia_name': 'Nathaniel_Hawthorne'}],
 'fiction': True,
 'gutenberg_id': 2181,
 'language': 'en',
 'medium': 'Book',
 'sort_title': 'Marble Faun; Or, The Romance of Monte Beni - Volume 1, The',
 'subjects': [{'fiction': True,
   'identifier': 'Nobility -- Fiction',
   'type': 'LCSH'},
  {'fiction': True, 'identifier': 'Murder -- Fiction', 'type': 'LCSH'},
  {'fiction': True,
   'identifier': 'Love stories',
   'name': 'Love stories',
   'type': 'LCSH'},
  {'fiction': True,
   'identifier': 'Women art students -- Fiction',
   'type': 'LCSH'},
  {'fiction': True, 'identifier': 'Rome (Italy) -- Fiction', 'type': 'LCSH'},
  {'fiction': True, 'identifier': 'Artists -- Fiction', 'type': 'LCSH'},
  {'audience': 'Adult',
   'fiction': True,
   'identifier': 'PS',
   'name': 'American literature',
 

## Step two: Vectorize and zoom

In a separate step, I extracted every sentence from each of the novels matched with the above search and used gensim's Word2Vec implementation (using the default settings) to train word vectors on those sentences. [You can download the pre-trained vectors here.](https://s3.amazonaws.com/aparrish/novel-vectors-word2vec.gz) (The file is around 200MB and contains the vectors in the standard Word2Vec format: plain text, count and dimensionality on first line, word and vector values on subsequent lines.)

If you're interested in the corpus of every sentence from these novels, let me know and I can send you a link. (It's about 34 million sentences, 1.2GB.)

Below, I load the pre-trained vectors.

In [10]:
import gensim
import nltk
import numpy as np

In [11]:
w2v = gensim.models.Word2Vec.load("/gutenberg/streams/novel-vectors")

Here's what vectorization looks like. First, parse a text into words (using NLTK's `word_tokenize` function)...

In [15]:
text = nltk.word_tokenize(gutenfetch.get_tar_text(novels[15]['gutenberg_id']))

Then fetch the vector for the associated word, creating a new numpy array with the result. If for some reason the word isn't in the pre-trained vectors, I decided to just put in zeroes (with the hope that this would "come out in the wash").

In [18]:
vec = np.array([w2v.wv[tok] if tok in w2v.wv else np.zeros(100) for tok in text])

The shape of the resulting vector:

In [19]:
vec.shape

(96089, 100)

Using scipy's `zoom` function, we can "zoom" this array such that it has an arbitrary length. (The `zoom` function interpolates the values, so when reducing the length of the text this should yield "smooth" semantic transitions between words.)

In [20]:
from scipy.ndimage import zoom

In [21]:
length_normalized = zoom(vec, (50000 / vec.shape[0], 1))

Finding the nearest word for the resulting vectors in this resampled text gives us a weird "clipped" version of *Anne of the Island*:

In [24]:
' '.join([w2v.similar_by_vector(w)[0][0] for w in length_normalized[:1000]])

"Produced Charles and Widger of ISLAND by Maud to the all the who `` more about All things late those seek them forth For in works Fate And the from worth -- Table of I Shadow Change . . . . . . . . 14 of Autumn . . . . . . . . 23 III Farewell . . . . . . . . 36 April Lady . . . . . . . . . . 46 Letters Home . . . . . . . . . 67 In Park . . . . . . . . . . . VII Again . . . . . . . . . . . VIII 's First . . . . . . . . IX An and Welcome Friend . . . X 's . . . . . . . . . . XI The of . . . . . . . . . XII `` 's '' . . . . . . . . XIII Way . . . . . . . . The . . . . . . . . . . . XV A Turned Down . . . . . . XVI anti-Administration . . . . . . . XVII A from . . . . . . . . . XVIII Josephine the . . . . An . . . . . . . . . . . Gilbert . . . . . . . . . . XXI of . . . . . . . . . XXII and Return Green . . . . XXIII Can Find Rock . . . . . XXIV Jonas . . . . . . . . . . . XLIII Charming . . . . . . . . XXVI Christine . . . . . . . . . XXIII Mutual . . . . . . . . . XXVIII June . . . . . 

## Step 3: Average

So, easy enough with one text. The following cell performs the same step for *every* text in the corpus. At each step, the numpy array for the text (which should be of shape `(50000, 100)` after being resized) is added to `total`, which at the end of the loop will contain the sum of the array for every length-normalize novel. In this version, I include only words consisting entirely of alphabetic characters, which helps to ensure that the output consists of words and not just a bunch of punctuation (as with an earlier version of the output that I posted).

The cell below will take some time to complete! (Around eight hours on my t2.large EC2 instance.)

In [31]:
normalize_to = 50000
total = np.zeros((50000, 100))
novel_count = 0
for i, novel in enumerate(novels):
    if i % 25 == 0: # show progress
        print(i, novel['title'], "/", end='')
    try:
        text = nltk.word_tokenize(
            gutenfetch.get_tar_text(novel['gutenberg_id']))
    except IndexError:
        continue
    #print(len(text))
    if len(text) == 0:
        continue
    vec = np.array(
        [w2v.wv[tok] if tok in w2v.wv else np.zeros(100) for tok in text if tok.isalpha()])
    zoomed = zoom(vec, (normalize_to / vec.shape[0], 1))
    total += zoomed
    novel_count += 1
print()

0 Moby Dick; Or, The Whale /



25 The Adventures of Huckleberry Finn (Tom Sawyer's Comrade) /50 Tess of the d'Urbervilles: A Pure Woman /75 The Lost Continent /100 Uncle Tom's Cabin /125 Black Beauty - The Autobiography of a Horse /150 Moran of the Lady Letty /175 The Mad King /200 The Magic of Oz /225 The White People /250 Rebecca of Sunnybrook Farm /275 Jean of the Lazy A /300 The Woman in White /325 The Cricket on the Hearth /350 Burning Daylight /375 This Side of Paradise /400 Reprinted Pieces /425 Tom Swift and His Submarine Boat; Or, Under the Ocean for Sunken Treasure /450 Malbone: An Oldport Romance /475 The Mirror of Kong Ho /500 Fire-Tongue /525 South Sea Tales /550 The Heritage of the Desert: A Novel /575 The Ball at Sceaux /600 Tom Swift and His Undersea Search; Or, the Treasure on the Floor of the Atlantic /625 A Start in Life /650 Juana /675 Chance: A Tale in Two Parts /700 The Amazing Interlude /725 Little Novels /750 Honorine /775 Finished /800 The Lost House /825 The Case of the Pool of Blood in the

6025 The Submarine Boys' Lightning Cruise /6050 Diddie, Dumps & Tot /6075 His Second Wife /6100 Hearts and Masks /6125 Six Little Bunkers at Cousin Tom's /6150 None Other Gods /6175 With Wolfe in Canada: The Winning of a Continent /6200 The Helpmate /6225 Grace Harlowe's First Year at Overton College /6250 The Bridal March; One Day /6275 Raggedy Ann Stories /6300 Captured by the Navajos /6325 Captain Scraggs; Or, The Green-Pea Pirates /6350 News from the Duchy /6375 What Might Have Been Expected /6400 The Best Short Stories of 1921 and the Yearbook of the American Short Story /6425 He Walked Around the Horses /6450 Woman Triumphant /6475 Zadig /6500 Brigands of the Moon /6525 Edison's Conquest of Mars /6550 Under the Great Bear /6575 Captain Jinks, Hero /6600 Little Wizard Stories of Oz /6625 A Little Princess /6650 The Story of the Other Wise Man /6675 With Frederick the Great: A Story of the Seven Years' War /6700 The Drummer's Coat /6725 Around the World in Ten Days /6750 The Drumme

We're left with ~10k novels:

In [45]:
novel_count

10410

Finding the average by dividing the array by the number of novels:

In [33]:
average_novel = total / novel_count

The following cell saves the calculated average as a serialized numpy array, in case you don't want to have to perform the calculation step again:

In [35]:
np.save(open("average-novel-isalpha.npy", "wb"), average_novel)

The output consists of the word whose vector is closest to each row in the average novel vector.

In [39]:
output = ' '.join([w2v.similar_by_vector(w)[0][0] for w in average_novel])

Done! Now write the entire thing out to a file, text-wrapped for your reading pleasure.

In [48]:
import textwrap

In [49]:
with open("the-average-novel.txt", "w") as fh:
    fh.write(textwrap.fill(output, width=65))

Done!