Language Processing *Finnegans Wake*
====================

Below we will explore some tools in the [Python Natural Language Tool Kit](http://www.nltk.org/) and see what we can reveal of what might be a shameful choice of a warke.

If you're new to *Finnegans Wake* I'll do my best to explain some of the things I'm trying to examine.

Motivation
---------------------
When James Joyce published his infamous work of obliterature, *Finnegans Wake*, he wanted "to keep the critics busy for 300 years".

It's been 75 years so maybe and some [Viconian thunderclaps](http://www.yourepeat.com/watch/?v=a11DEFm0WCw&start_at=347&end_at=390) later. We have new media through which we can clarify some of the obscurity of *The Wake*.

Drawbacks
---------------------
There are admittedly drawbacks to textual analysis of *Finnegans Wake*. Principally that *The Wake* is meant to be [read out loud](https://www.youtube.com/watch?v=M8kFqiv8Vww). There's information, double, triple,..., Nth-le meaning, that's revealed when heard aloud. We're not gonna access that information heare, nor will we be able to pick up puns.

Getting Started
---------------------
First we import the python libraries we'll be using and the text of *Finnegans Wake* itself.

Note: If you're having difficulty getting these running on your machine, I recommend checking out Anaconda for OSX, which handles python package installs relatively cleanly.

In [1]:
from __future__ import division
import numpy as np

# Plotting library
import matplotlib
import matplotlib.pyplot as plt
# Plot graphs within ipython notebook
%matplotlib inline

# Python Natural Language Tool Kit
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

# Import Regular Expressions
import re

# Import unicode parsing functions
import unicodedata

# Define a function to import and tokenize a book from a filename,
# Return a tuple of FULL_TEXT, TOKENIZED_TEXT
def import_text(path):
    full_text = open(path).read().decode('utf8')
    tokenized_text = nltk.Text(word_tokenize(full_text))
    # If the unicode category of a character starts with a P then it is Punctuation and we want it removed
    tokenized_text = [w.lower() for w in tokenized_text if not unicodedata.category(w[0]).startswith('P') ]
    return full_text, tokenized_text

# Import Finnegans Wake and create token list
wake, wake_tokens = import_text("res/wake.txt")

# Import a list of the most boring words in English
stopwords = stopwords.words('english')

# Print to make sure we have only lowercase text and no punctuation tokens
print(wake_tokens[0:50])

[u'finnegans', u'wake', u'by', u'james', u'joyce', u'i', u'riverrun', u'past', u'eve', u'and', u'adam', u'from', u'swerve', u'of', u'shore', u'to', u'bend', u'of', u'bay', u'brings', u'us', u'by', u'a', u'commodius', u'vicus', u'of', u'recirculation', u'back', u'to', u'howth', u'castle', u'and', u'environs', u'sir', u'tristram', u'violer', u"d'amores", u"fr'over", u'the', u'short', u'sea', u'had', u'passen-core', u'rearrived', u'from', u'north', u'armorica', u'on', u'this', u'side']


Vocabulary Richness
---------------------
First things first, lets see just how linguistically rich *Finnegans Wake* is. A popular metric for vocabulary richness is ratio of unique words to total words. We'll define a function that takes a text title and its tokens and returns its richness ratio.

In [2]:
def richness(title, tokens):
    total_words = len(tokens)
    print ("======" + title.upper() + "======")
    print ("Number of total words: " + str(total_words))
    total_unique_words = len(set(tokens))
    print ("Number of unique words: " + str(total_unique_words))
    richness_ratio = total_unique_words / total_words
    print ("Ratio of unique to total: " + str(richness_ratio) + "\n")
    
richness("Finnegans Wake", wake_tokens)


Number of total words: 219406
Number of unique words: 57525
Ratio of unique to total: 0.262185172694



Hmm, 26.7% for a 258,468-word book.

Let's see that ratio for 250,000-words worth of Herman Melville' *Moby Dick* and James Joyce's *Ulysses*.


In [3]:
# Import Ulysses and create token list
ulysses, ulysses_tokens = import_text("res/ulysses.txt")

# Import Moby Dick and create token list
mobydick, mobydick_tokens = import_text("res/mobydick.txt")

richness("Ulysses", ulysses_tokens)
richness("Moby Dick", mobydick_tokens)


Number of total words: 264435
Number of unique words: 29658
Ratio of unique to total: 0.112156106416

Number of total words: 211460
Number of unique words: 18279
Ratio of unique to total: 0.086441880261



*Ulysses* has 11.2% richness for a similar amount of words.

*Moby Dick* has 8.6% richness for a similar amount of words.

22.7% means *Finnegans Wake* is incredibly rich.


*A Portrait of the Artist*
---------------------
What words occur in both *Portrait of the Artist as a Young Man*, *Dubliners*, *Ulysses* and *Finnegans Wake*?

We'll pull copies of all those texts in and cross-reference word occurances. Then we'll see which words occur in all four.

In [4]:
# Import Portrait and create token list
portrait, portait_tokens = import_text("res/portrait.txt")

# Import Dubliners and create token list
dubliners, dubliners_tokens = import_text("res/dubliners.txt")

joyce_intersection = (set(wake_tokens) & set(ulysses_tokens) & set(dubliners_tokens) & set(portait_tokens)) - set(stopwords)

print("Number of words that occur in all four of Joyce's 'Novels':")
print(len(list(joyce_intersection)))

Number of words that occur in all four of Joyce's 'Novels':
2998


There are about 3000 words that are in all four of Joyce's novels. Let's look at the frequency distribution of words in each of the books, then add all those scores together, and look at the list of the top 50 words that occur in all Joyce books. Below, we're only looking words that are longer than 5 letters because the short words are less interesting.

In [6]:
all_tokens = wake_tokens + ulysses_tokens + portait_tokens + dubliners_tokens
all_tokens = [w for w in all_tokens if w in joyce_intersection]
all_freq = nltk.FreqDist(all_tokens)
for w in all_freq.most_common()[0:250]:
    # We only look at words longer than 5 letter in length. Arbitrary, but more interesting this way.
    if len(w[0]) > 5: 
        w_string = w[0] + ", " + str(w[1])
        print w_string

little, 786
father, 590
street, 451
always, 385
though, 369
thought, 315
another, 306
behind, 300
course, 297
turned, 293
something, 293
mother, 291
towards, 290
without, 281
fellow, 273
looked, 256
better, 251
morning, 244
moment, 231
called, 231
nothing, 226
passed, 225
coming, 215
things, 204
looking, 204
dublin, 197
seemed, 181
second, 178
friend, 178
people, 176
brought, 174
saying, 171
perhaps, 170
remember, 169
walked, 169
together, 167
corner, 165
others, 160
slowly, 159
evening, 157
anything, 151
enough, 150
around, 150
answered, 149


### HCE and ALP
---------------------
Hundreds of characters appear in *Finnegans Wake* but all of those characters are actually just manifestations or sub-manifestations of man and woman, husband and wife, mountain and river, space and time. Joyce calls them HCE and ALP.

Let's use the power of [Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression) to list all the different occurances of the initials HCE and ALP.

In [7]:
hce = re.findall("\s[Hh]\S*\s[Cc]\S*\s[Ee]\S*", wake, re.U)
for n in range(len(hce)//2):
    print "%-50s %s" % (hce[n*2], hce[n*2+1])

 Haroun Childeric Eggeberth                         he calmly extensolies.
 Hic cubat edilis.                                  How Copen-hagen ended.
 happinest childher everwere.                      
How charmingly exquisite!
 Hither, craching eastuards,                        Hag Chivychas Eve,
 Here Comes Everybody.                              Habituels conspicuously emergent.
 H. C. Earwicker                                    he clearly expressed
 H. C. Earwicker,                                   He'll Cheat E'erawan
 haardly creditable edventyres                      haughty, cacuminal, erubescent
 Humpheres Cheops Exarchas,                         huge chain envelope,
 Hatches Cocks' Eggs,                               haught crested elmer,
 his corns either.                                  highly commendable exercise,
 high chief evervirens                              H2 C E3
 hagious curious encestor                           had claimed endright,
 Howforhim chirrupeth ev

In [8]:
alp = re.findall("\s[Aa]\S*\s[Ll]\S*\s[Pp]\S*", wake, re.U)
for n in range(len(alp)//2):
    print "%-50s %s" % (alp[n*2], alp[n*2+1])

 askes lay. Phall                                   addle liddle phifie
 Apud libertinam parvulam                           a lugly parson
 along landed Paddy                                 a lady pack
 a lilyth, pull                                     annie lawrie promises
 a lady's postscript:                               arboro, lo petrusu.
 A Laugh-able Party,                                at length presuaded
 a lane picture                                     Any lucans, please?
 acta legitima plebeia,                             any luvial peatsmoor
 Annos longos patimur                               and leadlight panes.
 areyou looking-for Pearlfar                        Amy Licks Porter
 and lited, pleaded                                 a lovely park,
 are lovely, pitounette,                            and lice, pricking
 Amnis Limina Permanent)                            any lively purliteasy:
 a lunger planner's                                 a little present
 a loose p

In [9]:
print("Number of occurances of HCE: " + str(len(hce)))
print("Number of occurances of ALP: " + str(len(alp)))

Number of occurances of HCE: 124
Number of occurances of ALP: 67


###Single ALP and HCE words
What words that contain letters ALP, HCE, what about both?

In [29]:
a_l_p = re.findall(r"\b(\S*([alp])\S*((?!\2)([alp]))\S*((?!(\2|\3))([alp]))\S*)\b", wake, re.U)
print "Number of ALP words: " + str(len(a_l_p)) + "\n"
# There are too many words so we're only going to look at the first 200
a_l_p = a_l_p[0:200]
for n in range(len(a_l_p)//4):
    print "%-20s %-20s %-20s %s" % (a_l_p[n*4][0], a_l_p[n*4+1][0], a_l_p[n*4+2][0], a_l_p[n*4+3][0])

Number of ALP words: 2706

penisolate           oldparr              humptyhillhead       upturnpikepointandplace
cata-pelting         apeal                appalling            leap
multiplicab-les      staple               hierarchitec-titiptitoploftical baubletop
municipal            plethora             bockalips            platterplate
Shopalist            dalppling            parvulam             slaaps
Lipoleumhat          pullupon-easyan      alps                 alps
tallowscoop          plate                tallowscoop          tailoscrupp
lamp                 preealittle          pouralittle          wipealittle
pelfalittle          flapped              peacefugle           ilandiskippy
scampulars           playing              planko               place's
pall                 play                 parently             paisibly
plain                perihelygangs        paxsealing           Fleapow
pectoral             pillar               pleasurad            plaine
plage     

In [30]:
h_c_e = re.findall(r"\b(\S*([hce])\S*((?!\2)([hce]))\S*((?!(\2|\3))([hce]))\S*)\b", wake, re.U)
print "Number of HCE words: " + str(len(h_c_e)) + "\n"
# There are too many words so we're only going to look at the first 200
h_c_e = h_c_e[0:200]
for n in range(len(h_c_e)//4):
    print "%-20s %-20s %-20s %s" % (h_c_e[n*4][0], h_c_e[n*4+1][0], h_c_e[n*4+2][0], h_c_e[n*4+3][0])

Number of HCE words: 2766

thuartpeatrick       pftjschute           clashes              Whoyteboyce
chance               cashels              pharce               pentschanjeuchy
Childeric            hierarchitec-titiptitoploftical archers              cubehouse
search               crunch-bracken       cheriffs             citherers
trochees             hitched              Finiche              schlice
sundyechosies        detch                scotcher             Touchole
sheltershock         Belchum              Belchum              Belchum
Belchum              Belchum              blooches             wrothschields
whence               childher             coacher's            brooches
allmicheal           muchears             phace                Herrschuft
ketch                scentbreeched        choose               mechanics
each                 chabelshovel-ler     Mitchel              lichening
Marchessvan          quickenshoon         Ballyaughacleeagh-bally charged
selfs

Okay so there are a ton of HCE and ALP words. So many so that searching for that patterning seems kinda fruitless

Instead what if we look for words that have some combination of all of those letters

## HCE + ALP words

We'll modify out earlier regular expression to return these words.

If you'd like to see a description of what's going on in that crazy long regex below, follow [this link](https://regex101.com/r/oH9rG9/1) and click 'Explanation'

In [33]:
h_c_e_a_l_p = re.findall(r"\b(\S*([hcealp])\S*((?!\2)([hcealp]))\S*((?!(\2|\3))([hcealp]))\S*((?!(\2|\3|\5))([hcealp]))\S*((?!(\2|\3|\5|\8))([hcealp]))\S*((?!(\2|\3|\5|\8|\11))([hcealp]))\S*)\b", wake, re.U)

# This contains all the thunder words, we're gonna remove them
h_c_e_a_l_p = [w for w in h_c_e_a_l_p if len(w[0]) < 70]
print "Number of HCEALP words: " + str(len(h_c_e_a_l_p)) + "\n"
for n in range(len(h_c_e_a_l_p)//4):
    print "%-20s %-20s %-20s %s" % (h_c_e_a_l_p[n*4][0], h_c_e_a_l_p[n*4+1][0], h_c_e_a_l_p[n*4+2][0], h_c_e_a_l_p[n*4+3][0])

Number of HCEALP words: 87

hierarchitec-titiptitoploftical archipelago's        respunchable         andrewpaulmurphyc
pleach               pecklapitschens      epickthalamorous     epipsychidically
applecheeks          accomplished         placewheres          chapelgoers
morphomelosophopancreates pathetically         Isitachapel–Asitalukin place-hider
patchpurple          houdingplaces        chapel               blacksheep
hyperchemical        accomplished         metropoliarchialisation accomplished
palypeachum          unspeechably         chaplet              accom-plished
pupil-teacher's      cheekadeekchimple    steeplechange        pupilteachertaut
blockcheap           howdrocephalous      parishclerks         Brighten-pon-the-Baltic
stuumplecheats       Makehal-pence        Holophullopopu-lace  speechsalver's
heliotropical        fellow-chap          handcomplishies      hopeygoalucrey
parchels             accomplishment       Estchapel            halfpricers
polemarch      

Bigrams and Trigrams
---------------------
HCE and ALP may be frequently occuring initials, lets look at frequently occuring phrases and see if anything interesting arises.

Bigrams are pairs of words that occur side-by-side. Trigrams are triplets of words that occur side-by-side.

Below we generate a list of the most frequently occuring bigrams and trigrams.

In [32]:
# Generate list of bigram tuples from wake_tokens
bgs = nltk.bigrams(wake_tokens)
# Generate list of trigram tuples from wake_tokens
tgs = nltk.trigrams(wake_tokens)

# Define a function which takes a list of ngram tuples NGRAM
# returns a that list with any tuples that contain stopwords or punctuation removed.
# this is how we get actually interesting phrases
def ngramsFilter(ngram):
    ngrams_filtered = []
    for tup in ngram:
        flag = True
        for w in tup:
            if w in stopwords:
                flag = False
                break
        if flag:
            ngrams_filtered.append(tup)
    return ngrams_filtered

# Map each bigram to its number of occurances throughout the tokens
fdist_bigrams = nltk.FreqDist(ngramsFilter(bgs))
# Map each trigram to its number of occurances throughout the tokens
fdist_trigrams =nltk.FreqDist(ngramsFilter(tgs))

print("=====BIGRAMS=====")
for tup in fdist_bigrams.most_common(30):
    print('"' + tup[0][0] + " " + tup[0][1] + '", ' + str(tup[1]))
print("\n")
print("=====TRIGRAMS=====")
for tup in fdist_trigrams.most_common(30):
    print('"' + tup[0][0] + " " + tup[0][1] + " " + tup[0][2] + '", ' + str(tup[1]))
print("\n")

=====BIGRAMS=====
"let us", 37
"wo n't", 33
"ca n't", 32
"would n't", 26
"poor old", 16
"could n't", 14
"come back", 12
"anna livia", 12
"ah ho", 12
"shaun replied", 11
"o o", 10
"ay ay", 10
"every time", 9
"tell us", 9
"grand old", 9
"n't say", 8
"hide seek", 8
"one time", 8
"would like", 8
"n't know", 8
"old man", 7
"right enough", 7
"one day", 7
"wait till", 7
"ho ho", 7
"say nothing", 7
"ever since", 7
"good old", 7
"n't forget", 6
"n't tell", 6


=====TRIGRAMS=====
"o o o", 7
"jarl von hoother", 5
"nin nin nin", 4
"seek hide seek", 4
"hide seek hide", 4
"ho ho ho", 3
"whoishe whoishe whoishe", 3
"meeter cat wife", 3
"many many many", 3
"steady steady steady", 3
"order order order", 3
"ah dearo dearo", 3
"jarl van hoother", 3
"dearo dearo dear", 3
"hear o hear", 3
"fore shalt thou", 2
"hip champouree hiphip", 2
"hoke hoke hoke", 2
"two plays punk", 2
"ting ting ting", 2
"variants katey sherratt", 2
"rally o rally", 2
"big white harse", 2
"au aue ha", 2
"hee hee hee", 2
"old matt gr

The Queen's English
---------------------
*Finnegans Wake* seems inscrutable but many Joyceans say that the best guide to the wake is just a comprehensive English dictionary. Let's see just how many words in Finnegans are actually in English.

We'll import a list of English words then see if we can find each word in that English word list.

In [14]:
# Import a list of all English Words
en_words, en_words_tokens = import_text('res/en-words.txt')

In [58]:
#EN_WAKE is the full text of Finnegans Wake with all non-English words removed
en_wake = [w for w in wake_tokens if w.lower() in en_words_tokens]
en_ratio = len(en_wake) / len(wake_tokens)
print("Ratio of English words to total words: " + str(en_ratio))

Ratio of English words to total words: 0.778632018189


So 77.9% of the words in *Finnegans Wake* are valid (in the dictionary) English words.

That above computation is pretty nasty though. Because it relies on a nested for-loop it runs in O(n<sup>2</sup>) time.

We have a set of unique words from above. Let's just use set intersection to check quickly!

In [15]:
# EN_WAKE_SET is a list of every unique English word in Finnegans Wake
wake_tokens_set =  set(wake_tokens)
en_wake_set = list(set(en_words_tokens) & wake_tokens_set)
unique_en_ratio = len(en_wake_set) / len(wake_tokens_set)
print("Ratio of unique English words to total unique words: " + str(unique_en_ratio))

Ratio of unique English words to total unique words: 0.334550195567


Ah, O(n). That's nice. We'll use this again later.

Of all the different words that occur in *Finnegans Wake* only 32.5% of those words are plain English!

##Most Common Non-English Words
Working with sets is nice and all but it destroys our frequency information. So if we, for instance, would like to know which are the most commonly occuring non-english words in *The Wake* then we have to use the inverse list of `eng_wake` from above.

Below wer are only looking at words that are longer than two letters because the two letter words aren't so interesting.

In [16]:
#NON_ENG_WAKE is the full text of Finnegans Wake with all English words removed
set_en_tokens = set(en_words_tokens)
non_en_wake = [w for w in wake_tokens if w.lower() not in set_en_tokens]

In [18]:
non_en_wake_dist = nltk.FreqDist(non_en_wake)

# We're only looking at the top 300
most_common = [w for w in non_en_wake_dist.most_common()[0:300] if len(w[0]) > 2]
for n in range(len(most_common)//4):
    print "%-20s %-20s %-20s %s" % (most_common[n*4][0] + ", " + str(most_common[n*4][1]) ,
                                    most_common[n*4+1][0] + ", " + str(most_common[n*4+1][1]), 
                                    most_common[n*4+2][0] + ", " + str(most_common[n*4+2][1]), 
                                    most_common[n*4+3][0] + ", " + str(most_common[n*4+3][1]))

n't, 289             shaun, 46            mrs, 43              sayd, 35
willingdone, 24      o'er, 24             taff, 20             kevin, 20
sagd, 19             shem, 19             mookse, 18           shee, 17
browne, 16           ing, 16              jaun, 15             hosty, 15
livia, 15            arrah, 15            yous, 15             jinnies, 14
kersse, 14           juva, 13             muta, 13             humphrey, 13
marcus, 13           sazd, 13             nolan, 13            rann, 13
talis, 12            1132, 12             earwicker, 12        lyons, 11
moy, 11              tay, 11              yur, 10              lucan, 10
thon, 10             lipoleums, 10        anny, 10             e'er, 9
blong, 9             jarl, 9              les, 9               dee, 9
est, 9               honuphrius, 9        glugg, 9             anita, 9
nin, 9               mor, 9               hoother, 9           dook, 8
efter, 8             med, 8               rere, 8        

##Full English
Some sentences in *the Wake* are full of purely English words. Let's find them