Language Processing *Finnegans Wake*
====================

Below we will explore some tools in the [Python Natural Language Tool Kit](http://www.nltk.org/) and see what we can reveal of what might be a shameful choice of a warke.

If you're new to *Finnegans Wake* I'll do my best to explain some of the things I'm trying to examine.

Motivation
---------------------
When James Joyce published his infamous work of obliterature, *Finnegans Wake*, he wanted "to keep the critics busy for 300 years".

It's been 75 years so maybe and some [Viconian thunderclaps](http://www.yourepeat.com/watch/?v=a11DEFm0WCw&start_at=347&end_at=390) later. We have new media through which we can clarify some of the obscurity of *The Wake*.

Drawbacks
---------------------
There are admittedly drawbacks to textual analysis of *Finnegans Wake*. Principally that *The Wake* is meant to be [read out loud](https://www.youtube.com/watch?v=M8kFqiv8Vww). There's information, double, triple,..., Nth-le meaning, that's revealed when heard aloud. We're not gonna access that information heare, nor will we be able to pick up puns.

Getting Started
---------------------
First we import the python libraries we'll be using and the text of *Finnegans Wake* itself.

Note: If you're having difficulty getting these running on your machine, I recommend checking out Anaconda for OSX, which handles python package installs relatively cleanly.

In [24]:
from __future__ import division
import numpy as np

# Plotting library
import matplotlib
import matplotlib.pyplot as plt
# Plot graphs within ipython notebook
%matplotlib inline

# Python Natural Language Tool Kit
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

# Import Regular Expressions
import re

# Import unicode parsing functions
import unicodedata

# Define a function to import and tokenize a book from a filename,
# Return a tuple of FULL_TEXT, TOKENIZED_TEXT
def import_text(path):
    full_text = open(path).read().decode('utf8')
    tokenized_text = nltk.Text(word_tokenize(full_text))
    # If the unicode category of a character starts with a P then it is Punctuation and we want it removed
    tokenized_text = [w.lower() for w in tokenized_text if not unicodedata.category(w[0]).startswith('P') ]
    return full_text, tokenized_text

# Import Finnegans Wake and create token list
wake, wake_tokens = import_text("res/wake.txt")

# Import a list of the most boring words in English
stopwords = stopwords.words('english')

# Print to make sure we have only lowercase text and no punctuation tokens
print(wake_tokens[0:50])

[u'finnegans', u'wake', u'by', u'james', u'joyce', u'i', u'riverrun', u'past', u'eve', u'and', u'adam\u2019s', u'from', u'swerve', u'of', u'shore', u'to', u'bend', u'of', u'bay', u'brings', u'us', u'by', u'a', u'commodius', u'vicus', u'of', u'recirculation', u'back', u'to', u'howth', u'castle', u'and', u'environs', u'sir', u'tristram', u'violer', u'd\u2019amores', u'fr\u2019over', u'the', u'short', u'sea', u'had', u'passen-core', u'rearrived', u'from', u'north', u'armorica', u'on', u'this', u'side']


Vocabulary Richness
---------------------
First things first, lets see just how linguistically rich *Finnegans Wake* is. A popular metric for vocabulary richness is ratio of unique words to total words. We'll define a function that takes a text title and its tokens and returns its richness ratio.

In [25]:
def richness(title, tokens):
    total_words = len(tokens)
    print ("======" + title.upper() + "======")
    print ("Number of total words: " + str(total_words))
    total_unique_words = len(set(tokens))
    print ("Number of unique words: " + str(total_unique_words))
    richness_ratio = total_unique_words / total_words
    print ("Ratio of unique to total: " + str(richness_ratio) + "\n")
    
richness("Finnegans Wake", wake_tokens)


Number of total words: 219038
Number of unique words: 58533
Ratio of unique to total: 0.267227604343



Hmm, 26.7% for a 258,468-word book.

Let's see that ratio for 250,000-words worth of Herman Melville' *Moby Dick* and James Joyce's *Ulysses*.


In [26]:
# Import Ulysses and create token list
ulysses, ulysses_tokens = import_text("res/ulysses.txt")

# Import Moby Dick and create token list
mobydick, mobydick_tokens = import_text("res/mobydick.txt")

richness("Ulysses", ulysses_tokens)
richness("Moby Dick", mobydick_tokens)


Number of total words: 264435
Number of unique words: 29658
Ratio of unique to total: 0.112156106416

Number of total words: 211460
Number of unique words: 18279
Ratio of unique to total: 0.086441880261



*Ulysses* has 11.2% richness for a similar amount of words.

*Moby Dick* has 8.6% richness for a similar amount of words.

22.7% means *Finnegans Wake* is incredibly rich.


*A Portrait of the Artist*
---------------------
What words occur in both *Portrait of the Artist as a Young Man*, *Dubliners*, *Ulysses* and *Finnegans Wake*?

We'll pull copies of all those texts in and cross-reference word occurances. Then we'll see which words occur in all four.

In [27]:
# Import Portrait and create token list
portrait, portait_tokens = import_text("res/portrait.txt")

# Import Dubliners and create token list
dubliners, dubliners_tokens = import_text("res/dubliners.txt")

joyce_intersection = (set(wake_tokens) & set(ulysses_tokens) & set(dubliners_tokens) & set(portait_tokens)) - set(stopwords)

print("Number of words that occur in all four of Joyce's 'Novels':")
print(len(list(joyce_intersection)))

Number of words that occur in all four of Joyce's 'Novels':
2988


There are about 3000 words that are in all four of Joyce's novels. Let's look at the frequency distribution of words in each of the books, then add all those scores together, and look at the list of the top 50 words across Joyce books.

In [28]:
all_tokens = wake_tokens + ulysses_tokens + portait_tokens + dubliners_tokens
all_tokens = [w for w in all_tokens if w not in stopwords]
all_freq = nltk.FreqDist(all_tokens)
all_freq.most_common(50)

[(u'said', 2687),
 (u'one', 1768),
 (u'like', 1640),
 (u'would', 1328),
 (u'old', 1205),
 (u'``', 1105),
 (u'man', 1018),
 (u'stephen', 1000),
 (u'bloom', 994),
 (u'mr', 975),
 (u"n't", 924),
 (u'time', 875),
 (u'two', 860),
 (u'could', 852),
 (u'o', 847),
 (u'see', 838),
 (u'little', 786),
 (u'back', 727),
 (u'good', 714),
 (u'us', 703),
 (u'know', 701),
 (u'first', 697),
 (u'well', 678),
 (u'eyes', 667),
 (u'come', 639),
 (u'say', 613),
 (u'way', 592),
 (u'father', 588),
 (u'go', 573),
 (u'never', 571),
 (u'says', 568),
 (u'mr.', 564),
 (u'hand', 542),
 (u'yes', 540),
 (u'face', 533),
 (u'god', 532),
 (u'made', 529),
 (u'day', 529),
 (u'long', 523),
 (u'let', 520),
 (u'night', 518),
 (u'life', 513),
 (u'still', 495),
 (u'round', 491),
 (u'upon', 488),
 (u'ever', 482),
 (u'head', 469),
 (u'went', 462),
 (u'right', 454),
 (u'tell', 453)]

HCE and ALP
---------------------
Hundreds of characters appear in *Finnegans Wake* but all of those characters are actually just manifestations or sub-manifestations of man and woman, husband and wife, mountain and river, space and time. Joyce calls them HCE and ALP.

Let's use the power of [Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression) to list all the different occurances of the initials HCE and ALP.

In [29]:
hce = re.findall("\s[Hh]\S*\s[Cc]\S*\s[Ee]\S*", wake, re.U)
for occurance in hce:
    print(occurance)

 Haroun Childeric Eggeberth
 he calmly extensolies.
 Hic cubat edilis.
 How Copen-hagen ended.
 happinest childher everwere.

How charmingly exquisite!
 Hither, craching eastuards,
 Hag Chivychas Eve,
 Here Comes Everybody.
 Habituels conspicuously emergent.
 H. C. Earwicker
 he clearly expressed
 H. C. Earwicker,
 He’ll Cheat E’erawan
 haardly creditable edventyres
 haughty, cacuminal, erubescent
 Humpheres Cheops Exarchas,
 huge chain envelope,
 Hatches Cocks’ Eggs,
 haught crested elmer,
 his corns either.
 highly commendable exercise,
 high chief evervirens
 H2 C E3
 hagious curious encestor
 had claimed endright,
 Howforhim chirrupeth evereach-
 Homo Capite Erectus,
 He Can Explain,
 Howke Cotchme Eye,
 Huffy Chops Eads,
 hardily curio-sing entomophilust
 heptagon crystal emprisoms
 Hwang Chang evelytime;
 hoveth chieftains evrywehr,
 hereditatis columna erecta,
 hagion chiton eraphon;
 hallucination, cauchman, ectoplasm;
 hard cash earned
 Hewitt Castello, Equerry,
 heavengendere

In [30]:
alp = re.findall("\s[Aa]\S*\s[Ll]\S*\s[Pp]\S*", wake, re.U)
for occurance in alp:
    print(occurance)

 askes lay. Phall
 addle liddle phifie
 Apud libertinam parvulam
 a lugly parson
 along landed Paddy
 a lady pack
 a lilyth, pull
 annie lawrie promises
 a lady’s postscript:
 arboro, lo petrusu.
 A Laugh-able Party,
 at length presuaded
 a lane picture
 Any lucans, please?
 acta legitima plebeia,
 any luvial peatsmoor
 Annos longos patimur
 and leadlight panes.
 areyou looking-for Pearlfar
 Amy Licks Porter
 and lited, pleaded
 a lovely park,
 are lovely, pitounette,
 and lice, pricking
 Amnis Limina Permanent)
 any lively purliteasy:
 a lunger planner’s
 a little present
 a loose past.
 and Le PŠre
 Annushka Lutetiavitch Pufflovah,
 All Ladies’ presents.
 and letters play
 apes. Lights, pageboy,
 alla ludo poker
 an litlee plads
 af liefest pose,
 AND LIBERTINE. PROPE
 a lonely peggy,
 appia lippia pluvaville,
 Art, literature, politics,
 American Lake Poetry,
 a luckybock, pledge
 Anna Lynchya Pourable
 anny livving plusquebelle,
 annapal livibel prettily
 a locally person
 a lynche

In [31]:
print("Number of occurances of HCE: " + str(len(hce)))
print("Number of occurances of ALP: " + str(len(alp)))

Number of occurances of HCE: 124
Number of occurances of ALP: 67


What words that contain letters ALP, HCE, what about both?

In [56]:
a_l_p = re.findall(r"\b(\S*([alp])+\S*((?!\2)([alp]))+\S*((?!(\2|\3))([alp]))+\S*)\b", wake, re.U)
for occurance in a_l_p:
    print(occurance[0])

penisolate
oldparr
humptyhillhead
upturnpikepointandplace
cata-pelting
apeal
appalling
leap
multiplicab-les
staple
hierarchitec-titiptitoploftical
baubletop
municipal
plethora
bockalips
platterplate
Shopalist
dalppling
parvulam
slaaps
Lipoleumhat
pullupon-easyan
alps
alps
tallowscoop
plate
tallowscoop
tailoscrupp
lamp
preealittle
pouralittle
wipealittle
pelfalittle
flapped
peacefugle
ilandiskippy
scampulars
playing
planko
place’s
pall
play
parently
paisibly
plain
perihelygangs
paxsealing
Fleapow
pectoral
pillar
pleasurad
plaine
plage
alp
please
allaphbed
please
please
place
plaus-ible
palm
toptypsical
leap
wallops
larpnotes
lamphouse
paly
lilipath
knavepaltry
campbells
panuncular
captol
Wolkencap
Impalpabunt
lipalip
play
swamplight
Healiopolis
Kapelavaster
players
supershillelagh
palmsweat
paddyplanters
laps
planter
pale
falconplumes
appeals
popular
mudapplication
lamp
playing
plaidboy
explain
leopard
wouldpay
deadlop
place
paroqial
archipelago’s
wicklowpattern
respunchable
occupationa

In [57]:
h_c_e = re.findall(r"\b(\S*([hce])+\S*((?!\2)([hce]))+\S*((?!(\2|\3))([hce]))+\S*)\b", wake, re.U)
for occurance in h_c_e:
    print(occurance[0])

thuartpeatrick
pftjschute
clashes
Whoyteboyce
chance
cashels
pharce
pentschanjeuchy
Childeric
hierarchitec-titiptitoploftical
archers
cubehouse
search
crunch-bracken
cheriffs
citherers
trochees
hitched
Finiche
schlice
sundyechosies
detch
scotcher
Touchole
sheltershock
Belchum
Belchum
Belchum
Belchum
Belchum
blooches
wrothschields
whence
childher
coacher’s
brooches
allmicheal
muchears
phace
Herrschuft
ketch
scentbreeched
choose
mechanics
each
chabelshovel-ler
Mitchel
lichening
Marchessvan
quickenshoon
Ballyaughacleeagh-bally
charged
selfstretches
choosed
childsfather
nuncheon
chorley
excheck
poached
hence
Stench
thanacestross
attachment
bitches
reaching
earthcrust
creakish
Toucheaterre
cotched
milch-camel
charter
closeth
chopwife
mischievmiss
besch
kickaheeling
niece-of-his-inlaw
punched
changers
hurricane
watercloth
bench
enfranchisable
kerchief
cheeks
toethpicks
chewed
chepped
Maccullaghmore
archgoose
headboddylwatcher
chempel
dinnerchime
cherub
cheek
schoolbelt
choose
twicenightly
Fe

Let's use the NLTK frequency distribution object to see just where these characters occur throughout the book

Bigrams and Trigrams
---------------------
HCE and ALP may be frequently occuring initials, lets look at frequently occuring phrases and see if anything interesting arises.

Bigrams are pairs of words that occur side-by-side. Trigrams are triplets of words that occur side-by-side.

Below we generate a list of the most frequently occuring bigrams and trigrams.

In [5]:
# Generate list of bigram tuples from wake_tokens
bgs = nltk.bigrams(wake_tokens)
# Generate list of trigram tuples from wake_tokens
tgs = nltk.trigrams(wake_tokens)

# Define a function which takes a list of ngram tuples NGRAM
# returns a that list with any tuples that contain stopwords or punctuation removed.
# this is how we get actually interesting phrases
def ngramsFilter(ngram):
    ngrams_filtered = []
    for tup in ngram:
        flag = True
        for w in tup:
            if w in stopwords:
                flag = False
                break
        if flag:
            ngrams_filtered.append(tup)
    return ngrams_filtered

# Map each bigram to its number of occurances throughout the tokens
fdist_bigrams = nltk.FreqDist(ngramsFilter(bgs))
# Map each trigram to its number of occurances throughout the tokens
fdist_trigrams =nltk.FreqDist(ngramsFilter(tgs))

print("=====BIGRAMS=====")
for tup in fdist_bigrams.most_common(20):
    print('"' + tup[0][0] + " " + tup[0][1] + '", ' + str(tup[1]))
print("\n")
print("=====TRIGRAMS=====")
for tup in fdist_trigrams.most_common(20):
    print('"' + tup[0][0] + " " + tup[0][1] + " " + tup[0][2] + '", ' + str(tup[1]))
print("\n")

=====BIGRAMS=====
"let us", 37
"poor old", 16
"come back", 12
"ah ho", 12
"shaun replied", 11
"anna livia", 11
"o o", 10
"ay ay", 10
"every time", 9
"tell us", 9
"grand old", 9
"hide seek", 8
"one time", 8
"would like", 8
"right enough", 7
"one day", 7
"wait till", 7
"ho ho", 7
"say nothing", 7
"ever since", 7


=====TRIGRAMS=====
"o o o", 7
"jarl von hoother", 4
"nin nin nin", 4
"seek hide seek", 4
"hide seek hide", 4
"ho ho ho", 3
"many many many", 3
"steady steady steady", 3
"whoishe whoishe whoishe", 3
"order order order", 3
"ah dearo dearo", 3
"dearo dearo dear", 3
"hear o hear", 3
"fore shalt thou", 2
"hip champouree hiphip", 2
"want bud budderly", 2
"ting ting ting", 2
"rally o rally", 2
"big white harse", 2
"hee hee hee", 2




The Queen's English
---------------------
*Finnegans Wake* seems inscrutable but many Joyceans say that the best guide to the wake is just a comprehensive English dictionary. Let's see just how many words in Finnegans are actually in English.

We'll import a list of English words then see if we can find each word in that English word list.

In [None]:
# Import a list of all English Words
en_words, eng_words_tokens = import_text('res/en-words.txt')

#ENG_WAKE is the full text of Finnegans Wake with all non-English words removed
eng_wake = [w for w in wake_tokens if w.lower() in eng_words_tokens]
eng_ratio = len(eng_wake) / len(wake_tokens)
print("Ratio of English words to total words: " + str(eng_ratio))

So 77.9% of the words in *Finnegans Wake* are valid (in the dictionary) English words.

That above computation is pretty nasty though. Because it relies on a nested for-loop it runs in O(n<sup>2</sup>) time.

We have a set of unique words from above. Let's just use set intersection to check quickly!

In [None]:
# ENG_WAKE_SET is a list of every unique English word in Finnegans Wake
wake_tokens_set =  set(wake_tokens)
eng_wake_set = list(set(eng_words_tokens) & wake_tokens_set)
unique_eng_ratio = len(eng_wake_set) / len(wake_tokens_set)
print("Ratio of unique English words to total unique words: " + str(unique_eng_ratio))

Ah, O(n). That's nice. We'll use this again later.

Of all the different words that occur in *Finnegans Wake* only 32.5% of those words are plain English!

Languages
---------------------
Many of the non-english words in *Finnegans Wake* are constructed by Joyce himself, they are English or multi-lingual puns and portmanteaus. Although *The Wake* uses words from many Languages, lets look at some of the most popular: English, Latin, French, German and Italian.

In [3]:
ge_words, ge_words_tokens = import_text('res/ge-words.txt')
la_words, la_words_tokens = import_text('res/la-words.txt')
it_words, it_words_tokens = import_text('res/it-words.txt')
fr_words, fr_words_tokens = import_text('res/fr-words.txt')

UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 25023: invalid start byte