Language Processing *Finnegans Wake*
====================

Below we will explore some tools in the [Python Natural Language Tool Kit](http://www.nltk.org/) and see what we can reveal of what might be a shameful choice of a warke.

If you're new to *Finnegans Wake* I'll do my best to explain some of the things I'm trying to examine.

Motivation
---------------------
When James Joyce published his infamous work of obliterature, *Finnegans Wake*, he wanted "to keep the critics busy for 300 years".

It's been 75 years so maybe and some [Viconian thunderclaps](http://www.yourepeat.com/watch/?v=a11DEFm0WCw&start_at=347&end_at=390) later. We have new media through which we can clarify some of the obscurity of *The Wake*.

Drawbacks
---------------------
There are admittedly drawbacks to textual analysis of *Finnegans Wake*. Principally that *The Wake* is meant to be [read out loud](https://www.youtube.com/watch?v=M8kFqiv8Vww). There's information, double, triple,..., Nth-le meaning, that's revealed when heard aloud. We're not gonna access that information heare, nor will we be able to pick up puns.

Getting Started
---------------------
First we import the python libraries we'll be using and the text of Finnegans Wake itself.

Note: If you're having difficulty getting these running on your machine, I recommend checking out Anaconda for OSX, which handles python package installs relatively cleanly.

In [14]:
from __future__ import division
import numpy as np

# Plotting library
import matplotlib
import matplotlib.pyplot as plt
# Plot graphs within ipython notebook
%matplotlib inline

# Python Natural Language Tool Kit
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

# Import Regular Expressions
import re

# Define a function to import and tokenize a book from a filename,
# Return a tuple of FULL_TEXT, TOKENIZED_TEXT
def import_text(path):
    full_text = open(path).read().decode('utf8')
    tokenized_text = nltk.Text(word_tokenize(full_text))
    tokenized_text = [w.lower() for w in tokenized_text]
    return full_text, tokenized_text

# Import Finnegans Wake and create token list
wake, wake_tokens = import_text("res/wake.txt")

# Print to make sure we have only lowercase text and no punctuation tokens
print(wake_tokens[0:50])

[u'finnegans', u'wake', u',', u'by', u'james', u'joyce', u'i', u'riverrun', u',', u'past', u'eve', u'and', u'adam\u2019s', u',', u'from', u'swerve', u'of', u'shore', u'to', u'bend', u'of', u'bay', u',', u'brings', u'us', u'by', u'a', u'commodius', u'vicus', u'of', u'recirculation', u'back', u'to', u'howth', u'castle', u'and', u'environs', u'.', u'sir', u'tristram', u',', u'violer', u'd\u2019amores', u',', u'fr\u2019over', u'the', u'short', u'sea', u',', u'had']


Vocabulary Richness
---------------------
First things first, lets see just how linguistically rich *Finnegans Wake* is. A popular metric for vocabulary richness is ratio of unique words to total words. We'll define a function that takes a text title and its tokens and returns its richness ratio.

In [10]:
def richness(title, tokens):
    total_words = len(tokens)
    print ("======" + title.upper() + "======")
    print ("Number of total words: " + str(total_words))
    total_unique_words = len(set(tokens))
    print ("Number of unique words: " + str(total_unique_words))
    richness_ratio = total_unique_words / total_words
    print ("Ratio of unique to total: " + str(richness_ratio) + "\n")
    
richness("Finnegans Wake", wake_tokens)


Number of total words: 258468
Number of unique words: 58629
Ratio of unique to total: 0.226832722039



Hmm, 22.7% for a 258,468-word book.

Let's see that ratio for 250,000-words worth of Herman Melville' *Moby Dick* and James Joyce's *Ulysses*.


In [11]:
# Import Ulysses and create token list
ulysses, ulysses_tokens = import_text("res/ulysses.txt")

# Import Moby Dick and create token list
mobydick, mobydick_tokens = import_text("res/mobydick.txt")

richness("Ulysses", ulysses_tokens)
richness("Moby Dick", mobydick_tokens)


Number of total words: 319471
Number of unique words: 30399
Ratio of unique to total: 0.0951541767484

Number of total words: 250542
Number of unique words: 18413
Ratio of unique to total: 0.073492667896



*Ulysses* has 9.5% richness for a similar amount of words.

*Moby Dick* has 7.3% richness for a similar amount of words.

22.7% means *Finnegans Wake* is incredibly rich.


HCE and ALP
---------------------
Hundreds of characters appear in *Finnegans Wake* but all of those characters are actually just manifestations or sub-manifestations of man and woman, husband and wife, mountain and river, space and time. Joyce calls them HCE and ALP.

Let's list all the different occurances of the initials HCE and ALP.

In [19]:
hce = re.findall("\s[Hh]\S*\s[Cc]\S*\s[Ee]\S*", wake, re.U)
for occurance in hce:
    print(occurance)

 Haroun Childeric Eggeberth
 he calmly extensolies.
 Hic cubat edilis.
 How Copen-hagen ended.
 happinest childher everwere.

How charmingly exquisite!
 Hither, craching eastuards,
 Hag Chivychas Eve,
 Here Comes Everybody.
 Habituels conspicuously emergent.
 H. C. Earwicker
 he clearly expressed
 H. C. Earwicker,
 He’ll Cheat E’erawan
 haardly creditable edventyres
 haughty, cacuminal, erubescent
 Humpheres Cheops Exarchas,
 huge chain envelope,
 Hatches Cocks’ Eggs,
 haught crested elmer,
 his corns either.
 highly commendable exercise,
 high chief evervirens
 H2 C E3
 hagious curious encestor
 had claimed endright,
 Howforhim chirrupeth evereach-
 Homo Capite Erectus,
 He Can Explain,
 Howke Cotchme Eye,
 Huffy Chops Eads,
 hardily curio-sing entomophilust
 heptagon crystal emprisoms
 Hwang Chang evelytime;
 hoveth chieftains evrywehr,
 hereditatis columna erecta,
 hagion chiton eraphon;
 hallucination, cauchman, ectoplasm;
 hard cash earned
 Hewitt Castello, Equerry,
 heavengendere

In [20]:
alp = re.findall("\s[Aa]\S*\s[Ll]\S*\s[Pp]\S*", wake, re.U)
for occurance in alp:
    print(occurance)

 askes lay. Phall
 addle liddle phifie
 Apud libertinam parvulam
 a lugly parson
 along landed Paddy
 a lady pack
 a lilyth, pull
 annie lawrie promises
 a lady’s postscript:
 arboro, lo petrusu.
 A Laugh-able Party,
 at length presuaded
 a lane picture
 Any lucans, please?
 acta legitima plebeia,
 any luvial peatsmoor
 Annos longos patimur
 and leadlight panes.
 areyou looking-for Pearlfar
 Amy Licks Porter
 and lited, pleaded
 a lovely park,
 are lovely, pitounette,
 and lice, pricking
 Amnis Limina Permanent)
 any lively purliteasy:
 a lunger planner’s
 a little present
 a loose past.
 and Le PŠre
 Annushka Lutetiavitch Pufflovah,
 All Ladies’ presents.
 and letters play
 apes. Lights, pageboy,
 alla ludo poker
 an litlee plads
 af liefest pose,
 AND LIBERTINE. PROPE
 a lonely peggy,
 appia lippia pluvaville,
 Art, literature, politics,
 American Lake Poetry,
 a luckybock, pledge
 Anna Lynchya Pourable
 anny livving plusquebelle,
 annapal livibel prettily
 a locally person
 a lynche

In [21]:
print("Number of occurances of HCE: " + str(len(hce)))
print("Number of occurances of ALP: " + str(len(alp)))

Number of occurances of HCE: 124
Number of occurances of ALP: 67


Mostly English
---------------------
*Finnegans Wake* seems inscrutable but many Joyceans say that the best guide to the wake is just a comprehensive English dictionary. Let's see just how many words in Finnegans are actually in English.

We'll import a list of English words then see if we can find each word in that English word list.

In [12]:
# Import a list of all English Words
en_words, eng_words_tokens = import_text('res/en-words.txt')

#ENG_WAKE is the full text of Finnegans Wake with all non-English words removed
eng_wake = [w for w in wake_tokens if w.lower() in eng_words_tokens]
eng_ratio = len(eng_wake)  / len(wake_tokens)
print("Ratio of English words to total words: " + str(eng_ratio))

KeyboardInterrupt: 

Languages
---------------------
*Finnegans Wake* may be written mostly in plain English. There are many non-english words as well. Many of those words are constructed by Joyce himself, english or multi-lingual puns and portmanteaus. Although *The Wake* uses many Languages. For this first section we're going to concern ourselves with only words that exist in six of the most popular languages in *The Wake*: English, Irish, Latin, French, German and Italian.