Language Processing *Finnegans Wake*
====================

Below we will explore some tools in the [Python Natural Language Tool Kit](http://www.nltk.org/) and see what we can reveal of what might be a shameful choice of a warke.

If you're new to *Finnegans Wake* I'll do my best to explain some of the things I'm trying to examine.

Motivation
---------------------
When James Joyce published his infamous work of obliterature, *Finnegans Wake*, he wanted "to keep the critics busy for 300 years".

It's been 75 years so maybe and some [Viconian thunderclaps](http://www.yourepeat.com/watch/?v=a11DEFm0WCw&start_at=347&end_at=390) later. We have new media through which we can clarify some of the obscurity of *The Wake*.

Drawbacks
---------------------
There are admittedly drawbacks to textual analysis of *Finnegans Wake*. Principally that *The Wake* is meant to be [read out loud](https://www.youtube.com/watch?v=M8kFqiv8Vww). There's information, double, triple,..., Nth-le meaning, that's revealed when heard aloud. We're not gonna access that information heare, nor will we be able to pick up puns.

Getting Started
---------------------
First we import the python libraries we'll be using and the text of Finnegans Wake itself.

Note: If you're having difficulty getting these running on your machine, I recommend checking out Anaconda for OSX, which handles python package installs relatively cleanly.

In [12]:
from __future__ import division
import numpy as np

# Plotting library
import matplotlib
import matplotlib.pyplot as plt
# Plot graphs within ipython notebook
%matplotlib inline

# Python Natural Language Tool Kit
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

# Import Finnegans Wake and create token list
wake = open("res/wake.txt").read().decode('utf8')
wake_tokens = nltk.Text(word_tokenize(wake))
wake_tokens = [w.lower() for w in wake_tokens]
# Print to make sure we have only lowercase text and no punctuation tokens
print(wake_tokens[0:50])

[u'finnegans', u'wake', u',', u'by', u'james', u'joyce', u'i', u'riverrun', u',', u'past', u'eve', u'and', u'adam\u2019s', u',', u'from', u'swerve', u'of', u'shore', u'to', u'bend', u'of', u'bay', u',', u'brings', u'us', u'by', u'a', u'commodius', u'vicus', u'of', u'recirculation', u'back', u'to', u'howth', u'castle', u'and', u'environs', u'.', u'sir', u'tristram', u',', u'violer', u'd\u2019amores', u',', u'fr\u2019over', u'the', u'short', u'sea', u',', u'had']


Vocabulary Richness
---------------------
First things first, lets see just how linguistically rich *Finnegans Wake* is. A popular metric for vocabulary richness is ratio of unique words to total words.

In [10]:
total_words = len(wake_tokens)
print ("Number of total words: " + str(total_words))
total_unique_words = len(set(wake_tokens))
print ("Number of unique words: " + str(total_unique_words))
richness_ratio = total_unique_words / total_words
print ("Ratio of unique to total: " + str(richness_ratio))

Number of total words: 258468
Number of unique words: 58629
Ratio of unique to total: 0.226832722039


Hmm, 22.7% for a 258,468-word book.

Let's see that ratio for 250,000-words worth of Herman Melville' *Moby Dick* and James Joyce's *Ulysses*.


In [13]:
# Import Ulysses and create token list
ulysses = open("res/ulysses.txt").read().decode('utf8')
ulysses_tokens = nltk.Text(word_tokenize(ulysses))
ulysses_tokens = [w.lower() for w in ulysses_tokens]
# Print to make sure we have only lowercase text and no punctuation tokens
print(ulysses_tokens[0:50])

# Import Moby Dick and create token list
mobydick = open("res/mobydick.txt").read().decode('utf8')
mobydick_tokens = nltk.Text(word_tokenize(mobydick))
mobydick_tokens = [w.lower() for w in mobydick_tokens]
# Print to make sure we have only lowercase text and no punctuation tokens
print(mobydick_tokens[0:50])

[u'the', u'project', u'gutenberg', u'ebook', u'of', u'ulysses', u',', u'by', u'james', u'joyce', u'this', u'ebook', u'is', u'for', u'the', u'use', u'of', u'anyone', u'anywhere', u'at', u'no', u'cost', u'and', u'with', u'almost', u'no', u'restrictions', u'whatsoever', u'.', u'you', u'may', u'copy', u'it', u',', u'give', u'it', u'away', u'or', u're-use', u'it', u'under', u'the', u'terms', u'of', u'the', u'project', u'gutenberg', u'license', u'included', u'with']
[u'the', u'project', u'gutenberg', u'ebook', u'of', u'moby', u'dick', u';', u'or', u'the', u'whale', u',', u'by', u'herman', u'melville', u'this', u'ebook', u'is', u'for', u'the', u'use', u'of', u'anyone', u'anywhere', u'at', u'no', u'cost', u'and', u'with', u'almost', u'no', u'restrictions', u'whatsoever', u'.', u'you', u'may', u'copy', u'it', u',', u'give', u'it', u'away', u'or', u're-use', u'it', u'under', u'the', u'terms', u'of', u'the']


HCE and ALP
---------------------
Hundreds of characters appear in *Finnegans Wake* but all of those characters are actually just manifestations or sub-manifestations of man and woman, husband and wife, mountain and river, space and time. Joyce calls them HCE and ALP.

Let's list all the different occurances of the initials HCE and ALP.

Mostly English
---------------------
*Finnegans Wake* seems inscrutable but many Joyceans say that the best guide to the wake is just a comprehensive English dictionary. Let's see just how many words in Finnegans are actually in English.

We'll import a list of English words then see if we can find each word in that English word list.

In [None]:
# Import a list of all English Words
en_words = open('res/en-words.txt').read().decode('utf8')
eng_words_tokens = word_tokenize(en_words)

#ENG_WAKE is the full text of Finnegans Wake with all non-English words removed
eng_wake = [w for w in wake_tokens if w.lower() in eng_words_tokens]
eng_ratio = len(eng_wake)  / len(wake_tokens)
print("Ratio of English words to total words: " + str(eng_ratio))

Languages
---------------------
*Finnegans Wake* may be written mostly in plain English. There are many non-english words as well. Many of those words are constructed by Joyce himself, english or multi-lingual puns and portmanteaus. Although *The Wake* uses many Languages. For this first section we're going to concern ourselves with only words that exist in six of the most popular languages in *The Wake*: English, Irish, Latin, French, German and Italian.