# Playing with Word Tokenization

One thing that turned up quickly is that a simple `String.split()`
was not necessarily the best way to tokenize titles into words.
I'd claim that _Baby Love_ by the Supremes and _Baby, I Love You_ by the Ronnettes
both start with _Baby_, but `split()` defaults to treating all non-space characters equal.
So, the former starts with "Baby" but the latter with "Baby,".
A simplistic strategy might be to look at consecutive alphabetics,
but I'm pretty sure that _Don't Bring Me Down_ by ELO does not start with "Don."
I could play with applying random regex, but this is a well known problem,
so let's look for existing solutions.

## Load the data

The [AtoZ Playlist page](http://xpn.org/music-artist/xpn-a-z)
contains a directory of songs by first letter.
Behind the scenes, it makes ReST request to their backend.
Eventually I should cache the results,
so this keeps working when they change the site.
But right now while the playlist is still going,
just build a data frame off of the results.

In [41]:
%matplotlib inline
from lxml import html
import requests
import pandas as pd
from IPython.display import display, HTML

rows = []
for letter in ['A', 'B', 'C', 'D']:
    page = requests.get('http://xpn.org/static/az.php?q=%s' %  letter)
    tree = html.fromstring(page.content)
    plays = tree.xpath('//li/text()')
    for play in plays:
        rows.append(play.split(' - ', 1))
playlist = pd.DataFrame(rows, columns=('Title', 'Artist'))

## Parse the first word of the title

### Simple String Splitting

Just use the default `String.split()`.
This has the problem that it makes adjacent punctuation meaningful.
This turns out to be more important that one might expect,
as the pattern "word, words ..." is not uncommon.
As an example it maps `"Baby Love"` to `["Baby", "Love"]`
but maps `"Baby, I Love You"` to `["Baby,", "I", "Love", "You"]`
making the two songs start with different words.

In [42]:
simple_split = playlist.apply(lambda x: x[0].split()[0], axis=1).value_counts().to_frame('split()')

### NLTK Word tokenizer

[NLTK](http://www.nltk.org/), or Natural Language Toolkit,
is a popular python package that includes most of the usual suspects
for text analysis.
It includes lots of tokenizers and `word_tokenize` seems an obvious choice.
However, it is not that simple.
It does solve the comma problem and 
maps `"Baby, I Love You"` to `["Baby", ",", "I", "Love", "You"]`.
It maps `"Do You Wanna Dance?"` to `["Do", "You", "Wan", "na"', "Dance", "?"]`,
which gets "Do" right.
However it maps `"Don't Get Me Wrong"` to `['Do', "n't", 'Get', 'Me', 'Wrong']`.
The difference between "Do" and "Don't" is kind of fundamental.
Even worse, this folds _Ca Plane Pour Moi_ in with all the songs that begin
with "Can't" but tokenize at `["Ca", "n't", ...]`.

In [43]:
from nltk.tokenize import word_tokenize
nltk_word_tokenizer = playlist.apply(lambda x: word_tokenize(x[0])[0], axis=1).value_counts().to_frame('nltk.word_tokenize()')

### NLTK WordPunct tokenizer

[NLTJ](http://www.nltk.org) also provides a simpler tokenizer called `wordpunct`
which does a regex based tokenization.
It doesn't turn "Don't" into "Do" which is good.
But spiting on any non-alphabetics, it turns "Don't" into "Don",
that is to say it maps `"Don't Bring Me Down`
to `['Don', "'", 't', 'Bring', 'Me', 'Down']`.

In [44]:
from nltk.tokenize import wordpunct_tokenize
nltk_wordpunct_tokenizer = playlist.apply(lambda x: wordpunct_tokenize(x[0])[0], axis=1).value_counts().to_frame('nltk.wordpunct_tokenize()')

## Comparing the results

Paste all the samples together and compare the results.
At this juncture, using the `word_tokenize()` routine causes more problems than it solves.
It solves the "Baby," problem, but does an ugly job on contractions such as
"Don't" and "Ain't".

In [45]:
results = simple_split.join(nltk_word_tokenizer, how='outer')
results = results.join(nltk_wordpunct_tokenizer, how='outer')
results = results.join(results.max(axis=1).to_frame('max'))
HTML(results.sort('max', ascending=False).head(50).to_html())

Unnamed: 0,split(),nltk.word_tokenize(),nltk.wordpunct_tokenize(),max
Do,18.0,79.0,18.0,79
Don't,61.0,,,61
Don,,,61.0,61
All,39.0,39.0,39.0,39
A,37.0,37.0,38.0,38
Baby,19.0,23.0,23.0,23
Come,22.0,22.0,22.0,22
Can,5.0,5.0,22.0,22
Ai,,20.0,,20
Ain't,20.0,,,20


## Conclusions

More work is needed, but right now I can't justify moving away from the simplistic split on whitespace.