# Tokenization

Before a program can process natural language, we need identify the _words_ that constitute a string of characters. This is important because the meaning of text generally depends on the relations of words in that text. 

By default text on a computer is represented through `String` values. These values store a sequence of characters (nowadays mostly in [UTF-8](http://en.wikipedia.org/wiki/UTF-8) format). The first step of an NLP pipeline is therefore to split the text into smaller units corresponding to the words of the language we are considering. In the context of NLP we often refer to these units as _tokens_, and the process of extracting these units is called _tokenization_. Tokenization is considered boring by most, but it's hard to overemphasize its importance, seeing as it's the first step in a long pipeline of NLP processors, and if you get this step wrong, all further steps will suffer. 


In Python a simple way to tokenize a text is via the `split` method that divides a text wherever a particular substring is found. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenization approach.

In [10]:
text = "Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?"
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

### Tokenization with Regular Expressions
NLTK allows users to construct tokenizers using [regular expressions](http://en.wikipedia.org/wiki/Regular_expression) that define the character sequence patterns at which to either split tokens, or patterns that define what constitutes a token. In general regular expressions are a powerful tool NLP practitioners can use when working with text, and they come in handy when you work with command line tools such as [grep](http://en.wikipedia.org/wiki/Grep). In the code below we use a simple pattern `\\s` that matches any whitespace to define where to split (indicated using `gaps=True`).

In [9]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\s', gaps=True) #TODO: Exercise that works with gaps=False
tokenizer.tokenize(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

One shortcoming of this tokenization is its treatment of punctuation because it considers "plan." as a token whereas ideally we would prefer "plan" and "." to be distinct tokens (why?). To achieve this we need to use [lookahead](http://www.regular-expressions.info/lookaround.html) patterns that split at zero-length patterns before and after punctuation. Note how we used Scala [String Interpolation](http://docs.scala-lang.org/overviews/core/string-interpolation.html) to inject smaller patterns into larger ones.

In [7]:
import regex
punct = '[\.\?]'
before_punct = '(?={})'.format(punct)
after_punct = '(?<={})(?!\\s)'.format(punct) ## but not if whitespace follows
pattern = '(\s|{}|{})'.format(before_punct, after_punct)
# tokenizer = RegexpTokenizer('(\s|{}|{})'.format(before_punct, after_punct), gaps=True)
# tokenizer = RegexpTokenizer(before_punct, gaps=True)
# tokenizer.tokenize(text)
regex.split(after_punct, text, flags=regex.VERSION1)

["Mr. Bob Dobolina is thinkin' of a master plan. Why doesn't he quit?", '']

In [11]:
import tensorflow as tf