In [24]:
%%capture
%load_ext autoreload
%autoreload 2
%cd ..
import statnlpbook.tokenization as tok

# Tokenization

* Identify the **words** in a string of characters. 
* Improve input **representation** in [structured prediction recipe](structured_prediction.ipynb).

In Python you can tokenize a text via `split`, but this is suboptimal:

In [7]:
text = "Mr. Bob Dobolina is thinkin' of a master plan." + \
       "\nWhy doesn't he quit?"
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.\nWhy',
 "doesn't",
 'he',
 'quit?']

Python allows users to construct tokenizers using 
### Regular Expressions 
that define the character sequence patterns at which to either split tokens.

In [13]:
import re
re.compile('\s').split(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

Problems:
* Bad treatment of punctuation.  
* Easier to define a token than a gap. 

Let us use `findall` instead:

In [16]:
re.compile('\w+|[.?]').findall(text)

['Mr',
 '.',
 'Bob',
 'Dobolina',
 'is',
 'thinkin',
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 'doesn',
 't',
 'he',
 'quit',
 '?']

Problems:
* "Mr." is split into two tokens, should be single. 
* Lost an apostrophe. 

Both is fixed below ...

In [19]:
re.compile('Mr.|[\w\']+|[.?]').findall(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 "doesn't",
 'he',
 'quit',
 '?']

## Learning to Tokenize?
* For English simple pattern matching often sufficient. 
* In other languages (e.g. Japanese), words are not separated by whitespace.


In [21]:
jap = "彼は音楽を聞くのが大好きです"
re.compile('彼|は|く|音楽|を|聞くの|が|大好き|です').findall(jap)

['彼', 'は', '音楽', 'を', '聞くの', 'が', '大好き', 'です']

Equally complex for certain English domains (eg. bio). [TODO: Example!]

Solution: Treat tokenization as a **statistical NLP problem**! 
  * classification
  * [sequence labelling](sequence_labelling.ipynb)

# Sentence Segmentation

* Many NLP tools work sentence-by-sentence. 
* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens.

In [28]:
tokens = re.compile('Mr.|[\w\']+|[.?]').findall(text)
# try different regular expressions
tok.sentence_segment(re.compile('\.'), tokens)

[['Mr.',
  'Bob',
  'Dobolina',
  'is',
  "thinkin'",
  'of',
  'a',
  'master',
  'plan',
  '.'],
 ['Why', "doesn't", 'he', 'quit', '?']]

# Background Reading

* Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.
* Manning, Raghavan & Schuetze, Introduction to Information Retrieval: [Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)