# Tokenization
Is it easy or hard to break up a sentence into words?

In [4]:
text = """Intraterrestrials comprise a vast still-mysterious ecosystem in Earth’s crust containing as many (or more) living microbial cells than are on Earth’s surface. We know this from scientists such as myself going out on scientific drilling ships that sample deep marine sediments or drilling deep into continental crust, laboriously counting the number of cells we find there, and extrapolating out to the rest of the world. The deepest we’ve found intraterrestrials thus far is about 5 km down."""

In [2]:
text.split()


['Intraterrestrials',
 'comprise',
 'a',
 'vast',
 'still-mysterious',
 'ecosystem',
 'in',
 'Earth’s',
 'crust',
 'containing',
 'as',
 'many',
 '(or',
 'more)',
 'living',
 'microbial',
 'cells',
 'than',
 'are',
 'on',
 'Earth’s',
 'surface.',
 'We',
 'know',
 'this',
 'from',
 'scientists',
 'such',
 'as',
 'myself',
 'going',
 'out',
 'on',
 'scientific',
 'drilling',
 'ships',
 'that',
 'sample',
 'deep',
 'marine',
 'sediments',
 'or',
 'drilling',
 'deep',
 'into',
 'continental',
 'crust,',
 'laboriously',
 'counting',
 'the',
 'number',
 'of',
 'cells',
 'we',
 'find',
 'there,',
 'and',
 'extrapolating',
 'out',
 'to',
 'the',
 'rest',
 'of',
 'the',
 'world.',
 'The',
 'deepest',
 'we’ve',
 'found',
 'intraterrestrials',
 'thus',
 'far',
 'is',
 'about',
 '5',
 'km',
 'down.']

At first glance this looks reasonable. However: 

* (or
* surface.

Also possibly problematic: 
* Earth's
* we've

Let's tackle the first problem first. 

We use regular expressions:

In [5]:
import re

print(re.split(r'[ .!,:;()]+', text ))

['Intraterrestrials', 'comprise', 'a', 'vast', 'still-mysterious', 'ecosystem', 'in', 'Earth’s', 'crust', 'containing', 'as', 'many', 'or', 'more', 'living', 'microbial', 'cells', 'than', 'are', 'on', 'Earth’s', 'surface', 'We', 'know', 'this', 'from', 'scientists', 'such', 'as', 'myself', 'going', 'out', 'on', 'scientific', 'drilling', 'ships', 'that', 'sample', 'deep', 'marine', 'sediments', 'or', 'drilling', 'deep', 'into', 'continental', 'crust', 'laboriously', 'counting', 'the', 'number', 'of', 'cells', 'we', 'find', 'there', 'and', 'extrapolating', 'out', 'to', 'the', 'rest', 'of', 'the', 'world', 'The', 'deepest', 'we’ve', 'found', 'intraterrestrials', 'thus', 'far', 'is', 'about', '5', 'km', 'down', '']




What is better, what is still not great?

In addition to splitting on punctuation, re also lets us match punctuation symbols and spaces:

In [6]:
list(re.finditer(r'[ .!,:;()]+', text ))

[<re.Match object; span=(17, 18), match=' '>,
 <re.Match object; span=(26, 27), match=' '>,
 <re.Match object; span=(28, 29), match=' '>,
 <re.Match object; span=(33, 34), match=' '>,
 <re.Match object; span=(50, 51), match=' '>,
 <re.Match object; span=(60, 61), match=' '>,
 <re.Match object; span=(63, 64), match=' '>,
 <re.Match object; span=(71, 72), match=' '>,
 <re.Match object; span=(77, 78), match=' '>,
 <re.Match object; span=(88, 89), match=' '>,
 <re.Match object; span=(91, 92), match=' '>,
 <re.Match object; span=(96, 98), match=' ('>,
 <re.Match object; span=(100, 101), match=' '>,
 <re.Match object; span=(105, 107), match=') '>,
 <re.Match object; span=(113, 114), match=' '>,
 <re.Match object; span=(123, 124), match=' '>,
 <re.Match object; span=(129, 130), match=' '>,
 <re.Match object; span=(134, 135), match=' '>,
 <re.Match object; span=(138, 139), match=' '>,
 <re.Match object; span=(141, 142), match=' '>,
 <re.Match object; span=(149, 150), match=' '>,
 <re.Match obj

There are tools that, rather than swallowing punctuation, will split it off into separate "words". They also deal with things like "we've".

In [None]:
!python3 -m pip install nltk
import nltk
nltk.download("punct")

In [7]:
import nltk

nltk.word_tokenize(text)

['Intraterrestrials',
 'comprise',
 'a',
 'vast',
 'still-mysterious',
 'ecosystem',
 'in',
 'Earth',
 '’',
 's',
 'crust',
 'containing',
 'as',
 'many',
 '(',
 'or',
 'more',
 ')',
 'living',
 'microbial',
 'cells',
 'than',
 'are',
 'on',
 'Earth',
 '’',
 's',
 'surface',
 '.',
 'We',
 'know',
 'this',
 'from',
 'scientists',
 'such',
 'as',
 'myself',
 'going',
 'out',
 'on',
 'scientific',
 'drilling',
 'ships',
 'that',
 'sample',
 'deep',
 'marine',
 'sediments',
 'or',
 'drilling',
 'deep',
 'into',
 'continental',
 'crust',
 ',',
 'laboriously',
 'counting',
 'the',
 'number',
 'of',
 'cells',
 'we',
 'find',
 'there',
 ',',
 'and',
 'extrapolating',
 'out',
 'to',
 'the',
 'rest',
 'of',
 'the',
 'world',
 '.',
 'The',
 'deepest',
 'we',
 '’',
 've',
 'found',
 'intraterrestrials',
 'thus',
 'far',
 'is',
 'about',
 '5',
 'km',
 'down',
 '.']