# Tokenization, stemming, and sentence segmentation

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2016 Florian Leitner. All rights reserved.

Today's lab will cover some of the basic techniques to work with text: tokenization, stemming, and word embeddings.

First we run the standard setup seen in the preparatory notebooks:

In [129]:
%pylab inline --no-import-all

Populating the interactive namespace from numpy and matplotlib


Another (optional) preparatory step: Install the `segtok` tokenization and segmentation library created by your instructor:

- Anaconda Python: `conda install segtok`
- Stock Python: `pip3 install segtok`

Once done, the following import should work; if not, you can always skip the `segtok`-based examples.

In [130]:
import segtok

## Tokenizing text

If the next command (`import nltk`...) does not succeed, remember to install NLTK first (day 1).

In [131]:
import nltk

In [132]:
nltk.data.find('corpora/gutenberg/carroll-alice.txt')

FileSystemPathPointer('/Users/fleitner/nltk_data/corpora/gutenberg/carroll-alice.txt')

In [133]:
alice = nltk.data.load('corpora/gutenberg/carroll-alice.txt', format='text')

In [134]:
print(alice[:543], "...")

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a ...


### Simplistic tokenization: white-space

In [135]:
print(alice.split()[:100])

["[Alice's", 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865]', 'CHAPTER', 'I.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank,', 'and', 'of', 'having', 'nothing', 'to', 'do:', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading,', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it,', "'and", 'what', 'is', 'the', 'use', 'of', 'a', "book,'", 'thought', 'Alice', "'without", 'pictures', 'or', "conversation?'", 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(as', 'well', 'as', 'she', 'could,', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid),', 'whether', 'the', 'pleasure', 'of', 'making', 'a']


Look at the above result. Can you tell if this is a good strategy or not? Why?

### Standard tokenization: letters and numbers only

In [136]:
import re

In [137]:
tokenizer = re.compile(r'\w+', re.UNICODE)

Note that with the above pattern, we leave all dots attached to the tokens. That is because then the tokenizer output can be used as input for a sentence segmentation algorithm that should decide if the dot is used as an abbrevation marker (and stay attached) or as sentence terminal (and should be split it off as its own token).

In [138]:
print(tokenizer.findall(alice)[:100])

['Alice', 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', 'CHAPTER', 'I', 'Down', 'the', 'Rabbit', 'Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', 'thought', 'Alice', 'without', 'pictures', 'or', 'conversation', 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', 'as', 'well', 'as', 'she', 'could', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', 'whether', 'the', 'pleasure', 'of']


This clearly looks much better; We have actual words and numbers and this indeed is useful. But some problems remain: 

- over-tokenization, e.g., a time string like "19:30", or, in the above example, the hyphen or the possessive s
- loss of information (punctuation)

The latter problem (loss of punctuation) can be overcome, to some extent, by allowing some of the cases (e.g. ASCII hyphens, dots and dashes, as above):

In [139]:
tokenizer = re.compile(r'(\'?[\w.-]+)', re.UNICODE)

In [140]:
def punctuation_tokenizer(text):
    tokens = tokenizer.split(text)
    return [t for t in map(str.strip, tokens) if t]

In [141]:
print(punctuation_tokenizer(alice)[:100])

['[', 'Alice', "'s", 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'without", 'pictures', 'or', 'conversation', "?'", 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day', 'made', 'her']


### Advanced tokenization: NLTK and `segtok`

In [142]:
from nltk import word_tokenize as word_tokenizer

If you want to know how the NLTK tokenizer is implemented, here are the gory details:

In [143]:
#nltk.tokenize.treebank??

A similar tokenizer with modular functionality is provided in the `segtok` package:

In [144]:
from segtok.tokenizer import web_tokenizer
from segtok.tokenizer import split_contractions

In [145]:
print(alice[:471])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day


In [146]:
print(word_tokenizer(alice)[:100]) # nltk

['[', 'Alice', "'s", 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'Alice', "'without", 'pictures', 'or', 'conversation', '?', "'", 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot']


In [147]:
print(split_contractions(web_tokenizer(alice))[:100]) # segtok

['[', 'Alice', "'s", 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without', 'pictures', 'or', 'conversation', "?'", 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day']


Overall, results seem pretty similar, with the differences being in the fine-print.
Here are some really hard examples to gauge the performance of a tokenizer:

In [148]:
example = """
Chemical formulas like "[Al₂(S₁O₄)₃]²⁻" or, dates as in 31/12/01, times like in 19:30
or IP addresses such as 192.168.0.1, units like 10 m³ or money as in US$ 10,000.00
and even names as in "Mr. Lewis Carroll" can all spell trouble. And, do not forget
those pesky web and email adresses: http://www.company.com/index.htm and first.last@company.com!
"""

In [149]:
print(punctuation_tokenizer(example))

['Chemical', 'formulas', 'like', '"[', 'Al₂', '(', 'S₁O₄', ')', '₃', ']', '²', '⁻"', 'or', ',', 'dates', 'as', 'in', '31', '/', '12', '/', '01', ',', 'times', 'like', 'in', '19', ':', '30', 'or', 'IP', 'addresses', 'such', 'as', '192.168.0.1', ',', 'units', 'like', '10', 'm³', 'or', 'money', 'as', 'in', 'US', '$', '10', ',', '000.00', 'and', 'even', 'names', 'as', 'in', '"', 'Mr.', 'Lewis', 'Carroll', '"', 'can', 'all', 'spell', 'trouble.', 'And', ',', 'do', 'not', 'forget', 'those', 'pesky', 'web', 'and', 'email', 'adresses', ':', 'http', '://', 'www.company.com', '/', 'index.htm', 'and', 'first.last', '@', 'company.com', '!']


In [150]:
print(nltk.word_tokenize(example))

['Chemical', 'formulas', 'like', '``', '[', 'Al₂', '(', 'S₁O₄', ')', '₃', ']', '²⁻', "''", 'or', ',', 'dates', 'as', 'in', '31/12/01', ',', 'times', 'like', 'in', '19:30', 'or', 'IP', 'addresses', 'such', 'as', '192.168.0.1', ',', 'units', 'like', '10', 'm³', 'or', 'money', 'as', 'in', 'US', '$', '10,000.00', 'and', 'even', 'names', 'as', 'in', '``', 'Mr.', 'Lewis', 'Carroll', "''", 'can', 'all', 'spell', 'trouble', '.', 'And', ',', 'do', 'not', 'forget', 'those', 'pesky', 'web', 'and', 'email', 'adresses', ':', 'http', ':', '//www.company.com/index.htm', 'and', 'first.last', '@', 'company.com', '!']


In [151]:
print(web_tokenizer(example))

['Chemical', 'formulas', 'like', '"[', 'Al₂', '(', 'S₁O₄', ')₃]²⁻"', 'or', ',', 'dates', 'as', 'in', '31', '/', '12', '/', '01', ',', 'times', 'like', 'in', '19:30', 'or', 'IP', 'addresses', 'such', 'as', '192.168.0.1', ',', 'units', 'like', '10', 'm³', 'or', 'money', 'as', 'in', 'US', '$', '10,000.00', 'and', 'even', 'names', 'as', 'in', '"', 'Mr.', 'Lewis', 'Carroll', '"', 'can', 'all', 'spell', 'trouble.', 'And', ',', 'do', 'not', 'forget', 'those', 'pesky', 'web', 'and', 'email', 'adresses', ':', 'http://www.company.com/index.htm', 'and', 'first.last', '@', 'company.com', '!']


As you can see, even our basic tokenizer isn't all that bad, but you get more tokens that you might want.
If there is no other reason to decide, you can use the speed of the tokenizers as another data-point to make a decision:

In [152]:
%timeit punctuation_tokenizer(example)

10000 loops, best of 3: 63.9 µs per loop


In [153]:
%timeit nltk.word_tokenize(example)

1000 loops, best of 3: 788 µs per loop


In [154]:
%timeit web_tokenizer(example)

1000 loops, best of 3: 500 µs per loop


In other words, NLTK and `segtok` are about the same as fast; But if all you need is speed and can live with over-tokenization, our basic regular expression tokenizer is an order of magnitude faster than either.

---

***

---

## Stemming words

NLTK provides you with many different stemming algorithms; Here we will only explore one as an example. But look at their API to find more examples.

In [155]:
nltk.stem.SnowballStemmer?

In [156]:
stemmer = nltk.stem.SnowballStemmer("english")

In [157]:
alice_stemmed = [stemmer.stem(t) for t in segtok.tokenizer.word_tokenizer(alice)]

In [158]:
print(alice_stemmed[:100])

['[', 'alic', 'adventur', 'in', 'wonderland', 'by', 'lewi', 'carrol', '1865', ']', 'chapter', 'i.', 'down', 'the', 'rabbit-hol', 'alic', 'was', 'begin', 'to', 'get', 'veri', 'tire', 'of', 'sit', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'have', 'noth', 'to', 'do', ':', 'onc', 'or', 'twice', 'she', 'had', 'peep', 'into', 'the', 'book', 'her', 'sister', 'was', 'read', ',', 'but', 'it', 'had', 'no', 'pictur', 'or', 'convers', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'alic', "'", 'without', 'pictur', 'or', 'convers', "?'", 'so', 'she', 'was', 'consid', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day', 'made']


---

***

---

## Sentence segmentation

NLTK's current default sentence splitter (available as `nltk.sent_tokenize`) is an implmentation of the [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt) with the properties discussed in class.

In [159]:
print("\n\n".join(nltk.sent_tokenize(alice)[26:33]))

How funny it'll seem to come out among the people that walk with
their heads downward!

The Antipathies, I think--' (she was rather glad
there WAS no one listening, this time, as it didn't sound at all the
right word) '--but I shall have to ask them what the name of the country
is, you know.

Please, Ma'am, is this New Zealand or Australia?'

(and
she tried to curtsey as she spoke--fancy CURTSEYING as you're falling
through the air!

Do you think you could manage it?)

'And what an
ignorant little girl she'll think me for asking!

No, it'll never do to
ask: perhaps I shall see it written up somewhere.'


This happens to be the sentences segmenter (model for English) that can be loaded explicitly like this:

In [160]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

On the other hand, the `segtok` package provides a rule-based sentence splitter.

In [161]:
from segtok.segmenter import split_multi

In [162]:
#segtok.segmenter?

In [163]:
#split_multi??

In [164]:
print("\n\n".join(list(split_multi(alice))[28:35]))

How funny it'll seem to come out among the people that walk with
their heads downward!

The Antipathies, I think--' (she was rather glad
there WAS no one listening, this time, as it didn't sound at all the
right word) '--but I shall have to ask them what the name of the country
is, you know.

Please, Ma'am, is this New Zealand or Australia?'

(and
she tried to curtsey as she spoke--fancy CURTSEYING as you're falling
through the air! Do you think you could manage it?)

'And what an
ignorant little girl she'll think me for asking!

No, it'll never do to
ask: perhaps I shall see it written up somewhere.'

Down, down, down.


In [165]:
tricky_stuff = """
Species mentions like S. lividans or H. sapiens and Mr. Name is easy.
Periods in Mr. Smith and Johann S. Bach do not mark boundaries.
i is a good variable name.
¡Good sentence segmentation is hard!
(1) First things go here.
(2) And second goes here.
(3) Last, but not least.
1. This is one.
2. And that is two.
3. Finally, three, too.
It is expected, on the basis of (Olmsted, M. C., C. F. Anderson, and
M. T. Record, Jr. 1989. Proc. Natl. Acad. Sci. USA. 100:100), to decrease sharply.
And there are many other things... Would you not think?
(How does it deal with this parenthesis?)
"It should be part of the previous sentence."
"(And the same with this one.)"
('And this one!')
"('(And (this)) '?)"
[(and this. )]
That's it, folks!
"""

In [166]:
print("\n".join(nltk.sent_tokenize(tricky_stuff.replace('\n', ' '))))

 Species mentions like S. lividans or H. sapiens and Mr. Name is easy.
Periods in Mr. Smith and Johann S. Bach do not mark boundaries.
i is a good variable name.
¡Good sentence segmentation is hard!
(1) First things go here.
(2) And second goes here.
(3) Last, but not least.
1.
This is one.
2.
And that is two.
3.
Finally, three, too.
It is expected, on the basis of (Olmsted, M. C., C. F. Anderson, and M. T. Record, Jr. 1989.
Proc.
Natl.
Acad.
Sci.
USA.
100:100), to decrease sharply.
And there are many other things... Would you not think?
(How does it deal with this parenthesis?)
"It should be part of the previous sentence."
"(And the same with this one.)"
('And this one!')
"('(And (this)) '?)"
[(and this. )]
That's it, folks! 


In [167]:
print("\n".join(split_multi(tricky_stuff.replace('\n', ' '))))

Species mentions like S. lividans or H. sapiens and Mr. Name is easy.
Periods in Mr. Smith and Johann S.
Bach do not mark boundaries.
i is a good variable name.
¡Good sentence segmentation is hard!
(1) First things go here.
(2) And second goes here.
(3) Last, but not least.
1. This is one.
2. And that is two.
3. Finally, three, too.
It is expected, on the basis of (Olmsted, M. C., C. F. Anderson, and M. T. Record, Jr. 1989. Proc. Natl. Acad. Sci. USA. 100:100), to decrease sharply.
And there are many other things...
Would you not think?
(How does it deal with this parenthesis?)
"It should be part of the previous sentence."
"(And the same with this one.)" ('And this one!')
"('(And (this)) '?)" [(and this. )] That's it, folks!



As a general rule of thumb: If you have a orthographically correct text (scientific publications, books, etc.), it probably pays of to develop a rule-based segmenter. If not, or you lack the time to develop good rules, the Punkt algorithm gets you pretty close to a very good result with very little work (assuming you have a ready-made implemntation available, like here from NLTK). Plus, Punkt might detect up some special cases in your corpus that the rules did not catch (like "Johann S. Bach" above). The main downside of Punkt is that it usually comes at the cost of **under-splitting**. That *might* be an issue in combination with subsequent deep parsing, because the runtime can behave exponential in the length of the sentence (number of tokens) being analyzed and it typically will introduce errors on those sentences.