# Sentence segmentation

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2017 Florian Leitner. All rights reserved.

## Setup

In [1]:
import segtok
import nltk
import spacy

(If either the above imports failed, please install the missing Python modules; "`pip3 install MODULE_NAME`".)

To use `spacy`, you will also have to have downloaded at least the English models, too; To do that, run the following command in your terminal:

```shell
python3 -m spacy download en
```

As for NLTK's data, we already saw how to download that yesterday.

## Introduction

NLTK's current default sentence splitter (available as `nltk.sent_tokenize`) is an implmentation of the **unsupervised** [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt) with the properties discussed in class (day 2, first three slides).

The other two solutions we'll look at is a **supervised** model that learns to split sentences from pre-splitted text (SpaCy), and a **rule-based** model that your instructor keeps maintaining (`segtok`).

First, pre-load everything necessary for sentence splitting by each module/approach:

In [2]:
from segtok.segmenter import split_multi, split_single
from nltk import sent_tokenize
# import spacy

spacy_en = spacy.load('en')

Second, define a list of sentences that are known to be hard to split and see how the segmenters perform on those cases.

In [3]:
tricky_stuff = """One sentence per line.
And another sentence on the same line.
(How about a sentence in parenthesis?)
Or a sentence with "a quote!"
'How about those pesky single quotes?'
[And not to forget about square brackets.]
And, brackets before the terminal [2].
You know Mr. Abbreviation I told you so.
What about the med. staff here?
But the undef. abbreviation not.
And this f.e. is tricky stuff.
I.e. a little easier here.
However, e.g., should be really easy.
Three is one btw., is clear.
Their presence was detected by transformation into S. lividans.
Three subjects diagnosed as having something.
What the heck??!?!
(A) First things here.
(1) No, they go here.
[z] Last, but not least.
(vii) And the Romans, too.
Let's meet at 14.10 in N.Y..
This happened in the U.S. last week.
Brexit: The E.U. and the U.K. are separating.
Refugees are welcome in the E.U..
The U.S. Air Force was called in.
What about the E.U. High Court?
And then there is the U.K. House of Commons.
Now only this splits: the EU.
A sentence ending in U.S.
Another that won't split.
12 monkeys ran into here.
In the Big City.
How he got an A.
Mathematics . dot times.
An abbreviation at the fin..
This is a sentence terminal ellipsis...
This is another sentence terminal ellipsis....
An easy to handle G. species mention.
Am 13. Jän. 2006 war es regnerisch.
The basis for Lester B. Pearson's initials was developed later.
This model was introduced by Dr. Edgar F. Codd after criticisms.
This quote "He said it." is actually inside.
A. The first assumption.
B. The second bullet.
C. The last case.
1. This is one.
2. And that is two.
3. Finally, three, too.
Always last, a simple final sentence example."""

input_text = tricky_stuff.replace('\n', ' ')
expected_sentences = tricky_stuff.split('\n')
print("n. sentences =", len(expected_sentences))

n. sentences = 50


## With NLTK

In [4]:
from nltk import sent_tokenize

for sentence in sent_tokenize(input_text):
    print(sentence)
    
print("\nn. sentences =",
      len(sent_tokenize(input_text)))

One sentence per line.
And another sentence on the same line.
(How about a sentence in parenthesis?)
Or a sentence with "a quote!"
'How about those pesky single quotes?'
[And not to forget about square brackets.]
And, brackets before the terminal [2].
You know Mr.
Abbreviation I told you so.
What about the med.
staff here?
But the undef.
abbreviation not.
And this f.e.
is tricky stuff.
I.e.
a little easier here.
However, e.g., should be really easy.
Three is one btw., is clear.
Their presence was detected by transformation into S. lividans.
Three subjects diagnosed as having something.
What the heck??!?!
(A) First things here.
(1) No, they go here.
[z] Last, but not least.
(vii) And the Romans, too.
Let's meet at 14.10 in N.Y..
This happened in the U.S. last week.
Brexit: The E.U.
and the U.K. are separating.
Refugees are welcome in the E.U..
The U.S. Air Force was called in.
What about the E.U.
High Court?
And then there is the U.K. House of Commons.
Now only this splits: the EU.
A se

## With SpaCy

In [5]:
spacy_doc = spacy_en(input_text)

for sentence in spacy_doc.sents:
    print(sentence)

print("\nn. sentences =",
      len(list(spacy_doc.sents)))

One sentence per line.
And another sentence on the same line.
(How about a sentence in parenthesis?)
Or a sentence with "a quote!" 'How about those pesky single quotes?' [And not to forget about square brackets.]
And, brackets before the terminal [2]. You know
Mr. Abbreviation I told you so.
What about the med.
staff here?
But the undef.
abbreviation not.
And this f.e. is tricky stuff.
I.e. a little easier here.
However, e.g., should be really easy.
Three is one btw.
, is clear.
Their presence was detected by transformation into S. lividans.
Three subjects diagnosed as having something.
What the heck??!?!
(A) First things here.
(1) No, they go here.
[z] Last, but not least.
(vii) And the Romans, too.
Let's meet at 14.10 in N.Y..
This happened in the U.S. last week.
Brexit: The E.U. and the U.K. are separating.
Refugees are welcome in the E.U..
The U.S. Air Force was called in.
What about the E.U. High Court?
And then there is the U.K. House of Commons.
Now only this splits: the EU.
A s

## With `segtok`

In [6]:
for sentence in split_multi(input_text):
    print(sentence)

print("\nn. sentences =",
      len(list(split_multi(input_text))))

One sentence per line.
And another sentence on the same line.
(How about a sentence in parenthesis?)
Or a sentence with "a quote!"
'How about those pesky single quotes?'
[And not to forget about square brackets.]
And, brackets before the terminal [2].
You know Mr. Abbreviation I told you so.
What about the med. staff here?
But the undef.
abbreviation not.
And this f.e. is tricky stuff.
I.e. a little easier here.
However, e.g., should be really easy.
Three is one btw., is clear.
Their presence was detected by transformation into S. lividans.
Three subjects diagnosed as having something.
What the heck??!?!
(A) First things here.
(1) No, they go here.
[z] Last, but not least.
(vii) And the Romans, too.
Let's meet at 14.10 in N.Y..
This happened in the U.S. last week.
Brexit: The E.U. and the U.K. are separating.
Refugees are welcome in the E.U..
The U.S. Air Force was called in.
What about the E.U. High Court?
And then there is the U.K. House of Commons.
Now only this splits: the EU.
A se

Mostly, sentence segmenters tend to over-split (abbrevations, enumerations, European dates (with dots)), but the also under-split, e.g., inside quotes or sentencens ending with single letters (probably "mistaken" for initials). As most language processing systems tend to work on the sentence level (because it is easier to handle), this chronic oversplitting may harm the baseline performance of your system. However, as always: Before jumping at `segtok` for your sentence segmentation needs - test your assumptions on a sample corpus of the text you want to process!