# Sentence Segmentation
<div class='alert alert-info' style="margin:20px">In spaCy Basics we saw briefy how Doc objects are divided into sentences.
In this notebook we'll learn how sentence segmentation works, and how to set our own segmentation rules to break up does into sentences based on our own rules.

In [1]:
# Import spacy and load language library
import spacy
import en_core_web_sm
nlp=en_core_web_sm.load()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
doc=nlp(u"This is the first sentence.This is the second sentence.This is the last sentence.")

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is the second sentence.
This is the last sentence.


### `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`:

In [3]:
print(doc[3])

first


In [4]:
print(doc.sents[2])

TypeError: 'generator' object is not subscriptable

## Adding Rules
spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:

In [5]:
doc=nlp(u"This is the sentence i'm going to show.This is the second sentence.Last Samurai.")
for token in doc:
    print(token.is_sent_start, "-" + token.text)

True -This
False -is
False -the
False -sentence
False -i
False -'m
False -going
False -to
False -show
False -.
True -This
False -is
False -the
False -second
False -sentence
False -.
True -Last
False -Samurai
False -.


In [6]:
for token in doc:
    print(token.i, token.text)

0 This
1 is
2 the
3 sentence
4 i
5 'm
6 going
7 to
8 show
9 .
10 This
11 is
12 the
13 second
14 sentence
15 .
16 Last
17 Samurai
18 .


In [7]:
# Default spacy
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker


In [8]:
doc3[:-1]

"Management is doing things right; leadership is doing the right things." -Peter

In [10]:
from spacy.language import Language
# Add a new rule to the pipeline
@Language.component('set_custom_boundaries')
def custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text==";":
            doc[token.i+1].is_sent_start=True
    return doc


Language.component('set_custom_boundaries',func=custom_boundaries)

<function __main__.custom_boundaries(doc)>

In [11]:
nlp.add_pipe('set_custom_boundaries',before='parser')
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [12]:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker


In [13]:
# Yet new rules doesn't apply for older doc
for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker


### Why not change the token directly?
Why not simply set the .is_sent_start value to True on existing tokens?

In [14]:
# Find the token we want to change
doc3[7]

leadership

In [15]:
doc3[7].is_sent_start=True

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

<font color='blue'>spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.