In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [2]:
# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


### `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`:

We can grab tokens from the doc.

In [3]:
doc[0]

This

We Can't grab sentences from the doc.

In [4]:
doc.sents[0]

TypeError: 'generator' object is not subscriptable

To access sents we can put doc into the list and than call the sents

In [5]:
list(doc.sents)

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

In [7]:
type(list(doc.sents)[0])

spacy.tokens.span.Span

In [11]:
doc = nlp(u'"Management is doing right things; leadership is doing the right things." -Peter Drucker')

In [12]:
doc.text

'"Management is doing right things; leadership is doing the right things." -Peter Drucker'

In [14]:
for sent in doc.sents:
    print(sent)
    print('\n')

"Management is doing right things; leadership is doing the right things."


-Peter Drucker




# Creating new rule to the pipe line.

***There are 2 ways:***
    
1.) `Add a segmentation rule`: Basically adding a new thing to segment on.

2.) `Change segmentation rule`: Change the rule entierly.

### 1.) Add a Segmentation Rule:

In [20]:
def set_custom_boundries(doc):
    '''Every token in the doc retains its index position'''
    for token in doc:
        print(token.i,"------------>",token)

In [21]:
set_custom_boundries(doc)

0 ------------> "
1 ------------> Management
2 ------------> is
3 ------------> doing
4 ------------> right
5 ------------> things
6 ------------> ;
7 ------------> leadership
8 ------------> is
9 ------------> doing
10 ------------> the
11 ------------> right
12 ------------> things
13 ------------> .
14 ------------> "
15 ------------> -Peter
16 ------------> Drucker


These are the index positions of the tokens.

In [22]:
def set_custom_boundries(doc):
    '''Every token in the doc retains its index position'''
    for token in doc[:-1]:
        print(token.i,"------------>",token)
        
set_custom_boundries(doc)

0 ------------> "
1 ------------> Management
2 ------------> is
3 ------------> doing
4 ------------> right
5 ------------> things
6 ------------> ;
7 ------------> leadership
8 ------------> is
9 ------------> doing
10 ------------> the
11 ------------> right
12 ------------> things
13 ------------> .
14 ------------> "
15 ------------> -Peter


This goes from 0 till second last element.

In [23]:
def set_custom_boundries(doc):
    '''Every token in the doc retains its index position'''
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundries, before='parser')
        
nlp.pipe_names

['tagger', 'set_custom_boundries', 'parser', 'ner']

In [24]:
doc4 = nlp(u'"Management is doing right things; leadership is doing the right things." -Peter Drucker')

In [25]:
for sent in doc4.sents:
    print(sent)

"Management is doing right things;
leadership is doing the right things."
-Peter Drucker


Here the segmentation rule of starting next sent from new line if it encounters ";" is applied directly.

<br>
___________________________________________________________________________________________________________________________

***What happens if we want to change the rules completely?***

So in some cases we want to replace spaces defalut sensitizer with our own set of rules.

## Change segmentation rule

In [26]:
# reset to the original default behaviour

nlp = spacy.load('en_core_web_sm')

In [29]:
my_string = u"This is a sentence. This is another. \n\nThis is a \nThirds sentence."

print(my_string)

This is a sentence. This is another. 

This is a 
Thirds sentence.


For a text dataset for poetry line breaks like this are more important than fullstop(.) and you may want to define line break themselves as the actual end of a sentence instad of what is classicaly known as a end of the sentence i.e <b>' . '</b>.

In [30]:
doc = nlp(my_string)


for sen in doc.sents:
    print(sen)

This is a sentence.
This is another. 


This is a 
Thirds sentence.


The above one is default,

but we actually want <b>\n</b> as a line break.

In [31]:
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_new_line = False
    
    
    '''For every word / token in that doc if we already seen new line than we are gonna yield 
        from start upto that current tokens index as a start or reset the current index pos as
        start pos and we'll set seen new line = False
    '''
    for word in doc:
        if seen_new_line:
            yield doc[start: word.i]
            start = word.i
            seen_new_line = False   # reset seen_new_line
            
        elif word.text.startswith('\n'):
            seen_new_line = True
            
    yield doc[start:]

In [32]:
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)

nlp.add_pipe(sbd)

In [33]:
doc = nlp(my_string)

for sen in doc.sents:
    print(sen)

This is a sentence. This is another. 


This is a 

Thirds sentence.


In [2]:
mylst = [2,'apple']

mylst[1] = 'orange'

In [3]:
mylstlst

[2, 'orange']

In [6]:
def abc(n=1):
    if n>3:
        return
    print(n)
    
abc(n+1)

NameError: name 'n' is not defined

In [5]:
abc(3)

3
