## Sentence Segmentation / Sentence Boundary Detection

- Detection of sentence begining and Ending
- Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentence

### pipeline


- When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.



![pipeline-7a14d4edd18f3edfee8f34393bff2992.svg](attachment:pipeline-7a14d4edd18f3edfee8f34393bff2992.svg)


In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [2]:
doc = nlp("Google LLC is an American multinational technology company ,that specializes in Internet-related services and products, which include online advertising technologies.")

In [3]:
doc

Google LLC is an American multinational technology company ,that specializes in Internet-related services and products, which include online advertising technologies.

In [4]:
for sentence in doc.sents:
    print(sentence)

Google LLC is an American multinational technology company ,that specializes in Internet-related services and products, which include online advertising technologies.


In [5]:
#defining a function to create a custom boundary

In [6]:
def custom_boundary(doc):
    for token in doc[:-1]:
        if token.text == ',':
            doc[token.i+1].is_sent_start = True
    return doc

In [7]:
nlp.add_pipe(custom_boundary,before='parser')

In [8]:
doc2 = nlp("Google LLC is an American multinational technology company ,that specializes in Internet-related services and products, which include online advertising technologies.")

In [9]:
for sentence in doc2.sents:
    print(sentence)

Google LLC is an American multinational technology company ,
that specializes in Internet-related services and products,
which include online advertising technologies.


# SentenceSegmenter

- segmentation of Poetry
- segmentation of song lyrics

yield Keyword

- yield is a keyword in Python that is used to return from a function without destroying the states of its local variable and when the function is called, the execution starts from the last yield statement. Any function that contains a yield keyword is termed as generator. 



In [120]:
import spacy

In [121]:
nlp = spacy.load("en_core_web_sm")

In [122]:
text = "What is good. What is bad.\nDon't think \njust move on."

In [123]:
print(text)

What is good. What is bad.
Don't think 
just move on.


In [124]:
string = nlp(text)

In [125]:
for sentence in string.sents:
    print(sentence)

What is good.
What is bad.

Don't think 
just move on.


In [126]:
from spacy.pipeline import SentenceSegmenter

In [127]:
def newline_splitter(doc): #defined function
    start = 0              #assigning default value
    found_newline = False  #assigning default value
    
    for word in doc:
        if found_newline:
            yield doc[start:word.i]   #start(0) to word.index
            start = word.i
            found_newline = False
            
        elif word.text.startswith('\n'):  #starting from newline
            found_newline = True
    yield doc[start:]

In [128]:
split = SentenceSegmenter(nlp.vocab,strategy = newline_splitter) #making segmenter

In [129]:
nlp.add_pipe(split)   #adding segmenter to pipeline

In [130]:
doc = nlp(text)

In [131]:
for sentence in doc.sents:
    print(sentence)

What is good. What is bad.

Don't think 

just move on.
