# Sentence segmentation

Sentence segmentation or sentence boundary disambiguation is one of the most crucial part in NLP, where a corpus of text is separated based on the sentences. Other languages such as English, Spanish, etc. use a reasonable approximation of delimiter (full-stop/comma) or case discrimination helps in detecting a sentence boundary.

Tokenization and sentence segmentation in Stanza are jointly performed by the TokenizeProcessor. This processor splits the raw input text into tokens and sentences, so that downstream annotation can happen at the sentence level. This processor can be invoked by the name tokenize.

In [3]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.4.0-py3-none-any.whl (574 kB)
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py): started
  Building wheel for emoji (setup.py): finished with status 'done'
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171046 sha256=777c7e1efe7cfaf06545cdb314cc3227df89271cc9d15d8631ce488611c426aa
  Stored in directory: c:\users\lenovo\appdata\local\pip\cache\wheels\fa\7a\e9\22dd0515e1bad255e51663ee513a2fa839c95934c5fc301090
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-1.7.0 stanza-1.4.0


In [4]:
import stanza

In [5]:
nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
doc

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/tokenize/combined.pt:   0%|    …

2022-06-21 16:15:30 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-06-21 16:15:30 INFO: Use device: cpu
2022-06-21 16:15:30 INFO: Loading: tokenize
2022-06-21 16:15:31 INFO: Done loading processors!


[
  [
    {
      "id": 1,
      "text": "This",
      "start_char": 0,
      "end_char": 4
    },
    {
      "id": 2,
      "text": "is",
      "start_char": 5,
      "end_char": 7
    },
    {
      "id": 3,
      "text": "a",
      "start_char": 8,
      "end_char": 9
    },
    {
      "id": 4,
      "text": "test",
      "start_char": 10,
      "end_char": 14
    },
    {
      "id": 5,
      "text": "sentence",
      "start_char": 15,
      "end_char": 23
    },
    {
      "id": 6,
      "text": "for",
      "start_char": 24,
      "end_char": 27
    },
    {
      "id": 7,
      "text": "stanza",
      "start_char": 28,
      "end_char": 34
    },
    {
      "id": 8,
      "text": ".",
      "start_char": 34,
      "end_char": 35
    }
  ],
  [
    {
      "id": 1,
      "text": "This",
      "start_char": 36,
      "end_char": 40
    },
    {
      "id": 2,
      "text": "is",
      "start_char": 41,
      "end_char": 43
    },
    {
      "id": 3,
      "text": "another",
 

In [6]:
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: test
id: (5,)	text: sentence
id: (6,)	text: for
id: (7,)	text: stanza
id: (8,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: another
id: (4,)	text: sentence
id: (5,)	text: .


In [8]:
#Get individual sentences from the doc
print([sentence.text for sentence in doc.sentences])

['This is a test sentence for stanza.', 'This is another sentence.']


In [12]:
#Tokenization without sentence segmentation

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit=True)
doc = nlp('This is a sentence.This is a second.\n\nThis is a third.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-21 16:28:10 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-06-21 16:28:10 INFO: Use device: cpu
2022-06-21 16:28:10 INFO: Loading: tokenize
2022-06-21 16:28:11 INFO: Done loading processors!


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: sentence
id: (5,)	text: .
id: (6,)	text: This
id: (7,)	text: is
id: (8,)	text: a
id: (9,)	text: second
id: (10,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: third
id: (5,)	text: .


Note: Here, "This is a second. This is a third." were supposed to be 2 different sentences, but, they were not segmented because of the condition, "tokenize_no_ssplit=True". But, whenever "\n\n" are mentioned explicitly (single '\n' doesn't work, two \n's are necessary), the segmentation still happens though.
Following is what happens if "tokenize_no_ssplit=False"

In [13]:
nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit=False)
doc = nlp('This is a sentence.This is a second.\n\nThis is a third.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-21 16:30:49 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-06-21 16:30:49 INFO: Use device: cpu
2022-06-21 16:30:49 INFO: Loading: tokenize
2022-06-21 16:30:49 INFO: Done loading processors!


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: sentence
id: (5,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: second
id: (5,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: third
id: (5,)	text: .


Note: In this case, both sentence segmentation and tokenization occur.

In [27]:
#Case of Pre-Tokenized Text

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
doc = nlp('This is token.ization done my way! Sentence are not split if no new line! \n New line splits sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-21 16:43:15 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-06-21 16:43:15 INFO: Use device: cpu
2022-06-21 16:43:15 INFO: Loading: tokenize
2022-06-21 16:43:15 INFO: Done loading processors!


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: token.ization
id: (4,)	text: done
id: (5,)	text: my
id: (6,)	text: way!
id: (7,)	text: Sentence
id: (8,)	text: are
id: (9,)	text: not
id: (10,)	text: split
id: (11,)	text: if
id: (12,)	text: no
id: (13,)	text: new
id: (14,)	text: line!
id: (1,)	text: New
id: (2,)	text: line
id: (3,)	text: splits
id: (4,)	text: sentence.


In [26]:
#Case of Not Pre-Tokenized Text

nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=False)
doc = nlp('This is token.ization done my way! \n Sentence split, too!')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-21 16:42:19 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-06-21 16:42:19 INFO: Use device: cpu
2022-06-21 16:42:19 INFO: Loading: tokenize
2022-06-21 16:42:19 INFO: Done loading processors!


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: token.ization
id: (4,)	text: done
id: (5,)	text: my
id: (6,)	text: way
id: (7,)	text: !
id: (1,)	text: Sentence
id: (2,)	text: split
id: (3,)	text: ,
id: (4,)	text: too
id: (5,)	text: !


Note: Here , ',' and '!' are separate tokens in contrast to previous case where ,',' and '!' are attached to words, i.e, 'split,', 'way!' and 'too!'. 

In [22]:
nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
doc = nlp([['This', 'is', 'token.ization', 'done', 'my', 'way!'], ['Sentence', 'split,', 'too!']])
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-21 16:40:10 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-06-21 16:40:10 INFO: Use device: cpu
2022-06-21 16:40:10 INFO: Loading: tokenize
2022-06-21 16:40:10 INFO: Done loading processors!


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: token.ization
id: (4,)	text: done
id: (5,)	text: my
id: (6,)	text: way!
id: (1,)	text: Sentence
id: (2,)	text: split,
id: (3,)	text: too!


# Use spaCy for Fast Tokenization and Sentence Segmentation

In [29]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.3.1-cp39-cp39-win_amd64.whl (11.7 MB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.7-cp39-cp39-win_amd64.whl (18 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp39-cp39-win_amd64.whl (1.9 MB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.2-py3-none-any.whl (7.2 kB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.1-py3-none-any.whl (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.9
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp39-cp39-win_amd64.whl (112 kB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.7-cp39-cp39-win_amd64.whl (6.

In [30]:
nlp = stanza.Pipeline(lang='en', processors={'tokenize': 'spacy'}) # spaCy tokenizer is currently only allowed in English pipeline.
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-21 16:52:28 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | spacy     |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-06-21 16:52:28 INFO: Use device: cpu
2022-06-21 16:52:28 INFO: Loading: tokenize
2022-06-21 16:52:32 INFO: Loading: pos
2022-06-21 16:52:33 INFO: Loading: lemma
2022-06-21 16:52:33 INFO: Loading: depparse
2022-06-21 16:52:33 INFO: Loading: sentiment
2022-06-21 16:52:33 INFO: Loading: constituency
2022-06-21 16:52:34 INFO: Loading: ner
2022-06-21 16:52:35 INFO: Done loading processors!


id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: test
id: (5,)	text: sentence
id: (6,)	text: for
id: (7,)	text: stanza
id: (8,)	text: .
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: another
id: (4,)	text: sentence
id: (5,)	text: .
