## NLTK tokenize

In [15]:
from nltk.tokenize import (sent_tokenize, word_tokenize)

# https://www.nltk.org/api/nltk.tokenize.html

para = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tokenization won't be foolproof with split() method."""

para.split(". ")

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 "But one drawback with split() method, that we can only use one separator at a time! So sentence tokenization won't be foolproof with split() method."]

In [7]:
#   Tokenizing text into sentences  : Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).
nltk_sent_para = sent_tokenize(para, language='english')

print(f"We get {len(nltk_sent_para)} tokens (sentences) in this paragraph")

print(nltk_sent_para)

We get 3 tokens (sentences) in this paragraph
['Characters like periods, exclamation point and newline char are used to separate the sentences.', 'But one drawback with split() method, that we can only use one separator at a time!', "So sentence tokenization won't be foolproof with split() method."]


Source code

```python
def sent_tokenize(text, language="english"):
    """
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
    return tokenizer.tokenize(text)
```

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

In [8]:
#   Tokenizing text into words  : Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).
nltk_word_para = word_tokenize(para, language="english", preserve_line=False)

print(f"We get {len(nltk_word_para)} tokens (words) in this paragraph")

print(nltk_word_para)

We get 49 tokens (words) in this paragraph
['Characters', 'like', 'periods', ',', 'exclamation', 'point', 'and', 'newline', 'char', 'are', 'used', 'to', 'separate', 'the', 'sentences', '.', 'But', 'one', 'drawback', 'with', 'split', '(', ')', 'method', ',', 'that', 'we', 'can', 'only', 'use', 'one', 'separator', 'at', 'a', 'time', '!', 'So', 'sentence', 'tokenization', 'wo', "n't", 'be', 'foolproof', 'with', 'split', '(', ')', 'method', '.']


Source code

```python
from nltk.tokenize.destructive import NLTKWordTokenizer

_treebank_word_tokenizer = NLTKWordTokenizer()

def word_tokenize(text, language="english", preserve_line=False):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: A flag to decide whether to sentence tokenize the text or not.
    :type preserve_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [
        token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
    ]
```

The word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

## spaCy tokenize

In [16]:
import spacy

#   nlp = spacy.load("en_core_web_md")
import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp(para)

doc

Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tokenization won't be foolproof with split() method.

In [17]:
#   Tokenizing text into sentences

spacy_sent_para = [sent.text for sent in doc.sents]

print(f"We get {len(spacy_sent_para)} tokens (sentences) in this paragraph")

print(spacy_sent_para)

We get 3 tokens (sentences) in this paragraph
['Characters like periods, exclamation point and newline char are used to separate the sentences.', 'But one drawback with split() method, that we can only use one separator at a time!', "So sentence tokenization won't be foolproof with split() method."]


In [18]:
#   Tokenizing text into words

spacy_word_para = [word.text for word in doc]

print(f"We get {len(spacy_word_para)} tokens (words) in this paragraph")

print(spacy_word_para)

We get 49 tokens (words) in this paragraph
['Characters', 'like', 'periods', ',', 'exclamation', 'point', 'and', 'newline', 'char', 'are', 'used', 'to', 'separate', 'the', 'sentences', '.', 'But', 'one', 'drawback', 'with', 'split', '(', ')', 'method', ',', 'that', 'we', 'can', 'only', 'use', 'one', 'separator', 'at', 'a', 'time', '!', 'So', 'sentence', 'tokenization', 'wo', "n't", 'be', 'foolproof', 'with', 'split', '(', ')', 'method', '.']


In [27]:
#https://spacy.io/api/token#attributes
doc[0].text

'separate'