<strong>This notebook demonstates the different tokenizers in the NTLK toolkit<strong>

Tokenizing Text into Sentences
Tokenization: A technique to break up a piece of text into small pieces according to some rules.

In [None]:
import nltk
#nltk.download('punkt')
#nltk.download('webtext')
#nltk.download('wordnet')  # lexical database for the English language
#nltk.download('omw')       # Open Multilingual WordNet, using ISO-639 language codes.
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize # instance of PunktSentenceTokenizer defined in nltk.tokenize.punkt
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
import nltk.data

<strong>A) Sentence Tokenizers<strong>

In [None]:
para = "This tutorial is about different forms of tokenizing in NLTK. It should be a lot of fun."

In [None]:
sent_tokenize(para) 

Here is a more efficient way: Load the PunktSentenceTokenizer once. Then call its methods.

In [None]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(para)

Now we can do the same thing in Spanish.

In [None]:
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
spanish_tokenizer.tokenize('Hola Senor. Buenos dias. Mucho gusto')

<strong>B) Tokenizing Sentences into Words<strong>

In [None]:
hw_string = 'Hello World.'
print(word_tokenize(hw_string))  # word_tokenize is a wrapper function.
# This is the same as the code above.
tokenizer = TreebankWordTokenizer()  # inherits from TokenizerI, uses conventions found in the Penn Treebank corpus
print(tokenizer.tokenize(hw_string))

Contractions: Words of the form "didn't", "couldn't etc.

Notice how the different tokenizers handle contractions and punctuation.

In [None]:
game_sentence = "He didn't go to the ballgame."

In [None]:
from nltk.tokenize import RegexpTokenizer  # inherits from TokenizerI
tokenizer = RegexpTokenizer("[\w']+")   # Note the ' character in the [ ]: It allows us to include single quotes in a word.
tokenizer.tokenize(game_sentence)
# Notice how it drops the ending punctuation mark.

In [None]:
tokenizer1 = RegexpTokenizer("[\w'\.]+")   # We include a period inside a word.
tokenizer1.tokenize(game_sentence)

In [None]:
from nltk.tokenize import WhitespaceTokenizer 
tokenizer =  WhitespaceTokenizer ()  # inherits from RegexpTokenizer
tokenizer.tokenize(game_sentence)

In [None]:
tokenizer = WordPunctTokenizer()  # inherits from RegexpTokenizer
tokenizer.tokenize(game_sentence)

In [None]:
tokenizer = TreebankWordTokenizer()  # inherits from TokenizerI, uses conventions found in the Penn Treebank corpus
print(tokenizer.tokenize(game_sentence))

In [None]:
from nltk.corpus import stopwords  # instance of nltk.corpus.reader.WordListCorpusReader
english_stops = set(stopwords.words('english'))
# stopwords.words() will generate the list of stop words for all available languages
#print(english_stops)

In [None]:
# And here is the complete list of languages
stopwords.fileids()

The example below shows us how to tokenize on whitespace using the RegexpTokenizer.

In [None]:
tokenizer = RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize(game_sentence)