# Natural Language ToolKit

`NLTK` is used for statistical natural language processing. It consists of module called `tokenize` with several methods that aids in splitting text to tokens like:

- word_tokenize
- sent_tokenize
- wordpunct_tokenize
- WhitespaceTokenizer
- TreebankWordTokenizer
- TweetTokenizer
- MWETokenizer

<p align="center">
    <img src="./../assets/tokenization/nltk.jpg"><br>
    <a href="https://udemy.com/course/python-for-data-science-and-machine-learning-bootcamp"><i>[Image source]</i></a>
</p>

As an example here, we’ll be using same text for all the tokenization modules mentioned here in this repo. Let's start by installing nltk library with the following command:

`pip3 install nltk`

In [1]:
#Turn off the pretty print
%pprint

Pretty printing has been turned OFF


In [2]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks.🙂😍"

Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries and sometimes sentence can start with non-capitalized words.

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/arun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### word_tokenize

`word_tokenize` can be used to find the words and punctuations in a string.

In [4]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
tokens

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks.🙂😍']

### sent_tokenize

`sent_tokenize` operates at the level of sentences and generates the tokens as follows

In [5]:
from nltk.tokenize import sent_tokenize

tokens = sent_tokenize(text)
tokens

['Good muffins cost $3.88.', 'Please buy me two of them.', 'Thanks.🙂😍']

### wordpunct_tokenize

`wordpunct_tokenize` splits text on whitespace and punctuation. This particular tokenizer requires the `Punkt` sentence tokenization models to be installed

In [6]:
from nltk.tokenize import wordpunct_tokenize

tokens = wordpunct_tokenize(text)
tokens

['Good', 'muffins', 'cost', '$', '3', '.', '88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.🙂😍']

### WhitespaceTokenizer

`WhitespaceTokenizer` extracts tokens from string based on whitespaces, new lines and tabs.

In [7]:
from nltk.tokenize import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize(text)
tokens

['Good', 'muffins', 'cost', '$3.88.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.🙂😍']

### TreebankWordTokenizer

This tokenizer incorporates a variety of common rules for english word tokenization. It separates phrase-terminating punctuation like (?!.;,) from adjacent tokens and retains decimal numbers as a single token. Besides, it contains rules for English contractions. You can find all the rules for the Treebank Tokenizer [here](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.treebank).

For example “don’t” is tokenized as [“do”, “n’t”].

[Source](https://neptune.ai/blog/tokenization-in-nlp)

In [8]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
tokens

['Good', 'muffins', 'cost', '$', '3.88.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.🙂😍']

### TweetTokenizer

TweetTokenizer is a rule based tokenizer specially desined for text data like tweets. It allows splitting of emojis into different words which can be helpful for certain tasks like sentiment analysis.

In [9]:
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(text)
tokens

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.', '🙂', '😍']

### MWETokenizer

MWETokenizer stands for Multi-word Expression Tokenizer. It provides a function `add_mwe()` that allows the user to enter multiple word expressions before using the tokenizer on the text data. More simply, it can merge multi-word expressions into single tokens.

In [10]:
from nltk.tokenize import MWETokenizer

tokenizer = MWETokenizer()
tokens = tokenizer.tokenize(word_tokenize(text))
tokens

['Good', 'muffins', 'cost', '$', '3.88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks.🙂😍']

In [11]:
tokenizer = MWETokenizer()
tokenizer.add_mwe(('$', '3.88', '.'))
tokens = tokenizer.tokenize(word_tokenize(text))
tokens

['Good', 'muffins', 'cost', '$_3.88_.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks.🙂😍']