# **Tokenization**

**AIM** : Tokenize the sentence into words

**Tools** : Jupyter / any editor of python

**Library** : NLTK

**Method** : Already available functions like

Word_tokenize, Sent_tokenize, TreebankWordTokenizer, Wordpunct_tokenize, TweetTokenizer, MWETokenizer


In [None]:
import nltk
from nltk.tokenize import (word_tokenize, sent_tokenize, TreebankWordTokenizer, wordpunct_tokenize, TweetTokenizer, MWETokenizer)
text = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"

**Word Tokenizer**

Word tokenizers are one class of tokenizers that split a text into words. These tokenizers can be used to create a bag of words representation of the text, which can be used for downstream tasks like building word2vec or TF-IDF models.

In [None]:
nltk_tokens = nltk.word_tokenize(text)
print(nltk_tokens)

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the', 'comforts', 'of', 'their', 'drawing', 'rooms']


**Sentence Tokenizer**

Sentence tokenization is the process of splitting text into individual sentences. For literature, journalism, and formal documents the tokenization algorithms built in to spaCy perform well, since the tokenizer is trained on a corpus of formal English text. The sentence tokenizer performs less well for electronic health records featuring abbreviations, medical terms, spatial measurements, and other forms not found in standard written English.

In [None]:
nltk_tokens = nltk.sent_tokenize(text)
print(nltk_tokens)

['It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms']


**Punctuation Based Tokenizer**

This tokenizer splits the sentences into words based on whitespaces and punctuations.

In [None]:
text = "What you don't want to be done to yourself, don't do it to others."

nltk_tokens = nltk.wordpunct_tokenize(text)
print(nltk_tokens)

['What', 'you', 'don', "'", 't', 'want', 'to', 'be', 'done', 'to', 'yourself', ',', 'don', "'", 't', 'do', 'it', 'to', 'others', '.']


**Treebank Word Tokenizer**

The problem which we had in the punctuation tokenizer of splitting the words into an incorrect format like doesn’t into doesn, ‘, and t but now the problem is solved. Treebank tokenizer contains rules for English contractions.

In [None]:
text = "What you don't want to be done to yourself, don't do it to others."

tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(text))

['What', 'you', 'do', "n't", 'want', 'to', 'be', 'done', 'to', 'yourself', ',', 'do', "n't", 'do', 'it', 'to', 'others', '.']


**Tweet Tokenizer**

TweetTokenizer helps to tokenize Tweet Corpus into relevant tokens.

The advantage of using TweetTokenizer() compared to regular word_tokenize is that, when processing tweets, we often come across emojis, hashtags that need to be handled differently.

In [None]:
text = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter']

tweet_tokenizer = TweetTokenizer()
tweet_tokens = []
for sent in text:
    print(tweet_tokenizer.tokenize(sent))
    tweet_tokens.append(tweet_tokenizer.tokenize(sent))

['https://t.co/9z2J3P33Uc']
['laugh', '/', 'cry']
['😬', '😭', '😓', '🤢', '🙄', '😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']


**Multi-Word Expression Tokenizer**

A MWETokenizer takes a string which has already been divided into tokens and
retokenizes it, merging multi-word expressions into single tokens, using a lexicon (dictionary) of MWEs.

In [None]:
tk = MWETokenizer([('C', 'U'), ('Chandigarh', 'University')])
tk.add_mwe(('who', 'are', 'you'))

text = "who are you at Chandigarh University"

text1 = tk.tokenize(text.split())
   
print(text1)

['who_are_you', 'at', 'Chandigarh_University']
