# Tokenisation

- <b>Definition:</b> Tokenization is the process of breaking down a stream of text into smaller units called "tokens." These tokens can be words, phrases, or other meaningful sub-units like punctuation or special characters depending on the application type we are working on.

- <b>Purpose:</b> Tokenization simplifies text processing by converting unstructured text into structured, analyzable components.

- <b>Types:</b> Word Tokenisation, Sentence Tokenisation, Tweet Tokenisation, Subword Tokenisation, N-gram Tokenisation, Regex-based Tokenisation, Whitespace Tokenisation, etc 

#### We will be using NLTK library for basic tokenisation

### 1. Word Tokenisation

In [3]:
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)

At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.


In [5]:
# Tokenise the above document into words using simple python split method
print(document.split())

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']


#### However, the above method has just split on white space which left us with words like God with a period

In [14]:
# Let us download punkt and punkt_tab, which is a pre-trained tokenization model provided by the Natural Language Toolkit (NLTK). It is specifically designed to handle sentence and word tokenization for a variety of languages.
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/nehaverma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/nehaverma/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [15]:
# Tokenise using nltk library
from nltk.tokenize import word_tokenize
print(word_tokenize(document))

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself', '.', 'It', 'looks', 'like', 'religious', 'mania', ',', 'and', 'he', "'ll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God', '.']


#### As we now see, the words in the document using nltk is tokenized not only on white space

### 2. Sentence tokeniser

In [18]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(document))

["At nine o'clock I visited him myself.", "It looks like religious mania, and he'll soon think that he himself is God."]


In [20]:
print(sent_tokenize('Hello World! Are you enjoying Data Science? I hope you are.'))

['Hello World!', 'Are you enjoying Data Science?', 'I hope you are.']


### 3. Tweet tokeniser

In [22]:
message = "i recently watched this show called mindhunters:). i totally loved it üòç. it was gr8 <3. #bingewatching #nothingtodo üòé"

In [24]:
# Lets try using word tokenizer for above message

print(word_tokenize(message))

['i', 'recently', 'watched', 'this', 'show', 'called', 'mindhunters', ':', ')', '.', 'i', 'totally', 'loved', 'it', 'üòç', '.', 'it', 'was', 'gr8', '<', '3', '.', '#', 'bingewatching', '#', 'nothingtodo', 'üòé']


A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time.

The word tokeniser breaks the emoji '<3' into '<' and '3' which is something that we don't want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

Let's use the tweet tokeniser of nltk to tokenise this message.

In [26]:
from nltk import TweetTokenizer

In [27]:
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(message))

['i', 'recently', 'watched', 'this', 'show', 'called', 'mindhunters', ':)', '.', 'i', 'totally', 'loved', 'it', 'üòç', '.', 'it', 'was', 'gr8', '<3', '.', '#bingewatching', '#nothingtodo', 'üòé']


As we can see, it handles all the emojis and the hashtags pretty well.

Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression.

Let's look at how you can use regular expression tokeniser.

## 4. Regex Tokeniser

In [31]:
import warnings
warnings.filterwarnings('ignore')

In [32]:
from nltk import regexp_tokenize

In [35]:
# let us filter all hastags from the above message using regexp_tokenize

pattern = '#[\w]+'

regexp_tokenize(message, pattern)

['#bingewatching', '#nothingtodo']