### How tokenizing text, sentence, words work 
<img src="https://user-images.githubusercontent.com/32620288/166104650-bca608ed-afc3-4c56-8bf2-eebf0b52b054.png" width="400" height="1">

----

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

#### Key points of the article –

* Text into sentences tokenization
* Sentences into words tokenization
* Sentences using regular expressions tokenization


##### 1. Sentence Tokenization – Splitting sentences in the paragraph

In [2]:
from nltk.tokenize import sent_tokenize
  
text = "British Foreign Secretary Liz Truss, on Thursday, appealed to the international allies to invoke stringent sanctions on Russia until it completely withdraws its forces from Ukraine. Addressing the G7 leaders in Germany, she said that Russian President Vladimir Putin has been humiliating himself on the world stage and urged her counterparts to invoke further sanctions until Moscow agrees on peace. Truss also called her fellow G7 foreign ministers for financial and technical assistance to rebuild Ukraine."
sent_tokenize(text)

['British Foreign Secretary Liz Truss, on Thursday, appealed to the international allies to invoke stringent sanctions on Russia until it completely withdraws its forces from Ukraine.',
 'Addressing the G7 leaders in Germany, she said that Russian President Vladimir Putin has been humiliating himself on the world stage and urged her counterparts to invoke further sanctions until Moscow agrees on peace.',
 'Truss also called her fellow G7 foreign ministers for financial and technical assistance to rebuild Ukraine.']

How sent_tokenize works ?

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.

##### 2. PunktSentenceTokenizer – When we have huge chunks of data then it is efficient to use

In [1]:
import nltk.data
  
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
  
tokenizer.tokenize(text)

OSError: No such file or directory: 'C:\\Users\\divak\\AppData\\Roaming\\nltk_data\\tokenizers\\punkt\\PY3\\PY3\\english.pickle'

##### 3. Tokenize sentence of different language – One can also tokenize sentence from different languages using different pickle file other than English.

In [None]:
import nltk.data
  
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
  
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

##### 4. Word Tokenization – Splitting words in a sentence.

In [3]:
from nltk.tokenize import word_tokenize
  
text = "Prime Minister Narendra Modi on Friday, May 13 expressed condolences over the sudden demise of the United Arab Emirates President Sheikh Khalifa bin Zayed Al Nahyan. I am deeply saddened to know about the passing away of HH Sheikh Khalifa bin Zayed, Prime Minister wrote in a twitter post. He reminisced that the late UAE president was a great statesman and visionary leader under whom India-UAE relations prospered. PM Modi sent heartfelt condolences to UAE from the people of India, saying that the Indian communities were with the people of UAE in these tough times. May his soul rest in peace,said Prime Minister Modi."
word_tokenize(text)

['Prime',
 'Minister',
 'Narendra',
 'Modi',
 'on',
 'Friday',
 ',',
 'May',
 '13',
 'expressed',
 'condolences',
 'over',
 'the',
 'sudden',
 'demise',
 'of',
 'the',
 'United',
 'Arab',
 'Emirates',
 'President',
 'Sheikh',
 'Khalifa',
 'bin',
 'Zayed',
 'Al',
 'Nahyan',
 '.',
 'I',
 'am',
 'deeply',
 'saddened',
 'to',
 'know',
 'about',
 'the',
 'passing',
 'away',
 'of',
 'HH',
 'Sheikh',
 'Khalifa',
 'bin',
 'Zayed',
 ',',
 'Prime',
 'Minister',
 'wrote',
 'in',
 'a',
 'twitter',
 'post',
 '.',
 'He',
 'reminisced',
 'that',
 'the',
 'late',
 'UAE',
 'president',
 'was',
 'a',
 'great',
 'statesman',
 'and',
 'visionary',
 'leader',
 'under',
 'whom',
 'India-UAE',
 'relations',
 'prospered',
 '.',
 'PM',
 'Modi',
 'sent',
 'heartfelt',
 'condolences',
 'to',
 'UAE',
 'from',
 'the',
 'people',
 'of',
 'India',
 ',',
 'saying',
 'that',
 'the',
 'Indian',
 'communities',
 'were',
 'with',
 'the',
 'people',
 'of',
 'UAE',
 'in',
 'these',
 'tough',
 'times',
 '.',
 'May',
 'his',

How word_tokenize works?

word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

###### 5. Using TreebankWordTokenizer

In [5]:
from nltk.tokenize import TreebankWordTokenizer
  
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

['Prime',
 'Minister',
 'Narendra',
 'Modi',
 'on',
 'Friday',
 ',',
 'May',
 '13',
 'expressed',
 'condolences',
 'over',
 'the',
 'sudden',
 'demise',
 'of',
 'the',
 'United',
 'Arab',
 'Emirates',
 'President',
 'Sheikh',
 'Khalifa',
 'bin',
 'Zayed',
 'Al',
 'Nahyan.',
 'I',
 'am',
 'deeply',
 'saddened',
 'to',
 'know',
 'about',
 'the',
 'passing',
 'away',
 'of',
 'HH',
 'Sheikh',
 'Khalifa',
 'bin',
 'Zayed',
 ',',
 'Prime',
 'Minister',
 'wrote',
 'in',
 'a',
 'twitter',
 'post.',
 'He',
 'reminisced',
 'that',
 'the',
 'late',
 'UAE',
 'president',
 'was',
 'a',
 'great',
 'statesman',
 'and',
 'visionary',
 'leader',
 'under',
 'whom',
 'India-UAE',
 'relations',
 'prospered.',
 'PM',
 'Modi',
 'sent',
 'heartfelt',
 'condolences',
 'to',
 'UAE',
 'from',
 'the',
 'people',
 'of',
 'India',
 ',',
 'saying',
 'that',
 'the',
 'Indian',
 'communities',
 'were',
 'with',
 'the',
 'people',
 'of',
 'UAE',
 'in',
 'these',
 'tough',
 'times.',
 'May',
 'his',
 'soul',
 'rest',
 

##### 6. PunktWordTokenizer – It doen’t separates the punctuation from the words.

In [8]:
from nltk.tokenize import WordPunctTokenizer
  
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Elon Musk on Friday said that his deal to buy Twitter is 'temporarily on hold' the social network reported false or spam accounts comprised less than 5%")

['Elon',
 'Musk',
 'on',
 'Friday',
 'said',
 'that',
 'his',
 'deal',
 'to',
 'buy',
 'Twitter',
 'is',
 "'",
 'temporarily',
 'on',
 'hold',
 "'",
 'the',
 'social',
 'network',
 'reported',
 'false',
 'or',
 'spam',
 'accounts',
 'comprised',
 'less',
 'than',
 '5',
 '%']

###### 7. Using Regular Expression

In [9]:
from nltk.tokenize import RegexpTokenizer
  
tokenizer = RegexpTokenizer("[\w']+")
text = "The Russia-Ukraine war has entered day 79 with Russia escalating its assault in east and southern Ukraine. Meanwhile, the US House has passed a $40bn aid package for Kyiv. On the other hand, Russian FM Sergei Lavrov accused the UN of failing to establish a political solution for Ukraine. Zelenskyy stated that the war will end for Kyiv only after Russian troops returned all occupied territories."
tokenizer.tokenize(text)

['The',
 'Russia',
 'Ukraine',
 'war',
 'has',
 'entered',
 'day',
 '79',
 'with',
 'Russia',
 'escalating',
 'its',
 'assault',
 'in',
 'east',
 'and',
 'southern',
 'Ukraine',
 'Meanwhile',
 'the',
 'US',
 'House',
 'has',
 'passed',
 'a',
 '40bn',
 'aid',
 'package',
 'for',
 'Kyiv',
 'On',
 'the',
 'other',
 'hand',
 'Russian',
 'FM',
 'Sergei',
 'Lavrov',
 'accused',
 'the',
 'UN',
 'of',
 'failing',
 'to',
 'establish',
 'a',
 'political',
 'solution',
 'for',
 'Ukraine',
 'Zelenskyy',
 'stated',
 'that',
 'the',
 'war',
 'will',
 'end',
 'for',
 'Kyiv',
 'only',
 'after',
 'Russian',
 'troops',
 'returned',
 'all',
 'occupied',
 'territories']

-----------------------------------------