<a href="https://colab.research.google.com/github/anildatascientist/NLP/blob/main/NLP_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##<b>Tokenization

**Tokenization** is the process of breaking down text into smaller units called **tokens**.

These **token** can be words or senteneces, thus the terms, **word-tokenization** and **sentence-tokenization**.

<b>*We will try to achive tokenization with the following 3 ways:*

1. Using python methods
2. Using regular expression
3. Using libraries


### <b> 1. Using python method

In [9]:
text1= """In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.
Sounds great!
But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.
"""

In [10]:
# Word Token
word_tokens = text1.split()
print(word_tokens)

['In', 'Natural', 'Language', 'Processing', 'we', 'want', 'to', 'make', 'computer', 'programs', 'that', 'understand,', 'generate', 'and,', 'more', 'generally', 'speaking,', 'work', 'with', 'human', 'languages.', 'Sounds', 'great!', 'But', 'there’s', 'a', 'challenge', 'that', 'jumps', 'out:', 'we,', 'humans,', 'communicate', 'with', 'words', 'and', 'sentences;', 'meanwhile,', 'computers', 'only', 'understand', 'numbers.']


In [14]:
sentence_tokens = text1.split('.')
display(sentence_tokens)

['In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages',
 ' \nSounds great! \nBut there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers',
 '\n']

### <b> 2. Using regular expression

In [16]:
# Importing regular expression library by the name of "re"
import re

In [17]:
text2= """In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.
Sounds great!
But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.
"""

In [20]:
word_tokens_re = re.findall("[\w]+", text2)
print(word_tokens_re)

['In', 'Natural', 'Language', 'Processing', 'we', 'want', 'to', 'make', 'computer', 'programs', 'that', 'understand', 'generate', 'and', 'more', 'generally', 'speaking', 'work', 'with', 'human', 'languages', 'Sounds', 'great', 'But', 'there', 's', 'a', 'challenge', 'that', 'jumps', 'out', 'we', 'humans', 'communicate', 'with', 'words', 'and', 'sentences', 'meanwhile', 'computers', 'only', 'understand', 'numbers']


In [34]:
sentence_tokens_re = re.compile("[.!]").split(text2)
sentence_tokens_re

['In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages',
 ' \nSounds great',
 ' \nBut there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers',
 '\n']

### <b> 3. Using libraries

#### <b> 3.1: NLTK Library

In [42]:
# Installing NLTK library
!pip install --user -U nltk



In [43]:
# Word Tokenization using NLTK library

from nltk.tokenize import word_tokenize

In [44]:
text3= """In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.
Sounds great!
But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.
"""

In [50]:
word_tokens_nltk = word_tokenize(text3)
print(word_tokens_nltk)

['In', 'Natural', 'Language', 'Processing', 'we', 'want', 'to', 'make', 'computer', 'programs', 'that', 'understand', ',', 'generate', 'and', ',', 'more', 'generally', 'speaking', ',', 'work', 'with', 'human', 'languages', '.', 'Sounds', 'great', '!', 'But', 'there', '’', 's', 'a', 'challenge', 'that', 'jumps', 'out', ':', 'we', ',', 'humans', ',', 'communicate', 'with', 'words', 'and', 'sentences', ';', 'meanwhile', ',', 'computers', 'only', 'understand', 'numbers', '.']


In [51]:
# Sentence Tokenization using NLTK library

from nltk.tokenize import sent_tokenize

In [54]:
sentence_tokens_nltk = sent_tokenize(text3)
display(sentence_tokens_nltk)

['In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.',
 'Sounds great!',
 'But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.']

#### <b> 3.2 Spacy Library

In [61]:
# Installing dependencies
!pip install spacy



##### <b> <i> Word Tokenization using Spacy Library

In [72]:
# Importing the Spacy Library into the Python environment

import spacy

In [73]:
# Loading English Language Model using Spacy Library

eng_model = spacy.load('en_core_web_sm')

In [74]:
text4= """In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.
Sounds great!
But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.
"""

In [79]:
word_tokens_spacy = eng_model(text4)

In [89]:
print([token.text for token in word_tokens_spacy])

['In', 'Natural', 'Language', 'Processing', 'we', 'want', 'to', 'make', 'computer', 'programs', 'that', 'understand', ',', 'generate', 'and', ',', 'more', 'generally', 'speaking', ',', 'work', 'with', 'human', 'languages', '.', '\n', 'Sounds', 'great', '!', '\n', 'But', 'there', '’s', 'a', 'challenge', 'that', 'jumps', 'out', ':', 'we', ',', 'humans', ',', 'communicate', 'with', 'words', 'and', 'sentences', ';', 'meanwhile', ',', 'computers', 'only', 'understand', 'numbers', '.', '\n']


##### <b> <i> Sentence Tokenization using Spacy Library

In [84]:
import spacy
eng_model = spacy.load('en_core_web_sm')

In [85]:
text4= """In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.
Sounds great!
But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.
"""

In [90]:
sentence_tokens_spacy = eng_model(text4)

In [93]:
display([sent.text for sent in sentence_tokens_spacy.sents])

['In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages. \n',
 'Sounds great! \n',
 'But there’s a challenge that jumps out: we, humans, communicate with words and sentences; meanwhile, computers only understand numbers.\n']

### <b> Which Tokenization Should we use?

In [96]:
# Import TweetTokenizer from nltk.tokenizer module

from nltk.tokenize import TweetTokenizer

In [102]:
text5 = """
Shivan teaches at Success Analytics. He has also started NLP batch! ☺️🥰💪🔥
"""

smart_tokenizer = TweetTokenizer(text5)

print(smart_tokenizer.tokenize(text5))

['Shivan', 'teaches', 'at', 'Success', 'Analytics', '.', 'He', 'has', 'also', 'started', 'NLP', 'batch', '!', '☺', '️', '🥰', '💪', '🔥']


In [104]:
token_split_method = text5.split()
print(token_split_method)

['Shivan', 'teaches', 'at', 'Success', 'Analytics.', 'He', 'has', 'also', 'started', 'NLP', 'batch!', '☺️🥰💪🔥']


In [105]:
tokens_regular_expression = re.findall("[\w]+", text5)
print(tokens_regular_expression)

['Shivan', 'teaches', 'at', 'Success', 'Analytics', 'He', 'has', 'also', 'started', 'NLP', 'batch']


In [106]:
tokens_nltk = word_tokenize(text5)
print(tokens_nltk)

['Shivan', 'teaches', 'at', 'Success', 'Analytics', '.', 'He', 'has', 'also', 'started', 'NLP', 'batch', '!', '☺️🥰💪🔥']


In [107]:
tokens_spacy = eng_model(text5)
print([token.text for token in tokens_spacy])

['\n', 'Shivan', 'teaches', 'at', 'Success', 'Analytics', '.', 'He', 'has', 'also', 'started', 'NLP', 'batch', '!', '☺', '️', '🥰', '💪', '🔥', '\n']


#### <b> As evident from above snippets of code, TweetTokenizer from nltk.tokenize library is best for Tokenization!!!

In [108]:
######################################################################################## THE END #############################################################################