# 2.5 Tokenizing Text

Fundamental step in NLP involves converting our text into smaller units through a process known as tokenization. These smaller units are known as our tokens. Word tokenization is the most common form of tokenization, where individual words in the text becomes a token, but tokens can also be sentences, sub words or individual characters depending on your use case. 

Why do we do this? The meaning of the overall text is better understood if we can analyse and understand the individual parts as well as the whole. It's also an important step before we vecotrize our data, which we'll cover more in the next section of this course. 

Now let's look at some examples of sentence and word tokenization using the nltk package.

In [7]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Sentance tokenization

In [8]:
sentences = "Her cat's name is Luna. Her dog's name is max"
sent_tokenize_data=sent_tokenize(sentences)
print(sent_tokenize_data)

["Her cat's name is Luna.", "Her dog's name is max"]


### Word tokenization

In [9]:
sentence = "Her cat's name is Luna"
word_tokenize(sentence)

['Her', 'cat', "'s", 'name', 'is', 'Luna']

Notice how "cat's" has been split into 2 tokens. This may be fine for your task but it is definitely something to keep in mind when you are preprocessing any text data - you might want to remove punctuation or replace contractions before tokenizing.

In [None]:
sentence_2 = "Her cat's name is Luna and her dog's name is max"
word_tokenize(sentence_2)

['Her',
 'cat',
 "'s",
 'name',
 'is',
 'Luna',
 'and',
 'Her',
 'dog',
 "'s",
 'name',
 'is',
 'max']

These tokens illustrate what we learned in our last lesson about the importance of using lowercase. We can see we have two instances of the word 'her' - one which is capitalised. The tokens then are different and will be treated as different in most analysis.

In [None]:
# Using BertTokenizer from Hugging Face Transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

['her', 'cat', "'", 's', 'name', 'is', 'luna']


In [10]:
sentence = "Her cat's name is Luna"
tokens = tokenizer.tokenize(sentence)
print(tokens)

['her', 'cat', "'", 's', 'name', 'is', 'luna']
