# Implementing Tokenisation

- `nltk` offers comprehensive tools and resources for processing natural language text.
- `spaCy` is fast and accurate in processing large volumes of text data.
- `BertTokenizer` is specifically designed for tokenising text according to the BERT model's specification.
- `XLNetTokenizer` is tailored for tokenising text in alignment with the XLNet model's requirements.
- `torchtext` simplifies the process of working with text data and provides functionalities for data preprocessing, tokenisation, vocabulary management, and batching.

In [1]:
!pip install nltk
!pip install transformers
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install torchtext

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached click-8.1.7-py3-none-any.whl (97 kB)
Installing collected packages: click, nltk
Successfully installed click-8.1.7 nltk-3.9.1
Collecting spacy
  Downloading spacy-3.8.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.10-cp39-cp39-macosx_11_0_arm64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (f

In [10]:
!pip install -U torchtext



In [1]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer

import torchtext

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /Users/fredjeong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/fredjeong/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


NameError: name '_C' is not defined

## Word-based tokenizer

### nltk

In [4]:
text = "This is a sample sentence for word tokenisation."
tokens = word_tokenize(text)
print(tokens)

['This', 'is', 'a', 'sample', 'sentence', 'for', 'word', 'tokenisation', '.']


`nltk` or `spaCy` often split words like "don't" and "couldn't". 

In [5]:
text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."
tokens = word_tokenize(text)
print(tokens)

['I', 'could', "n't", 'help', 'the', 'dog', '.', 'Ca', "n't", 'you', 'do', 'it', '?', 'Do', "n't", 'be', 'afraid', 'if', 'you', 'are', '.']


In [8]:
text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

token_list = [token.text for token in doc]
print("Tokens:", token_list)

for token in doc:
    print(token.text, token.pos_, token.dep_)

Tokens: ['I', 'could', "n't", 'help', 'the', 'dog', '.', 'Ca', "n't", 'you', 'do', 'it', '?', 'Do', "n't", 'be', 'afraid', 'if', 'you', 'are', '.']
I PRON nsubj
could AUX aux
n't PART neg
help VERB ROOT
the DET det
dog NOUN dobj
. PUNCT punct
Ca AUX aux
n't PART neg
you PRON nsubj
do VERB ROOT
it PRON dobj
? PUNCT punct
Do AUX aux
n't PART neg
be AUX ROOT
afraid ADJ acomp
if SCONJ mark
you PRON nsubj
are AUX advcl
. PUNCT punct


The problem with word-based tokenisation is that words with similar meanings are assigned different IDs, being treated as entirely separate words with distinct meanings. For example, `unicorns` and `unicorn` are treated as two separate words, potentially causing the model to miss their semantic relationship.

In [10]:
text = "Unicorns are real. I saw a unicorn yesterday."
token = word_tokenize(text)
print(token)

['Unicorns', 'are', 'real', '.', 'I', 'saw', 'a', 'unicorn', 'yesterday', '.']


Moreover, since a unique ID is assigned to each word, the model's overall vocabulary tends to be large, resulting in the model having large parameters.

## Character-based tokenizer

Character-based tokenization has its limitations. Single characters may not convey the same information as entire words, and the overall token length increases significantly, potentially causing issues with model size and a loss of performance.

## Subword-based tokenizer

The subword-based tokeniser allows frequently used words to remain unsplit while breaking down infrequent words into meaningful subwords.

It learns subword units from a given text corpus, identifying common prefixes, suffixes, and root words as subword tokens based on their frequency of occurrence. This approach offers the advantage of representing a broader range of words and adapting to the specific language patterns within a text corpus.

### WordPiece

`WordPiece` initialises its vocabulary to include every character present in the training data and progressively learns a specified number of merge rules. WordPiece selects the pair that maximises the likelihood of the training data when added to the vocabulary, meaning that it evaluates what it sacrifices by merging two symbols to ensure it's a worthwhile endeavor.

In [12]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("IBM taught me tokenization.")

['ibm', 'taught', 'me', 'token', '##ization', '.']

### Unigram and SentencePiece

`Unigram` starts with a large list of possibilities and gradually narrowing it down based on how frequently small pieces appear in the text. 

`SentencePiece` is a tool that takes text, divides it into smaller, more manageable parts, assigns IDs to these segments, and ensures that it does so consistently.

Unigrams and SentencePiece work together by implementing Unigram's subword tokenisation method within the SentencePiece framework. SentencePiece handles subword segmentation and ID assignment, while Unigram's principles guide the vocabulary reduction process to create a more efficient representation of the text data. 

This combination is particularly valuable for various NLP tasks in which subword tokenisation can enhnace the performance of language models.

In [13]:
tokeniser = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokeniser.tokenize("IBM taught me tokenisation.")

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

['▁IBM', '▁taught', '▁me', '▁token', 'isation', '.']

## Tokenisation with PyTorch

Tokeniser breaks down text into individual words or subwords. After tokenisation, we do vocabulary mapping, which is the process of mapping these tokens to unique integers, allowing them to be fed into neural networks. 

In [14]:
dataset = [
    (1, 'Introduction to NLP'),
    (2, 'Basics of PyTorch'),
    (1, 'NLP Techniques for Text Classification'),
    (3, 'Named Entity Recognition with PyTorch'),
    (3, 'Sentiment Analysis using PyTorch'),
    (3, 'Machine Translation with PyTorch'),
    (1, ' NLP Named Entity,Sentiment Analysis,Machine Translation '),
    (1, ' Mahcine Translation with NLP '),
    (1, ' Named Entity vs Sentiment Analysis  NLP ')
]

We use `get_tokenizer` function to fetch a tokeniser by name.

In [15]:
from torchtext.data.utils import get_tokenizer

In [17]:
tokenizer = get_tokenizer("basic_english")

In [19]:
tokenizer(dataset[6][1])

['nlp',
 'named',
 'entity',
 ',',
 'sentiment',
 'analysis',
 ',',
 'machine',
 'translation']

## Token indices

We use the function `build_vocab_from_iterator` to represent words as numbers. The output is typically referred to as 'token indices' or simply 'indices'. These indices represent the numeric representation of the tokens in the vocabulary.

The `build_vocab_from_iterato` function assigns a unique index to each token based on its position in the vocabulary. 

`dataest` is an iterable. Therefore, we use a generator function `yield_tokens` to apply the `tokenizer`. The purpose of the generator function `yield_tokens` is to yield tokenised texts one at a time. 

Instead of processing the entire dataset and returning all the tokenised texts in one go, the generator function processes the yields each tokenised text individually as it is requested.

In [29]:
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

In [40]:
# Create an iterator using the generator

my_iterator = yield_tokens(dataset)

In [28]:
next(my_iterator)

['introduction', 'to', 'nlp']

In [72]:
dataset

[(1, 'Introduction to NLP'),
 (2, 'Basics of PyTorch'),
 (1, 'NLP Techniques for Text Classification'),
 (3, 'Named Entity Recognition with PyTorch'),
 (3, 'Sentiment Analysis using PyTorch'),
 (3, 'Machine Translation with PyTorch'),
 (1, ' NLP Named Entity,Sentiment Analysis,Machine Translation '),
 (1, ' Mahcine Translation with NLP '),
 (1, ' Named Entity vs Sentiment Analysis  NLP ')]

## Out-of-vocabulary (OOV)

In [41]:
vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

In [42]:
def get_tokenised_sentence_and_indices(iterator):
    tokenised_sentence = next(iterator)
    token_indices = [vocab[token] for token in tokenised_sentence]
    return tokenised_sentence, token_indices

tokenised_sentence, token_indices = get_tokenised_sentence_and_indices(my_iterator)
next(my_iterator)

print("Tokenised Sentence:", tokenised_sentence)
print("Token Indices:", token_indices)

Tokenised Sentence: ['introduction', 'to', 'nlp']
Token Indices: [14, 20, 1]


In [43]:
lines = ["IBM taught me tokenization", 
         "Special tokenizers are ready and they will blow your mind", 
         "just saying hi!"]

special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

tokenizer_en = get_tokenizer('spacy', language='en_core_web_sm')

tokens = []
max_length = 0

for line in lines:
    tokenized_line = tokenizer_en(line)
    tokenized_line = ['<bos>'] + tokenized_line + ['<eos>']
    tokens.append(tokenized_line)
    max_length = max(max_length, len(tokenized_line))

for i in range(len(tokens)):
    tokens[i] = tokens[i] + ['<pad>'] * (max_length - len(tokens[i]))

print("Lines after adding special tokens:\n", tokens)

# Build vocabulary without unk_init
vocab = build_vocab_from_iterator(tokens, specials=['<unk>'])
vocab.set_default_index(vocab["<unk>"])

# Vocabulary and Token Ids
print("Vocabulary:", vocab.get_itos())
print("Token IDs for 'tokenization':", vocab.get_stoi())

Lines after adding special tokens:
 [['<bos>', 'IBM', 'taught', 'me', 'tokenization', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'Special', 'tokenizers', 'are', 'ready', 'and', 'they', 'will', 'blow', 'your', 'mind', '<eos>'], ['<bos>', 'just', 'saying', 'hi', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]
Vocabulary: ['<unk>', '<pad>', '<bos>', '<eos>', '!', 'IBM', 'Special', 'and', 'are', 'blow', 'hi', 'just', 'me', 'mind', 'ready', 'saying', 'taught', 'they', 'tokenization', 'tokenizers', 'will', 'your']
Token IDs for 'tokenization': {'your': 21, 'will': 20, 'tokenizers': 19, 'taught': 16, 'saying': 15, 'mind': 13, 'blow': 9, 'are': 8, 'Special': 6, 'IBM': 5, '!': 4, 'tokenization': 18, 'ready': 14, '<eos>': 3, 'they': 17, 'hi': 10, 'and': 7, '<bos>': 2, 'me': 12, 'just': 11, '<pad>': 1, '<unk>': 0}


In [44]:
new_line = "I learned about embeddings and attention mechanisms."

# Tokenize the new line
tokenized_new_line = tokenizer_en(new_line)
tokenized_new_line = ['<bos>'] + tokenized_new_line + ['<eos>']

# Pad the new line to match the maximum length of previous lines
new_line_padded = tokenized_new_line + ['<pad>'] * (max_length - len(tokenized_new_line))

# Convert tokens to IDs and handle unknown words
new_line_ids = [vocab[token] if token in vocab else vocab['<unk>'] for token in new_line_padded]

# Example usage
print("Token IDs for new line:", new_line_ids)

Token IDs for new line: [2, 0, 0, 0, 0, 7, 0, 0, 0, 3, 1, 1]


## Exercise: Comparative text tokenisation and performance analysis

Evaluate and compare the tokenisation capabilities of four different NLP libraries (`nltk`, `spaCy`, `BertTokenizer`, and `XLNetTokenizer`) by analysing the frequency of tokenised words and measuring the processing time for each tool using `datetime`. 

In [45]:
text = "Going through the world of tokenization has been like walking through a huge maze made of words, symbols, and meanings. Each turn shows a bit more about the cool ways computers learn to understand our language. And while I'm still finding my way through it, the journey’s been enlightening and, honestly, a bunch of fun. Eager to see where this learning path takes me next!"

# Counting and displaying tokens and their frequency
from collections import Counter

def show_frequencies(tokens, method_name):
    print(f"{method_name} Token Frequencies: {dict(Counter(tokens))}\n")

### Step 1: Tokenisation

#### Word-based tokenizer: `nltk` and `spaCy`

In [59]:
# nltk tokenisation

# nltk split words like "don't" and "I'm".
tokens_nltk = nltk.tokenize.word_tokenize(text)
print(tokens_nltk)

['Going', 'through', 'the', 'world', 'of', 'tokenization', 'has', 'been', 'like', 'walking', 'through', 'a', 'huge', 'maze', 'made', 'of', 'words', ',', 'symbols', ',', 'and', 'meanings', '.', 'Each', 'turn', 'shows', 'a', 'bit', 'more', 'about', 'the', 'cool', 'ways', 'computers', 'learn', 'to', 'understand', 'our', 'language', '.', 'And', 'while', 'I', "'m", 'still', 'finding', 'my', 'way', 'through', 'it', ',', 'the', 'journey', '’', 's', 'been', 'enlightening', 'and', ',', 'honestly', ',', 'a', 'bunch', 'of', 'fun', '.', 'Eager', 'to', 'see', 'where', 'this', 'learning', 'path', 'takes', 'me', 'next', '!']


In [60]:
# spaCy tokenisation

# 단어 뭉치 불러오기
nlp = spacy.load('en_core_web_sm')

# 주어진 텍스트를 단어 뭉치에 대입시키기
doc = nlp(text)

# nltk split words like "don't" and "I'm".
tokens_spacy = [token.text for token in doc]
print(tokens_spacy)

['Going', 'through', 'the', 'world', 'of', 'tokenization', 'has', 'been', 'like', 'walking', 'through', 'a', 'huge', 'maze', 'made', 'of', 'words', ',', 'symbols', ',', 'and', 'meanings', '.', 'Each', 'turn', 'shows', 'a', 'bit', 'more', 'about', 'the', 'cool', 'ways', 'computers', 'learn', 'to', 'understand', 'our', 'language', '.', 'And', 'while', 'I', "'m", 'still', 'finding', 'my', 'way', 'through', 'it', ',', 'the', 'journey', '’s', 'been', 'enlightening', 'and', ',', 'honestly', ',', 'a', 'bunch', 'of', 'fun', '.', 'Eager', 'to', 'see', 'where', 'this', 'learning', 'path', 'takes', 'me', 'next', '!']


#### Subword-based tokenizer: `BertTokenizer` and `XLNetTokenizer`

In [61]:
# BertTokenizer (WordPiece)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens_bert = tokenizer.tokenize(text)
print(tokens_bert)

['going', 'through', 'the', 'world', 'of', 'token', '##ization', 'has', 'been', 'like', 'walking', 'through', 'a', 'huge', 'maze', 'made', 'of', 'words', ',', 'symbols', ',', 'and', 'meanings', '.', 'each', 'turn', 'shows', 'a', 'bit', 'more', 'about', 'the', 'cool', 'ways', 'computers', 'learn', 'to', 'understand', 'our', 'language', '.', 'and', 'while', 'i', "'", 'm', 'still', 'finding', 'my', 'way', 'through', 'it', ',', 'the', 'journey', '’', 's', 'been', 'en', '##light', '##ening', 'and', ',', 'honestly', ',', 'a', 'bunch', 'of', 'fun', '.', 'eager', 'to', 'see', 'where', 'this', 'learning', 'path', 'takes', 'me', 'next', '!']


In [1]:
credits_7_5 = [67.39, 58.45, 65.10, 59.00, 59.00, 65.65, 60.70, 60.70, 60.70, 60.70]
credits_5 = [60.34, 56.00, 65.91]

total = sum(credits_7_5) * 7.5 + sum(credits_5) * 5

average = total / (len(credits_7_5) * 7.5 + len(credits_10) * 10)

print(average)

72.16666666666667


In [62]:
# XLNetTokenizer (Unigram and SentencePiece)

tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased') # cased: 대소문자 구분 한다는 뜻
tokens_xlnet = tokenizer.tokenize(text)
print(tokens_xlnet)

# 앞에 나오는 _는 새로운 단어의 시작이라는 뜻

['▁Going', '▁through', '▁the', '▁world', '▁of', '▁token', 'ization', '▁has', '▁been', '▁like', '▁walking', '▁through', '▁a', '▁huge', '▁maze', '▁made', '▁of', '▁words', ',', '▁symbols', ',', '▁and', '▁meaning', 's', '.', '▁Each', '▁turn', '▁shows', '▁a', '▁bit', '▁more', '▁about', '▁the', '▁cool', '▁ways', '▁computers', '▁learn', '▁to', '▁understand', '▁our', '▁language', '.', '▁And', '▁while', '▁I', "'", 'm', '▁still', '▁finding', '▁my', '▁way', '▁through', '▁it', ',', '▁the', '▁journey', '’', 's', '▁been', '▁enlighten', 'ing', '▁and', ',', '▁honestly', ',', '▁a', '▁bunch', '▁of', '▁fun', '.', '▁E', 'ager', '▁to', '▁see', '▁where', '▁this', '▁learning', '▁path', '▁takes', '▁me', '▁next', '!']


### Step 2: Indexing

각 단어를 정수에 대응시켜서 문장을 숫자로 표현한다. 이 때, `<unk>` `<pad>`, `<bos>`, `<eos>` 등 특별한 토큰들을 추가해준다.

In [86]:
import nltk
import spacy
from transformers import BertTokenizer, XLNetTokenizer
from datetime import datetime

# NLTK Tokenization
start_time = datetime.now()
nltk_tokens = nltk.word_tokenize(text)
nltk_time = datetime.now() - start_time

# SpaCy Tokenization
nlp = spacy.load("en_core_web_sm")
start_time = datetime.now()
spacy_tokens = [token.text for token in nlp(text)]
spacy_time = datetime.now() - start_time

# BertTokenizer Tokenization
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
start_time = datetime.now()
bert_tokens = bert_tokenizer.tokenize(text)
bert_time = datetime.now() - start_time

# XLNetTokenizer Tokenization
xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
start_time = datetime.now()
xlnet_tokens = xlnet_tokenizer.tokenize(text)
xlnet_time = datetime.now() - start_time
    
# Display tokens, time taken for each tokenizer, and token frequencies
print(f"NLTK Tokens: {nltk_tokens}\nTime Taken: {nltk_time} seconds\n")
show_frequencies(nltk_tokens, "NLTK")

print(f"SpaCy Tokens: {spacy_tokens}\nTime Taken: {spacy_time} seconds\n")
show_frequencies(spacy_tokens, "SpaCy")

print(f"Bert Tokens: {bert_tokens}\nTime Taken: {bert_time} seconds\n")
show_frequencies(bert_tokens, "Bert")

print(f"XLNet Tokens: {xlnet_tokens}\nTime Taken: {xlnet_time} seconds\n")
show_frequencies(xlnet_tokens, "XLNet")

NLTK Tokens: ['Going', 'through', 'the', 'world', 'of', 'tokenization', 'has', 'been', 'like', 'walking', 'through', 'a', 'huge', 'maze', 'made', 'of', 'words', ',', 'symbols', ',', 'and', 'meanings', '.', 'Each', 'turn', 'shows', 'a', 'bit', 'more', 'about', 'the', 'cool', 'ways', 'computers', 'learn', 'to', 'understand', 'our', 'language', '.', 'And', 'while', 'I', "'m", 'still', 'finding', 'my', 'way', 'through', 'it', ',', 'the', 'journey', '’', 's', 'been', 'enlightening', 'and', ',', 'honestly', ',', 'a', 'bunch', 'of', 'fun', '.', 'Eager', 'to', 'see', 'where', 'this', 'learning', 'path', 'takes', 'me', 'next', '!']
Time Taken: 0:00:00.000327 seconds

NLTK Token Frequencies: {'Going': 1, 'through': 3, 'the': 3, 'world': 1, 'of': 3, 'tokenization': 1, 'has': 1, 'been': 2, 'like': 1, 'walking': 1, 'a': 3, 'huge': 1, 'maze': 1, 'made': 1, 'words': 1, ',': 5, 'symbols': 1, 'and': 2, 'meanings': 1, '.': 3, 'Each': 1, 'turn': 1, 'shows': 1, 'bit': 1, 'more': 1, 'about': 1, 'cool': 1, 