<a href="https://colab.research.google.com/github/bashdragon/llm-discussion/blob/main/foundations/learn_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization for Transformers
### Lets look at tokenization - the foundation of NLP.
-------------------------------------------------

# Section 1: Install dependencies and setup the library

In [64]:
!pip install transformers nltk sentencepiece



In [65]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize
from transformers import AutoTokenizer

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# Section 2: Understanding Tokenization

Tokenization is the process of splitting text into smaller units, called tokens. There are different types of tokenization:
1. Word Tokenization: Splitting text into words.
2. Subword Tokenization: Breaking words into meaningful subunits.
3. Character Tokenization: Treating each character as a token.

# Section 3: Tokenization using NLTK

In [66]:
text = "Tokenization is the foundation of NLP. It helps break text into units. Do you agree?"


## Sentence level tokenization:

In [67]:
sentences = sent_tokenize(text)
print("Sentences:", sentences)

Sentences: ['Tokenization is the foundation of NLP.', 'It helps break text into units.', 'Do you agree?']


## Word level tokenization:

In [68]:
words = word_tokenize(text)
print("Words:", words)

Words: ['Tokenization', 'is', 'the', 'foundation', 'of', 'NLP', '.', 'It', 'helps', 'break', 'text', 'into', 'units', '.', 'Do', 'you', 'agree', '?']


# Section 4: Sub-word level tokenization:

Transformer models use subword tokenization to handle unknown words efficiently.
This prevents out-of-vocabulary (OOV) issues and makes training more efficient.

In [69]:
## GPT2 tokenizer is a pretrained model that used BPE (Byte Pair Encoding) for subword tokenization
tokenizer = AutoTokenizer.from_pretrained("gpt2")
encoded = tokenizer(text)

In [70]:
print(encoded.input_ids)

[30642, 1634, 318, 262, 8489, 286, 399, 19930, 13, 632, 5419, 2270, 2420, 656, 4991, 13, 2141, 345, 4236, 30]


In [71]:
print(tokenizer.convert_ids_to_tokens(encoded.input_ids))

['Token', 'ization', 'Ġis', 'Ġthe', 'Ġfoundation', 'Ġof', 'ĠN', 'LP', '.', 'ĠIt', 'Ġhelps', 'Ġbreak', 'Ġtext', 'Ġinto', 'Ġunits', '.', 'ĠDo', 'Ġyou', 'Ġagree', '?']


# Section 5: Please look up these concepts for furthur understanding

1.   Why is there a Ġ before every decoded token in section 4?
2.   what is OOV?
3.   what is BPE?

