# Tokenization

## Why Do We Need Tokenization?

Computers do not understand raw text.

Before feeding text into:
- Machine Learning models
- Language Models (LLMs)
- NLP pipelines

We must convert text into smaller pieces (tokens) that can be processed numerically.

## Load GPT Tokenizer

In [6]:
import tiktoken

# Load tokenizer for a GPT model
# You can use: "gpt-4o-mini", "gpt-4", "gpt-3.5-turbo"
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

text = "Playing unbelievably with OpenAI tokenization."

# Encode text into token IDs
token_ids = encoding.encode(text)

print("Original Text:")
print(text)

tokens = [encoding.decode([token]) for token in token_ids]

print("\nSubword Tokens:")
print(tokens)

print("\nToken IDs:")
print(token_ids)

Original Text:
Playing unbelievably with OpenAI tokenization.

Subword Tokens:
['Playing', ' unbelievably', ' with', ' Open', 'AI', ' token', 'ization', '.']

Token IDs:
[41323, 180692, 483, 7788, 17527, 6602, 2860, 13]


## Decode Back to Text

In [3]:
decoded_text = encoding.decode(token_ids)

print("Decoded Text:")
print(decoded_text)

Decoded Text:
Playing unbelievably with OpenAI tokenization.


# ===============================================================

# Stop Word Removal in NLP

Stop words are common words that usually do not add significant meaning.

Examples:
- is
- the
- in
- at
- on
- and

Removing stop words helps:
- Reduce noise
- Reduce dimensionality
- Improve NLP model efficiency


In [6]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...


In [5]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "Stop word removal is an important step in Natural Language Processing."

# Tokenize
words = word_tokenize(text)

# Load stopwords
stop_words = set(stopwords.words("english"))

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original Tokens:")
print(words)

print("\nFiltered Tokens:")
print(filtered_words)


Original Tokens:
['Stop', 'word', 'removal', 'is', 'an', 'important', 'step', 'in', 'Natural', 'Language', 'Processing', '.']

Filtered Tokens:
['Stop', 'word', 'removal', 'important', 'step', 'Natural', 'Language', 'Processing', '.']


In [57]:
len(stop_words)

198

# ==================================================================

# Stemming vs Lemmatization Demo

In [7]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "The boys are running faster than the girls. He studies better than others."

# Tokenize
words = word_tokenize(text)

print("Original Words:")
print(words)

stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(word) for word in words]

print("Stemmed Words:")
print(stemmed_words)

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Lemmatized Words (without POS):")
print(lemmatized_words)

Original Words:
['The', 'boys', 'are', 'running', 'faster', 'than', 'the', 'girls', '.', 'He', 'studies', 'better', 'than', 'others', '.']
Stemmed Words:
['the', 'boy', 'are', 'run', 'faster', 'than', 'the', 'girl', '.', 'he', 'studi', 'better', 'than', 'other', '.']
Lemmatized Words (without POS):
['The', 'boy', 'are', 'running', 'faster', 'than', 'the', 'girl', '.', 'He', 'study', 'better', 'than', 'others', '.']


# =============================================================

# Bag of Words

Bag of Words is a text representation technique that:

- Converts text into numerical vectors
- Ignores grammar and word order
- Counts word frequency

Each unique word becomes a feature.

In [9]:
# Sample sentences
documents = [
    "I love NLP",
    "I love Machine Learning",
    "NLP is fun"
]

# Step 1: Build vocabulary
vocab = set()

for doc in documents:
    for word in doc.lower().split():
        vocab.add(word)

vocab = sorted(vocab)

print("Vocabulary:")
print(vocab)

Vocabulary:
['fun', 'i', 'is', 'learning', 'love', 'machine', 'nlp']


In [10]:
# Create vectors
bow_vectors = []

for doc in documents:
    word_count = []
    words = doc.lower().split()
    
    for word in vocab:
        word_count.append(words.count(word))
    
    bow_vectors.append(word_count)

# Display result
print("Bag of Words Vectors:\n")
for doc, vector in zip(documents, bow_vectors):
    print(f"{doc} -> {vector}")


Bag of Words Vectors:

I love NLP -> [0, 1, 0, 0, 1, 0, 1]
I love Machine Learning -> [0, 1, 0, 1, 1, 1, 0]
NLP is fun -> [1, 0, 1, 0, 0, 0, 1]


# ======================================================================

# N-gram Bag of Words

N-grams are continuous sequences of N words.

Examples:
- Unigram (1-gram): single words
- Bigram (2-gram): two-word pairs
- Trigram (3-gram): three-word sequences

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

documents = [
    "I love NLP",
    "I love Machine Learning",
    "NLP is fun"
]

## Unigram (Standard BoW)

In [13]:
vectorizer_uni = CountVectorizer(ngram_range=(1,1))  # Unigram
X_uni = vectorizer_uni.fit_transform(documents)

df_uni = pd.DataFrame(X_uni.toarray(), 
                      columns=vectorizer_uni.get_feature_names_out())

print("Unigram Features:")
print(vectorizer_uni.get_feature_names_out())
df_uni

Unigram Features:
['fun' 'is' 'learning' 'love' 'machine' 'nlp']


Unnamed: 0,fun,is,learning,love,machine,nlp
0,0,0,0,1,0,1
1,0,0,1,1,1,0
2,1,1,0,0,0,1


## Bigram

In [14]:
vectorizer_bi = CountVectorizer(ngram_range=(2,2))  # Bigram
X_bi = vectorizer_bi.fit_transform(documents)

df_bi = pd.DataFrame(X_bi.toarray(), 
                     columns=vectorizer_bi.get_feature_names_out())

print("Bigram Features:")
print(vectorizer_bi.get_feature_names_out())
df_bi

Bigram Features:
['is fun' 'love machine' 'love nlp' 'machine learning' 'nlp is']


Unnamed: 0,is fun,love machine,love nlp,machine learning,nlp is
0,0,0,1,0,0
1,0,1,0,1,0
2,1,0,0,0,1


## Trigram

In [15]:
vectorizer_tri = CountVectorizer(ngram_range=(3,3))  # Trigram
X_tri = vectorizer_tri.fit_transform(documents)

df_tri = pd.DataFrame(X_tri.toarray(), 
                      columns=vectorizer_tri.get_feature_names_out())

print("Trigram Features:")
print(vectorizer_tri.get_feature_names_out())
df_tri

Trigram Features:
['love machine learning' 'nlp is fun']


Unnamed: 0,love machine learning,nlp is fun
0,0,0
1,1,0
2,0,1


# ========================================

# TF-IDF (Term Frequency â€“ Inverse Document Frequency)
TF-IDF is a technique used in Natural Language Processing (NLP) to convert text into numbers based on how important a word is in a document.

TF= Total words in document / Number of times term appears

IDF=log(Number of documents containing term / Total number of documents)

TF-IDF=TFÃ—IDF

In [21]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I love data science",
    "I love machine learning",
    "Data science is amazing"
]

# Create TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Convert to DataFrame
df_tfidf = pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

# Display nicely
df_tfidf

Unnamed: 0,amazing,data,is,learning,love,machine,science
Doc 1,0.0,0.57735,0.0,0.0,0.57735,0.0,0.57735
Doc 2,0.0,0.0,0.0,0.622766,0.47363,0.622766,0.0
Doc 3,0.562829,0.428046,0.562829,0.0,0.0,0.0,0.428046


# ===============================================

# Part-of-Speech (POS) Tagging
POS tagging assigns grammatical labels to words.

Examples:
- NN  â†’ Noun
- VB  â†’ Verb
- JJ  â†’ Adjective
- RB  â†’ Adverb

It helps NLP systems understand grammar and structure.

In [25]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\singazq\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [27]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

text = "The boy is running fast."

words = word_tokenize(text)
tagged_words = pos_tag(words)

print(tagged_words)


[('The', 'DT'), ('boy', 'NN'), ('is', 'VBZ'), ('running', 'VBG'), ('fast', 'RB'), ('.', '.')]


# ====================================================================

# Named Entity Recognition (NER) is an NLP technique that:

ðŸ‘‰ Identifies and classifies important entities in text.

Common entity types:

PERSON â†’ Rahul, Elon Musk

ORG â†’ Google, TCS

GPE / LOCATION â†’ India, New York

DATE â†’ 2024, January

MONEY â†’ $500

TIME â†’ 10 AM

In [30]:
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

In [32]:
text = "Rahul works at Google in India and joined in 2022."

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, "->", ent.label_)


Google -> ORG
India -> GPE
2022 -> DATE


In [43]:
text1 = "Apple is tasty"
text2 = "Apple released a new iPhone"

doc = nlp(text1)
for ent in doc.ents:
    print(ent.text, "->", ent.label_)

doc = nlp(text2)
for ent in doc.ents:
    print(ent.text, "->", ent.label_)

Apple -> ORG
Apple -> ORG


# ================================================

# BERT (Bidirectional Encoder Representations from Transformers)
It revolutionized NLP by understanding context from both directions (left and right).

Before BERT:

Models read text left â†’ right (like GPT earlier versions)

Limited contextual understanding

BERT:

Reads the whole sentence at once

Understands context bidirectionally

Example:

"He went to the bank."

BERT understands:

"bank" could mean financial institution or river bank

It decides based on surrounding words

In [58]:
from transformers import BertTokenizer, BertModel
import torch

# Load pretrained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

text = "BERT understands context better than older models."

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Get embeddings
outputs = model(**inputs)

print("Shape of last hidden state:")
print(outputs.last_hidden_state.shape)

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Shape of last hidden state:
torch.Size([1, 10, 768])


In [51]:
tokens = tokenizer.tokenize(text)
print(tokens)

['bert', 'understands', 'context', 'better', 'than', 'older', 'models', '.']


In [53]:
print(outputs.last_hidden_state)

tensor([[[-0.5566, -0.1176, -0.0310,  ..., -0.3326, -0.0759,  0.5822],
         [ 0.2180, -0.1765,  0.1108,  ..., -0.1463,  0.4253,  0.1268],
         [-0.7525,  0.4435,  0.0050,  ..., -0.7219, -0.4487,  0.3388],
         ...,
         [ 0.3207,  0.0362, -0.8701,  ..., -0.6677, -0.2056,  0.2906],
         [-0.5575, -0.7512, -0.2402,  ...,  0.3028,  0.3006, -0.2441],
         [ 0.6131,  0.3551, -0.4015,  ...,  0.6005, -0.5748, -0.2608]]],
       grad_fn=<NativeLayerNormBackward0>)


In [54]:
print(outputs.last_hidden_state.shape)

torch.Size([1, 10, 768])


In [55]:
print("Input IDs length:", len(inputs["input_ids"][0]))
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))

Input IDs length: 10
Tokens: ['[CLS]', 'bert', 'understands', 'context', 'better', 'than', 'older', 'models', '.', '[SEP]']
