## Text Preprocessing for Machine Learning - Tokenization 

### Tokenization - Converting paragraph (corpus) into sentences (document)

Tokenization is a preprocessing technique used to transform human-readable text into a format that computers can process. 

It involves breaking down text into smaller units, such as sentences or words, making the data easier for models to analyze and process.

Tokenization can be applied at different levels:

- Sentence Tokenization: Splitting a paragraph or corpus into individual sentences.

- Word Tokenization: Further breaking down each sentence into individual words.


For example, consider the following paragraph (corpus):

"The era of smartphones has revolutionized how we generate ideas. Writing is now assisted by intelligent tools that help complete our sentences."

Sentence Tokenization would split this into:

"The era of smartphones has revolutionized how we generate ideas."

"Writing is now assisted by intelligent tools that help complete our sentences."

Word Tokenization would break the first sentence into:

"The", "era", "of", "smartphones", "has", "revolutionized", "how", "we", "generate", "ideas."


> This structured representation of text is crucial for feeding data into the transformer model.

## Tokenization code Implementation 

In [None]:
corpus = """
  Hello my name is Ridwan Ibidunni. I am a graduate of mathematics with a keen interest in developing AI applications
    to solve problems in our environments. I love coding and teaching people how to code.
"""  

In [2]:
type(corpus)

str

In [3]:
print(corpus)


    Hello my name is Ridwan Ibidunni. I am a graduate of mathematics with a keen interest in developing AI applications
    to solve problems in our environments. I love coding and teaching people how to code.



### Tokenizing paragraph to sentences

In [5]:
%pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.11.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m70.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Downloading regex-2025.11.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (791 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m791.7/791.7 kB[0m [31m31.3 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached joblib-1.5.2-py3-none-any

In [6]:
import nltk # type: ignore
# Download the 'punkt_tab' data package
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/aljebra/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
#import then method needed to tokenize sentence from nltk
from nltk import sent_tokenize
document = sent_tokenize(corpus)

In [8]:
type(document)

list

In [9]:
for item in document:
    print(item)


    Hello my name is Ridwan Ibidunni.
I am a graduate of mathematics with a keen interest in developing AI applications
    to solve problems in our environments.
I love coding and teaching people how to code.


### Tokenizing paragraph into word. And sentences into words

In [10]:
from nltk import word_tokenize

#convert paragraph to word
corpus_tokenized_word = word_tokenize(corpus)

corpus_tokenized_word 

['Hello',
 'my',
 'name',
 'is',
 'Ridwan',
 'Ibidunni',
 '.',
 'I',
 'am',
 'a',
 'graduate',
 'of',
 'mathematics',
 'with',
 'a',
 'keen',
 'interest',
 'in',
 'developing',
 'AI',
 'applications',
 'to',
 'solve',
 'problems',
 'in',
 'our',
 'environments',
 '.',
 'I',
 'love',
 'coding',
 'and',
 'teaching',
 'people',
 'how',
 'to',
 'code',
 '.']

In [11]:
#convert sentence to word
for word in document:
    print(word_tokenize(word))

['Hello', 'my', 'name', 'is', 'Ridwan', 'Ibidunni', '.']
['I', 'am', 'a', 'graduate', 'of', 'mathematics', 'with', 'a', 'keen', 'interest', 'in', 'developing', 'AI', 'applications', 'to', 'solve', 'problems', 'in', 'our', 'environments', '.']
['I', 'love', 'coding', 'and', 'teaching', 'people', 'how', 'to', 'code', '.']


In [12]:
#use this library to have more detailed separation of the puntuation
from nltk import wordpunct_tokenize
wordpunct_tokenize(corpus)

['Hello',
 'my',
 'name',
 'is',
 'Ridwan',
 'Ibidunni',
 '.',
 'I',
 'am',
 'a',
 'graduate',
 'of',
 'mathematics',
 'with',
 'a',
 'keen',
 'interest',
 'in',
 'developing',
 'AI',
 'applications',
 'to',
 'solve',
 'problems',
 'in',
 'our',
 'environments',
 '.',
 'I',
 'love',
 'coding',
 'and',
 'teaching',
 'people',
 'how',
 'to',
 'code',
 '.']