Tokenization breaks text into smaller units called **tokens** (words, sentences, or subwords), making it easier for computers to process in NLP tasks like classification or sentiment analysis. [tedboy.github](https://tedboy.github.io/nlps/generated/nltk.tokenize.html)

## Why Tokenization Matters
- **Preprocessing:** Splits messy text into chunks for easy handling.
- **Feature Extraction:** Tokens become building blocks for ML models.
- **Text Representation:** Turns words into numbers (e.g., vectors or embeddings). [geeksforgeeks](https://www.geeksforgeeks.org/nlp/nlp-how-tokenizing-text-sentence-words-works/)

## Types of Tokenization (With Examples)
### Word Tokenization
Splits on spaces/punctuation into words.
**Input:** "Tokenization is a crucial step in natural language processing (NLP). It involves breaking down text into smaller units, such as words or phrases, for analysis."
**Output tokens:** ['Tokenization', 'is', 'a', 'crucial', 'step', 'In', 'natural', 'language', 'processing', '(NLP)', '.', 'It', 'involves', ...] [tedboy.github](https://tedboy.github.io/nlps/generated/nltk.tokenize.html)

### Sentence Tokenization
Splits on periods, !, ? into sentences.
**Input:** "Tokenization is the first step. It breaks down text into sentences. Each sentence is then processed separately."
**Output:** ['Tokenization is the first step.', 'It breaks down text into sentences.', 'Each sentence is then processed separately.'] [nltk](https://www.nltk.org/api/nltk.tokenize.html)

### Subword Tokenization
Breaks words into smaller parts (great for rare words or complex languages).
**Input:** "Unsupervised learning is a technique used in machine learning."
**Output:** ['Un', 'supervised', 'learning', 'is', 'a', 'technique', 'used', 'in', 'machine', 'learning', '.'] [towardsdatascience](https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17/)

## NLTK Tokenizers Covered
NLTK offers these for different needs (import via `from nltk.tokenize import ...` and run `nltk.download('punkt')` first):
- **word_tokenize():** Words (handles punctuation).
- **sent_tokenize():** Sentences (uses periods, !, ?).
- **TweetTokenizer():** Tweets/hashtags/emojis/mentions.
- **RegexpTokenizer():** Custom patterns with regex.
- **WhitespaceTokenizer():** Splits only on spaces/tabs. [geeksforgeeks](https://www.geeksforgeeks.org/nlp/nlp-how-tokenizing-text-sentence-words-works/)

In [1]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Tokenization is key. Let's try it!"
print(word_tokenize(text))  # ['Tokenization', 'is', 'key', '.', "Let's", 'try', 'it', '!']
print(sent_tokenize(text))  # ['Tokenization is key.', "Let's try it!"]

['Tokenization', 'is', 'key', '.', 'Let', "'s", 'try', 'it', '!']
['Tokenization is key.', "Let's try it!"]


[nltk_data] Downloading package punkt_tab to /home/vinny/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
