# What is Tokenization?
Tokenization is the process of splitting text into smaller units called tokens. Tokens can be:

- Words (word-level tokenization)

- Sentences (sentence-level tokenization)

- Subwords or characters (subword/character-level)



### Common Tokenization Techniques

**Whitespace Tokenization:** Splits text by spaces. Simple but may miss punctuation.

**Regex Tokenization:**	Uses regular expressions to define custom token patterns.

**NLTK Tokenization:**	Uses pretrained models in NLTK for more accurate token splitting.

**Subword Tokenization (BPE, WordPiece):**	Breaks down rare or unknown words into meaningful subword units.

**Sentence Tokenization:**	Splits text into sentences using punctuation and linguistic rules.


### Why Tokenize

* Easier to map part of speech
* Matching common words
* Removing unwanted tokens

`Example:`

I don't like Sam's shoes.

"I", "do", "n't", "like", "Sam", "s", "shoes", "."

### NLTK Tokenizers

`word_tokenize:` Splits text into individual words using pre-trained tokenization rules.

`sent_tokenize:` Divides a paragraph or document into individual sentences.

`regexp_tokenize:` Tokenizes text based on custom regular expression patterns.

`TweetTokenizer:` Specifically designed to tokenize tweets, handling hashtags, mentions, emojis, and informal language effectively.

## Example Using NLTK in Python

In [23]:
import nltk
nltk.download('punkt_tab', download_dir='C:/nltk_data')
nltk.download('punkt', download_dir='C:/nltk_data')

[nltk_data] Downloading package punkt_tab to C:/nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package punkt to C:/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize, TweetTokenizer

# Sample text
text = "Dr. Smith loves NLP. He's teaching it at the university!"

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)

# Word Tokenization
words = word_tokenize(text)
print("\nWord Tokenization:")
print(words)

# Regex Tokenization (e.g., extract words only)
regex_words = regexp_tokenize(text, pattern=r'\w+')
print("\nRegex Tokenization:")
print(regex_words)

# Tweet Tokenization
tweet_tokenizer = TweetTokenizer()
tweet = "Loving the new features in #Python3.12! 😍🔥 Check it out: https://python.org @ThePSF"
tweet_tokens = tweet_tokenizer.tokenize(tweet)
print("\nTweet Tokenization:")
print(tweet_tokens)


Sentence Tokenization:
['Dr. Smith loves NLP.', "He's teaching it at the university!"]

Word Tokenization:
['Dr.', 'Smith', 'loves', 'NLP', '.', 'He', "'s", 'teaching', 'it', 'at', 'the', 'university', '!']

Regex Tokenization:
['Dr', 'Smith', 'loves', 'NLP', 'He', 's', 'teaching', 'it', 'at', 'the', 'university']

Tweet Tokenization:
['Loving', 'the', 'new', 'features', 'in', '#Python3', '.', '12', '!', '😍', '🔥', 'Check', 'it', 'out', ':', 'https://python.org', '@ThePSF']
