# 📝 NLP Notes: Tokenization

## 🔹 What is Tokenization?
Tokenization is the process of breaking down text into smaller pieces called **tokens**. These tokens can be words, sentences, or characters. It's usually the first step in any Natural Language Processing (NLP) pipeline.

### 🔸 Types of Tokenization:
- **Word Tokenization**: Splitting text into individual words.
- **Sentence Tokenization**: Splitting text into individual sentences.
- **Character Tokenization**: Splitting text into individual characters.

Tokenization helps in:
- Text analysis
- Removing stopwords
- Converting text into numerical features


## ✅ Tokenization using NLTK
We will use the `nltk` library to perform both word and sentence tokenization.

In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing (NLP) is fun! Let's learn more about it."

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokens:", sentence_tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vishalrathod\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\vishalrathod/nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
    - 'C:\\Users\\vishalrathod\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


## ✅ Tokenization using spaCy
We can also perform tokenization using the `spaCy` library, which offers a robust tokenizer.

In [None]:
import spacy

# Load English tokenizer
nlp = spacy.load("en_core_web_sm")

text = "Natural Language Processing (NLP) is fun! Let's learn more about it."
doc = nlp(text)

# Word tokens
print("spaCy Word Tokens:")
for token in doc:
    print(token.text)

# Sentence tokens
print("\nspaCy Sentence Tokens:")
for sent in doc.sents:
    print(sent.text)