### This file contains snippets to practice and help you understand and implement various tokenization techniques used in Natural Language Processing (NLP).
###### 1. Corpus:
###### A corpus is a large and structured collection of text data used for training and evaluating natural language processing models. It can consist of multiple documents, such as books, articles, or web pages.

###### 2.Documents:
###### A document is an individual piece of text within a corpus. It can be a single sentence, paragraph, or an entire article, depending on the granularity of the corpus.

###### 3.Tokens:
###### Tokens are the smaller units into which text is broken during tokenization. These can be words, subwords, phrases, or even individual characters, depending on the tokenization method used. Tokens are the basic building blocks for further NLP tasks.

### Word Tokenization

In [1]:
import nltk
from nltk.tokenize import word_tokenize

In [2]:
# Download NLTK data (only required once)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
text = "Tokenization is an important step in NLP!"
tokens = word_tokenize(text)
print("Word Tokens:", tokens)


Word Tokens: ['Tokenization', 'is', 'an', 'important', 'step', 'in', 'NLP', '!']


###  Sentence Tokenization (Using NLTK)

In [4]:
from nltk.tokenize import sent_tokenize

In [5]:
text = "Tokenization splits text into meaningful parts. Sentence tokenization works at a higher level."
sentences = sent_tokenize(text)
print("Sentence Tokens:", sentences)

Sentence Tokens: ['Tokenization splits text into meaningful parts.', 'Sentence tokenization works at a higher level.']


### Subword Tokenization (Using Hugging Face Tokenizers)

In [6]:
from transformers import AutoTokenizer

In [7]:
# Load a tokenizer for BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenization is crucial for pre-trained models like BERT."
tokens = tokenizer.tokenize(text)
print("Subword Tokens:", tokens)

Subword Tokens: ['token', '##ization', 'is', 'crucial', 'for', 'pre', '-', 'trained', 'models', 'like', 'bert', '.']


In [8]:
# Converting tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

Token IDs: [19204, 3989, 2003, 10232, 2005, 3653, 1011, 4738, 4275, 2066, 14324, 1012]


### Character Tokenization

In [9]:
text = "Tokenization"
char_tokens = list(text)
print("Character Tokens:", char_tokens)

Character Tokens: ['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n']


### Custom Tokenizer Using Regular Expressions

In [10]:
import re

text = "Hello, world! This is a test #tokenization."
# Split text on non-word characters
tokens = re.findall(r'\w+|\S', text)
print("Custom Tokens:", tokens)

Custom Tokens: ['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '#', 'tokenization', '.']


### Handling Special Cases (Contractions, Emojis, URLs)

In [11]:
text = "Don't tokenize URLs like https://github.com and handle emojis 🙂."

# Handling contractions
tokens = re.findall(r"\w+|\S", text)
print("Tokens with Contractions:", tokens)

# Handling emojis
emoji_tokens = re.findall(r"\w+|[^\w\s]", text)
print("Tokens with Emojis:", emoji_tokens)

Tokens with Contractions: ['Don', "'", 't', 'tokenize', 'URLs', 'like', 'https', ':', '/', '/', 'github', '.', 'com', 'and', 'handle', 'emojis', '🙂', '.']
Tokens with Emojis: ['Don', "'", 't', 'tokenize', 'URLs', 'like', 'https', ':', '/', '/', 'github', '.', 'com', 'and', 'handle', 'emojis', '🙂', '.']


In [14]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------

### Tokenization with spaCy

In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Tokenization with spaCy is simple and robust!"

doc = nlp(text)
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)

spaCy Tokens: ['Tokenization', 'with', 'spaCy', 'is', 'simple', 'and', 'robust', '!']


### Multilingual Tokenization

In [16]:
text = "これは日本語の文です。"  # Japanese text
doc = nlp(text)

tokens = [token.text for token in doc]
print("Multilingual Tokens:", tokens)

Multilingual Tokens: ['これは日本語の文です', '。']
