# Tokenization In NLP

## Why is Tokenization Important ?

**Text Segmentation**: Tokenization divides a continuous stream of text into individual units, making it easier for computers to understand and manipulate. In English and many other languages, words are often separated by spaces, making it natural to tokenize text by splitting at spaces. However, tokenization can also handle languages like Chinese or Thai, which don't use spaces between words, by segmenting text into meaningful chunks based on language-specific rules.


**Vocabulary Building**: Tokenization is a crucial step in building the vocabulary of a language model. Each unique token in a corpus contributes to the vocabulary. A larger vocabulary allows a model to represent a wider range of words and concepts.


**Text Cleaning**: Tokenization can help in cleaning text by separating punctuation, special characters, and other noise from the main text. This simplifies the subsequent analysis and can lead to more accurate results in tasks like sentiment analysis or text classification.


**Feature Extraction**: In NLP, text data is typically converted into numerical vectors for machine learning models to process. Tokenization assigns a unique identifier (e.g., an integer) to each token, enabling the conversion of text into numerical feature vectors. Each token becomes a feature, and its frequency or presence can be used as input for machine learning models.


**Text Analysis**: Tokenization is the foundation for various NLP tasks, including:


**Text Classification**: Assigning a category or label to a text document based on its tokens.


**Named Entity Recognition (NER)**: Identifying and tagging entities (e.g., names of people, places, organizations) in a text.
Sentiment Analysis: Analyzing the sentiment (positive, negative, neutral) expressed in a text.


**Information Retrieval**: Finding relevant documents or passages in a large corpus based on token matches.
Machine Translation: Translating a text from one language to another, often at the token level.


**Normalization**: Tokenization can help in normalizing text by converting all characters to lowercase (or uppercase) to ensure consistent processing, and by handling accents, diacritics, or other variations in text.

In [1]:
# !pip install spacy

In [2]:
import spacy 
print(spacy.__version__)

3.7.4


In [3]:
help(spacy)

Help on package spacy:

NAME
    spacy

PACKAGE CONTENTS
    __main__
    about
    attrs
    cli (package)
    compat
    displacy (package)
    errors
    git_info
    glossary
    kb (package)
    lang (package)
    language
    lexeme
    lookups
    matcher (package)
    ml (package)
    morphology
    parts_of_speech
    pipe_analysis
    pipeline (package)
    schemas
    scorer
    strings
    symbols
    tests (package)
    tokenizer
    tokens (package)
    training (package)
    ty
    util
    vectors
    vocab

FUNCTIONS
    blank(name: str, *, vocab: Union[spacy.vocab.Vocab, bool] = True, config: Union[Dict[str, Any], confection.Config] = {}, meta: Dict[str, Any] = {}) -> spacy.language.Language
        Create a blank nlp object for a given language code.
        
        name (str): The language code, e.g. "en".
        vocab (Vocab): A Vocab object. If True, a vocab is created.
        config (Dict[str, Any] / Config): Optional config overrides.
        meta (Dict[str, 

In [4]:
# !pip install --upgrade click spacy

!python -m spacy download en_core_web_sm 
# Need to download all the times (auto deletes after runtime expires.)


Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
                                              0.0/12.8 MB ? eta -:--:--
                                              0.1/12.8 MB 4.2 MB/s eta 0:00:04
     -                                        0.5/12.8 MB 6.2 MB/s eta 0:00:02
     --                                       0.9/12.8 MB 6.8 MB/s eta 0:00:02
     ---                                      1.3/12.8 MB 7.3 MB/s eta 0:00:02
     -----                                    1.6/12.8 MB 7.3 MB/s eta 0:00:02
     ------                                   2.0/12.8 MB 7.6 MB/s eta 0:00:02
     -------                                  2.4/12.8 MB 7.8 MB/s eta 0:00:02
     --------                                 2.8/12.8 MB 7.9 MB/s eta 0:00:02
     ----------                           


[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
text = "My Name is Bond, James Bond. I love panipuri"

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

print(doc.__len__()) # Count of tokens in doc.

for token in doc:
    print(token.text)
    
print(type(doc))

11
My
Name
is
Bond
,
James
Bond
.
I
love
panipuri
<class 'spacy.tokens.doc.Doc'>


In [6]:
for sentence in doc.sents: # Printing as sentences.
    print(sentence)

My Name is Bond, James Bond.
I love panipuri


# Stemming & Lemmatization

In [7]:
import nltk
from nltk.stem import PorterStemmer

In [9]:
stemmer = PorterStemmer()

In [12]:
words = ["eating","ability","ate","rafting","agility","meeting"]
for word in words:
    print(word ,"|",stemmer.stem(word))

# Stemmer can't stem complex words like ability,ate, etc.

eating | eat
ability | abil
ate | ate
rafting | raft
agility | agil
meeting | meet


In [15]:
new_doc = nlp("eating eats ability ate rafting agility meeting better")
for token in  new_doc:
    print(token,"|",token.lemma_,"|",token.lemma)

eating | eat | 9837207709914848172
eats | eat | 9837207709914848172
ability | ability | 11565809527369121409
ate | eat | 9837207709914848172
rafting | raft | 7154368781129989833
agility | agility | 4291486835428689731
meeting | meeting | 14798207169164081740
better | well | 4525988469032889948


### Stemming : Stems/Clips the word free from -ing forms or suffixes.
    
    
    Ex : eating -> eat, swimming -> swimm

### Lemmatization :Using Rules of Language  to remove suffixes or simplify words in their root form.
    
    
    Ex: better -> good/well, ate -> eat