# Tokenization In NLP

## Why is Tokenization Important ?

**Text Segmentation**: Tokenization divides a continuous stream of text into individual units, making it easier for computers to understand and manipulate. In English and many other languages, words are often separated by spaces, making it natural to tokenize text by splitting at spaces. However, tokenization can also handle languages like Chinese or Thai, which don't use spaces between words, by segmenting text into meaningful chunks based on language-specific rules.


**Vocabulary Building**: Tokenization is a crucial step in building the vocabulary of a language model. Each unique token in a corpus contributes to the vocabulary. A larger vocabulary allows a model to represent a wider range of words and concepts.


**Text Cleaning**: Tokenization can help in cleaning text by separating punctuation, special characters, and other noise from the main text. This simplifies the subsequent analysis and can lead to more accurate results in tasks like sentiment analysis or text classification.


**Feature Extraction**: In NLP, text data is typically converted into numerical vectors for machine learning models to process. Tokenization assigns a unique identifier (e.g., an integer) to each token, enabling the conversion of text into numerical feature vectors. Each token becomes a feature, and its frequency or presence can be used as input for machine learning models.


**Text Analysis**: Tokenization is the foundation for various NLP tasks, including:


**Text Classification**: Assigning a category or label to a text document based on its tokens.


**Named Entity Recognition (NER)**: Identifying and tagging entities (e.g., names of people, places, organizations) in a text.
Sentiment Analysis: Analyzing the sentiment (positive, negative, neutral) expressed in a text.


**Information Retrieval**: Finding relevant documents or passages in a large corpus based on token matches.
Machine Translation: Translating a text from one language to another, often at the token level.


**Normalization**: Tokenization can help in normalizing text by converting all characters to lowercase (or uppercase) to ensure consistent processing, and by handling accents, diacritics, or other variations in text.

In [1]:
!pip install spacy



In [5]:
import spacy 
print(spacy.__version__)

3.6.1


In [6]:
help(spacy)

Help on package spacy:

NAME
    spacy

PACKAGE CONTENTS
    __main__
    about
    attrs
    cli (package)
    compat
    displacy (package)
    errors
    git_info
    glossary
    kb (package)
    lang (package)
    language
    lexeme
    lookups
    matcher (package)
    ml (package)
    morphology
    parts_of_speech
    pipe_analysis
    pipeline (package)
    schemas
    scorer
    strings
    symbols
    tests (package)
    tokenizer
    tokens (package)
    training (package)
    ty
    util
    vectors
    vocab

FUNCTIONS
    blank(name: str, *, vocab: Union[spacy.vocab.Vocab, bool] = True, config: Union[Dict[str, Any], confection.Config] = {}, meta: Dict[str, Any] = {}) -> spacy.language.Language
        Create a blank nlp object for a given language code.
        
        name (str): The language code, e.g. "en".
        vocab (Vocab): A Vocab object. If True, a vocab is created.
        config (Dict[str, Any] / Config): Optional config overrides.
        meta (Dict[str, 

In [7]:
!pip install --upgrade click spacy

!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-3.7.2-cp38-cp38-win_amd64.whl (12.5 MB)
Collecting weasel<0.4.0,>=0.1.0
  Downloading weasel-0.3.4-py3-none-any.whl (50 kB)
Collecting cloudpathlib<0.17.0,>=0.7.0
  Downloading cloudpathlib-0.16.0-py3-none-any.whl (45 kB)
Installing collected packages: cloudpathlib, weasel, spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.6.1
    Uninstalling spacy-3.6.1:
      Successfully uninstalled spacy-3.6.1


ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\~pacy\\attrs.cp38-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.6.0
    Uninstalling en-core-web-sm-3.6.0:
      Successfully uninstalled en-core-web-sm-3.6.0
Successfully installed en-core-web-sm-3.7.1
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
text = "My Name is Bond, James Bond. I love panipuri"

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

print(doc.__len__()) # Count of tokens in doc.

for token in doc:
    print(token.text)
    
print(type(doc))

RegistryError: [E892] Unknown function registry: 'vectors'.

Available names: architectures, augmenters, batchers, callbacks, cli, datasets, displacy_colors, factories, initializers, languages, layers, lemmatizers, loggers, lookups, losses, misc, models, ops, optimizers, readers, schedules, scorers, tokenizers

In [None]:
for sentence in doc.sents: # Printing as sentences.
    print(sentence)

# Stemming