# **Tokenization**

we will start with the first step of data pre-processing i.e Tokenization. Further, we will implement different methods in python to perform tokenization of text data.

## **Tokenize Words Using NLTK**

Let’s start with the tokenization of words using the NLTK library. It breaks the given string and returns a list of strings by the white specified separator.

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim --user -q --no-warn-script-location


In [None]:
!python -m spacy download en --user -q -no-warn-script-location

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
#download packages
import nltk
nltk.download("punkt")

In [None]:
#Tokenize words
from nltk.tokenize import word_tokenize 
text = "Machine learning is a method of data analysis that automates analytical model building"
word_tokenize(text)

Here, we tokenize the sentences instead of words by a full stop (.) separator.

In [None]:
#Tokenize Sentence
from nltk.tokenize import sent_tokenize 
text = "Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
sent_tokenize(text) 

In [None]:
#Tokenize words of different words
import nltk
nltk.download('punkt')
import nltk.data 
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle') 
text = 'Hola amigo. Me llamo Ankit.'
spanish_tokenizer.tokenize(text)

## **Regular Expression**

Regex function is used to match or find strings using a sequence of patterns consisting of letters and numbers. We will re library to tokenize words and sentences of a paragraph.

In [None]:
from nltk.tokenize import RegexpTokenizer 
tokenizer = RegexpTokenizer("[\w']+") 
text = "Machine learning is a method of data analysis that automates analytical model building"
tokenizer.tokenize(text)

In [None]:
#Split Sentences
import re
text = """Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."""
sentences = re.compile('[.!?] ').split(text)
sentences

## **Split()**

split() method is used to break the given string in a sentence and return a list of strings by the specified separator.

In [None]:
text = """Machine learning is a method of data analysis that automates analytical model building"""
# Splits at space 
text.split()

In [None]:
#Split Sentence
text = """Machine learning is a method of data analysis that automates analytical model building.It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."""
# Splits at space 
text.split('.')

## **Spacy**

Spacy is an open-source library used for tokenization of words and sentences. We will load en_core_web_sm  which supports the English language.

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')
sentence = sp(u'Machine learning is a method of data analysis that automates analytical model building.')
print(sentence)
L=[]
for word in sentence:
    L.append(word)

In [None]:
#Split Sentences
sentence = sp(u'Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.')
print(sentence)
x = []
for sent in sentence.sents:
    x.append(sent.text)

## **Gensim**

The last method that we will cover in this article is gensim. It is an open-source python library for topic modelling and similarity retrieval of large datasets.

In [None]:
from gensim.utils import tokenize
text = """Artificial intelligence, the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings."""
list(tokenize(text))

In [None]:
#Split Sentence
from gensim.summarization.textcleaner import split_sentences
text = """Artificial intelligence, the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience."""
split1 = split_sentences(text)

In [None]:
split1

# **Read more articles on:**

> * [Spacy Basics](https://analyticsindiamag.com/nlp-deep-learning-nlp-framework-nlp-model/)

> * [StanfordCore NLP](https://analyticsindiamag.com/how-to-use-stanza-by-stanford-nlp-group-with-python-code/)

> * [Tokenization in NLP](https://analyticsindiamag.com/hands-on-guide-to-different-tokenization-methods-in-nlp/)