What is tokenization?

We usually start an NLP project with a large body of text, called a corpus. This could be a collection of tweets, website reviews or transcriptions of films, for example. We need to pre-process our corpus to give it enough structure to be used in a machine learning model and tokenization is the most common first step.

Tokenization is the process of breaking down a corpus into tokens. The procedure might look like segmenting a piece of text into sentences and then further segmenting these sentences into individual words, numbers and punctuation, which would be tokens.

###Tokenization using spaCy

In [3]:
#Import spacy
import spacy
print(spacy.__name__,spacy.__version__)

spacy 3.7.6


In [4]:
#Load the spacy model
nlp=spacy.load("en_core_web_sm")

In [5]:
#Tokenize string
s="Noah doesn't like to run when its rains."
doc=nlp(s)

In [6]:
#Print tokens
print([t.text for t in doc])

['Noah', 'does', "n't", 'like', 'to', 'run', 'when', 'its', 'rains', '.']


The doc object is a container, which can be indexed and sliced like a list.



In [7]:
#Index and slice example
print(doc[0])
print(doc[0:3])

Noah
Noah doesn't


In [8]:
#Object types
print(type(doc))
print(type(doc[0]))
print(type(doc[0:3]))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.span.Span'>


In [9]:
#Token attribute examples
print(doc[3].text)
print(doc[3].lang_)
print(doc[3].__len__())

like
en
4


In [10]:
#Locate index of tokens
print([(t.text,t.i) for t in doc[:6]])

[('Noah', 0), ('does', 1), ("n't", 2), ('like', 3), ('to', 4), ('run', 5)]


In [11]:
#Tokenize multiple sentences
s="Hello there! General Kenobi.You are a bold one."
doc=nlp(s)

In [12]:
#Print sentences
list(doc.sents)

[Hello there!, General Kenobi., You are a bold one.]

In [13]:
#object type
type(list(doc.sents)[0])

spacy.tokens.span.Span

In [14]:
#print tokens
print([t.text for t in doc])

['Hello', 'there', '!', 'General', 'Kenobi', '.', 'You', 'are', 'a', 'bold', 'one', '.']
