# Tokenization

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">What is tokenization?</p>

We usually start an NLP project with a large body of text, called a **corpus**. This could be a collection of tweets, website reviews or transcriptions of films, for example. We need to **pre-process** our corpus to give it enough **structure** to be used in a machine learning model and tokenization is the most common first step.

**Tokenization** is the process of breaking down a corpus into **tokens**. The procedure might look like **segmenting** a piece of text into sentences and then further segmenting these sentences into individual **words, numbers and punctuation**, which would be tokens. 

<center>
<img src='https://i.postimg.cc/ydcbhtkj/tokenization2.jpg' width=600>
</center>

Each token should be chosen to be as **small as possible** while still carrying carrying **meaning on its own**. For example, `"£10"` can be split into the two tokens `"£"` and `"10"` as each one possess its own meaning. 

Note that researches are still trying to find out the best way to tokenize text. There exist effective models that break down words into smaller parts like splitting `"running"` into `["run","-ing"]` (called morphemes), or even into individual letters `["r", "u", "n", "n", "i", "n", "g"]` (called graphemes). 

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Tokenization using spaCy</p>

<center>
<img src='https://i.postimg.cc/ryZxP111/spacy.png' width=250>
</center>
<br>

Lucky for us, there exist **robust NLP libraries** which can perform tokenization for us. Let's see how to do this with one of the most popular ones called **spaCy**. 

In [None]:
# Import spacy library
import spacy
print(spacy.__name__, spacy.__version__)

Next we load a **statistical model** of the English language **trained on web articles**. Its capabilities include tokenization, among other things. 

In [None]:
# Load the spacy model
nlp = spacy.load("en_core_web_sm")

To tokenize a string, we simply pass it in as an **argument** to the model.

In [None]:
# Tokenize string
s = "Noah doesn't like to run when it rains."
doc = nlp(s)

And we can **view the tokens** using one line of code.

In [None]:
# Print tokens
for token in doc:
    print(token)

*Notes:*
* `"doesn't"` gets split into two tokens: `"does"` and `"n't"`.
* the full stop `"."` gets its own token.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Types and attributes</p>

The `doc` object is a **container**, which can be indexed and sliced like a list.

In [None]:
# Index and slice example
print(doc[0])
print(doc[0:3])

Each entry in the doc object is a **token object**. And if you slice a doc object you get a **span object**. 

In [None]:
# Object types
print(type(doc))
print(type(doc[0]))
print(type(doc[0:3]))

Each token has several **attributes** such as language, length, index, etc. We will explore these in more detail in later notebooks but here are some examples.

In [None]:
# Token attribute examples
print(doc[3].text)
print(doc[3].lang_)
print(doc[3].__len__())

We can locate the index of each token using the `.i` attribute.

In [None]:
# Locate index of tokens
for token in doc[:6]:
    print(token.text, token.i)

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Tokenizing paragraphs</p>

We can tokenize paragraphs in a very similar way.

In [None]:
# Tokenize multiple sentences
s = "Hello there! General Kenobi. You are a bold one."
doc = nlp(s)

We can iterate through the sentences using the `.sents` attribute.

In [None]:
# Print sentences
list(doc.sents)

Note that each sentence is a **span object** of the original document. 

In [None]:
# Object type
type(list(doc.sents)[0])