# NLP Basics with NLTK


## What is NLTK?

NLTK (Natural Language Toolkit) is one of the most important and also the earliest Python-based NLP development tool. In summary, NLTK provides convenience interface with over 50 corpora and lexical resources such as WordNet, the Penn Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency Thesaurus.

NLTK also contains the basic statistical-based text processing libraries for fundamental NLP enabling technology together with basic semantic reasoning tool, which include:
- tokenization
- parsing
- classification
- stemming
- tagging
- basic semantic reasoning

## How to Install NLTK?
#### Step 1 Install NLTK

**`pip install nltk`**

#### Step 2 Install NLTK Data


In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True


`nltk.download()` will automatically invoke the NLTK downloader, a separate window-based downloading module for user to download FOUR different types of NLP DATA into their Python machines. They include: Collection libraries, Corpora, Modules and other NLP packages. 

It will automatically install the related Corpora library, NLTK models and Packages like this:

##### 
##### Corpora library
<img src="./Fig10.3.png" width = "" height = "" alt="NLTK installation" align=left />

##### 
#####
##### NLTK models
<img src="./Fig10.4.png" width = "" height = "" alt="NLTK installation" align=left />

## Tokenization in NLP

### What is Tokenization in NLP?

Token is one of he most fundamental concepts in NLP. A token may be a word, part of a word or just characters like punctuation. It is one of the most foundational NLP tasks and a difficult one, because every language has its own grammatical constructs, which are often difficult to write down as rules.

In short, tokenization is the NLP process of splitting the utterance (sentences) in a text, document or speeches into smaller chunks so-called tokens. These tokens can be words, characters, or even numbers or symbols. 

In fact, one of the most important and fundamental tasks in NLP is text processing with tokenization as the first step. By using tokenization, we can build a vocabulary from a document or corpus. 

NLTK provides an easy way to tokenize any string (of Text) by using `tokenize()` function, as shown below:

In [2]:
from nltk.tokenize import word_tokenize

text = ['Jane lend $100.50 to Peter early this morning.', '@ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--', 'Mr. Smith lives in New York City.', 'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things', '¡This, is a sentence with weird» symbols… appearing everywhere¿']
tokens = []
for sentence in text:
    print('Original sentence:', sentence)
    tokens.append(word_tokenize(sentence))
    print('Tokenized sentence:', tokens[-1])
    print('-------------------------------')


Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Tokenized sentence: ['Mr.', 'Smith', 'lives', 'in', 'New', 'York', 'City', '.']
-------------------------------
Original sentence: The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
Tokenized sentence: ['The', 'https', ':', '//github.com/jonsafari/tok-tok/blob/master/tok-tok.pl', 'is', 'a', 'website', 'with/and/or', 'slashes', 'and', 'sort', 'of', 'weird', ':', 't

Tokenizing is not just spliting the text into words using spaces as separators, as shown below:

In [3]:
split_tokens = []
for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    split_tokens.append(sentence.split())
    print('Splitting the sentence:', split_tokens[-1])
    print('Tokenized sentence:', tokens[i])
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Splitting the sentence: ['Jane', 'lend', '$100.50', 'to', 'Peter', 'early', 'this', 'morning.']
Tokenized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Splitting the sentence: ['@ref,', 'This', 'is', 'a', 'cooool', '#dummysmiley:', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
Tokenized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Splitting the sentence: ['Mr.', 'Smith', 'lives', 'in', 'New', 'York', 'City.']
Tokenized sentence: ['Mr.', 'Smith', 'lives', 'in', 'New', 'York', 'City', '.']
-------------------------------
Ori

It can be challenging to find a tokenization procedure that works well for all cases and applications. For instance:

- Multi-word expressions: New York City
- Compound words: counter-intuitive, double-check, long-term
- Punctuation marks: everywhere¿, symbols…, with/and/or
- Special symbols: $100.5, #dummysmiley, :-P
- Specific expressions: https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl


For that reason, several tokenizers exist, with different properties and different sets of rules to define tokens.

NLTK offers a wide range of tokenizers: https://www.nltk.org/api/nltk.tokenize.html

Let's try and compare some of these tokenizers:

In [4]:
from nltk.tokenize import TweetTokenizer, MWETokenizer, ToktokTokenizer,TreebankWordTokenizer


tweet_tokenizer = TweetTokenizer()
tweet_tokens = []

mwe_tokenizer = MWETokenizer([('New', 'York', 'City')], separator='_')
mwe_tokens = []

toktok_tokenizer = ToktokTokenizer()
toktok_tokens = []

trebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = []

for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    print('Basic tokenization:', tokens[i])
    tweet_tokens.append(tweet_tokenizer.tokenize(sentence))
    print('Tweet tokenization:', tweet_tokens[-1])
    mwe_tokens.append(mwe_tokenizer.tokenize(tokens[i]))
    print('MWE tokenization:', mwe_tokens[-1])
    toktok_tokens.append(toktok_tokenizer.tokenize(sentence))
    print('Toktok tokenization:', toktok_tokens[-1])
    treebank_tokens.append(trebank_tokenizer.tokenize(sentence))
    print('Treebank tokenization:', treebank_tokens[-1])
    print('-------------------------------')


Original sentence: Jane lend $100.50 to Peter early this morning.
Basic tokenization: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Tweet tokenization: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
MWE tokenization: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Toktok tokenization: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Treebank tokenization: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Basic tokenization: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
Tweet tokenization: ['@ref', ',', 'This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'ar

Some NLTK tokenizers can also produce token-spans, represented as tuples of integers, with the information of where the tokens where located in the original text. This can help to recover the original text after tokenization

In [9]:

spans = []
treebank_tokens = []
for sentence in text:
    print('Original sentence:', sentence)
    treebank_tokens.append(trebank_tokenizer.tokenize(sentence))
    spans.append(list(trebank_tokenizer.span_tokenize(sentence)))
    print('Tokenized sentence:', treebank_tokens[-1])
    print('Token spans:', spans[-1])
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Token spans: [(0, 4), (5, 9), (10, 11), (11, 17), (18, 20), (21, 26), (27, 32), (33, 37), (38, 45), (45, 46)]
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
Token spans: [(0, 1), (1, 4), (4, 5), (6, 10), (11, 13), (14, 15), (16, 22), (23, 24), (24, 35), (35, 36), (37, 38), (38, 39), (39, 40), (41, 42), (42, 44), (45, 46), (46, 47), (48, 51), (52, 56), (57, 63), (64, 65), (66, 67), (68, 69), (69, 70), (71, 72), (72, 74)]
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Tokenized sentence: ['Mr.', 'Smith', 'lives', 'in', 'New', 'Yo

<img src="./note.png" width = "40" height = "40" alt="note" align=top />

Let's have a look at the different tokenizations obtained with different tokenizers

1. Can we identify a best generic tokenizer?
2. Can we identify a best tokenizer for specific kinds of text?
3. What problems/challenges do we observe in these tokenizers?

## Word Normalization

### Lemmatization

Lemmatization is an important text pre-processing technique in Natural Language Processing (NLP) that reduces words to their base form known as a "lemma." For example, the lemma of "running" is "run" and "better" becomes "good."

Unlike stemming which simply removes prefixes or suffixes, it considers the word's meaning and part of speech (POS) and ensures that the base form is a valid word. This makes lemmatization more accurate as it avoids generating non-dictionary words.

**Advantages**:

* **Improved accuracy**: It ensures words with similar meanings like "running" and "ran" are treated as the same.
* **Reduced Data Redundancy**: By reducing words to their base forms, it reduces redundancy in the dataset. This leads to smaller datasets which makes it easier to handle and process large amounts of text for analysis or training machine learning models.
* **Better NLP Model Performance**: By treating all similar word as same, it can improve the performance of NLP models in certain tasks by making text more consistent. For example, treating "running," "ran" and "runs" as the same word improves the model's understanding of context and meaning.


**Disadvantages**:

* Time-consuming: It can be slower compared to other techniques such as stemming because it involves parsing the text and performing dictionary lookups or morphological analysis.

NLTK  uses a predefined dictionary or lexicon such as WordNet to look up the base form of a word. This method is more accurate than rule-based lemmatization because it accounts for exceptions and irregular words.

**Lemmatization in NLTK**: https://www.nltk.org/api/nltk.stem.wordnet.html

In [10]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    print('Tokenized sentence:', tokens[i])
    lemmatized_tokens.append([lemmatizer.lemmatize(token) for token in tokens[i]])
    print('Lemmatized sentence:', lemmatized_tokens[-1])
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Lemmatized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
Lemmatized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrow', '<', '>', '-', '>', '<', '--']
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Tokenized sentence: ['Mr.', 'Smith', 'lives', 'in', 'New', 'York', 'City', '.']
Lemmatized sentence: ['Mr.', 'Smith', 'life', 'in', 'New', 'York', 'City',

##### Improving Lemmatization with Part of Speech (POS) 

To improve the accuracy of lemmatization, it’s important to specify the correct Part of Speech (POS) for each word. Part of Speech is the process of assigning a grammatical category (noun, verb, adjective, adverb, ...) to every word in a sentence. By default, NLTK assumes that words are nouns when no POS tag is provided. However, it can be more accurate if we specify the correct POS tag for each word.

For example:

    "running" (as a verb) should be lemmatized to "run".
    "better" (as an adjective) should be lemmatized to "good".

In [12]:
from nltk import pos_tag


def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
lemmatized_tokens_pos = []
for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    print('Tokenized sentence:', tokens[i])
    lemmatized_tokens.append([lemmatizer.lemmatize(token) for token in tokens[i]])
    print('Lemmatized sentence:', lemmatized_tokens[-1])
    tagged_tokens = pos_tag(tokens[i])
    lemmatized_tokens_pos.append([lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens])
    print('Lemmatized sentence with POS:', lemmatized_tokens_pos[-1])
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Lemmatized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Lemmatized sentence with POS: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
Lemmatized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrow', '<', '>', '-', '>', '<', '--']
Lemmatized sentence with POS: ['@', 'ref', ',', 'This', 'be', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '

### Stemming

Stemming removes the prefixes or suffixes from the words, such as ‘er”, “ion”, “ization”, in order to generate the root forms. 

Take the words: Computers, Computation, Computerization as example. Although these words have totally different spellings, they all shared the same concept which is related to the concept “compute”, so we set “comput” as the stem of these words. 

**Stemming in NLTK**: https://www.nltk.org/api/nltk.stem.

NLTK includes two classical stemming algorithms:

1. Porter Stemmer: Porter Stemming is the earliest stemming technique firstly used in 1980s. The key procedure of Porter Stemmer is to remove common endings of words in order to parse them into generic forms. As this method is rather simple and effective, it is commonly used in many NLP applications.
2. Snowball Stemmer: as compared with Porter Stemmer, Snowball Stemmer provides some improvement in term of stemming results. Besides, Snowball Stemmer provides multi-language stemming solution. 

In [16]:
from nltk.stem import PorterStemmer, SnowballStemmer

porter_stemmer = PorterStemmer()
snowball_stemmer_eng = SnowballStemmer("english")

porter_stemmed_tokens = []
snowball_stemmed_tokens = []
for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    print('Tokenized sentence:', tokens[i])
    porter_stemmed_tokens.append([porter_stemmer.stem(token) for token in tokens[i]])
    print('Stemmed sentence with Porter:', porter_stemmed_tokens[-1])
    snowball_stemmed_tokens.append([snowball_stemmer_eng.stem(token) for token in tokens[i]])
    print('Stemmed sentence with Snowball:', snowball_stemmed_tokens[-1])
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['Jane', 'lend', '$', '100.50', 'to', 'Peter', 'early', 'this', 'morning', '.']
Stemmed sentence with Porter: ['jane', 'lend', '$', '100.50', 'to', 'peter', 'earli', 'thi', 'morn', '.']
Stemmed sentence with Snowball: ['jane', 'lend', '$', '100.50', 'to', 'peter', 'earli', 'this', 'morn', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['@', 'ref', ',', 'This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
Stemmed sentence with Porter: ['@', 'ref', ',', 'thi', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-p', '<', '3', 'and', 'some', 'arrow', '<', '>', '-', '>', '<', '--']
Stemmed sentence with Snowball: ['@', 'ref', ',', 'this', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')'

In [None]:
from nltk.stem.snowball import SnowballStemmer

text_esp = "Los gatos estaban corriendo más rápido que los perros."
tokens_esp = word_tokenize(text_esp)
stemmer_esp = SnowballStemmer("spanish")
stemmed_words = [stemmer_esp.stem(word) for word in tokens_esp]

print(f"Original text: {text_esp}")
print(f"Stemmed Words: {stemmed_words}")

Original text: Los gatos estaban corriendo más rápido que los perros.
Stemmed Words: ['los', 'gat', 'estab', 'corr', 'mas', 'rap', 'que', 'los', 'perr', '.']


### Stop words Removal

A stop word is a commonly used word such as: the, is, a, an, of, etc in which all commonly used search engines have been programmed to ignore them. 

NLTK provides related operations to handle Stop Words:
1. Provision a libary of Stop Words in all support languages such as English. 
2. Provide an easy to use implementation to remove Stop Words from a string or even the whole text document. 


Let's have a look at the list of pre-defined stop words in different languages

In [17]:
# Import NLTK stop-words 
from nltk.corpus import stopwords 
# Let's have a look at the pre-defined stop-words for some languages
print(stopwords.words('english'))
print(stopwords.words('spanish'))
print(stopwords.words('catalan'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Now let's remove the stop words from the original text

In [19]:
from nltk.corpus import stopwords
stop_words_eng_ = stopwords.words('english')

for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    text_clean =[w for w in tokens[i] if w not in stop_words_eng_]
    print("Cleaned Sentence: ", ' '.join(text_clean))
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Cleaned Sentence:  Jane lend $ 100.50 Peter early morning .
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Cleaned Sentence:  @ ref , This cooool # dummysmiley : : - ) : -P < 3 arrows < > - > < --
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Cleaned Sentence:  Mr. Smith lives New York City .
-------------------------------
Original sentence: The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
Cleaned Sentence:  The https : //github.com/jonsafari/tok-tok/blob/master/tok-tok.pl website with/and/or slashes sort weird : things
-------------------------------
Original sentence: ¡This, is a sentence with weird» symbols… appearing everywhere¿
Cleaned Sentence:  ¡This , sentence weird » symbols… appearing everywhere¿
------------------------

#### Create You Own Stop Words
We can add ANY stop word we like to the predefined list 

In [20]:
my_stop_words = stopwords.words('english')
my_stop_words.append('This')

for i, sentence in enumerate(text):
    print('Original sentence:', sentence)
    text_clean =[w for w in tokens[i] if w not in my_stop_words]
    print("Cleaned Sentence: ", ' '.join(text_clean))
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Cleaned Sentence:  Jane lend $ 100.50 Peter early morning .
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Cleaned Sentence:  @ ref , cooool # dummysmiley : : - ) : -P < 3 arrows < > - > < --
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Cleaned Sentence:  Mr. Smith lives New York City .
-------------------------------
Original sentence: The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
Cleaned Sentence:  The https : //github.com/jonsafari/tok-tok/blob/master/tok-tok.pl website with/and/or slashes sort weird : things
-------------------------------
Original sentence: ¡This, is a sentence with weird» symbols… appearing everywhere¿
Cleaned Sentence:  ¡This , sentence weird » symbols… appearing everywhere¿
-----------------------------

## Subword Tokenization

We have seen that it can be difficult to define a configuration of tokenization + word normalization that works well in all cases. **Subword tokenization** aims at learning an ideal tokenization from the data. 

There are various standard techniques for subword tokenization:
1. Byte-Pair Encoding (BPE) 
2. WordPiece 
3. Unigram Language Modeling 

They usually consist of two methods:
- A token learner that induces a vocabulary from a corpus
- A token encoder that takes a raw sentence and tokenizes it according to the learned vocabulary

### Byte-Pair Encoding


Installation of the package:  https://pypi.org/project/bpetokenizer/

In [22]:
%pip install bpetokenizer

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


To test the BPE tokenizer we will use the text of the book "Alice's Adventures in Wonderland" from the Gutenberg corpus in NLTK. 

Let's load the text:

In [9]:
from nltk.corpus import gutenberg

raw_text = gutenberg.raw('carroll-alice.txt')
print(raw_text[:500])


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an


### Training the tokenizer

We will train a tokenizer with the text of the book. 

- We can add special tokens when we initialize the tokenizer to include extra symbols in the text. For instance, start and end of text or other specific separators or markers
- We can specify the size of the vocabulary of tokens
- We can save the trained tokenizer in a file 

In [13]:
from bpetokenizer import BPETokenizer


# Define special tokens for start and end of text with their corresponding IDs
special_tokens = {
    "<|endoftext|>": 1001,
    "<|startoftext|>": 1002
}

# Add special tokens to the raw text
text = '<|startoftext|> ' + raw_text + ' <|endoftext|>'

# Initialize the BPETokenizer object with the defined special tokens and train it on the text
tokenizer = BPETokenizer(special_tokens=special_tokens)
tokenizer.train(text, vocab_size=500, verbose=True)

#Save the trained tokenizer to a file in JSON format
tokenizer.save("sample_bpetokenizer", mode="json") 


merging 1/244: (32, 116) -> 256 (b' t') had 3777 frequency
merging 2/244: (104, 101) -> 257 (b'he') had 3706 frequency
merging 3/244: (32, 97) -> 258 (b' a') had 2583 frequency
merging 4/244: (32, 115) -> 259 (b' s') had 2221 frequency
merging 5/244: (105, 110) -> 260 (b'in') had 2000 frequency
merging 6/244: (256, 257) -> 261 (b' the') had 1831 frequency
merging 7/244: (32, 119) -> 262 (b' w') had 1523 frequency
merging 8/244: (111, 117) -> 263 (b'ou') had 1515 frequency
merging 9/244: (105, 116) -> 264 (b'it') had 1246 frequency
merging 10/244: (110, 100) -> 265 (b'nd') had 1175 frequency
merging 11/244: (101, 114) -> 266 (b'er') had 1133 frequency
merging 12/244: (32, 111) -> 267 (b' o') had 1118 frequency
merging 13/244: (104, 97) -> 268 (b'ha') had 1014 frequency
merging 14/244: (260, 103) -> 269 (b'ing') had 979 frequency
merging 15/244: (114, 101) -> 270 (b're') had 964 frequency
merging 16/244: (10, 10) -> 271 (b'\n\n') had 852 frequency
merging 17/244: (101, 100) -> 272 (b'ed'

Let's visualize the trained vocabulary

In [11]:
print("Vocabulary: ", tokenizer.vocab)
print('---')
print("Merges: ", tokenizer.merges)
print('---')
print("Special tokens: ", tokenizer.special_tokens)


Vocabulary:  {0: b'\x00', 1: b'\x01', 2: b'\x02', 3: b'\x03', 4: b'\x04', 5: b'\x05', 6: b'\x06', 7: b'\x07', 8: b'\x08', 9: b'\t', 10: b'\n', 11: b'\x0b', 12: b'\x0c', 13: b'\r', 14: b'\x0e', 15: b'\x0f', 16: b'\x10', 17: b'\x11', 18: b'\x12', 19: b'\x13', 20: b'\x14', 21: b'\x15', 22: b'\x16', 23: b'\x17', 24: b'\x18', 25: b'\x19', 26: b'\x1a', 27: b'\x1b', 28: b'\x1c', 29: b'\x1d', 30: b'\x1e', 31: b'\x1f', 32: b' ', 33: b'!', 34: b'"', 35: b'#', 36: b'$', 37: b'%', 38: b'&', 39: b"'", 40: b'(', 41: b')', 42: b'*', 43: b'+', 44: b',', 45: b'-', 46: b'.', 47: b'/', 48: b'0', 49: b'1', 50: b'2', 51: b'3', 52: b'4', 53: b'5', 54: b'6', 55: b'7', 56: b'8', 57: b'9', 58: b':', 59: b';', 60: b'<', 61: b'=', 62: b'>', 63: b'?', 64: b'@', 65: b'A', 66: b'B', 67: b'C', 68: b'D', 69: b'E', 70: b'F', 71: b'G', 72: b'H', 73: b'I', 74: b'J', 75: b'K', 76: b'L', 77: b'M', 78: b'N', 79: b'O', 80: b'P', 81: b'Q', 82: b'R', 83: b'S', 84: b'T', 85: b'U', 86: b'V', 87: b'W', 88: b'X', 89: b'Y', 90: b'

### Encoding the text

In [17]:
# Load the trained tokenizer from the JSON file
tokenizer = BPETokenizer()
tokenizer.load("sample_bpetokenizer.json", mode="json")

# Encode the text using the tokenizer, including special tokens. It returns a list of token IDs corresponding to the input text.
token_ids = tokenizer.encode(text, special_tokens="all")
print(token_ids)

# Decode the list of token IDs back into the original text using the tokenizer's decode method. This should return the original text if the encoding and decoding processes are consistent.
decode_text = tokenizer.decode(token_ids)
print(decode_text)

[60, 124, 323, 295, 382, 102, 116, 101, 120, 116, 124, 62, 32, 91, 485, 357, 307, 100, 118, 343, 117, 270, 115, 301, 437, 111, 265, 266, 108, 387, 273, 121, 32, 76, 101, 119, 290, 448, 295, 408, 280, 32, 49, 56, 54, 53, 93, 271, 67, 72, 65, 80, 84, 69, 82, 316, 46, 392, 395, 261, 32, 82, 393, 98, 264, 45, 72, 111, 296, 271, 485, 321, 449, 260, 110, 269, 274, 299, 345, 399, 256, 105, 270, 100, 291, 259, 264, 116, 269, 273, 121, 330, 259, 290, 356, 325, 261, 10, 98, 314, 107, 44, 276, 291, 315, 118, 269, 371, 104, 269, 274, 390, 58, 325, 281, 472, 256, 119, 468, 300, 364, 310, 368, 112, 272, 301, 382, 261, 10, 98, 378, 330, 259, 290, 356, 321, 361, 354, 269, 44, 420, 293, 364, 459, 310, 105, 479, 117, 270, 115, 472, 275, 286, 326, 115, 302, 414, 115, 301, 10, 264, 44, 292, 387, 450, 431, 261, 344, 306, 291, 258, 273, 378, 312, 473, 327, 292, 119, 351, 336, 310, 105, 479, 117, 270, 115, 472, 10, 99, 286, 326, 115, 302, 414, 63, 324, 83, 111, 300, 321, 275, 286, 115, 277, 266, 269, 301, 33

The package also includes an already pretrained tokenizer, a 17k vocab tokenizer called "wi17_base"

In [13]:
tokenizer = BPETokenizer.from_pretrained("wi17k_base", verbose=True)

# Encode the text using the tokenizer, including special tokens. It returns a list of token IDs corresponding to the input text.
token_ids = tokenizer.encode(text, special_tokens="all")
print(token_ids)

# Decode the list of token IDs back into the original text using the tokenizer's decode method. This should return the original text if the encoding and decoding processes are consistent.
decode_text = tokenizer.decode(token_ids)
print(decode_text)

loading tokenizer from: bpetokenizer/pretrained\wi17k_base\wi17k_base.json


FileNotFoundError: tokenizer file not found: bpetokenizer/pretrained\wi17k_base\wi17k_base.json. Please check the tokenizer name

### WordPiece


Installation of the package:  https://pypi.org/project/word-piece-tokenizer/

In [21]:
%pip install word-piece-tokenizer

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


This is a very lightweight implementation of the *WordPiece* tokenizer that can only use an already pretrained vocabulary, used to train the [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) model.

With this implementation we cannot train our own model. We can only tokenize a text with the pretrained vocabulary

In [19]:
from word_piece_tokenizer import WordPieceTokenizer
from nltk.corpus import gutenberg

raw_text = gutenberg.raw('carroll-alice.txt')
tokenizer = WordPieceTokenizer()

# Tokenize the raw text. It returns a list of token IDs corresponding to the input text. 
token_ids = tokenizer.tokenize(raw_text)
print(token_ids[:500])

# Shows the text corresponding to the token IDs. Includes special tokens used in BERT: [CLS] and [SEP]
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print('\n'.join(tokens[:500]))

# Converts back the list of tokens into a string. This should return the original text if the tokenization and detokenization processes are consistent.
decoded_text = tokenizer.convert_tokens_to_string(tokens)
print(decoded_text[:500])


[101, 1031, 5650, 1005, 1055, 7357, 1999, 20365, 2011, 4572, 10767, 6725, 1033, 3127, 1045, 1012, 2091, 1996, 10442, 1011, 4920, 5650, 2001, 2927, 2000, 2131, 2200, 5458, 1997, 3564, 2011, 2014, 2905, 2006, 1996, 2924, 1010, 1998, 1997, 2383, 2498, 2000, 2079, 1024, 2320, 2030, 3807, 2016, 2018, 21392, 5669, 2046, 1996, 2338, 2014, 2905, 2001, 3752, 1010, 2021, 2009, 2018, 2053, 4620, 2030, 11450, 1999, 2009, 1010, 1005, 1998, 2054, 2003, 1996, 2224, 1997, 1037, 2338, 1010, 1005, 2245, 5650, 1005, 2302, 4620, 2030, 4512, 1029, 1005, 2061, 2016, 2001, 6195, 1999, 2014, 2219, 2568, 1006, 2004, 2092, 2004, 2016, 2071, 1010, 2005, 1996, 2980, 2154, 2081, 2014, 2514, 2200, 17056, 1998, 5236, 1007, 1010, 3251, 1996, 5165, 1997, 2437, 1037, 10409, 1011, 4677, 2052, 2022, 4276, 1996, 4390, 1997, 2893, 2039, 1998, 8130, 1996, 18765, 14625, 1010, 2043, 3402, 1037, 2317, 10442, 2007, 5061, 2159, 2743, 2485, 2011, 2014, 1012, 2045, 2001, 2498, 2061, 2200, 9487, 1999, 2008, 1025, 4496, 2106, 5650, 

Let's go back to the text we used in the first examples of tokenization and let's see how WordPiece does the job there

In [20]:
text = ['Jane lend $100.50 to Peter early this morning.', '@ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--', 'Mr. Smith lives in New York City.', 'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things', '¡This, is a sentence with weird» symbols… appearing everywhere¿']
tokens = []
for sentence in text:
    print('Original sentence:', sentence)
    token_ids = tokenizer.tokenize(sentence)
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    print('Tokenized sentence:', tokens)
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['[CLS]', 'jane', 'lend', '$', '100', '.', '50', 'to', 'peter', 'early', 'this', 'morning', '.', '[SEP]']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['[CLS]', '@', 'ref', ',', 'this', 'is', 'a', 'co', '##oo', '##ol', '#', 'dummy', '##sm', '##ile', '##y', ':', ':', '-', ')', ':', '-', 'p', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '-', '-', '[SEP]']
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Tokenized sentence: ['[CLS]', 'mr', '.', 'smith', 'lives', 'in', 'new', 'york', 'city', '.', '[SEP]']
-------------------------------
Original sentence: The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
Tokenized sentence: ['[CLS]', 'the', 'https', ':', '/', '/', 'gi', '##th', '##ub

### Unigram LM (SentencePiece)


Installation of the package:  https://pypi.org/project/sentencepiece/

In [23]:
%pip install sentencepiece

Defaulting to user installation because normal site-packages is not writeable
Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp312-cp312-win_amd64.whl.metadata (10 kB)
Downloading sentencepiece-0.2.1-cp312-cp312-win_amd64.whl (1.1 MB)
   ---------------------------------------- 0.0/1.1 MB ? eta -:--:--
   ---------------------------------------- 1.1/1.1 MB 12.8 MB/s eta 0:00:00
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.1
Note: you may need to restart the kernel to use updated packages.


### Training the tokenizer

We only need to sepecify the name of an input file with the text for training the tokenizer, the name of the file to save the model and the vocabulary size

In [None]:
from sentencepiece import SentencePieceTrainer

SentencePieceTrainer.train(input='C:\\nltk_data\\corpora\\gutenberg\\carroll-alice.txt', model_prefix='sample_model', vocab_size=1000)


### Encoding the text

We can load any model previously trained and saved. We can encode the text as IDs or as text tokens

In [12]:
from sentencepiece import SentencePieceProcessor
from nltk.corpus import gutenberg

raw_text = gutenberg.raw('carroll-alice.txt')
print(raw_text[:500])

# Initializes and load the model from a file
model = SentencePieceProcessor()
model.load('sample_model.model')

# Encode the text as token ids. It returns a list of token IDs corresponding to the input text.
token_ids = model.encode_as_ids(raw_text)
print(token_ids[:500])

# Encode the text as text pieces. It returns a list of tokens corresponding to the input text.
tokens_text = model.encode_as_pieces(raw_text)
print(tokens_text[:500])

# Decode back the list of token IDs into the original text. This should return the original text if the encoding and decoding processes are consistent.
print(model.decode_ids(token_ids[:500]))

# Decode back the list of text pieces into the original text. This should return the original text if the encoding and decoding processes are consistent.
print(model.decode_pieces(tokens_text[:500]))


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an
[6, 0, 761, 9, 7, 993, 23, 302, 119, 28, 50, 103, 76, 144, 377, 17, 90, 135, 291, 79, 29, 47, 56, 6, 0, 481, 101, 339, 100, 310, 20, 8, 6, 311, 90, 25, 4, 170, 59, 890, 17, 21, 24, 507, 10, 176, 62, 752, 15, 612, 144, 33, 580, 44, 4, 969, 3, 11, 15, 624, 251, 10, 91, 34, 264, 117, 853, 16, 45, 959, 22, 116, 4, 6, 491, 33, 580, 24, 331, 19, 3, 71, 14, 45, 77, 779, 7, 117, 571, 7, 23, 14, 3, 5, 76, 85, 74, 4, 371, 15, 12, 6, 491, 26, 102, 21, 5, 90, 37, 108, 47, 52, 13, 779, 7, 117, 571, 55, 241,

Let's explore the vocabulary of the trained model

In [5]:
# Vlocabulary size
print("Vocabulary size:", model.get_piece_size())

# Tokens of the vocabulary
print("Vocabulary tokens:", [model.id_to_piece(i) for i in range(model.get_piece_size())])


Vocabulary size: 1000
Vocabulary tokens: ['<unk>', '<s>', '</s>', ',', '▁the', "▁'", '▁', 's', '.', "'", '▁to', '▁and', '▁a', 't', '▁it', '▁of', '▁she', 'e', '▁said', 'ing', '▁I', '▁Alice', 'ed', '▁in', '▁was', 'n', ",'", '▁you', 'd', 'r', "!'", 'I', '▁that', '▁her', ':', '▁as', 'm', 'i', 'ly', 'a', '▁be', '--', ';', '▁at', '▁on', '▁had', 'y', 'o', '▁with', 'p', 'er', ".'", 'u', '▁all', '!', "?'", 'll', 're', '▁so', '-', '▁for', '▁not', '▁very', '▁little', 've', '▁out', '▁they', 'g', '▁he', '▁this', 'h', '▁but', '▁The', '▁up', '▁is', '▁down', 'and', '▁no', '▁p', 'ar', '▁like', '▁his', '▁about', '▁again', '▁know', '▁what', '▁one', 'le', 'E', '▁were', 'w', '▁do', '▁s', '▁them', '▁herself', '▁went', '▁would', '▁could', '▁me', '▁have', 'T', 'A', '▁thought', 'l', 'S', 'c', '▁Queen', '▁did', 'th', '▁time', '▁if', '▁off', 'it', 'at', '▁see', 'f', '▁into', '▁or', '▁A', 'on', '▁when', 'an', 'O', 'en', '▁w', '▁f', '▁King', '▁can', '▁*', '▁think', '▁head', '▁then', '▁there', '▁Turtle', '▁began', 

Let's test the pretrained model included in the distribution of the package

In [11]:
from sentencepiece import SentencePieceProcessor
from nltk.corpus import gutenberg

raw_text = gutenberg.raw('carroll-alice.txt')
print(raw_text[:500])

# Initializes and load the model from a file
model = SentencePieceProcessor()
model.load('test_model.model')

# Vlocabulary size
print("Vocabulary size:", model.get_piece_size())

# Tokens of the vocabulary
print("Vocabulary tokens:", [model.id_to_piece(i) for i in range(model.get_piece_size())])

# Encode the text as token ids. It returns a list of token IDs corresponding to the input text.
token_ids = model.encode_as_ids(raw_text)
print(token_ids[:500])

# Encode the text as text pieces. It returns a list of tokens corresponding to the input text.
tokens_text = model.encode_as_pieces(raw_text)
print(tokens_text[:500])


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an
Vocabulary size: 1000
Vocabulary tokens: ['<unk>', '<s>', '</s>', '\r', '▁', ',', '.', '▁the', 's', '▁I', '▁to', '▁a', 'ed', '▁and', '▁of', 't', 'e', 'd', 'ing', 'a', '▁in', 'o', '▁was', '▁"', 'n', 'i', 'm', '▁it', 'c', 'p', 'r', 'l', "'", '▁me', 'y', '-', '▁that', 'g', '▁be', '▁he', 'er', 'ly', '▁for', '▁not', '▁with', '▁my', 'k', '▁is', 'ar', '▁you', '▁as', '▁s', '▁but', 'u', 'or', 'in', 're', 'f', '."', '▁on', 'h', '▁at', '▁or', 'w', '▁had', '▁this', 'b', '▁would', '▁p', '▁The', '▁re', '▁so'

Let's go back to the text we used in the first examples of tokenization and let's see how WordPiece does the job there

In [8]:
text = ['Jane lend $100.50 to Peter early this morning.', '@ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--', 'Mr. Smith lives in New York City.', 'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things', '¡This, is a sentence with weird» symbols… appearing everywhere¿']
tokens = []
for sentence in text:
    print('Original sentence:', sentence)
    token_ids = model.encode_as_ids(sentence)
    tokens = model.encode_as_pieces(sentence)
    print('Tokenized sentence:', tokens)
    print('-------------------------------')

Original sentence: Jane lend $100.50 to Peter early this morning.
Tokenized sentence: ['▁', 'J', 'an', 'e', '▁l', 'en', 'd', '▁', '$', '1', '0', '0', '.', '5', '0', '▁to', '▁P', 'e', 'ter', '▁e', 'ar', 'ly', '▁this', '▁morning', '.']
-------------------------------
Original sentence: @ref, This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
Tokenized sentence: ['▁', '@', 're', 'f', ',', '▁This', '▁is', '▁a', '▁co', 'o', 'o', 'o', 'l', '▁', '#', 'd', 'um', 'm', 'y', 's', 'm', 'i', 'le', 'y', ':', '▁', ':', '-', ')', '▁', ':', '-', 'P', '▁', '<3', '▁and', '▁some', '▁', 'ar', 'r', 'ow', 's', '▁', '<', '▁', '>', '▁', '-', '>', '▁', '<', '--']
-------------------------------
Original sentence: Mr. Smith lives in New York City.
Tokenized sentence: ['▁M', 'r', '.', '▁S', 'm', 'i', 'th', '▁live', 's', '▁in', '▁', 'N', 'e', 'w', '▁', 'Y', 'or', 'k', '▁C', 'ity', '.']
-------------------------------
Original sentence: The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.

### BPE implementation in SentencePiece

The SentencePiece package also includes an implementation of the BPE tokenizer

In [10]:
from sentencepiece import SentencePieceTrainer, SentencePieceProcessor
from nltk.corpus import gutenberg

SentencePieceTrainer.train(input='C:\\nltk_data\\corpora\\gutenberg\\carroll-alice.txt', model_prefix='bpe_model', vocab_size=1000, model_type='bpe')

raw_text = gutenberg.raw('carroll-alice.txt')
print(raw_text[:500])

# Initializes and load the model from a file
model = SentencePieceProcessor()
model.load('bpe_model.model')

# Encode the text as token ids. It returns a list of token IDs corresponding to the input text.
token_ids = model.encode_as_ids(raw_text)
print(token_ids[:500])

# Encode the text as text pieces. It returns a list of tokens corresponding to the input text.
tokens_text = model.encode_as_pieces(raw_text)
print(tokens_text[:500])


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an
[939, 0, 967, 61, 952, 947, 40, 673, 89, 566, 47, 166, 439, 950, 159, 276, 337, 705, 38, 171, 41, 153, 28, 939, 0, 912, 59, 963, 128, 136, 8, 334, 966, 970, 943, 45, 62, 65, 856, 20, 234, 134, 210, 526, 35, 717, 223, 276, 70, 6, 38, 99, 67, 8, 19, 64, 962, 955, 23, 35, 56, 489, 455, 20, 127, 972, 464, 207, 271, 48, 44, 106, 54, 653, 21, 231, 8, 19, 121, 70, 6, 38, 99, 65, 571, 17, 955, 151, 36, 106, 196, 542, 216, 566, 207, 272, 69, 842, 947, 47, 36, 955, 15, 159, 188, 161, 8, 615, 35, 5, 19, 1