# Setup

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Tokenization
The objective of text tokenization is to break the text into smaller units which are often more linguistically meaningful.

These smaller linguistic units are usually easier to deal with computationally and semantically.

## Sentence Tokenization

- The `sent_tokenize()` function uses an instance of `PunktSentenceTokenizer` from the `ntlk.tokenize.punkt` module. 

- To process large amount of data, it is recommended to load the pre-trained `PunktSentenceTokenizer` once, and call its `tokenizer()` method for the task.

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
sents = 'Deep in the human unconscious is a pervasive need for a logical universe that makes sense. But, the real universe is always one step beyond logic.'
for s in sent_tokenize(sents):
    print(s+'\n')

Deep in the human unconscious is a pervasive need for a logical universe that makes sense.

But, the real universe is always one step beyond logic.



## Word Tokenization
Similarly, the `word_tokenize()` function is a wrapper function that calls the `tokenize()` method on a instance of `TreebankWordTokenizer` class.

In [None]:
from nltk.tokenize import word_tokenize
print(word_tokenize(sents))
print(len(word_tokenize(sents)))

['Deep', 'in', 'the', 'human', 'unconscious', 'is', 'a', 'pervasive', 'need', 'for', 'a', 'logical', 'universe', 'that', 'makes', 'sense', '.', 'But', ',', 'the', 'real', 'universe', 'is', 'always', 'one', 'step', 'beyond', 'logic', '.']
29


To process large amount of data, please create an instance of `TreebankWordTokenizer` and call its `tokenize()` method for more efficient processing. We will get the same results with the following codes as above.

In [None]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

print(tokenizer.tokenize(sents))
print(len(tokenizer.tokenize(sents)))

['Deep', 'in', 'the', 'human', 'unconscious', 'is', 'a', 'pervasive', 'need', 'for', 'a', 'logical', 'universe', 'that', 'makes', 'sense.', 'But', ',', 'the', 'real', 'universe', 'is', 'always', 'one', 'step', 'beyond', 'logic', '.']
28


The `nltk` module has implemented other more task-oriented word tokenizers, which differ in terms of their specific handling of the punctuations and contractions.

### Comparing different word tokenizers

- `WordPunctTokenizer` should split all punctuations into separate tokens. Check if that is true.
- `TreebankWordTokenizer` follows the Penn Treebank conventions for word tokenization.


In [None]:
from nltk.tokenize import WordPunctTokenizer
wpt = WordPunctTokenizer()
tbwt = TreebankWordTokenizer()

In [None]:
s = "Life is easy and beautiful, isn't?"

In [None]:
print(wpt.tokenize(s))

['Life', 'is', 'easy', 'and', 'beautiful', ',', 'isn', "'", 't', '?']


In [None]:
print(tbwt.tokenize(s))

['Life', 'is', 'easy', 'and', 'beautiful', ',', 'is', "n't", '?']


💬 Discuss the differences between the two tokenizers.

## Subword Tokenization with Huggingface

* A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
* A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is
* Tokens that conform with the fixed vocabulary used in BERT
* The Token IDs for the tokens, from BERT’s tokenizer


In [None]:
!pip install tokenizers

Collecting tokenizers
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 5.1 MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.10.3


In [None]:
# Get a pre-trained tokenizer
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

--2021-11-25 13:28:00--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.198.200
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.198.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2021-11-25 13:28:00 (2.71 MB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]



In [None]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

In [None]:
output = tokenizer.encode('''Hello, y'all! How are you 😁 ? It is a great time to learn about word embeddings. I am trying very hard.''')
output

Encoding(num_tokens=33, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [None]:
print(output.tokens)

['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', 'it', 'is', 'a', 'great', 'time', 'to', 'learn', 'about', 'word', 'em', '##bed', '##ding', '##s', '.', 'i', 'am', 'trying', 'very', 'hard', '.', '[SEP]']


In [None]:
print(output.ids)
# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]

[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 2009, 2003, 1037, 2307, 2051, 2000, 4553, 2055, 2773, 7861, 8270, 4667, 2015, 1012, 1045, 2572, 2667, 2200, 2524, 1012, 102]


💬 Do you see any subwords? Which are they and why?