## Additional Side Notes for Lesson 3

### Tokenization

What does it mean to 'tokenize' a phrase? Should we simply split on spaces? What are some of the subtleties?

Some of the issues could be:
* **Sparsity and vocabulary size:**
   - are 'swim', 'swimming', 'run', 'running' four completely independent tokens?
   - if those are fully independent then the vocabulary will need to be very large
   
* **Missing important relationships**
   - the example above makes the point clear
   
* **Dealing with numbers**
   - what about 'ranked place 7' vs '123 horses on the farm', vs '12765 BC was a rainy year'. Are we really reserving a token for each number that could show up?
 
In the following we will look at three tokenizers and see how handle these issues: the tokenizers of Google BERT and Roberta, obtained from Huggingface (https://huggingface.co/), and NLTK's wordtokenizer. 

(Note: Both BERT and Roberta are pre-trained Transformer models that generate word embeddings (and more) in a sophisticated and context-aware way. (See Week 9, but it is ok to give a preview).)

We start by importing the transformer library and some other useful ones:

In [1]:
from transformers import BertTokenizer, RobertaTokenizer, DistilBertTokenizer
from nltk.tokenize import word_tokenize

In [2]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
roberta_tokenizer =  RobertaTokenizer.from_pretrained('roberta-base') 

def show_tokenizations(phrase):
    print('BERT:\n\t', bert_tokenizer.tokenize(phrase))
    print('Roberta:\n\t', roberta_tokenizer.tokenize(phrase))
    print('NLTK:\n\t', word_tokenize(phrase))

In [3]:
show_tokenizations("This ship was built 12345 BC")

BERT:
	 ['This', 'ship', 'was', 'built', '123', '##45', 'BC']
Roberta:
	 ['This', 'Ġship', 'Ġwas', 'Ġbuilt', 'Ġ123', '45', 'ĠBC']
NLTK:
	 ['This', 'ship', 'was', 'built', '12345', 'BC']


**Observations:**

* numbers are handled differently
* 'continuation markers' are handled differently

In [4]:
show_tokenizations("The dog sniffed the food. He bunkered it.")

BERT:
	 ['The', 'dog', 'sniffed', 'the', 'food', '.', 'He', 'bunker', '##ed', 'it', '.']
Roberta:
	 ['The', 'Ġdog', 'Ġsniff', 'ed', 'Ġthe', 'Ġfood', '.', 'ĠHe', 'Ġbun', 'kered', 'Ġit', '.']
NLTK:
	 ['The', 'dog', 'sniffed', 'the', 'food', '.', 'He', 'bunkered', 'it', '.']


**Observations:**

* handling of past tense of less common verbs ('sniffed', 'bunkered')

In [5]:
show_tokenizations("Serendipity is on lists of untranslatebale words")

BERT:
	 ['Ser', '##end', '##ip', '##ity', 'is', 'on', 'lists', 'of', 'un', '##tra', '##ns', '##late', '##bal', '##e', 'words']
Roberta:
	 ['S', 'ere', 'nd', 'ip', 'ity', 'Ġis', 'Ġon', 'Ġlists', 'Ġof', 'Ġunt', 'rans', 'late', 'b', 'ale', 'Ġwords']
NLTK:
	 ['Serendipity', 'is', 'on', 'lists', 'of', 'untranslatebale', 'words']


**Observations:**

* Different word pieces in different tokenizers.

In [6]:
show_tokenizations("The sky is blue.So is the ocean. Both are blue.")

BERT:
	 ['The', 'sky', 'is', 'blue', '.', 'So', 'is', 'the', 'ocean', '.', 'Both', 'are', 'blue', '.']
Roberta:
	 ['The', 'Ġsky', 'Ġis', 'Ġblue', '.', 'So', 'Ġis', 'Ġthe', 'Ġocean', '.', 'ĠBoth', 'Ġare', 'Ġblue', '.']
NLTK:
	 ['The', 'sky', 'is', 'blue.So', 'is', 'the', 'ocean', '.', 'Both', 'are', 'blue', '.']


**Observations:**

* Subtleties with missing spaces etc. ('... blue.So is...')

In [7]:
show_tokenizations("Swimming, biking are fun. Best is swimming.")

BERT:
	 ['Swimming', ',', 'bi', '##king', 'are', 'fun', '.', 'Best', 'is', 'swimming', '.']
Roberta:
	 ['Sw', 'imming', ',', 'Ġbiking', 'Ġare', 'Ġfun', '.', 'ĠBest', 'Ġis', 'Ġswimming', '.']
NLTK:
	 ['Swimming', ',', 'biking', 'are', 'fun', '.', 'Best', 'is', 'swimming', '.']


**Observations:**

* different splittings - see 'swimming' vs 'biking' for BERT and Roberta
* Different handling of capitalized first token in sentence

In [8]:
show_tokenizations("Don't answer me!")

BERT:
	 ['Don', "'", 't', 'answer', 'me', '!']
Roberta:
	 ['Don', "'t", 'Ġanswer', 'Ġme', '!']
NLTK:
	 ['Do', "n't", 'answer', 'me', '!']


**Observations:**

* different handling of "don\'t\'-s"

Does all of that matter? YES!!

* Model performance is affected
* Code for how to stitch back together a sentence fro  tokens...
* ....