## Analysis: Tokenizer, Model and Adding New Vocabulary

The Huggingface Library tightly couples the `Tokenizer` and `Model` objects together. If you want to use a particular model (e.g., `BertModel`), then the accompanying tokenizer is required as well. The tokenizer contains a `{word[Str]: index[Int]}` vocabulary mapping which corresponds to the model input layer's dimensions. This mapping is used to convert text - in the form of a `string` - into a tensor representation which is provided as input to the model.

This notebook examines this relationship by comparing two words, a common word for which the tokenizer has a vocab entry, and a **Out-of-Vocabulary** word  for which there is no vocab entry. We start with the following questions:

- how does the `Tokenizer` handle _known_ vs. _unknown_ words?
- what is the relationship between a tokenizer's vocab size and a model's input layer?
- how can we **add** new words to a tokenizer's vocabulary? And, how does this affect the model?

In [1]:
from typing import List, Tuple
from transformers import BertTokenizer, BertModel

### Utility Functions

These functions are used throughout the notebook. Return to this cell for reference as needed.

In [2]:
def show_encoding(tokenizer, text:str) -> List[Tuple[int, str]]:
    """Show encoded/decoded pairs for example sentence."""
    encoding = tokenizer(text, return_tensors='pt')['input_ids'].flatten()
    return [(_enc.item(), tokenizer.decode(_enc)) for _enc in encoding]

def check_tokenizer_model_size_compatability(tokenizer, model):
    """Check is tokenizer and model are compatible based on
    tokenizer's vocabulary size and model's input layer dimension."""
    vocab_size = len(tokenizer)
    emb = model.get_input_embeddings()
    input_size = emb.num_embeddings
    
    assert \
    vocab_size == input_size, \
    (
        f"Tokenizer has size {vocab_size}, " 
        f"but Model has input dimension of {input_size}. "
        f"Difference of ({abs(vocab_size - input_size)})."
    )
    print(f"Tokenizer vocab size and model input layer dimension match (size={vocab_size})")
    return

### Artifacts: Checkpoint, Tokenizer, Model

In [3]:
model_checkpoint = 'bert-base-uncased'

In [4]:
tokenizer = BertTokenizer.from_pretrained(model_checkpoint)

In [5]:
bert_model = BertModel.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Comparing Known and Unknown Words in Tokenizer

In [6]:
# these are our test words; one is in vocab, the other is not.
common_word = 'dog' 
oov_word = 'gobbledygook'

In [7]:
## example of tokenizer encoding/decoding of text with common word
show_encoding(tokenizer, text=f"the {common_word} barked loudly!")

[(101, '[ C L S ]'),
 (1996, 't h e'),
 (3899, 'd o g'),
 (17554, 'b a r k e d'),
 (9928, 'l o u d l y'),
 (999, '!'),
 (102, '[ S E P ]')]

In [8]:
## example of tokenizer encoding/decoding of text with out of vocabulary word
show_encoding(tokenizer, text=f"ever heard of {oov_word}?")

[(101, '[ C L S ]'),
 (2412, 'e v e r'),
 (2657, 'h e a r d'),
 (1997, 'o f'),
 (2175, 'g o'),
 (12820, '# # b b l e d'),
 (2100, '# # y'),
 (3995, '# # g o'),
 (6559, '# # o k'),
 (1029, '?'),
 (102, '[ S E P ]')]

In [9]:
# tokenizer has a vocab entry for "dog"
show_encoding(tokenizer, common_word)

[(101, '[ C L S ]'), (3899, 'd o g'), (102, '[ S E P ]')]

In [10]:
# tokenizer does not have a vocab entry for `oov_word``
show_encoding(tokenizer, oov_word)

[(101, '[ C L S ]'),
 (2175, 'g o'),
 (12820, '# # b b l e d'),
 (2100, '# # y'),
 (3995, '# # g o'),
 (6559, '# # o k'),
 (102, '[ S E P ]')]

In [11]:
# get encoded representation of example words
# see how the oov word encoding is much longer than the common word encoding?
common_enc = tokenizer(common_word, return_tensors='pt')['input_ids']
print(f"encoding for '[{common_word}]':\n- {common_enc}\n- size: {common_enc.size()}")

oov_enc = tokenizer(oov_word, return_tensors='pt')['input_ids']
print(f"\nencoding for '[{oov_word}]':\n- {oov_enc}\n- size: {oov_enc.size()}")

encoding for '[dog]':
- tensor([[ 101, 3899,  102]])
- size: torch.Size([1, 3])

encoding for '[gobbledygook]':
- tensor([[  101,  2175, 12820,  2100,  3995,  6559,   102]])
- size: torch.Size([1, 7])


### Inspection of Vocabulary Size and Input Layer Dimension

In [12]:
# number of entries in vocab, size of input layer
len(tokenizer), bert_model.get_input_embeddings()

(30522, Embedding(30522, 768, padding_idx=0))

In [13]:
check_tokenizer_model_size_compatability(tokenizer, bert_model)

Tokenizer vocab size and model input layer dimension match (size=30522)


In [14]:
# pass encoding for common word to model
# output tensor is the size we would expect...
common_out = bert_model(common_enc)
print('output shape:', common_out[0].size())

output shape: torch.Size([1, 3, 768])


In [15]:
# pass oov word encoding to model
# output tensor also what we are expecting
oov_out = bert_model(oov_enc)
print('output shape:', oov_out[0].size())

output shape: torch.Size([1, 7, 768])


### Update Tokenizer Vocabulary

The `Tokenizer` class allows for _new_ words to be added to the vocabulary mapping. But, how does increasing the size of the vocab (i.e., `len(tokenizer)`) affect the input layer to the accompanying model?

In [16]:
# new vocab can be added to the tokenizer...
starting_vocab_size = len(tokenizer)
print(f"starting size of tokenizer vocab: {starting_vocab_size}")

starting size of tokenizer vocab: 30522


In [17]:
# add the `oov_word` to vocab
tokenizer.add_tokens(new_tokens=[oov_word])
print(f"updated size of tokenizer vocab: {len(tokenizer)}")

updated size of tokenizer vocab: 30523


In [18]:
assert len(tokenizer) > starting_vocab_size, "tokenizer vocab size has not changed!"

In [19]:
# check to see how tokenizer has changed with oov word
# so much better, right ... ?
show_encoding(tokenizer, oov_word)

[(101, '[ C L S ]'), (30522, 'g o b b l e d y g o o k'), (102, '[ S E P ]')]

In [20]:
# we now have a single integer (i.e., vocab entry) to represent the `oov_word`.
oov_enc_updated = tokenizer(oov_word, return_tensors='pt')['input_ids']
print(oov_enc_updated)

tensor([[  101, 30522,   102]])


In [21]:
# !! How does our updated vocabulary affect the accompanying model?
bert_model(oov_enc_updated)

IndexError: index out of range in self

In [None]:
# !! notice the input size of the model compared to our !updated! tokenizer...
bert_model.get_input_embeddings()

Embedding(30522, 768, padding_idx=0)

In [None]:
check_tokenizer_model_size_compatability(tokenizer, bert_model)

AssertionError: Tokenizer has size 30523, but Model has input dimension of 30522. Difference of (1).

In [None]:
# (potential) solution: update model input layer to match updated tokenizer
bert_model.resize_token_embeddings(len(tokenizer))
bert_model.get_input_embeddings()

Embedding(30523, 768)

In [None]:
check_tokenizer_model_size_compatability(tokenizer, bert_model)

Tokenizer vocab size and model input layer dimension match (size=30523)


In [None]:
# try again to pass (updated) oov_word encoding to (updated) model
oov_out = bert_model(oov_enc_updated)
oov_out = oov_out[0]

In [None]:
print("output shape:", oov_out.size())

output shape: torch.Size([1, 3, 768])


### Conclusion

We have updated our `Tokenizer` with a _new_ vocabulary word and the input layer of our `Model`. The output of a _forward-pass_ matches what we would expect based on the input encoding. However, what does the output tensor for the Out-of-Vocabulary word _mean_? 

Assumption is the output embedding is not a strong representation as the weights have not been trained. If we resume training, and the training set contains many examples of our OOV term then the representation will be updated and improved.