# Tokenization

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

**Note**: If you're running this on the lab machines, you should **re-run the class setup script**:

In [None]:
!/home/cs/344/setup-cs344.sh #can run in shell

Then you will need to **LOG OUT AND LOG BACK IN**. (If you know what you're doing and want to avoid the log out: that added a definition of `HF_HOME` to `~/.profile`; you can set it here with `os.environ` if you want.)

Now let's install the library.

In [1]:
!pip install -q transformers[sentencepiece]



In [1]:
import torch
from torch import tensor

### Download and load the model

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id) #causal modeling

In [3]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


## Task

Consider the following phrase:

In [4]:
#phrase = "Bryan Reynolds hits baseballs farther than"
phrase = "To be or not to"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

### Getting familiar with tokens

1: Use `tokenizer.tokenize` to convert the phrase into a list of tokens. (What do you think the `Ġ` means?)

In [5]:
batch = tokenizer(phrase, return_tensors='pt'); batch #can add multiple phrases to pass to model at same time
#numbers are words but not yet converted to strings. tensors.

{'input_ids': tensor([[1675,  307,  393,  407,  284]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [6]:
batch['input_ids'].shape
#1 by 5 tensor. In batches. We only passed 1 batch. 5 words returned

torch.Size([1, 5])

In [7]:
#tokenizer.convert_ids_to_tokens(batch['input_ids']) 
tokenizer.convert_ids_to_tokens(batch['input_ids'][0]) 

['ĠTo', 'Ġbe', 'Ġor', 'Ġnot', 'Ġto']

In [11]:
out = model(**batch, labels=batch['input_ids']) #makes all key values pair into model

In [12]:
vars(out).keys() #keys in dictionary for model

dict_keys(['loss', 'logits', 'past_key_values', 'hidden_states', 'attentions', 'cross_attentions'])

In [13]:
out.logits.shape
#batch size, num elements in batch (how many input id's we have), 
#The tokenizer has 50257 strings in its vocabulary.

torch.Size([1, 5, 50257])

In [21]:
tokenizer.convert_ids_to_tokens(out.logits[0, 1]
.softmax(dim=0)
.topk(10)
.indices) #for be. 50527 words. scores for next best word. we then softmax
#predicting in place of "OR" since we index "be"
#model not looking words following.
#this is a causal model. only looks to what coms beforehand.

['Ġa',
 'Ġable',
 'Ġthe',
 'Ġsure',
 'Ġin',
 'Ġused',
 'Ġan',
 'Ġmore',
 'Ġon',
 'Ġhonest']

In [23]:
model.forward
out = model.forward(**batch)#, labels=batch['input_ids']) 
#params input_ids and labels important

<bound method GPT2LMHeadModel.forward of GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inpl

In [26]:
#causal model stuff. hidden states
input_ids = batch['input_ids']
transformer_outputs = model.transformer(input_ids)
hidden_states = transformer_outputs[0]
lm_logits = model.lm_head(hidden_states)

In [30]:
hidden_states.shape 
model.lm_head 
model.lm_head.weight[257].shape
#linear layer takes input and does dot product w weights
#score determined as logit by dot product of 768 values

torch.Size([768])

In [22]:
out.logits[0, -1].shape #gives us last word tower
next_token_logits = out.logits[0, -1]
#not probabilities yet
next_token_logits.max() #max value is in negative

next_token_probs = next_token_logits.softmax(dim=0) #add or subtract logits to keep shape and spacing
next_token_probs.max() #max value. get index and look up in token list

tensor(0.6485, grad_fn=<MaxBackward1>)

In [56]:
#next_token_probs.argmax() #value at whatever returned argmax is
next_token_probs.topk(5).indices

tensor([ 262,   13,  257,   11, 1363])

In [57]:
next_token_probs[262]

tensor(0.1440, grad_fn=<SelectBackward0>)

In [58]:
#tokenizer.convert_ids_to_tokens([262]) 
tokenizer.convert_ids_to_tokens(next_token_probs.topk(5).indices) 

['Ġthe', '.', 'Ġa', ',', 'Ġhome']

p(world | for God so loved the)
p(God | for)
p(so | for God)


In [59]:
out.loss

tensor(8.9839, grad_fn=<NllLossBackward0>)

For each of the 5 strings we passed, logits computed for all other 50527 words. Higher logits are better scores for next word.
   w1    |      |
   w2    |      |
   w3    |      |
   w4    |      |
   w5    |      |
        for    God

4: Turn `input_ids` back into a readable string. Try this two ways: (1) using `convert_ids_to_tokens` and (2) using `tokenizer.decode`.

In [9]:
# using convert_ids_to_tokens
# your code here

' I visited Muskegon'

In [10]:
# your code here

' I visited Muskegon'

### Applying what you learned

5: Use `model.generate(tensor([input_ids]))` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input.) Call the result `output_ids`.


In [11]:
# your code here

tensor([[  314,  8672,  2629,   365, 14520,    11,   290,   314,   373,  6655,
           284,  1064,   326,   262,  1748,   550,   407,   587,  1498,   284]])

6: Convert your `output_ids` into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use `output_ids[0]`.)

In [12]:
# your code here

' I visited Muskegon, and I was surprised to find that the city had not been able to'

Note: `generate` uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

- Turn on `do_sample=True`. Run it a few times to see what it gives.
- Set `top_k=5`. Or 50.

## Analysis

Q1: Write a brief explanation of what a tokenizer does. Note that we worked with two parts of a tokenizer in this exercise (one that deals only with strings, and another that deals with numbers); make sure your explanation addresses both parts.

*your response here*

Q2: What are the smallest and largest numbers you've seen in `input_ids`? How does this relate to the number of words in the tokenizer's vocabulary? (See the `print` statement just after loading the model.)

*your response here*

Q3: What do you think the `Ġ` means? (Hint: it replaces a single well-known character.)

*your response here*

Q4: Suppose you add some personal flair to your writing by adding some extra syllables to the end of some words. Explain what this tokenizer will do with your embellished writing.

*your response here*