<a href="https://colab.research.google.com/github/daka13/HowLLMsWork/blob/main/David's_Word_prediction_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling word tokens with language models

On Wednesday we learned about tensors (multidimensional arrays) and how they represent the state of language models.

We also looked at how models represent words as "tokens", and how they generate vector representations.

Today we'll continue this and also look at conditional generation, word probability, and in-context prompting.

Start by saving a copy of this notebook. You will add results to [this shared document](https://docs.google.com/document/d/10ZcQRt-SZLr6mmzs2qMzd6ggErkFmxP8oWnifRLkbws/edit?usp=sharing).

The code currently does not use a GPU. You can stay with the default CPU runtime.

Colab notebooks do not have the Huggingface `transformers` library by default. Use an inline `pip` call to add it to the current runtime environment. You may need to do this every time you get disconnected from the runtime.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.0-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.0 MB/s[0m eta [36m0:00:0

In [None]:
import time
import numpy as np

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/pythia-410m" ## <- change this

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


Downloading (…)okenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/911M [00:00<?, ?B/s]

## Here are two functions for returning model outputs

The first one gets the vector representations of each input word at each layer and the probability distribution over words for the next word at each position.

The second generates a specified number of additional tokens, but returns no other information.

In [None]:
def get_output(prompts):

  start = time.time()
  tokenizer.pad_token = tokenizer.eos_token
  input_tensors = tokenizer(prompts, return_tensors="pt", padding=True)

  model_output = model(
      input_tensors["input_ids"],
      output_hidden_states = True
  )

  print("running time (s):", time.time() - start)

  return model_output

def generate(prompts, num_tokens=10):

  start = time.time()
  tokenizer.pad_token = tokenizer.eos_token
  input_tensors = tokenizer(prompts, return_tensors="pt", padding=True)

  input_shape = input_tensors["input_ids"].shape
  input_length = input_shape[1]
  print("input has ", input_length, "tokens")

  model_output = model.generate(
      input_tensors["input_ids"],
      output_hidden_states = True,
      do_sample=True,
      temperature=0.9, # don't change these settings for now, we'll come back to this!
      max_length=input_length + num_tokens, # includes prefix
  )

  print("running time (s):", time.time() - start)

  return model_output

## Try some prompts here

Try creating a template where the next predicted word(s) will answer some question.

For example:

    wren: bird, corgi: dog, egret:

or

    Review: \"I was riveted to my seat. I loved the characters. The director deserves an oscar!\" Is the review positive or negative?

Try different models, different tasks, and various ways of formatting. Does it help to include the strings you want to output as options?

In [None]:
generated_text = generate("Nowadays, I can't even speak my mind", num_tokens=50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


input has  10 tokens
running time (s): 12.986849308013916


In [None]:
"".join(tokenizer.batch_decode(generated_text[0]))

"Nowadays, I can't even speak my mind, I'm a little kid who can't say anything. I can't even stand the sound of my own voice.\n\nI've noticed that when I'm with a stranger, I often feel that he's a really normal person who is more"

## `logits` tell us what it *could* have output

For each token in the input we can look at the estimated probability of the next word.

In [None]:
model_output = get_output("Nowadays, I can't even speak my mind")

running time (s): 0.4209442138671875


In [None]:
model_output.keys()

odict_keys(['logits', 'past_key_values', 'hidden_states'])

In [None]:
model_output["logits"].shape

torch.Size([1, 10, 50304])

## Unpack these values by comparing to the vocabulary

We want to zip the list of predicted word scores with the list of possible output strings. There are two problems:
* The tokenizer's vocabulary is organized as a map from string -> int, rather than an array mapping int -> string
* The strings in the vocabulary are encoded in a way such that fragments of characters are still printable

The next section is code to put the vocab in a nicer format for sorting words. Much of it is taken from HF code. Don't expect it to make sense.

In [None]:
## This is some code I use often to convert the vocabulary format with Ġ for spaces
## to something more human readable. -DM

## code from https://github.com/huggingface/transformers/blob/main/src/transformers/models/codegen/tokenization_codegen.py

def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
    characters the bpe code barfs on.

    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
    decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
    tables between utf-8 bytes and unicode strings.
    """
    bs = (
        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(2 ** 8):
        if b not in bs:
            bs.append(b)
            cs.append(2 ** 8 + n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

def to_unicode(s):
  try:
    return bytes([u_to_b[c] for c in s]).decode("utf-8").replace(" ", "▁")
  except:
    return "😱" + s

def convert(vocab_map, unicode=False):
  vocab_array = np.empty(len(vocab_map), dtype=object)
  for s, i in vocab_map.items():
    vocab_array[i] = to_unicode(s) if unicode else s
  return vocab_array

b_to_u = bytes_to_unicode()
u_to_b = { c:i for i,c in b_to_u.items() }


In [None]:
vocabulary = convert(tokenizer.vocab, unicode=True)
vocabulary[30:50], vocabulary[12450:12470]

(array(['=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P'], dtype=object),
 array(['▁readily', 'aya', '▁scream', '▁addresses', '▁facilitate', 'Sw',
        'UP', 'asted', 'ة', '▁1984', '}}$,', '▁nutrition', '😱å¹', 'estyle',
        '▁Lett', '▁deliber', 'gered', 'command', '▁jun', '▁Aud'],
       dtype=object))

In [None]:
def top_words(x, n=10):
  score = x.detach().numpy()
  sorted_words = sorted(zip(score, vocabulary), reverse=True)
  return sorted_words[:n]

In [None]:
top_words(model_output["logits"][0,1,:])

[(17.15962, ','),
 (14.673129, '▁the'),
 (14.174997, '▁we'),
 (13.992931, '▁it'),
 (13.84141, '▁there'),
 (13.757128, '▁you'),
 (13.234147, '▁I'),
 (12.935117, '▁in'),
 (12.807119, '▁a'),
 (12.754117, '▁most')]