# Looking Inside Transformer LLMs

# Loading the LLM

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct",
                                             device_map="cuda",
                                             torch_dtype="auto",
                                             trust_remote_code=False,
                                             )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [2]:
# Create a pipeline
generator = pipeline("text-generation",
                     model=model,
                     tokenizer = tokenizer,
                     return_full_text = False,
                     max_new_tokens = 500,
                     do_sample = False,
                     )

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


# The Inputs and Outputs of a Trained Transformer LLM


In [None]:
prompt = (
    "Write an email apologizing to the University of Missouri administration for the delay "
    "in submitting the research report. Explain the reasons and assure timely submission in the future.<|assistant|>"
)
output = generator(prompt)

print(output[0]['generated_text'])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Subject: Apology for Delayed Submission of Research Report


Dear University of Missouri Administration,


I hope this message finds you well. I am writing to sincerely apologize for the delay in submitting the research report that was due on April 15, 2023.


Unfortunately, unforeseen circumstances arose that significantly impacted my ability to complete the report on time. Specifically, I encountered a critical technical issue with my computer, which resulted in the loss of a substantial portion of my work. Additionally, I was unexpectedly called away for a family emergency, which further hindered my progress.


I understand the importance of adhering to deadlines, especially in an academic setting, and I deeply regret any inconvenience this delay may have caused. Please rest assured that I have taken steps to prevent such occurrences in the future. I have now secured a backup system for my work and have developed a more robust contingency plan to manage unforeseen events.


I am co

In [None]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, 

# Choosing a single token from the probability distribution (sampling / decoding)

In [3]:
prompt = "The capital of Japan is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids = input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

We could also show actual tokenization breakdown

In [7]:
# Show actual tokenization breakdown
print(f"Input: {prompt}")
print(f"Token IDs: {input_ids.tolist()[0]}")
print(f"Decoded tokens:")

for i, token_id in enumerate(input_ids.tolist()[0]):
    token = tokenizer.decode([token_id])
    print(f"  Position {i}: '{token}' → ID {token_id}")

Input: The capital of Japan is
Token IDs: [450, 7483, 310, 5546, 338]
Decoded tokens:
  Position 0: 'The' → ID 450
  Position 1: 'capital' → ID 7483
  Position 2: 'of' → ID 310
  Position 3: 'Japan' → ID 5546
  Position 4: 'is' → ID 338


Note that

In [10]:
lm_head_output

tensor([[[24.7500, 24.8750, 22.7500,  ..., 19.0000, 19.0000, 19.0000],
         [31.0000, 31.5000, 26.0000,  ..., 25.8750, 25.8750, 25.8750],
         [31.3750, 28.8750, 31.0000,  ..., 26.2500, 26.2500, 26.2500],
         [33.5000, 34.0000, 37.2500,  ..., 28.1250, 28.1250, 28.2500],
         [31.6250, 29.8750, 30.8750,  ..., 23.0000, 23.0000, 23.0000]]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)

Now:

In [None]:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'Tokyo'

We could take a closer look here

In [12]:
import torch
import torch.nn.functional as F

# lm_head_output shape: (batch_size, seq_length, vocab_size)

# Get logits for the last token position
logits = lm_head_output[0, -1]  # shape: (vocab_size,)

# Convert logits to probabilities
probs = F.softmax(logits, dim=-1)

# Get top 6 token IDs and their probabilities
top5_probs, top5_token_ids = torch.topk(probs, 6)

# Print token and probability
for token_id, prob in zip(top5_token_ids, top5_probs):
    token_str = tokenizer.decode([token_id])
    print(f"{token_str}: {prob.item():.4f}")


Tokyo: 0.8359
_: 0.0197
a: 0.0197
Ky: 0.0093
...: 0.0072
known: 0.0072


Also pay attention to the shape of the model output

In [None]:
model_output[0].shape

torch.Size([1, 5, 3072])

and the shape of the model head output

In [None]:
lm_head_output.shape

torch.Size([1, 5, 32064])

# Speeding up generation by caching keys and values


In [None]:
prompt = (
  "Write an email apologizing to the University of Missouri administration for the delay "
  "in submitting the research report. Explain the reasons and assure timely submission in the future.<|assistant|>"
)

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

In [None]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


4.73 s ± 273 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

35.2 s ± 144 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
