# Train multiple linear probes at once
In this notebook we will find out at which layer a transformer has the most linearly seperable information required to do causal language modelling on wikitext data. The nice thing about *transformer_heads* is that this will all be possible with just one training run.

In [1]:
from transformer_heads import load_headed
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    MistralForCausalLM,
    Trainer,
    BitsAndBytesConfig,
    TrainingArguments,
    GPT2Model,
    GPT2LMHeadModel,
)
from transformer_heads.util.helpers import DataCollatorWithPadding, get_model_params
from peft import LoraConfig
from transformer_heads.config import HeadConfig
from transformer_heads.util.model import print_trainable_parameters
from transformer_heads.util.evaluate import evaluate_head_wise, get_top_n_preds
import torch

In [3]:
# Parameters
model_path = "meta-llama/Llama-2-7b-hf"
train_epochs = 1
eval_epochs = 1
logging_steps = 100

In [4]:
model_params = get_model_params(model_path)
model_class = model_params["model_class"]
hidden_size = model_params["hidden_size"]
vocab_size = model_params["vocab_size"]
print(model_params)

{'vocab_size': 32000, 'max_position_embeddings': 4096, 'hidden_size': 4096, 'intermediate_size': 11008, 'num_hidden_layers': 32, 'num_attention_heads': 32, 'num_key_value_heads': 32, 'hidden_act': 'silu', 'initializer_range': 0.02, 'rms_norm_eps': 1e-05, 'pretraining_tp': 1, 'use_cache': True, 'rope_theta': 10000.0, 'rope_scaling': None, 'attention_bias': False, 'attention_dropout': 0.0, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float16', 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': False, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.

Let's define a lot of heads in a loop. The heads will be hooked at layer -1,-3,-5,-7,-9,-11,-13. This is using python indexing: Layer -1 means after the last transformer block for example. We'll keep the original pretrained lm_head of the transformer model for comparison.

In [5]:
head_configs = [
    HeadConfig(
        name=f"wikitext_head_{(1+(i-1)*2)}",
        layer_hook=-(1 + (i - 1) * 2),
        in_size=hidden_size,
        hidden_size=0,
        num_layers=1,
        output_activation="linear",
        is_causal_lm=True,
        loss_fct="cross_entropy",
        num_outputs=vocab_size,
        is_regression=False,
        output_bias=False,
    )
    for i in range(1, 8)
]
head_configs.append(
    HeadConfig(
        name=f"lm_head",
        layer_hook=-1,
        in_size=hidden_size,
        hidden_size=0,
        num_layers=1,
        output_activation="linear",
        is_causal_lm=True,
        loss_fct="cross_entropy",
        num_outputs=vocab_size,
        is_regression=False,
        output_bias=False,
        trainable=False,
    )
)

In [6]:
dd = load_dataset("wikitext", "wikitext-2-v1")

In the *tokenize_function*, we define labels for each head. For causal_lm, this is just the copied input_ids.

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token


def tokenize_function(examples):
    out = tokenizer(examples["text"], padding=False, truncation=True)
    for hc in head_configs:
        out[hc.name] = out["input_ids"].copy()
    return out


for split in dd.keys():
    dd[split] = dd[split].filter(function=lambda example: len(example["text"]) > 10)
    dd[split] = dd[split].map(tokenize_function, batched=True)
dd.set_format(
    type="torch",
    columns=["input_ids", "attention_mask"] + [x.name for x in head_configs],
)
for split in dd.keys():
    dd[split] = dd[split].remove_columns("text")

Map:   0%|          | 0/2870 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/23627 [00:00<?, ? examples/s]

Map:   0%|          | 0/2460 [00:00<?, ? examples/s]

In [8]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.float32,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = load_headed(
    model_class,
    model_path,
    head_configs=head_configs,
    quantization_config=quantization_config,
    device_map={"": torch.cuda.current_device()},
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of TransformerWithHeads were not initialized from the model checkpoint at meta-llama/Llama-2-7b-hf and are newly initialized: ['heads.wikitext_head_1.lins.0.weight', 'heads.wikitext_head_11.lins.0.weight', 'heads.wikitext_head_13.lins.0.weight', 'heads.wikitext_head_3.lins.0.weight', 'heads.wikitext_head_5.lins.0.weight', 'heads.wikitext_head_7.lins.0.weight', 'heads.wikitext_head_9.lins.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Huggingface tells us that our newly added heads are newly initialized. Great.

In [9]:
print_trainable_parameters(model)

all params: 4417916928 || trainable params: 917504000 || trainable%: 20.76779656459851
params by dtype: defaultdict(<class 'int'>, {torch.float32: 1179914240, torch.uint8: 3238002688})
trainable params by dtype: defaultdict(<class 'int'>, {torch.float32: 917504000})


A lot of heads with large vocab size -> High amount of trainable parameters

In [10]:
dd["train"]

Dataset({
    features: ['input_ids', 'attention_mask', 'wikitext_head_1', 'wikitext_head_3', 'wikitext_head_5', 'wikitext_head_7', 'wikitext_head_9', 'wikitext_head_11', 'wikitext_head_13', 'lm_head'],
    num_rows: 23627
})

In [11]:
print(get_top_n_preds(5, model, "The historical significance of", tokenizer))

{'wikitext_head_1': ['에', 'ény', 'junto', 'Unity', 'zelf'], 'wikitext_head_3': ['Point', 'Everything', 'encode', 'nob', 'fal'], 'wikitext_head_5': ['чных', 'mlung', 'ismus', 'stress', 'подацима'], 'wikitext_head_7': ['мене', 'unsafe', 'lear', 'North', 'het'], 'wikitext_head_9': ['ște', 'rayed', 'credit', 'particul', 'marriage'], 'wikitext_head_11': ['CA', 'U', 'Ale', 'Æ', 'DOCTYPE'], 'wikitext_head_13': ['фе', 'продол', 'pag', 'attach', '改'], 'lm_head': ['the', 'this', 'a', '', 'The']}


The untrained heads are predicting somewhat randomly

In the collator, we need to make sure that the labels for each head are padded correctly. Here, we are padding with -100, the ignore_index token for cross_entropy.

In [12]:
args = TrainingArguments(
    output_dir="linear_probe_test",
    learning_rate=0.0002,
    num_train_epochs=train_epochs,
    logging_steps=logging_steps,
    do_eval=False,
    remove_unused_columns=False,
)
collator = DataCollatorWithPadding(
    feature_name_to_padding_value={
        "input_ids": tokenizer.pad_token_id,
        "attention_mask": 0,
        **{key.name: -100 for key in head_configs},
    }
)
trainer = Trainer(
    model,
    args=args,
    train_dataset=dd["train"],
    data_collator=collator,
)
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mykeller[0m ([33mchm-hci[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: wandb version 0.16.4 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


[34m[1mwandb[0m: Tracking run with wandb version 0.16.3


[34m[1mwandb[0m: Run data is saved locally in [35m[1m/raven/u/ykeller/transformer_heads/wandb/run-20240324_102422-d2nnltle[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.


[34m[1mwandb[0m: Syncing run [33mfirm-smoke-217[0m


[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chm-hci/huggingface[0m


[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chm-hci/huggingface/runs/d2nnltle[0m


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...




Step,Training Loss
100,57.6001
200,38.5629
300,33.7243
400,30.8404
500,29.6798
600,28.3661
700,27.4192
800,27.1187
900,26.3536
1000,25.88


Checkpoint destination directory linear_probe_test/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.




Checkpoint destination directory linear_probe_test/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.




Checkpoint destination directory linear_probe_test/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.




Checkpoint destination directory linear_probe_test/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.




Checkpoint destination directory linear_probe_test/checkpoint-2500 already exists and is non-empty.Saving will proceed but saved results may be invalid.




TrainOutput(global_step=2954, training_loss=26.48923161325216, metrics={'train_runtime': 20574.7915, 'train_samples_per_second': 1.148, 'train_steps_per_second': 0.144, 'total_flos': 3.1610522837397504e+17, 'train_loss': 26.48923161325216, 'epoch': 1.0})

In [13]:
print(evaluate_head_wise(model, dd["validation"], collator, epochs=eval_epochs))


Evaluating: 100%|██████████| 308/308 [13:32<00:00,  2.64s/it]



(23.908249817885363, {'wikitext_head_1': 2.7227502832939097, 'wikitext_head_3': 2.876460525896642, 'wikitext_head_5': 2.860558650323323, 'wikitext_head_7': 2.877980626248694, 'wikitext_head_9': 2.8937691941663815, 'wikitext_head_11': 2.950445142659274, 'wikitext_head_13': 3.052337732407954, 'lm_head': 3.673947743007115})





Nothing super surprising here. For each transformer block of the LLM that we are passing, the hidden state contains more (linearly seperable) information about causal language modelling with wikitext data. That makes a lot of sense as the model pretraining was also for causal language modelling.

In [14]:
print(get_top_n_preds(5, model, "The historical significance of", tokenizer))

{'wikitext_head_1': ['the', '', 'D', 'this', 'Old'], 'wikitext_head_3': ['the', '', 'D', 'this', 'F'], 'wikitext_head_5': ['the', '', 'D', 'Var', 'these'], 'wikitext_head_7': ['the', '', 'D', 'Old', 'this'], 'wikitext_head_9': ['the', '', 'D', 'Old', 'Ha'], 'wikitext_head_11': ['the', '', 'O', 'D', 'this'], 'wikitext_head_13': ['the', '', 'this', 'a', 'A'], 'lm_head': ['the', 'this', 'a', '', 'The']}


The heads are now predicting more likely tokens.