# Main Tutorial

In [1]:
from easyroutine import path_to_parents
path_to_parents(1)

%load_ext autoreload
%autoreload 2

Changed working directory to: /orfeo/cephfs/home/dssc/francescortu/easyroutine


In [4]:
# You can set the logging level for the entire library using the following utility function
from easyroutine.logger import enable_debug_logging,enable_info_logging,enable_warning_logging, setup_logging

enable_info_logging()

[2;36m[03/14/25 12:03:03][0m[2;36m [0m[34mINFO    [0m Info logging enabled for easyroutine.  ]8;id=513513;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/logger.py\[2mlogger.py[0m]8;;\[2m:[0m]8;id=484543;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/logger.py#95\[2m95[0m]8;;\


## Hooked Model
The central element of the interpretability sub-module is the `HookedModel` class, that is a wrapper around a HuggingFace model with hooks to extract intermediate representations. For now we support just few models, but we are working to extend the list.  Check the documentation for the full list of supported models.
For this tutorial we will use the tiny 2 layers transformer model `hf-internal-testing/tiny-random-LlamaForCausalLM`

In [3]:
from easyroutine.interpretability import HookedModel

# takes the usual args of the HF library
model = HookedModel.from_pretrained("hf-internal-testing/tiny-random-LlamaForCausalLM", device_map="cuda")



[2;36m[03/14/25 12:01:16][0m[2;36m [0m[34mINFO    [0m Found a wrapper for LlamaAttention    ]8;id=751706;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/module_wrappers/manager.py\[2mmanager.py[0m]8;;\[2m:[0m]8;id=110166;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/module_wrappers/manager.py#57\[2m57[0m]8;;\


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message


[2;36m                   [0m[2;36m [0m[34mINFO    [0m HookedModel: Model loaded in [1;36m1[0m  ]8;id=47130;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=35492;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#187\[2m187[0m]8;;\
[2;36m                    [0m         devices. First device: cu[1;92mda:0[0m   [2m                   [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m  HookedModel:                   ]8;id=546509;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=850483;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#203\[2m203[0m]8;;\
[2;36m                    [0m                                     The [2m                   [0m
[2;36m 

Let's see the model

In [6]:
print(model)

HookedModel(model_name=hf-internal-testing/tiny-random-LlamaForCausalLM):
        LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 16, padding_idx=31999)
    (layers): ModuleList(
      (0-1): 2 x LlamaDecoderLayer(
        (self_attn): LlamaAttentionWrapper(
          (q_proj): Linear(in_features=16, out_features=16, bias=False)
          (k_proj): Linear(in_features=16, out_features=16, bias=False)
          (v_proj): Linear(in_features=16, out_features=16, bias=False)
          (o_proj): Linear(in_features=16, out_features=16, bias=False)
          (attention_matrix_hook): AttentionMatrixHookModule()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=16, out_features=64, bias=False)
          (up_proj): Linear(in_features=16, out_features=64, bias=False)
          (down_proj): Linear(in_features=64, out_features=16, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((16,), eps=1e-06)
       

The `HookedModel` class automatically load also the tokenizer. To get the tokenizer we can use:

In [7]:
tokenizer = model.get_tokenizer() 

__Multimodal Models__
For multimodal model we have also the `HookedModel.get_processor()` method that return the processor for the multimodal model and the `HookedModel.get_text_tokenizer()` method that return the tokenizer for the text part of the multimodal model. In addition, it is possible to set the modality to use with the `HookedModel.use_language_model()` and `HookedModel.use_full_model()` methods. This methods are useful to switch between the language backbone and the full model with the visual encoder. It is useful in model like LlaVA which expect always the visual input.

Now let's try to do a forward pass with the model.

In [None]:
text = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(text, return_tensors="pt")
print(inputs)

output = model(inputs)
print(output)

{'input_ids': tensor([[    1,   450,  4996, 17354,  1701, 29916,   432, 17204,   975,   278,
         17366, 11203]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
ActivationCache(logits: tensor([[[-0.0430, -0.0654, -0.1011,  ..., -0.1387,  0.0483,  0.0708],
         [ 0.0603, -0.0422,  0.0654,  ...,  0.0310,  0.0212, -0.0014],
         [-0.0267,  0.0522, -0.0840,  ...,  0.1367,  0.0447, -0.1011],
         ...,
         [-0.0339, -0.0522, -0.0684,  ..., -0.0957,  0.0447,  0.0189],
         [-0.0270,  0.0024,  0.0011,  ...,  0.0369,  0.0029, -0.0118],
         [-0.0222,  0.0815,  0.0247,  ...,  0.0791,  0.0762, -0.1357]]],
       device='cuda:0', dtype=torch.bfloat16), mapping_index: {'all': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})


In [18]:
# Model have also a .predict method
output = model.predict(inputs=inputs,k=10)
print(output)

{'eso': 4.315376281738281e-05, 'además': 4.267692565917969e-05, 'bright': 4.1961669921875e-05, 'ionic': 4.1961669921875e-05, 'presidente': 4.172325134277344e-05, 'iar': 4.172325134277344e-05, '/`': 4.1484832763671875e-05, 'película': 4.1484832763671875e-05, 'Brown': 4.1484832763671875e-05, 'nested': 4.1484832763671875e-05}


As you can see, the return object of the forward pass is an ActivationCache object, i.e. a dictonary that could contains the hidden states of the model. Now let's see how we can extract the hidden states of the model.

In [19]:
from easyroutine.interpretability import ExtractionConfig

extraction_config = ExtractionConfig(
    extract_resid_out=True, # extract, per layer the outptu of each layer
    extract_attn_in=True, # extract, per layer the input of each 
)

output = model(
    inputs,
    extraction_config=extraction_config, # extract the requested activations
    target_token_positions=["last", -3] # extract at the last token and the third to last token position (support also "all", "all-text", "all-image", and  other more complex configurations)
)

Now let's see the extracted hidden states:

In [None]:
print("Activations extracted:", output.keys())
print("Resid shape (n_elements,target_token_positions,hiddend_dim):", output["resid_out_1"].shape)

# so output["resid_out_1"][0,0] is the residual of the first layer of the last token and output["resid_out_1"][0,1] is the residual of the first layer of the third to last token. If you are unsure of the mapping you can use

print(output["mapping_index"])


Activations extracted: dict_keys(['attn_in_0', 'resid_out_0', 'attn_in_1', 'resid_out_1', 'logits', 'mapping_index'])
Resid shape (batch,target_token_positions,hiddend_dim): torch.Size([1, 2, 16])
{'last': [0], -3: [1]}


If you want to extract the hidden states of a full dataset, you can use the `extract_cache` methods, given a dataloader that should have, for each element, the keys that the model expect (for sure `input_ids` and `attention_mask`, but maybe also `pixel_values` and `image_sizes` for multimodal models). The `extract_cache` method will return a list of ActivationCache objects, one for each element of the dataloader.

In [35]:
dataloader = [
    tokenizer("The quick brown fox jumps over the lazy dog", return_tensors="pt"),
    tokenizer("The cat is on the table", return_tensors="pt"),
]
print(dataloader)

[{'input_ids': tensor([[    1,   450,  4996, 17354,  1701, 29916,   432, 17204,   975,   278,
         17366, 11203]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[   1,  450, 6635,  338,  373,  278, 1591]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}]


In [36]:
cache = model.extract_cache(
    dataloader=dataloader,
    extraction_config=extraction_config,
    target_token_positions=["last", -3]
)

[2;36m[03/14/25 12:27:31][0m[2;36m [0m[34mINFO    [0m HookedModel: Extracting cache  ]8;id=940081;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=416787;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#1138\[2m1138[0m]8;;\
[2;36m[03/14/25 12:27:32][0m[2;36m [0m[34mINFO    [0m HookedModel: Forward pass      ]8;id=216125;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=507788;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#1142\[2m1142[0m]8;;\
[2;36m                    [0m         started                        [2m                    [0m


Extracting cache:: 100%|██████████| 2/2 [00:00<00:00, 33.50it/s]


In [None]:
cache["resid_out_1"].shape # (2,2,16) 2 samples, 2 target_token_positions, 16 hidden dim

torch.Size([2, 2, 16])

__WARNING__: Obviously, if you want to extract `all` positions, since all the tensors will have different shapes, the `extract_cache` method will return a list of tensors, one for each element in the dataloader. However, you can compute the mean of the hidden states of all the positions.

In [43]:
cache = model.extract_cache(
    dataloader=dataloader,
    extraction_config=extraction_config,
    target_token_positions=["all"]
)
print(cache["resid_out_1"])

[2;36m[03/14/25 12:30:20][0m[2;36m [0m[34mINFO    [0m HookedModel: Extracting cache  ]8;id=247977;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=602632;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#1138\[2m1138[0m]8;;\
[2;36m                   [0m[2;36m [0m[34mINFO    [0m HookedModel: Forward pass      ]8;id=305166;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=891987;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#1142\[2m1142[0m]8;;\
[2;36m                    [0m         started                        [2m                    [0m


Extracting cache::   0%|          | 0/2 [00:00<?, ?it/s]

[2;36m                    [0m         shapes [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m12[0m,   [2m                       [0m
[2;36m                    [0m         [1;36m16[0m[1m][0m[1m)[0m and [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m7[0m,  [2m                       [0m
[2;36m                    [0m         [1;36m16[0m[1m][0m[1m)[0m: Sizes of tensors must [2m                       [0m
[2;36m                    [0m         match except in dimension   [2m                       [0m
[2;36m                    [0m         [1;36m0[0m. Expected size [1;36m12[0m but got [2m                       [0m
[2;36m                    [0m         size [1;36m7[0m for tensor number [1;36m1[0m  [2m                       [0m
[2;36m                    [0m         in the list.; trying        [2m                       [0m
[2;36m                    [0m         torch.stack.                [2m                       [0m
[2;

Extracting cache:: 100%|██████████| 2/2 [00:00<00:00, 76.81it/s]

[tensor([[[-0.0181, -0.0427,  0.0099,  0.0004,  0.0059, -0.0098, -0.0219,
          -0.0154,  0.0082, -0.0272, -0.0104, -0.0026, -0.0126,  0.0266,
           0.0142, -0.0203],
         [-0.0019,  0.0297,  0.0276,  0.0126, -0.0024, -0.0034, -0.0217,
           0.0151,  0.0077,  0.0161,  0.0309,  0.0063, -0.0049, -0.0225,
           0.0442,  0.0018],
         [-0.0129,  0.0014,  0.0141,  0.0187,  0.0045,  0.0045,  0.0051,
           0.0181, -0.0087,  0.0282,  0.0266,  0.0012,  0.0215, -0.0006,
           0.0127, -0.0133],
         [-0.0208, -0.0115, -0.0349, -0.0236,  0.0184, -0.0576, -0.0167,
          -0.0193,  0.0247,  0.0197,  0.0054, -0.0256, -0.0046, -0.0093,
           0.0077,  0.0096],
         [-0.0217, -0.0076, -0.0269, -0.0206,  0.0026, -0.0339,  0.0050,
          -0.0022,  0.0112,  0.0009,  0.0117,  0.0052,  0.0206,  0.0280,
          -0.0186,  0.0134],
         [-0.0078, -0.0295, -0.0012, -0.0271, -0.0292,  0.0150, -0.0095,
           0.0256, -0.0189,  0.0284, -0.0018, -0.00




In [None]:
cache = model.extract_cache(  # we will get an warning for the logits
    dataloader=dataloader, 
    extraction_config=ExtractionConfig(
    extract_resid_out=True, # extract, per layer the outptu of each layer
    extract_attn_in=True, # extract, per layer the input of each 
    avg=True,
),
    target_token_positions=["all"],
)
print(cache["resid_out_1"].shape)

[2;36m[03/14/25 12:32:53][0m[2;36m [0m[34mINFO    [0m HookedModel: Extracting cache  ]8;id=1210;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=664325;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#1138\[2m1138[0m]8;;\
[2;36m                   [0m[2;36m [0m[34mINFO    [0m HookedModel: Forward pass      ]8;id=56015;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py\[2mhooked_model.py[0m]8;;\[2m:[0m]8;id=655390;file:///orfeo/cephfs/home/dssc/francescortu/easyroutine/easyroutine/interpretability/hooked_model.py#1142\[2m1142[0m]8;;\
[2;36m                    [0m         started                        [2m                    [0m


Extracting cache::   0%|          | 0/2 [00:00<?, ?it/s]

[2;36m                    [0m         shapes [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m, [1;36m12[0m,   [2m                       [0m
[2;36m                    [0m         [1;36m32000[0m[1m][0m[1m)[0m and [1;35mtorch.Size[0m[1m([0m[1m[[0m[1;36m1[0m,  [2m                       [0m
[2;36m                    [0m         [1;36m7[0m, [1;36m32000[0m[1m][0m[1m)[0m: Sizes of        [2m                       [0m
[2;36m                    [0m         tensors must match except   [2m                       [0m
[2;36m                    [0m         in dimension [1;36m0[0m. Expected    [2m                       [0m
[2;36m                    [0m         size [1;36m12[0m but got size [1;36m7[0m for  [2m                       [0m
[2;36m                    [0m         tensor number [1;36m1[0m in the      [2m                       [0m
[2;36m                    [0m         list.; trying torch.stack.  [2m                       [0m
[2;

Extracting cache:: 100%|██████████| 2/2 [00:00<00:00, 129.81it/s]

torch.Size([2, 1, 16])





## Save the Activation

## Interventions