In natural language processing, tokenization involves splitting a sentence into individual words, or tokens, which are then mapped onto an index. Understanding the GPT-2 cache is important for interpreting neural networks. Specifically, for each head in a particular layer, there is a matrix of lower triangular tokens that are used to set up attention. This layer then returns a specific type of pattern, such as attention or an MLP (multilayer perceptron).

To simplify the interpretation process, the following type and activation name aliases are often used:


In [None]:
layer_type_alias = {
        "a": "attn",
        "m": "mlp",
        "b": "",
        "block": "",
        "blocks": "",
        "attention": "attn",
    }

act_name_alias = {
        "attn":"pattern",
        "attn_logits":"attn_scores",
        "key":"k",
        "query":"q",
        "value":"v",
        "mlp_pre":"pre",
        "mlp_mid":"mid",
        "mlp_post":"post",
    }

Visualizing Attention Patterns and Ablation

Visualizing attention patterns using CircuitViz Library can reveal how specific tokens are linked together. For example, the summarization token "ization" is closely linked to "summar," while "supervised" is linked to a couple of tokens before it. However, it's important to understand why these links matter.

Ablation involves deleting an activation function and observing how it affects behavior. For example, zero ablation involves setting an output to zero, while mean ablation involves visualizing activations as points on a circle, with the mean being the center. Random ablation replaces the activation of a current data point with another. However, models like GPT-2 that are trained using dropout may not be affected by ablation because they are protected by pruning.

Activation Patching and Interpretation

Activation patching involves checking how a neuron responds to a corrupted prompt compared to a normal prompt. For example, "The Eiffel Tower is in Paris" versus "The Colosseum is in." This technique works best when token lengths are similar. The neuron may also activate after feeding an "Apple or axe or..." prompt to it, indicating that there is an attention head that looks out for vowel patterns and activates that neuron.

To further interpret neural networks, hook functions can be defined. These functions take in a value and a hook point, with the value having details of the batch dimension, position, head_index, and d_head (model dimension or tensor size). utils.get_act_name can be used to get a string layer and return the attribute that contains it for the forward hook.

Using loss as a metric for activation patching instead of logits can also be helpful. Overall, understanding the GPT-2 cache, visualizing attention patterns, performing ablation, and using activation patching and hook functions are all useful techniques for interpreting neural networks.