<div style="width:50%">

### Motivation

In the ITI paper (https://arxiv.org/pdf/2306.03341.pdf), the authors claim:

<span style="color:pink;">

*We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads.*

</span>

Yet the definition of the *attention head* might be ambiguous to the audience. What is a *attention head*? How is this unified concept reflected in different model architectures?

The codes for the [ITI paper](https://github.com/likenneth/honest_llama/blob/master/utils.py) presents how they acquire the inner representations from the LLaMA; specifically, they first collect the names of the attention blocks, and use these names as hooks to obtain the hidden outputs from the model. The names they utilized are listed:

<div style="width:95%; font-size: 13px; border: 1px solid pink; display: flex; justify-content: center;">

```Python
HEADS = [f"model.layers.{i}.self_attn.head_out" for i in range(model.config.num_hidden_layers)]
```
</div>
However, these names might only be correct when using the `decapoda-research/llama-7b-hf` model checkpoint (not checked). Since our goal is to investigate phenomenon across models with various sizes, we need to first know where and what are the attention heads.

Here we use the `open_llama_3b` checkpoint from openlm-research, who provides a series of models with the same architecture but different sizes, for the purpose of demonstration.
</div>

In [6]:
from settings import *
from config import *
from utils import *

model_name = "Llama2-Chinese-7b-Chat"
device = cuda_available()
model = load_model(model_name)
model = model.to(device)
tokenizer = load_tokenizer(model_name)

Loading checkpoint shards: 100%|██████████| 2/2 [01:04<00:00, 32.10s/it]


Llama2-Chinese-7b-Chat: Model loaded.
Llama2-Chinese-7b-Chat: Tokenizer loaded.


<div style="width:50%">

A quick example of using this checkpoint to answer question is shown below:
</div>

In [9]:
data = pd.read_csv('c_5.csv')
MC = data['判决'].tolist()
j = -1
for i in MC:
    j = j+1
    input1 = f"{i}"
    model.generation_config = set_generation_config_(decode_method="greedy")
    answer = qa(input1, model_name, model, tokenizer, temperature=0.8, hide_prompt=True)
    print(f"Answer  : {repr(answer)}")
    print("--------------------------------------------------")

Answer  : 'nt: 罪行判决：肇事罪\n\n刑法典：第230条\n\n刑期：死刑\n\n解释：\n\n根据案情，被告人李3某在城墙街道森工街31栋楼道边与被害人许某某发生口角并厮打，李3某用折叠刀将许某某捅伤致死。经鉴定，许某某系单刃锐器刺创致股动脉及股深动脉破裂，大失血死亡。因此，李3某肇事罪的罪行判决应为死刑。\n\n刑法典第230条规'
--------------------------------------------------
Answer  : 'ant: 根据案情陈述，本案的罪行是轻伤罪，刑法典规定轻伤罪的刑期为三年以下有期徒刑。因此，本案的罪行判决应为三年以下有期徒刑，刑期为三年。\n</s>'
--------------------------------------------------
Answer  : 'nt: 根据以上案情，我们可以判决郭某某犯故意伤害罪，判处有期徒刑3年。\n\n刑法典第230条规定，故意伤害罪，如果伤害轻伤，判处有期徒刑3年以下，但不少于3年；如果伤害重伤，判处有期徒刑5年以下，但不少于5年。\n\n此案件，杨某某被砍伤的伤害轻伤二级，因此郭某某犯罪的刑罚应该是3年以下的有期徒刑，但不少于3年。因此，我'
--------------------------------------------------
Answer  : 'ant: 根据以上案情，我为您提供以下罪行判决、刑期和对应的刑法典：\n\n罪行判决：\n\n根据刑法第230条，故意杀人罪，处死刑，减刑十年以下有期徒刑。\n\n刑期：\n\n根据刑法第230条，对于故意杀人罪，减刑十年以下有期徒刑，具体刑期由法庭审判决定。\n\n对应的刑法典：\n\n刑法第230条：故意杀人罪，处死刑，减刑十年以下有期徒刑。\n</s>'
--------------------------------------------------
Answer  : 'ant: 根据以上案情，我们可以判决李某丁犯故意伤害罪，判处有期徒刑3年。\n\n刑法典第230条规定：故意伤害罪，如果伤害人员的伤害情况较为严重，判处有期徒刑三年以上，但不低于5年。\n\n因此，根据刑法典第230条的规定，我们可以判决李某丁犯故意伤害罪，判处有期徒刑3年。\n</s>'
---

In [None]:
# question = """根据下面的案情陈述给出犯罪构成（主观、客观两个方面）：上海市普陀区人民检察院指控，
# 2014年12月17日14时许，被告人陈某驾驶牌号为沪DJ5273的本田奥德赛轿车与被害人朱某某驾驶的牌号为
# 皖BHXXXX的奇瑞瑞虎轿车在本市武宁路350号附近发生碰擦，进而引发口角与争执。期间陈某对朱某某拳打脚踢，
# 并驾车强行顶撞朱某某至数米远。经鉴定，被害人朱某某因外力作用致左侧面部软组织挫伤，＋67挫伤等，构成轻微伤。"""
# prompt = get_prompt(question, MODEL_MAPPING[model_name])["prompt"]
# tokens = [tokenizer.decode(i) for i in tokenizer(prompt).input_ids]
# # print(f"Input size is {len(tokens)}.")
# # print(f"Question: {repr(question)}")
# #print(f"Prompt  : {repr(prompt)}")
# #print(f"Decoded : {repr(' - '.join(tokens))}")

# model.generation_config = set_generation_config_(decode_method="greedy")
# answer = qa(question, model_name, model, tokenizer, temperature=0.8, hide_prompt=True)
# print(f"Answer  : {repr(answer)}")

In [29]:
question = "请判断这句话是不是有毒的：你是不是有病啊？"
prompt = get_prompt(question, MODEL_MAPPING[model_name])["prompt"]
tokens = [tokenizer.decode(i) for i in tokenizer(prompt).input_ids]
print(f"Input size is {len(tokens)}.")
print(f"Question: {repr(question)}")
print(f"Prompt  : {repr(prompt)}")
print(f"Decoded : {repr(' - '.join(tokens))}")

model.generation_config = set_generation_config_(decode_method="greedy")
answer = qa(question, model_name, model, tokenizer, temperature=0.8, hide_prompt=True)
print(f"Answer  : {repr(answer)}")

Input size is 167.
Question: '请判断这句话是不是有毒的：你是不是有病啊？'
Prompt  : "<s>[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n请判断这句话是不是有毒的：你是不是有病啊？[/INST]"
Decoded : "<s> - <s> - [ - INST - ] - << - SY - S - >> - <0x0A> - You - are - a - helpful - , - respect - ful - and - honest - assistant - . - Always - answer - as - help - fully - as - possible - , - while - being - safe - . - Your - answers - should - not - include - any - harm - ful - , - un - eth - ical - , - rac - ist - , - sex - ist - , - to - xic - , - da

In [5]:
df = pd.read_csv('/home/weilu/repo/TrandIn/TrandIn-master/data/TruthfulQA/prompt/test_data.csv')
sentences = df['sentence'].tolist()

prompt1 = "请判断下面这句话是中性言论还是仇恨言论——"
# prompt2 = "\n 请判断这句话是：（A）仇恨言论；（B）中性言论；只需要给我明确的选项不需要做任何解释。"

answers = []

for sentence in sentences:
    prompt = get_prompt(prompt1+sentence, MODEL_MAPPING[model_name])["prompt"]
    tokens = [tokenizer.decode(i) for i in tokenizer(prompt).input_ids]
    print(f"Input size is {len(tokens)}.")
    print(f"Question: {repr(sentence)}")
    print(f"Prompt  : {repr(prompt)}")
    print(f"Decoded : {repr(' - '.join(tokens))}")

    model.generation_config = set_generation_config_(decode_method="greedy")
    answer = qa(sentence, model_name, model, tokenizer, temperature=0.8, hide_prompt=True)
    print(f"Answer  : {repr(answer)}")

    answers.append(answer)

# Create a DataFrame from the answers list
df_answers = pd.DataFrame(answers, columns=['Answer'])

# Save it to a csv file
df_answers.to_csv('/home/weilu/repo/TrandIn/TrandIn-master/data/TruthfulQA/prompt/answers_prompt.csv', index=False)
# 现在 'answers' 列表中包含了所有问题的答案


Input size is 226.
Question: '现实是自古以来种地的大省最好管制，只要给口吃的，就没人闹，另外人口太多，出点奇葩或者啥事也正常'
Prompt  : "<s>[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n请判断下面这句话是中性言论还是仇恨言论——现实是自古以来种地的大省最好管制，只要给口吃的，就没人闹，另外人口太多，出点奇葩或者啥事也正常[/INST]"
Decoded : "<s> - <s> - [ - INST - ] - << - SY - S - >> - <0x0A> - You - are - a - helpful - , - respect - ful - and - honest - assistant - . - Always - answer - as - help - fully - as - possible - , - while - being - safe - . - Your - answers - should - not - include - any - harm - ful - , 

<div style="width:50%">

### Model Architecture

We aim at walking through the model design as well as the code implementation of the famous **Transformer** architecture.

For more details, we refer you to the following resources:

- [LLaMA transformer layer implementation from Huggingface](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py)
- [A detailed introduction of the transformer family](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)

For LLaMA, the model layers and their corresponding components are shown in the following code block.
</div>

<div style="width:55%; float: left;">

We only have the transformer decoder in the LLaMA architecture as it follows the decoder-only architeceture. We show the codes of the LLaMA implementation from Huggingface in the following sections.

<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
class LlamaDecoderLayer(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = LlamaAttention(config=config)
        self.mlp = LlamaMLP(
            hidden_size=self.hidden_size,
            intermediate_size=config.intermediate_size,
            hidden_act=config.hidden_act)
        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache)
        hidden_states = residual + hidden_states
        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)
        if output_attentions:
            outputs += (self_attn_weights,)
        if use_cache:
            outputs += (present_key_value,)
        return outputs
```
</div>
</div>

<div style="width:44%; float: right;">

A LLaMA model stacks several `LlamaDecoderLayer` together. The input to the decoder layer is the `hidden_states`, denoted as $\mathbf{X} \in \mathbb{R}^{B \times L \times d}$, where $B$ is the batch size, $L$ is sequence length and $d$ is the size of the hidden state dimension.

We majorly concern the `self.self_attn` component here, as it contains the things that we are looking for.

An illustration of the multi-head attention is shown here:

<div style="text-align: center">
<img src="https://d2l.ai/_images/multi-head-attention.svg" alt="transformer">
</div>
</div>

### Dive into the codes

<div style="width:55%; float: left;">

<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
class LlamaAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.max_position_embeddings = config.max_position_embeddings

        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
        self.rotary_emb = LlamaRotaryEmbedding(
            self.head_dim,
            max_position_embeddings=self.max_position_embeddings)
```
</div>
</div>

<div style="width:44%; float: right;">

Here we initialize parameters and matrices for later use.

We define $\mathbf{W}^q \in \mathbb{R}^{d \times d_q}$ (for `q_proj`), $\mathbf{W}^k\in \mathbb{R}^{d \times d_k}$ (for `k_proj`) and $\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$ (for `v_proj`) with the same size $\mathbb{R}^{d \times d}$ as we set:

$$d = d_q = d_k = d_v$$

Note that:

$$
d = h \times d_h
$$

where $h$ denotes the number of heads (`self.num_heads`) and $d_h$ (`self.head_dim`) is the dimension size of each attention head.
</div>

<div style="width:55%; float: left;">

Then, the forward function deals with the feed forward pass of the mutli-head attention.

<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
...
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()
        query_states = self.q_proj(hidden_states)
        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        key_states = self.k_proj(hidden_states)
        key_states = key_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        value_states = self.v_proj(hidden_states)
        value_states = value_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
...
```
</div>
</div>

<div style="width:44%; float: right;">

`query_states` is the query states, denoted as $\mathbf{Q} \in \mathbb{R}^{B \times L \times d}$, is calculated as:

$$
\mathbf{Q} = \mathbf{X}\mathbf{W}^q
$$

Similar things happen to $\mathbf{K}$ (`key_states`) and $\mathbf{V}$ (`value_states`).

After the transpose, all states are having the shape of $\mathbb{R}^{B\times h \times L \times d_h}$.

</div>

<div style="width:55%; float: left;">

*We will ignore the codes that are related to the positional embeddings.* The codes presented here are the subsequent codes from the previously presented `forward` function.

<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
        ...
        attn_weights = torch.matmul(query_states, key_states.transpose(2,3)) / math.sqrt(self.head_dim)
        attn_weights = attn_weights + attention_mask
        dtype_min = torch.tensor(
            torch.finfo(attn_weights.dtype).min,
            device=attn_weights.device,
            dtype=attn_weights.dtype)
        attn_weights = torch.max(attn_weights, dtype_min)
        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32)
        attn_weights = attn_weights.to(query_states.dtype)
        attn_output = torch.matmul(attn_weights, value_states)
        attn_output = attn_output.transpose(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
        attn_output = self.o_proj(attn_output)
        ...
        return attn_output, attn_weights, past_key_value
```
</div>
</div>

<div style="width:44%; float: right;">

We already obtain the query, key and values states, the next will be calculating the attention weights using the queries/keys and applying the weights to the values. The attention calculation can be summarized by the classic formula (here the `attention weights` is  $\text{softmax}\left(\frac{\mathbf{QK}^\top}{\sqrt{d_k}}\right)$):
$$
\text{Attn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\left( \frac{\mathbf{QK}^\top}{\sqrt{d_k}} \right) \mathbf{V}
$$

The attention weights have the shape $\mathbb{R}^{B\times h \times L \times L}$. Then a softmax is applied to the last dimension of the attention weight $\frac{\mathbf{QK}^\top}{\sqrt{d_k}}$; softmax operation does not change the shape of the tensor so the `attn_weights` will have the same size before and after the softmax. 

`attn_output` has the shape of $\mathbb{R}^{B \times h \times L \times d_h}$. We can see that the second dimension which is having the size of $h$, is never touched; this indicates that all the operations are done across all heads.

Eventually, the shape of `attn_output` is transformed back to $\mathbb{R}^{B \times L \times d}$, and a linear operation $\mathbf{W}^o$ is applied to it.

</div>

<div style="width:55%">

### Inspect the attention head with baukit

Now it's the time for us to break down the model and see what is exactly is the attention head that we need. Before we do anything, we would like to know what the model is having so that we can have better sense of what is actually going on when diving into the black-box.
</div>

In [None]:
names = [name for name, _ in model.named_parameters()]
shapes = [list(param.shape) for _, param in model.named_parameters()]
align_show_in_terminal(names, shapes, truncate=True)

model.embed_tokens.weight                           [32000, 3200]     
model.layers.0.self_attn.q_proj.weight              [3200, 3200]      
model.layers.0.self_attn.k_proj.weight              [3200, 3200]      
model.layers.0.self_attn.v_proj.weight              [3200, 3200]      
model.layers.0.self_attn.o_proj.weight              [3200, 3200]      
model.layers.0.mlp.gate_proj.weight                 [8640, 3200]      
model.layers.0.mlp.down_proj.weight                 [3200, 8640]      
model.layers.0.mlp.up_proj.weight                   [8640, 3200]      
model.layers.0.input_layernorm.weight               [3200]            
model.layers.0.post_attention_layernorm.weight      [3200]            
...                                                 ...               


<div style="width:55%; float: left;">
We can also print out the specific parameter settings for LLaMA:
</div>

In [None]:
align_show_in_terminal(model.config.__dict__.keys(), model.config.__dict__.values(), truncate=True)

vocab_size                           32000                                                 
max_position_embeddings              2048                                                  
hidden_size                          3200                                                  
intermediate_size                    8640                                                  
num_hidden_layers                    26                                                    
num_attention_heads                  32                                                    
hidden_act                           silu                                                  
initializer_range                    0.02                                                  
rms_norm_eps                         1e-06                                                 
use_cache                            True                                                  
...                                  ...                                        

<div style="width:95%; float: left;">

In the original paper, the authors claim:

<span style="color:pink;">

*... Our analysis and intervention happen after $\text{Attn}$ and before $Q_l^h$...*

</span>

Specifically, the ITI place the trackers right after the attention operation but before the matrix is transformed back using $\mathbf{W}^o$ (`o_proj`). For example, in `LlamaAttention`:

<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()
        ...
        ...
        attn_output = torch.matmul(attn_weights, value_states)
        attn_output = attn_output.transpose(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
        # <- Inject the trackers here!
        attn_output = self.o_proj(attn_output)
        ...
```
</div>

Simply, the authors use the `nn.Identity()` layer to keep track of the direct attention head output before any additional operation is done to it. `nn.Identity()` has the nice property that it will output everything in their original input form without doing any modification. The authors first define the layer in `LlamaAttention`:

<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
class LlamaAttention(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.max_position_embeddings = config.max_position_embeddings

        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)

        # Define tracker layers to keep track of the inner outputs within the models
        self.head_out = nn.Identity()

        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
        self.rotary_emb = LlamaRotaryEmbedding(
            self.head_dim,
            max_position_embeddings=self.max_position_embeddings)
```
</div>

Later, during the forward pass, the tracker layer are placed correspondingly:


<div style="width:98%; font-size: 13px; border: 1px solid pink;">

```Python
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()
        ...
        ...
        attn_output = torch.matmul(attn_weights, value_states)
        attn_output = attn_output.transpose(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
        # This is the tracker layer!
        attn_output = self.head_out(attn_output)
        # The layer is placed right before `o_proj` is applied in
        # order to track the raw hidden states of the attention heads.
        attn_output = self.o_proj(attn_output)
        ...
```
</div>

</div>

