The goal of this notebook is to understand how the mapping between dummy tokens from the source prompt to the target prompt should look like.

Example with max_len=5 (and ignoring start/end tokens):
```
p_src = [feroci, us, turtle]
p_tgt = [sad, turtle]
```

Weifeng implements the mapper as
$$
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
$$
whereas I think the `1` in the second column is wrong, and the matrix should therefore be
$$
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
\end{bmatrix}
\text{or}
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
$$
but I'm not sure about the mapping of the diagonal after the 1st row, which represent the dummy tokens. 

### Setup & imports

In [1]:
!pip install -Uqq fastcore accelerate transformers diffusers

In [2]:
from PIL import Image
import numpy as np
import torch
from torch import nn, tensor
from torchvision.transforms import ToTensor
from fastcore.all import *
np.set_printoptions(precision=2, linewidth=140)
torch.set_printoptions(precision=2, linewidth=140, sci_mode=False)

In [3]:
from P2P import *

  deprecate(


In [4]:
device = 'cuda'
g_cpu = torch.Generator().manual_seed(2333)
prompts = ['ferocious turtle',
           'sad turtle']
NUM_DIFFUSION_STEPS = 20

In [5]:
lc = LoggedVars()  # logger for attention controller
la = LoggedVars()  # logger for attention application

In [6]:
pipe = Prompt2PromptPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", attn_logger=la)

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.


In [7]:
controller = AttentionReplace(prompts, NUM_DIFFUSION_STEPS, cross_replace_steps=0.4, self_replace_steps=0.4, tokenizer=pipe.tokenizer, device=pipe.device, logger=lc)

Now, let's run it once

In [8]:
outputs = pipe(prompt=prompts, height=512, width=512, num_inference_steps=NUM_DIFFUSION_STEPS, controller=controller, generator=g_cpu)

  num_channels_latents = self.unet.in_channels


  0%|          | 0/20 [00:00<?, ?it/s]



In [11]:
la.keys()

dict_keys(['attn', 'hidden_states__passed', 'encoder_hidden_states__passed', 'attention_mask__passed', 'batch_size', 'sequence_length', 'attn#prepare_attention_mask', 'attention_mask', 'query__pre_h2b', 'is_cross', 'encoder_hidden_states', 'key__pre_h2b', 'value__pre_h2b', 'query', 'key', 'value', 'attention_probs__precontrol', 'attn#get_attention_scores', 'attention_probs__postcontrol', 'place_in_unet', 'hidden_states_1', 'hidden_states_2', 'hidden_states_3', 'hidden_states_4', 'attn#to_out', 'attn#head_to_batch_dim', 'attn#batch_to_head_dim'])

In [12]:
lc.keys()

dict_keys(['mapper', 'self', 'attn', 'is_cross', 'place_in_unet'])

### Okay, let's get to it

**Important Note:** I discorverd that Weifeng copied the replacement part of the code from the paper authors's [repo](https://github.com/google/prompt-to-prompt/blob/main/seq_aligner.py#L189).
<br/>
So, for the implementation in Diffusers, I should assume the code is correct.

Still, I believe the the mapping to be wrong. I will check that out after implementing P2P into diffusers.