# Self-Attention

Paper: `Transformer` Attention is All you need (NIPS 2017)

Code:
- [官方TensorFlow实现](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py)
- [Pytorch实现](https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/SubLayers.py)

Reference:
- `Enzo_Mi` [Multi-Head Attention | 算法 + 代码](https://www.bilibili.com/video/BV1qo4y1F7Ep)
- `黑白` [Transformer代码及解析(Pytorch)](https://zhuanlan.zhihu.com/p/345993564)
- `于建民` [The Illustrated Transformer【译】](https://blog.csdn.net/yujianmin1990/article/details/85221271)
- `Jay Alammar` [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)

In [5]:
import torch
import torch.nn.functional as F
query = torch.rand((10, 32, 512))
key = query
value = query
attn = F.scaled_dot_product_attention(query, key, value)

print("query \n", query.shape)
print("attn \n", attn.shape)

query 
 torch.Size([10, 32, 512])
attn 
 torch.Size([10, 32, 512])


In [None]:
import torch
import torch.nn.functional as F

device = torch.device("mps")

query = torch.rand((10, 32, 512), device = device)
key = query
value = query
attn = F.scaled_dot_product_attention(query, key, value)

print("query \n", query.shape)
print("attn \n", attn.shape)

In [2]:
attn

tensor([[[0.4745, 0.4078, 0.5420,  ..., 0.5213, 0.5136, 0.4905],
         [0.4223, 0.3916, 0.5045,  ..., 0.5870, 0.4890, 0.5507],
         [0.4302, 0.4916, 0.5028,  ..., 0.5220, 0.5181, 0.5106],
         ...,
         [0.3784, 0.4479, 0.4898,  ..., 0.5049, 0.5079, 0.5083],
         [0.3707, 0.4917, 0.5711,  ..., 0.5041, 0.4907, 0.6239],
         [0.4729, 0.4541, 0.4978,  ..., 0.5907, 0.5661, 0.5213]],

        [[0.5778, 0.5685, 0.4711,  ..., 0.5387, 0.4863, 0.5201],
         [0.4893, 0.4546, 0.4603,  ..., 0.4155, 0.4935, 0.5398],
         [0.5993, 0.4695, 0.5234,  ..., 0.4527, 0.5264, 0.6029],
         ...,
         [0.4905, 0.5442, 0.4694,  ..., 0.5322, 0.5586, 0.5457],
         [0.4696, 0.5609, 0.4698,  ..., 0.3902, 0.5340, 0.5222],
         [0.5904, 0.4835, 0.5434,  ..., 0.4096, 0.4629, 0.5283]],

        [[0.4654, 0.4964, 0.5914,  ..., 0.4808, 0.3459, 0.4747],
         [0.4354, 0.4594, 0.4713,  ..., 0.5216, 0.3536, 0.5848],
         [0.4499, 0.5598, 0.6143,  ..., 0.4723, 0.4049, 0.

In [6]:
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
    def __init__(self, dim, dk, dv):
        super(SelfAttention, self).__init__()
        self.scale = dk ** -0.5
        self.q = nn.Linear(dim, dk)
        self.k = nn.Linear(dim, dk)
        self.v = nn.Linear(dim, dv)
    
    def forward(self, x):
        q = self.q(x)
        k = self.k(x)
        v = self.v(x)
        
        attn = q @ k.transpose(-2,-1) * self.scale
        attn = attn.softmax(dim=-1)
        
        x = attn @ v
        return x

att = SelfAttention(dim=2,dk=2,dv=3)
x = torch.rand((1,4,2))
output = att(x)
print(x, '\n', output)

tensor([[[0.6832, 0.2191],
         [0.3721, 0.6172],
         [0.1940, 0.8315],
         [0.5647, 0.7821]]]) 
 tensor([[[ 0.2901, -0.0998, -0.4378],
         [ 0.2862, -0.0993, -0.4376],
         [ 0.2842, -0.0991, -0.4374],
         [ 0.2801, -0.0991, -0.4383]]], grad_fn=<UnsafeViewBackward0>)


In [1]:
ll = nn.Linear(2, 5) # Linear 就是把 [...,2] 最后一维经过全连接层修改最后一维的维度为 [...,5]

x_in = torch.rand((1,4,2))
x_out = ll(x)

print(x_in, '\n', x_out)

NameError: name 'nn' is not defined