https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization  
https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
    
https://github.com/harvardnlp/annotated-transformer/  
http://nlp.seas.harvard.edu/annotated-transformer/

https://e2eml.school/transformers.html  
https://zhuanlan.zhihu.com/p/107889011  
https://kikaben.com/transformers-coding-details/

![image.png](attachment:image.png)

![image.png](attachment:image.png)
https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization

- Transformer is sequence to sequence neural network architecture
- input text is encoded with tokenizers to sequence of integers called input tokens
- input tokens are mapped to sequence of vectors (word embeddings) via embeddings layer
- output vectors (embeddings) can be classified to a sequence of tokens
- output tokens can then be decoded back to a text

## Tokenization vs Embedding
- input is tokenized, the tokens then are embedded
- output text embeddings are classified back into tokens, which then can be decoded into text
- tokenization converts a text into a list of integers
- embedding converts the list of integers into a list of vectors (list of embeddings)
- positional information about each token is added to embeddings using positional encodings or embeddings

## Positional Encodings add Token Order Information
- self-attention and feed-forward layers are symmetrical with respect to the input
- so we have to provide positional information about each input token
- so positional encodings or embeddings are added to token embeddings in transformer
- encodings are manually (human) selected, while embeddingss are learned (trained)
## Word Embeddings
- Embedding layers map tokens to word vectors (sequence of numbers) called word embeddings.
- Input and output embeddings layer often share the same token-vector mapping.
- Embeddings contain semantic information about the word.

In [2]:
from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print("These are tokens!", tokens)
for token in tokens[0]:
    print("This are decoded tokens!", tokenizer.decode([token]))

model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
for e in model.embeddings.word_embeddings(tokens)[0]:
    print("This is an embedding!", e)

These are tokens! tensor([[ 101, 2023, 2003, 1037, 7953, 1012,  102]])
This are decoded tokens! [CLS]
This are decoded tokens! this
This are decoded tokens! is
This are decoded tokens! a
This are decoded tokens! input
This are decoded tokens! .
This are decoded tokens! [SEP]


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[[ 0.0390, -0.0123, -0.0208,  ...,  0.0607,  0.0230,  0.0238],
         [-0.0558,  0.0151,  0.0031,  ..., -0.0140, -0.0277,  0.0139],
         [-0.0440, -0.0236, -0.0283,  ...,  0.0053, -0.0081,  0.0170],
         ...,
         [-0.0788,  0.0202, -0.0352,  ...,  0.0119, -0.0037, -0.0402],
         [-0.0244, -0.0138, -0.0078,  ...,  0.0069,  0.0057, -0.0016],
         [-0.0199, -0.0095, -0.0099,  ..., -0.0235,  0.0071, -0.0071]]],
       grad_fn=<EmbeddingBackward0>)
This is an embedding! tensor([ 3.8952e-02, -1.2318e-02, -2.0844e-02, -5.2684e-04, -1.9758e-02,
         3.8324e-02, -2.0617e-02,  3.3877e-03, -2.2452e-02, -4.3990e-02,
         1.2990e-02, -1.6670e-02,  9.7562e-03, -1.2669e-02, -4.5170e-02,
         2.5090e-02,  4.4694e-02,  5.9726e-02, -5.0432e-03, -3.6367e-02,
        -7.3220e-03,  1.3156e-03, -2.8033e-02,  2.8017e-02,  1.1700e-02,
        -3.4161e-03, -1.7048e-02,  2.5037e-02,  2.1979e-02, -2.0812e-03,
         5.1738e-03, -1.2071e-03, -4.3487e-02,  3.1312e-02, -

In [3]:
print(model.embeddings.word_embeddings(tokens).shape)

torch.Size([1, 7, 768])


## pytorch-lightning 是建立在pytorch之上的高层次模型接口。
pytorch-lightning 之于 pytorch，就如同keras之于 tensorflow.
pytorch-lightning 有以下一些引人注目的功能：
- 可以不必编写自定义循环，只要指定loss计算方法即可。
- 可以通过callbacks非常方便地添加CheckPoint参数保存、early_stopping 等功能。
- 可以非常方便地在单CPU、多CPU、单GPU、多GPU乃至多TPU上训练模型。
- 可以通过调用torchmetrics库，非常方便地添加Accuracy,AUC,Precision等各种常用评估指标。
- 可以非常方便地实施多批次梯度累加、半精度混合精度训练、最大batch_size自动搜索等技巧，加快训练过程。
- 可以非常方便地使用SWA(随机参数平均)、CyclicLR(学习率周期性调度策略)与auto_lr_find(最优学习率发现)等技巧 实现模型涨点。
一般按照如下方式 安装和 引入 pytorch-lightning 库。  

#安装  
pip install pytorch-lightning  
#引入  
import pytorch_lightning as pl 

## pytorch-lightning使用范例
下面我们使用minist图片分类问题为例，演示pytorch-lightning的最佳实践。

In [1]:
import torch 
from torch import nn 
from torchvision import transforms as T
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader,random_split
import pytorch_lightning as pl 
from torchmetrics import Accuracy 

In [2]:
class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str = "../data/minist/", 
                 batch_size: int = 32,
                 num_workers: int =4):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.num_workers = num_workers

    def setup(self, stage = None):
        transform = T.Compose([T.ToTensor()])
        self.ds_test = MNIST(self.data_dir, train=False,transform=transform,download=True)
        self.ds_predict = MNIST(self.data_dir, train=False,transform=transform,download=True)
        ds_full = MNIST(self.data_dir, train=True,transform=transform,download=True)
        self.ds_train, self.ds_val = random_split(ds_full, [55000, 5000])

    def train_dataloader(self):
        return DataLoader(self.ds_train, batch_size=self.batch_size,
                          shuffle=True, num_workers=self.num_workers,
                          pin_memory=True)

    def val_dataloader(self):
        return DataLoader(self.ds_val, batch_size=self.batch_size,
                          shuffle=False, num_workers=self.num_workers,
                          pin_memory=True)

    def test_dataloader(self):
        return DataLoader(self.ds_test, batch_size=self.batch_size,
                          shuffle=False, num_workers=self.num_workers,
                          pin_memory=True)

    def predict_dataloader(self):
        return DataLoader(self.ds_predict, batch_size=self.batch_size,
                          shuffle=False, num_workers=self.num_workers,
                          pin_memory=True)
    

data_mnist = MNISTDataModule()
data_mnist.setup()

for features,labels in data_mnist.train_dataloader():
    print(features.shape)
    print(labels.shape)
    break 

torch.Size([32, 1, 28, 28])
torch.Size([32])

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/minist/MNIST\raw\train-images-idx3-ubyte.gz


100%|███████████████████████████████████████████████████████████████████| 9912422/9912422 [00:03<00:00, 2999547.95it/s]


Extracting ../data/minist/MNIST\raw\train-images-idx3-ubyte.gz to ../data/minist/MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/minist/MNIST\raw\train-labels-idx1-ubyte.gz


100%|████████████████████████████████████████████████████████████████████████| 28881/28881 [00:00<00:00, 768734.81it/s]


Extracting ../data/minist/MNIST\raw\train-labels-idx1-ubyte.gz to ../data/minist/MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/minist/MNIST\raw\t10k-images-idx3-ubyte.gz


100%|████████████████████████████████████████████████████████████████████| 1648877/1648877 [00:01<00:00, 920853.11it/s]


Extracting ../data/minist/MNIST\raw\t10k-images-idx3-ubyte.gz to ../data/minist/MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/minist/MNIST\raw\t10k-labels-idx1-ubyte.gz


100%|██████████████████████████████████████████████████████████████████████████████████████| 4542/4542 [00:00<?, ?it/s]


Extracting ../data/minist/MNIST\raw\t10k-labels-idx1-ubyte.gz to ../data/minist/MNIST\raw

torch.Size([32, 1, 28, 28])
torch.Size([32])


torch.Size([32])

In [7]:
net = nn.Sequential(
    nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
    nn.MaxPool2d(kernel_size = 2,stride = 2),
    nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
    nn.MaxPool2d(kernel_size = 2,stride = 2),
    nn.Dropout2d(p = 0.1),
    nn.AdaptiveMaxPool2d((1,1)),
    nn.Flatten(),
    nn.Linear(64,32),
    nn.ReLU(),
    nn.Linear(32,10)
)

class Model(pl.LightningModule):
    
    def __init__(self,net,learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.net = net
        self.train_acc = Accuracy(task="multiclass", num_classes=10)
        self.val_acc = Accuracy(task="multiclass", num_classes=10)
        self.test_acc = Accuracy(task="multiclass", num_classes=10) 
        
        
    def forward(self,x):
        x = self.net(x)
        return x
    
    
    #定义loss
    def training_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x)
        loss = nn.CrossEntropyLoss()(preds,y)
        return {"loss":loss,"preds":preds.detach(),"y":y.detach()}
    
    #定义各种metrics
    def training_step_end(self,outputs):
        train_acc = self.train_acc(outputs['preds'], outputs['y']).item()    
        self.log("train_acc",train_acc,prog_bar=True)
        return {"loss":outputs["loss"].mean()}
    
    #定义optimizer,以及可选的lr_scheduler
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x)
        loss = nn.CrossEntropyLoss()(preds,y)
        return {"loss":loss,"preds":preds.detach(),"y":y.detach()}

    def validation_step_end(self,outputs):
        val_acc = self.val_acc(outputs['preds'], outputs['y']).item()    
        self.log("val_loss",outputs["loss"].mean(),on_epoch=True,on_step=False)
        self.log("val_acc",val_acc,prog_bar=True,on_epoch=True,on_step=False)
    
    def test_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x)
        loss = nn.CrossEntropyLoss()(preds,y)
        return {"loss":loss,"preds":preds.detach(),"y":y.detach()}
    
    def test_step_end(self,outputs):
        test_acc = self.test_acc(outputs['preds'], outputs['y']).item()    
        self.log("test_acc",test_acc,on_epoch=True,on_step=False)
        self.log("test_loss",outputs["loss"].mean(),on_epoch=True,on_step=False)
    
model = Model(net)

#查看模型大小
#model_size = pl.utilities.memory.get_model_size_mb(model)
#print("model_size = {} M \n".format(model_size))
model.example_input_array = [features]
summary = pl.utilities.model_summary.ModelSummary(model,max_depth=-1)
print(summary) 

   | Name      | Type               | Params | In sizes         | Out sizes       
----------------------------------------------------------------------------------------
0  | net       | Sequential         | 54.0 K | [32, 1, 28, 28]  | [32, 10]        
1  | net.0     | Conv2d             | 320    | [32, 1, 28, 28]  | [32, 32, 26, 26]
2  | net.1     | MaxPool2d          | 0      | [32, 32, 26, 26] | [32, 32, 13, 13]
3  | net.2     | Conv2d             | 51.3 K | [32, 32, 13, 13] | [32, 64, 9, 9]  
4  | net.3     | MaxPool2d          | 0      | [32, 64, 9, 9]   | [32, 64, 4, 4]  
5  | net.4     | Dropout2d          | 0      | [32, 64, 4, 4]   | [32, 64, 4, 4]  
6  | net.5     | AdaptiveMaxPool2d  | 0      | [32, 64, 4, 4]   | [32, 64, 1, 1]  
7  | net.6     | Flatten            | 0      | [32, 64, 1, 1]   | [32, 64]        
8  | net.7     | Linear             | 2.1 K  | [32, 64]         | [32, 32]        
9  | net.8     | ReLU               | 0      | [32, 32]         | [32, 32]       

  rank_zero_warn(


In [None]:
pl.utilities.memory.