Transformers 库建立在 Pytorch 框架之上（Tensorflow 的版本功能并不完善），
虽然官方宣称使用 Transformers 库并不需要掌握 Pytorch 知识，但是实际上我们还是需要通过 Pytorch 的 
DataLoader 类来加载数据、使用 Pytorch 的优化器对模型参数进行调整等等。

In [None]:
# 张量 (Tensor) 是深度学习的基础，例如常见的 0 维张量称为标量 (scalar)、1 维张量称为向量 (vector)、2 维张量称为矩阵 (matrix)。Pytorch 本质上就是一个基于张量的数学计算工具包，它提供了多种方式来创建张量：

In [None]:
import torch
torch.empty(2, 3) # empty tensor (uninitialized), shape (2,3)

tensor([[-5.9185e-31,  1.3116e-42,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00]])

In [2]:
torch.rand(2, 3) # random tensor, each value taken from [0,1)

tensor([[0.0597, 0.7948, 0.3898],
        [0.4318, 0.8445, 0.8419]])

In [3]:
torch.randn(2, 3) # random tensor, each value taken from standard normal distribution

tensor([[ 0.2076,  0.9583, -0.4509],
        [-0.6981, -0.3623, -0.6361]])

In [4]:
torch.zeros(2, 3, dtype=torch.long) # long integer zero tensor

tensor([[0, 0, 0],
        [0, 0, 0]])

In [5]:
torch.zeros(2, 3, dtype=torch.double) # double float zero tensor
# tensor([[0., 0., 0.],
#         [0., 0., 0.]], dtype=torch.float64)

tensor([[0., 0., 0.],
        [0., 0., 0.]], dtype=torch.float64)

In [8]:
torch.arange(10)
torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [7]:
array = [[1.0, 3.8, 2.1], [8.6, 4.0, 2.4]]
torch.tensor(array)


tensor([[1.0000, 3.8000, 2.1000],
        [8.6000, 4.0000, 2.4000]])

In [9]:
import numpy as np
array = np.array([[1.0, 3.8, 2.1], [8.6, 4.0, 2.4]])
torch.from_numpy(array)

tensor([[1.0000, 3.8000, 2.1000],
        [8.6000, 4.0000, 2.4000]], dtype=torch.float64)

     PyTorch with CUDA Support 
     conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y 

In [1]:
# this code asures that pytotch can access the GPU
# shift the conda env pytorh_GPU_cuda to the front
import torch
#print(torch.cuda.is_available())
print(torch.version.cuda)

12.4


In [None]:
import torch, platform
print("PyTorch", torch.__version__) # 2.80+CPU is CPU , 2.6.0cu is CUDA
print("CUDA available", torch.cuda.is_available())
print("CUDA version", torch.version.cuda)
print("GPU", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")

PyTorch 2.6.0+cu124
CUDA available True
CUDA version 12.4
GPU NVIDIA GeForce RTX 3060 Laptop GPU


PyTorch build: 2.8.0+cu126
Python: 3.11.13 64bit
CUDA runtime (nvcc): Cuda compilation tools, release 12.6, V12.6.85

In [3]:
import torch, platform, subprocess, sys
print("PyTorch build:", torch.__version__)          #PyTorch now sees the CUDA 12.1 wheel (+cu121)
print("Python:", sys.version.split()[0], platform.architecture()[0])
try:
    print("CUDA runtime (nvcc):", subprocess.check_output(["nvcc","--version"], text=True).split("\n")[3])
except FileNotFoundError:
    print("CUDA runtime (nvcc): not found  ← this is OK, wheels bundle their own")

PyTorch build: 2.6.0+cu124
Python: 3.11.13 64bit
CUDA runtime (nvcc): Cuda compilation tools, release 12.6, V12.6.85


In [4]:
# 上面这些方式创建的张量会存储在内存中并使用 CPU 进行计算，如果想要调用 GPU 计算，需要直接在 GPU 中创建张量或者将张量送入到 GPU 中：

torch.rand(2, 3).cuda()


tensor([[0.7627, 0.1800, 0.5982],
        [0.5238, 0.3279, 0.6467]], device='cuda:0')

In [None]:
import torch
print("CUDA available:", torch.cuda.is_available())
print(  "Device count:", torch.cuda.device_count())
print("Current device:", torch.cuda.current_device())
print(   "Device name:", torch.cuda.get_device_name(0))

x = torch.rand(1).cuda()
print("Random tensor on GPU:", x)

CUDA available: True
Device count: 1
Current device: 0
Device name: NVIDIA GeForce RTX 3060 Laptop GPU
Random tensor on GPU: tensor([0.1086], device='cuda:0')


In [None]:
torch.rand(2, 3, device="cuda")

tensor([[0.4040, 0.3405, 0.9690],
        [0.2856, 0.7073, 0.5478]], device='cuda:0')

In [None]:
torch.rand(2, 3).to("cuda")

tensor([[0.2300, 0.5589, 0.6929],
        [0.0508, 0.3757, 0.6551]], device='cuda:0')

In [None]:
# 进行 view 操作的张量必须是连续的 (contiguous)，可以调用 is_conuous 来判断张量是否连续；
# 如果非连续，需要先通过 contiguous 函数将其变为连续的。
# 也可以直接调用 Pytorch 新提供的 reshape 函数，
# 它与 view 功能几乎一致，并且能够自动处理非连续张量。

# 转置 transpose 交换张量中的两个维度，参数为相应的维度：

x = torch.tensor([[1, 2, 3], [4, 5, 6]])
x


tensor([[1, 2, 3],
        [4, 5, 6]])

In [9]:

x.transpose(0, 1)


tensor([[1, 4],
        [2, 5],
        [3, 6]])

In [10]:
# 交换维度 permute 与 transpose 函数每次只能交换两个维度不同，permute 可以直接设置新的维度排列方式：

x = torch.tensor([[[1, 2, 3], [4, 5, 6]]])
print(x, x.shape)


tensor([[[1, 2, 3],
         [4, 5, 6]]]) torch.Size([1, 2, 3])


In [20]:

x = x.permute(2, 0, 1)
print(x, x.shape)


tensor([[[1, 4]],

        [[2, 5]],

        [[3, 6]]]) torch.Size([3, 1, 2])


In [None]:
x = torch.arange(1, 4) # shape (1,3) 
y = torch.arange(4, 6) # shape (1,2)
print(x)
print(y)

tensor([1, 2, 3])
tensor([4, 5])


In [None]:
x = torch.arange(1, 4).view(3, 1) # shape (3,1) 
y = torch.arange(4, 6).view(1, 2) # shape (1,2)
print(x)
print(y)

tensor([[1],
        [2],
        [3]])
tensor([[4, 5]])


In [None]:
# 广播机制
# 前面我们都是假设参与运算的两个张量形状相同。在有些情况下，即使两个张量形状不同，
# 也可以通过广播机制 (broadcasting mechanism) 
# 对其中一个或者同时对两个张量的元素进行复制，使得它们形状相同，然后再执行按元素计算。

# 例如，我们生成两个形状不同的张量：

x = torch.arange(1, 4).view(3, 1) # shape (3,1) 
y = torch.arange(4, 6).view(1, 2) # shape (1,2)
print(x)
print(y)

tensor([[1],
        [2],
        [3]])
tensor([[4, 5]])


In [None]:
 print(x + y) 
# shape (3,2), 
# x along dim 1 is copied twice, 
# y along dim 0 is copied three times

tensor([[5, 6],
        [6, 7],
        [7, 8]])


In [13]:
x = torch.arange(24).view(2,3, 4)
x

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [15]:
x = torch.arange(12).view(3, 4)
x

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [None]:
 x[1, 3] # element at row index 1 (second row), column 3 (4th column)

tensor(7)

In [17]:
 x[1:3] 

tensor([[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [18]:
x[:, 2]

tensor([ 2,  6, 10])

In [19]:
x[:, 2:4]

tensor([[ 2,  3],
        [ 6,  7],
        [10, 11]])

In [None]:
# # 1. 卸载现有 Pillow（无论用 pip 还是 conda 装的）
# # run in in terminal is better by ZHEN . because jupyter sometimes has issue with conda yes not show in cell 
# conda uninstall pillow -y 
# pip uninstall pillow  # 确保彻底清除!

# # 2. 用 conda 重新安装（自动解决 DLL 依赖）
# conda install pillow -c conda-forge -y 

In [20]:
import platform
print(platform.architecture())  # 应输出 ('64bit', 'WindowsPE')

('64bit', 'WindowsPE')


In [21]:
from PIL import Image
print(Image.__version__)

11.3.0


cannot import name 'datasets' from 'torchvision' (unknown location)

In [22]:
import pkg_resources
print(pkg_resources.get_distribution("torchvision").version)
# # 从 pytorch 官方频道安装（自动匹配 CUDA 版本）
#conda install torchvision -c pytorch -c nvidia
# conda update torchvision -c pytorch -y

0.21.0+cu124


  import pkg_resources


In [23]:
import torchvision
print(torchvision.__version__)

0.21.0+cu124


In [26]:
from torch.utils.data       import DataLoader
from torchvision            import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")

img = train_features[0].squeeze()
label = train_labels[0]
print(img.shape)
print(f"Label: {label}")

100%|██████████| 26.4M/26.4M [00:07<00:00, 3.61MB/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 235kB/s]
100%|██████████| 4.42M/4.42M [00:04<00:00, 1.00MB/s]
100%|██████████| 5.15k/5.15k [00:00<?, ?B/s]

Feature batch shape: torch.Size([64, 1, 28, 28])
Labels batch shape: torch.Size([64])
torch.Size([28, 28])
Label: 3





In [24]:
#conda activate torch_cuda
!python -c "import torch; print(torch.__version__)"

2.6.0+cu124


In [25]:
from torch.utils.data import IterableDataset, DataLoader

class MyIterableDataset(IterableDataset):

    def __init__(self, start, end):
        super(MyIterableDataset).__init__()
        assert end > start
        self.start = start
        self.end = end

    def __iter__(self):
        return iter(range(self.start, self.end))

ds = MyIterableDataset(start=3, end=7) # [3, 4, 5, 6]
# Single-process loading
print(list(DataLoader(ds, num_workers=0)))
# # Directly doing multi-process loading
# print(list(DataLoader(ds, num_workers=2)))

[tensor([3]), tensor([4]), tensor([5]), tensor([6])]


构建模型
我们还是以前面加载的 FashionMNIST 数据库为例，构建一个神经网络模型来完成图像分类。模型同样继承自 nn.Module 类，通过 __init__() 初始化模型中的层和参数，在 forward() 中定义模型的操作，例如：


In [27]:

import torch
from torch import nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
            nn.Dropout(p=0.2)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

Using cuda device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=10, bias=True)
    (5): Dropout(p=0.2, inplace=False)
  )
)


In [None]:
可以看到，我们构建的模型首先将二维图像通过 Flatten 层压成一维向量，然后经过两个带有 ReLU 激活函数的全连接隐藏层，最后送入到一个包含 10 个神经元的分类器以完成 10 分类任务。我们还通过在最终输出前添加 Dropout 层来缓解过拟合。

最终我们构建的模型会输出一个 10 维向量（每一维对应一个类别的预测值），与先前介绍过的 pipeline 模型一样，这里输出的是 logits 值，我们需要再接一个 Softmax 层来计算最终的概率值。下面我们构建一个包含四个伪二维图像的 mini-batch 来进行预测：


In [None]:
import torch
from torch import nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512), # 28*28=784 input layer size to 512 hidden layer size
            nn.ReLU(),
            nn.Linear(512, 256),   # 512 hidden layer size to 256 hidden layer size 
            nn.ReLU(),
            nn.Linear(256, 10),    # 256 hidden layer size to 10 output layer size
            nn.Dropout(p=0.2)
        )

    def forward(self, x):
        x = self.flatten(x)       # flatten input tensor from 28x28 to 784
        logits = self.linear_relu_stack(x)  # pass through the layers
        return logits 

model = NeuralNetwork().to(device) # send the model to GPU 

# create a mini-batch of 4 random  input images 
X = torch.rand(4, 28, 28, device=device) # create a mini-batch of 4 random 28x28 images on GPU




logits = model(X)

pred_probab = nn.Softmax(dim=1)(logits)
print(pred_probab.size())
y_pred = pred_probab.argmax(-1)
print(f"Predicted class: {y_pred}")

Using cuda device
torch.Size([4, 10])
Predicted class: tensor([1, 1, 4, 1], device='cuda:0')


优化模型参数
在准备好数据、搭建好模型之后，我们就可以开始训练和测试（验证）模型了。正如前面所说，模型训练是一个迭代的过程，每一轮 epoch 迭代中模型都会对输入样本进行预测，然后对预测结果计算损失 (loss)，并求 loss 对每一个模型参数的偏导，最后使用优化器更新所有的模型参数。

损失函数 (Loss function) 用于度量预测值与答案之间的差异，模型的训练过程就是最小化损失函数。Pytorch 实现了很多常见的损失函数，例如用于回归任务的均方误差 (Mean Square Error) nn.MSELoss、用于分类任务的负对数似然 (Negative Log Likelihood) nn.NLLLoss、同时结合了 nn.LogSoftmax 和 nn.NLLLoss 的交叉熵损失 (Cross Entropy) nn.CrossEntropyLoss 等。

优化器 (Optimization) 使用特定的优化算法（例如随机梯度下降），通过在每一个训练阶段 (step) 减少（基于一个 batch 样本计算的）模型损失来调整模型参数。Pytorch 实现了很多优化器，例如 SGD、ADAM、RMSProp 等。

每一轮迭代 (Epoch) 实际上包含了两个步骤：

训练循环 (The Train Loop) 在训练集上进行迭代，尝试收敛到最佳的参数；
验证/测试循环 (The Validation/Test Loop) 在测试/验证集上进行迭代以检查模型性能有没有提升。
具体地，在训练循环中，优化器通过以下三个步骤进行优化：

调用 optimizer.zero_grad() 重设模型参数的梯度。默认情况下梯度会进行累加，为了防止重复计算，在每个训练阶段开始前都需要清零梯度；
通过 loss.backwards() 反向传播预测结果的损失，即计算损失对每一个参数的偏导；
调用 optimizer.step() 根据梯度调整模型的参数。
下面我们选择交叉熵作为损失函数、选择 AdamW 作为优化器，完整的训练循环和测试循环实现如下：


In [31]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

device = 'cuda' if torch.cuda.is_available() else 'cpu' # Use GPU <<====

print(f'Using {device} device')

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

learning_rate = 1e-3
batch_size = 64
epochs = 10 #3

# datasets -> dataloaders 
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader  = DataLoader(test_data, batch_size=batch_size)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10),
            nn.Dropout(p=0.2)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device) # send the model to GPU 

# Training loop 
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader, start=1):
        X, y = X.to(device), y.to(device)
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

# Test loop
def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(dim=-1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

Using cuda device
Epoch 1
-------------------------------
loss: 0.737276  [ 6400/60000]
loss: 0.988355  [12800/60000]
loss: 0.515212  [19200/60000]
loss: 0.957792  [25600/60000]
loss: 0.488365  [32000/60000]
loss: 0.663733  [38400/60000]
loss: 0.874533  [44800/60000]
loss: 0.499112  [51200/60000]
loss: 0.520403  [57600/60000]
Test Error: 
 Accuracy: 84.0%, Avg loss: 0.449959 

Epoch 2
-------------------------------
loss: 0.556642  [ 6400/60000]
loss: 0.709385  [12800/60000]
loss: 0.537429  [19200/60000]
loss: 0.929084  [25600/60000]
loss: 0.405274  [32000/60000]
loss: 0.591053  [38400/60000]
loss: 1.109954  [44800/60000]
loss: 0.535239  [51200/60000]
loss: 0.351561  [57600/60000]
Test Error: 
 Accuracy: 84.0%, Avg loss: 0.429587 

Epoch 3
-------------------------------
loss: 0.620890  [ 6400/60000]
loss: 0.447753  [12800/60000]
loss: 0.644886  [19200/60000]
loss: 0.663515  [25600/60000]
loss: 0.502192  [32000/60000]
loss: 0.432372  [38400/60000]
loss: 0.826991  [44800/60000]
loss: 0.

In [None]:
# span style="color: red;">This text is red.</span> set color red

可以看到，通过 3 轮迭代 (Epoch)，模型在训练集上的损失逐步下降、在测试集上的准确率逐步上升，证明优化器成功地对模型参数进行了调整，而且没有出现过拟合。

<span style="color: red;">注意：一定要在预测之前调用 model.eval() .</span>

<span style="color: orange;">方法将 dropout 层和 batch normalization 层设置为评估模式， </span>

<span style="color: cyan;"> 否则会产生不一致的预测结果。</span>


4. 保存及加载模型
在之前1的文章中，我们介绍过模型类 Model 的保存以及加载方法，但如果我们只是将预训练模型作为一个模块（例如作为编码器），那么最终的完整模型就是一个自定义 Pytorch 模型，它的保存和加载就必须使用 Pytorch 预设的接口。

保存和加载模型权重
Pytorch 模型会将所有参数存储在一个状态字典 (state dictionary) 中，可以通过 Model.state_dict() 加载。Pytorch 通过 torch.save() 保存模型权重：



In [None]:
import torch
import torchvision.models as models

model = models.vgg16(pretrained=True)
torch.save(model.state_dict(), 'model_weights.pth')


In [None]:
为了加载保存的权重，我们首先需要创建一个结构完全相同的模型实例，然后通过 Model.load_state_dict() 函数进行加载：

In [None]:
model = models.vgg16() # we do not specify pretrained=True, i.e. do not load default weights
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

In [None]:
保存和加载完整模型
上面存储模型权重的方式虽然可以节省空间，但是加载前需要构建一个结构完全相同的模型实例来承接权重。
如果我们希望在存储权重的同时，也一起保存模型结构，就需要将整个模型传给 torch.save() ：


In [None]:

import torch
import torchvision.models as models

model = models.vgg16(pretrained=True)
torch.save(model, 'model.pth')
这样就可以直接从保存的文件中加载整个模型（包括权重和结构）：

model = torch.load('model.pth')