Ray Train 最核心的 Trainer 类，Trainer 主要也是由两部分组成
* ScalingConfig：定义物理进程实际资源实用
* TrainingFunction：定义逻辑网络训练过程

如下是一个用 TorchTrainer 训练的 demo，取自 Ray 官网：https://docs.ray.io/en/latest/train/getting-started-pytorch.html

In [1]:
# * 修改了每个 worker 的 GPU 数量，让他能在一个 gpu 上跑并行
# * 修改了 backend，在 Windows 上不支持 NCCL
#  不过 Windows 上还是会报错 PermissionError: [WinError 32] Failed copying 'C:/Users/xxx/model.pt' to 'C:/Users/yyy/model.pt' Detail: [Windows error 32] 另一个程序正在使用此文件，进程无法访问。
import os
import tempfile

import torch
from torch.nn import CrossEntropyLoss
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision.models import resnet18
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose

import ray.train.torch

def train_func(config):
    # Model, Loss, Optimizer
    model = resnet18(num_classes=10)
    model.conv1 = torch.nn.Conv2d(
        1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
    )
    # model.to("cuda")  # This is done by `prepare_model`
    # [1] Prepare model.
    model = ray.train.torch.prepare_model(model)
    criterion = CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=0.001)

    # Data
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    data_dir = os.path.join(tempfile.gettempdir(), "data")
    train_data = FashionMNIST(root=data_dir, train=True, download=True, transform=transform)
    train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
    # [2] Prepare dataloader.
    train_loader = ray.train.torch.prepare_data_loader(train_loader)

    # Training
    for epoch in range(10):
        for images, labels in train_loader:
            # This is done by `prepare_data_loader`!
            # images, labels = images.to("cuda"), labels.to("cuda")
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # [3] Report metrics and checkpoint.
        metrics = {"loss": loss.item(), "epoch": epoch}
        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            torch.save(
                model.module.state_dict(),
                os.path.join(temp_checkpoint_dir, "model.pt")
            )
            ray.train.report(
                metrics,
                checkpoint=ray.train.Checkpoint.from_directory(temp_checkpoint_dir),
            )
        if ray.train.get_context().get_world_rank() == 0:
            print(metrics)

# [4] Configure scaling and resource requirements.
scaling_config = ray.train.ScalingConfig(
    num_workers=2, 
    resources_per_worker={
        "CPU": 4,
        "GPU": 0.5,
    },
    use_gpu=True
)

# [5] Launch distributed training job.
trainer = ray.train.torch.TorchTrainer(
    train_func,
    scaling_config=scaling_config,
    torch_config=ray.train.torch.TorchConfig(backend="gloo")
    # [5a] If running in a multi-node cluster, this is where you
    # should configure the run's persistent storage that is accessible
    # across all worker nodes.
    # run_config=ray.train.RunConfig(storage_path="s3://..."),
)
result = trainer.fit()

# [6] Load the trained model.
with result.checkpoint.as_directory() as checkpoint_dir:
    model_state_dict = torch.load(os.path.join(checkpoint_dir, "model.pt"))
    model = resnet18(num_classes=10)
    model.conv1 = torch.nn.Conv2d(
        1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
    )
    model.load_state_dict(model_state_dict)

0,1
Current time:,2024-01-15 21:22:11
Running for:,00:00:58.21
Memory:,10.2/15.7 GiB

Trial name,# failures,error file
TorchTrainer_f470a_00000,1,C:/Users/Five/ray_results/TorchTrainer_2024-01-15_21-21-07/TorchTrainer_f470a_00000_0_2024-01-15_21-21-13\error.txt

Trial name,status,loc
TorchTrainer_f470a_00000,ERROR,127.0.0.1:3744


[36m(RayTrainWorker pid=37000)[0m Setting up process group for: env:// [rank=0, world_size=2]
[36m(RayTrainWorker pid=37000)[0m [W socket.cpp:663] [c10d] The client socket has failed to connect to [huya.com]:3557 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[36m(TorchTrainer pid=3744)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=3744)[0m - (ip=127.0.0.1, pid=37000) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=3744)[0m - (ip=127.0.0.1, pid=30140) world_rank=1, local_rank=1, node_rank=0
[36m(RayTrainWorker pid=37000)[0m Moving model to device: cuda:0
[36m(RayTrainWorker pid=37000)[0m Wrapping provided model in DistributedDataParallel.
[36m(RayTrainWorker pid=30140)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/Five/ray_results/TorchTrainer_2024-01-15_21-21-07/TorchTrainer_f470a_00000_0_2024-01-15_21-21-13/checkpoint_000000)
[36m(RayTrainWorker pid=30140)[0m [W socket.cpp:663] [c10d] The client socket h

TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("C:\Users\Five\ray_results\TorchTrainer_2024-01-15_21-21-07")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures = -1` for unlimited retries.