Update: Update distributed training README.

chairc · Jul 15, 2023 · e7734c1 · e7734c1
1 parent 8a66ad4
commit e7734c1
Show file tree

Hide file tree

Showing 2 changed files with 53 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -43,6 +43,8 @@ We named this project IDDM: Industrial Defect Diffusion Model. It aims to reprod
 
 ### Training
 
+#### Normal Training
+
 1. Take the `landscape` dataset as an example and place the dataset files in the `datasets` folder. The overall path of the dataset should be `/your/path/datasets/landscape`, and the image files should be located at `/your/path/datasets/landscape/*.jpg`.
 
 2. Open the `train.py` file and locate the `--dataset_path` parameter. Modify the path in the parameter to the overall dataset path, for example, `/your/path/datasets/landscape`.
@@ -70,7 +72,27 @@ We named this project IDDM: Industrial Defect Diffusion Model. It aims to reprod
    ```bash
    python train.py --resume True --start_epoch 10 --load_model_dir 'df'  --conditional False --run_name 'df' --epochs 300 --batch_size 16 --image_size 64
    ```
+#### Distributed Training
+
+1. The basic configuration is similar to regular training, but note that enabling distributed training requires setting `--distributed` to `True`. To prevent arbitrary use of distributed training, we have several conditions for enabling distributed training, such as `args.distributed`, `torch.cuda.device_count() > 1`, and `torch.cuda.is_available()`.
+
+2. Set the necessary parameters, such as `--main_gpu` and `--world_size`. `--main_gpu` is usually set to the main GPU, which is used for validation, testing, or saving weights, and it only needs to be run on a single GPU. The value of `world_size` corresponds to the actual number of GPUs or distributed nodes being used.
+
+3. There are two methods for setting the parameters. One is to directly modify the `parser` in the `train.py` file under the condition `if __name__ == "__main__":`. The other is to run the following command in the console under the path `/your/path/Defect-Diffiusion-Model/tools`:
+
+**Conditional Distributed Training Command**
+
+   ```bash
+   python train.py --conditional True --run_name 'df' --epochs 300 --batch_size 16 --image_size 64 --num_classes 10 --distributed True --main_gpu 0 --world_size 2
+   ```
+
+   **Unconditional Distributed Training Command**
+
+   ```bash
+   python train.py --conditional False --run_name 'df' --epochs 300 --batch_size 16 --image_size 64 --distributed True --main_gpu 0 --world_size 2
+   ```
 
+4. Wait for the training to complete. Interrupt recovery is the same as basic training.
 
 **Parameter Explanation**
 
@@ -84,7 +106,6 @@ We named this project IDDM: Industrial Defect Diffusion Model. It aims to reprod
 | --image_size           |             | Input image size                | int  | Input image size. Adaptive input and output sizes            |
 | --dataset_path         |             | Dataset path                    | str  | Path to the conditional dataset, such as CIFAR-10, with each class in a separate folder, or the path to the unconditional dataset with all images in one folder |
 | --fp16                 |             | Half precision training         | bool | Enable half precision training. It effectively reduces GPU memory usage but may affect training accuracy and results |
-| --distributed          |             | Distributed training            | bool | TODO                                                         |
 | --optim                |             | Optimizer                       | str  | Optimizer selection. Currently supports Adam and AdamW       |
 | --lr                   |             | Learning rate                   | int  | Initial learning rate. Currently only supports linear learning rate |
 | --result_path          |             | Save path                       | str  | Path to save the training results                            |
@@ -94,6 +115,9 @@ We named this project IDDM: Industrial Defect Diffusion Model. It aims to reprod
 | --resume               |             | Resume interrupted training     | bool | Set to "True" to resume interrupted training. Note: If the epoch number of interruption is outside the condition of --start_model_interval, it will not take effect. For example, if the start saving model time is 100 and the interruption number is 50, we cannot set any loading epoch points because we did not save the model. We save the xxx_last.pt file every training, so we need to use the last saved model for interrupted training |
 | --start_epoch          |             | Epoch number of interruption    | int  | Epoch number where the training was interrupted              |
 | --load_model_dir       |             | Folder name of the loaded model | str  | Folder name of the previously loaded model                   |
+| --distributed         |          | Distributed training          | bool  | Enable distributed training                                 |
+| --main_gpu            |          | Main GPU for distributed      | int   | Set the main GPU for distributed training                   |
+| --world_size          |          | Number of distributed nodes    | int   | Number of distributed nodes, corresponds to the actual number of GPUs or distributed nodes being used |
 | --num_classes          |      ✓      | Number of classes               | int  | Number of classes used for classification                    |
 | --cfg_scale            |      ✓      | Classifier-free guidance weight | int  | Classifier-free guidance interpolation weight for better model generation effects |
 

diff --git a/README_zh.md b/README_zh.md
@@ -43,6 +43,8 @@
 
 ### 训练
 
+#### 普通训练
+
 1. 以`landscape`数据集为例，将数据集文件放入`datasets`文件夹中，该数据集的总路径如下`/your/path/datasets/landscape`，数据集图片路径如下`/your/path/datasets/landscape/*.jpg`
 
 2. 打开`train.py`文件，找到`--dataset_path`参数，将参数中的路径修改为数据集的总路径，例如`/your/path/datasets/landscape`
@@ -71,6 +73,29 @@
    python train.py --resume True --start_epoch 10 --load_model_dir 'df'  --conditional False --run_name 'df' --epochs 300 --batch_size 16 --image_size 64
    ```
 
+#### 分布式训练
+
+1. 基本配置与普通训练相似，值得注意的是开启分布式训练需要将`--distributed`设置为`True`。为了防止随意设置分布式训练，我们为开启分布式训练设置了几个基本条件，例如`args.distributed`、`torch.cuda.device_count() > 1`和`torch.cuda.is_available()`。
+
+2. 设置必要的参数，例如`--main_gpu`和`--world_size`。`--main_gpu`通常设置为主要GPU，例如做验证、做测试或保存权重，我们仅在单卡中运行即可。而`world_size`的值会与实际使用的GPU数量或分布式节点数量相对应。
+
+3. 我们有两种参数设置方法，其一是直接对`train.py`文件`if __name__ == "__main__":`中的`parser`进行设置；其二是在控制台在`/your/path/Defect-Diffiusion-Model/tools`路径下输入以下命令：
+
+**有条件训练命令**
+
+   ```bash
+   python train.py --conditional True --run_name 'df' --epochs 300 --batch_size 16 --image_size 64 --num_classes 10 --distributed True --main_gpu 0 --world_size 2
+   ```
+
+   **无条件训练命令**
+
+   ```bash
+   python train.py --conditional False --run_name 'df' --epochs 300 --batch_size 16 --image_size 64 --distributed True --main_gpu 0 --world_size 2
+   ```
+
+3. 等待训练即可，中断恢复同基本训练一致。
+
+
 
 
 **参数讲解**
@@ -85,7 +110,6 @@
 | --image_size           |          | 输入图像大小                     |   int    | 输入图像大小，自适应输入输出尺寸                             |
 | --dataset_path         |          | 数据集路径                       |   str    | 有条件数据集，例如cifar10，每个类别一个文件夹，路径为主文件夹；无条件数据集，所有图放在一个文件夹，路径为图片文件夹 |
 | --fp16                 |          | 半精度训练                       |   bool   | 开启半精度训练，有效减少显存使用，但无法保证训练精度和训练结果 |
-| --distributed          |          | 分布式训练                       |   bool   | TODO                                                         |
 | --optim                |          | 优化器                           |   str    | 优化器选择，目前支持adam和adamw                              |
 | --lr                   |          | 学习率                           |   int    | 初始化学习率，目前仅支持线性学习率                           |
 | --result_path          |          | 保存路径                         |   str    | 保存路径                                                     |
@@ -95,6 +119,9 @@
 | --resume               |          | 中断恢复训练                     |   bool   | 恢复训练将设置为“True”。注意：设置异常中断的epoch编号若在--start_model_interval参数条件外，则不生效。例如开始保存模型时间为100，中断编号为50，由于我们没有保存模型，所以无法设置任意加载epoch点。每次训练我们都会保存xxx_last.pt文件，所以我们需要使用最后一次保存的模型进行中断训练 |
 | --start_epoch          |          | 中断迭代编号                     |   int    | 设置异常中断的epoch编号                                      |
 | --load_model_dir       |          | 加载模型所在文件夹               |   str    | 写入中断的epoch上一个加载模型的所在文件夹                    |
+| --distributed          |          | 分布式训练                       |   bool   | 开启分布式训练                                               |
+| --main_gpu             |          | 分布式训练主显卡                 |   int    | 设置分布式中主显卡                                           |
+| --world_size           |          | 分布式训练的节点等级             |   int    | 分布式训练的节点等级， world_size的值会与实际使用的GPU数量或分布式节点数量相对应 |
 | --num_classes          |    是    | 类别个数                         |   int    | 类别个数，用于区分类别                                       |
 | --cfg_scale            |    是    | classifier-free guidance插值权重 |   int    | classifier-free guidance插值权重，用户更好生成模型效果       |