```
deepspeed --num_gpus 2 --num_nodes 1 torch_nccl_test.py

torchrun --nproc_per_node 2 --nnodes 1 torch_nccl_test.py

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
    

# on diffusion models
accelerate launch train_unconditional.py \
  --dataset_name="huggan/smithsonian_butterflies_subset" \
  --resolution=64 \
  --output_dir={model_name} \
  --train_batch_size=32 \
  --num_epochs=50 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_warmup_steps=500 \
  --mixed_precision="no"

```

- torchrun：以前称为 `torch.distributed.launch`
    - 与直接使用python执行脚本相比，torchrun自动处理多进程的初始化和配置，使得在分布式设置中运行脚本更加容易。它主要用于利用PyTorch的分布式包torch.distributed进行训练。
- deepspeed 是一个深度学习优化库，
    - ZeRO optimization stage
- accelerate 是由Hugging Face提供的一个库
    - `accelerate.commands.launch`

## deepspeed

- `which deepspeed`
    - `~/anaconda3/envs/trl/bin/deepspeed`

```
{
   "name": "Python: Debug DeepSpeed",
   "type": "python",
   "request": "launch",
   "program": "/home/nouamane/miniconda3/envs/dev/bin/deepspeed",
   "justMyCode": true,
   "console": "integratedTerminal",
   "args": [
       "--num_nodes=1",
       "--num_gpus=2",
       "/home/nouamane/projects/llm/Megatron-DeepSpeed/pretrain_gpt.py",
   ]
},
```

## accelerate

- `accelerate config`

```
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch sft_llama2",
            "type": "debugpy",
            "request": "launch",
            "module": "accelerate.commands.launch",
            "console": "integratedTerminal",
            "justMyCode": false,
            "args": [
                "${workspaceFolder}/examples/research_projects/stack_llama_2/scripts/sft_llama2.py",
                "--output_dir=./sft",
                "--max_steps=500",
                "--logging_steps=10",
                "--save_steps=10",
                "--per_device_train_batch_size=2",
                "--per_device_eval_batch_size=1",
                "--gradient_accumulation_steps=2",
                "--gradient_checkpointing=False",
                "--group_by_length=False",
                "--learning_rate=1e-4",
                "--lr_scheduler_type=cosine",
                "--warmup_steps=100",
                "--weight_decay=0.05",
                "--optim=paged_adamw_32bit",
                "--bf16=True",
                "--remove_unused_columns=False",
                "--run_name=sft_llama2",
                "--report_to=wandb"
            ],
        },
        {
            "name": "Launch dpo_llama2",
            "type": "debugpy",
            "request": "launch",
            "module": "accelerate.commands.launch",
            "console": "integratedTerminal",
            "justMyCode": false,
            "args": [
                "${workspaceFolder}/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py",
                "--model_name_or_path=sft/final_checkpoint",
                "--output_dir=dpo"
            ],
        }
    ]
}
```