In [1]:
import accelerate, deepspeed

[2024-07-02 20:07:01,696] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [2]:
accelerate.__version__, deepspeed.version

('0.29.3', '0.12.5')

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, HfArgumentParser
from transformers import TrainingArguments
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
import torch

In [4]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

## loss 

- https://huggingface.co/blog/deepspeed-to-fsdp-and-back
    - issue: https://github.com/huggingface/accelerate/issues/2624
    - Accelerate fsdp 对齐 Accelerate deepspeed
    - 之前会存在 Accelerate fsdp loss 不收敛的问题，对齐（混合）精度之后，问题解决；
        - 更具体来说是模型/优化器精度问题；
        - we can perform upcasting automatically for FSDP when mixed precision is enabled. We created a pull request with this change that was included in the 0.30.0 release.

### Precision Matters

- As the fp32 in the name suggests, DeepSpeed was performing **upcasting internally**, and it always keeps its **master weights in fp32 by design**. This upcasting to full precision meant that **the optimizer could converge at learning rates that it would not converge in lower precision**. 

## experiments

- Settings
    - model："mistralai/Mistral-7B-v0.1"
    - dataset：'tatsu-lab/alpaca'

In [14]:
getattr(torch, 'bfloat16')

torch.bfloat16

```
model = AutoModelForCausalLM.from_pretrained(
    'mistralai/Mistral-7B-v0.1',
    torch_dtype=getattr(torch, 'bfloat16'), ## UPDATED
    attn_implementation='sdpa', ## UPDATED
)
```

- `transformers.models.mistral.modeling_mistral.MistralForCasualLLM`

In [5]:
dataset = load_dataset('tatsu-lab/alpaca', split='train')

In [10]:
dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 52002
})

In [9]:
dataset[0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

In [11]:
num_data_samples = 3000

In [13]:
dataset = dataset.select(range(num_data_samples))
dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 3000
})

### 参数1

- `accelerate config`: 走 fsdp 后端
  - fsdp_transformer_layer_cls_to_wrap: `MistralDecoderLayer`
  - `accelerate launch --num_processes 1`
      - `UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.`
  - yaml
      - accelerate lanuch 的参数会覆盖 yaml 文件中的参数配置
      - `mixed_precision`: `[no, fp8, fp16, bf16]`

```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

```
accelerate launch \
    --num_processes 2 \
    --main_process_port 29502 \
    learning_rate_repro.py  \
      --num_train_epochs 5 \
      --output_dir './results' \
      --per_device_train_batch_size 1 \
      --lr_scheduler_type "linear" \
      --learning_rate 1e-6 \
      --logging_steps 1 \
      --save_strategy 'no' \
      --bf16
```

In [21]:
# num_train_epochs == 5
# per_device_train_batch_size == 6
# 一共多少 steps
5 * 3000 / 6

2500.0

In [20]:
# 一个 batch，多少 steps
3000 / 6

500.0

### 参数2

- `accelerate config`: 走 ds 后端
  - fsdp_transformer_layer_cls_to_wrap: `MistralDecoderLayer`
  
    ```
    accelerate launch \
        --num_processes 2 \
        --main_process_port 29502 \
        learning_rate_repro.py  \
          --num_train_epochs 5 \
          --output_dir './results' \
          --per_device_train_batch_size 2 \
          --lr_scheduler_type "linear" \
          --learning_rate 1e-6 \
          --logging_steps 1 \
          --save_strategy 'no' \
          --bf16
    ```