In [4]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'
os.environ['NCCL_P2P_DISABLE'] = '1'
os.environ['NCCL_IB_DISABLE'] = '1'

In [1]:
from datasets import load_dataset
import torch
from trl import SFTConfig, SFTTrainer
from transformers import TrainingArguments

[2024-07-30 19:56:19,465] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




### basics & overall

- model_kwargs => model：实例化构造 model
- TraingArguments
    - @dataclass
- SFTConfig: `class SFTConfig(TrainingArguments)`
    - @dataclass
    - 继承自 TrainingArguments，又新增了一些参数

```
trainer = SFTTrainer(
    ...
    args=sft_config
    ...
)
```

- 注意一些参数兼容性
    - dataset_num_proc (SFTConfig)
    - The number of workers to use to tokenize the data. Only used when `packing=False`. Defaults to None.
- use_cache=False, # set to False as we're going to use gradient checkpointing
    - model_kwargs
    - Gradient checkpointing requires recomputing activations during the backward pass, while caching aims to save those activations to avoid recomputation. These two approaches are fundamentally at odds [1].
    - Gradient checkpointing forces multiple forward passes to recompute activations. 

```
model.config.use_cache = False  # During training with checkpointing
# ... training loop ...
model.config.use_cache = True   # Re-enable for inference
```

- You can use the `DataCollatorForCompletionOnlyLM` to train your model on the generated prompts only.
    - Note that this works only in the case when packing=False. 

### wandb

```
os.environ['WANDB_DISABLED'] = 'true'
```

- 关闭 wandb 服务


### model kwargs

```
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_compute_dtype=torch.bfloat16,
)
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model_kwargs = dict(
    attn_implementation="flash_attention_2", 
    torch_dtype="auto",
    use_cache=False, # set to False as we're going to use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
)
```

### TrainingArguments

In [6]:
# experimental settings
training_arguments = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=2,
        optim="adamw_8bit",  # paged_adamw_8bit, adam 非常吃内存
        logging_steps=50,
        learning_rate=1e-4,  # 1e-6 ~ 1e-3: 1e-3, 5e-4, 1e-4
        eval_strategy="steps",
        do_eval=True,
        eval_steps=50,
        save_steps=100,
        fp16= not torch.cuda.is_bf16_supported(),
        bf16= torch.cuda.is_bf16_supported(),
        num_train_epochs=3,
        weight_decay=0.0,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        gradient_checkpointing=True,
)

- eval_strategy: `no, steps, epoch`
- lr schedule type, predefined plan，避免局部极小值

### SFTConfig

- `class SFTConfig(TrainingArguments):`
    - 继承了 TrainingArguments 类，
        - `num_train_epochs`: default 3
        - `per_device_train_batch_size`: default 8
        - `per_device_eval_batch_size`: default 8
        - `gradient_accumulation_steps`: default 1
        - `dataloader_drop_last`: default false
        - `report_to`:
            - none
            - tensorboard
            - wandb
- `dataset_text_field`: 跟 dataset 的成员对齐
- `max_seq_length`
- `output_dir='/tmp'`
- **packing=True**,
    - example packing, where multiple short examples are packed in the same input sequence to increase training efficiency. 
    - `# allows multiple shorter sequences to be packed into a single training example, maximizing the use of the model's context window.`

#### packing vs. non-packing

- packing => ConstantLengthDataset
    - max_seq_length
    ```
    constant_length_iterator = ConstantLengthDataset(
        tokenizer,
        dataset,
        dataset_text_field=dataset_text_field,
        formatting_func=formatting_func,
        seq_length=max_seq_length,
        infinite=False,
        num_of_sequences=num_of_sequences,
        chars_per_token=chars_per_token,
        eos_token_id=tokenizer.eos_token_id,
        append_concat_token=append_concat_token,
        add_special_tokens=add_special_tokens,
    )
    ```
- non-packing
    ```
    tokenized_dataset = dataset.map(
            tokenize,
            batched=True,
            remove_columns=dataset.column_names if remove_unused_columns else None,
            num_proc=self.dataset_num_proc,
            batch_size=self.dataset_batch_size,
        )
    ```

In [3]:
imdb = load_dataset('imdb', split='train')
imdb

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

```
sft_config = SFTConfig(
    dataset_text_field="text",
    max_seq_length=512,
    output_dir="/tmp",
)
```

### training steps

```
# 20022
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")

args = SFTConfig(output_dir="/tmp", 
                 max_seq_length=512, 
                 num_train_epochs=2, 
                 per_device_train_batch_size=4, 
                 gradient_accumulation_steps=4,
                 gradient_checkpointing=True,
                 )

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=args,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)
```

- training_epochs * len(dataset) / (_train_batch_size * args.gradient_accumulation_steps * args.world_size)
    - `20022 * 2 / (4*4*2)`

In [5]:
20022 * 2 / (4*4*2)

1251.375