Skip to content

Unstable inference of DeepSpeed-fine-tuned Stanford Alapaca #3107

@XinliYu

Description

@XinliYu

We fine-tuned Alpaca on one single node with torchrun, and on multiple nodes with DeepSpeed and following their recommended setup for inference.

The typical erroneous behavior we observed for DeepSpeed-fine-tuned model is that it repeats the prompt and then it just stops.

The following is DeepSpeed config. We simply add a --deepspeed argument to the torchrun command line referencing the following configuration, and remove those conflicting fsdp configuration in https://github.com/tatsu-lab/stanford_alpaca.

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 100,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-5,
      "weight_decay": 0.0
    }
  },
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 1,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}
python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=xxx --master_port=9901 train.py \
    --data_path ./alpaca_data.json \
    --output_dir ./train_ouput_02 \
    --num_train_epochs 7 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 50 \
    --tf32 True \
    --deepspeed ds_config.json

We evaluate the fine-tuned model with the above inference setup multiple times on the same prompts. The torchrun fine-tuned model on a single node is relatively stable. However, the multi-node DeepSpeed fine-tuned model is much less stable, where the output for each inference could be different.

The typical erroneous behavior we observed for DeepSpeed-fine-tuned model is that it repeats the prompt and then stops.

For example, the prompt is "Explain how algorithms can be used in educational institutions."
Three responses from torchrun fine-tuned model:

Algorithms are mathematical processes that can be used to solve problems and make decisions. In educational institutions, algorithms can be used in a variety of ways. For example, algorithms can be used to grade student work, to personalize learning experiences, to generate recommendation systems, and to detect cheating. Algorithms can also be used to analyze large amounts of data to identify patterns and trends in student performance.

Algorithms can be used in educational institutions to automate certain processes, such as grading tests and homework, providing personalized learning recommendations, and helping students find resources related to their coursework. Algorithms can also be used to track student progress, identify areas of difficulty, and provide feedback to students and teachers.

Algorithms can be used in educational institutions to help with the tracking and management of student records, providing automated feedback and assessment, personalizing learning experiences, and automating administrative tasks.

Three response from DeepSpeed-finetuned model. We can see in the first and the third responses that the output just repeats the prompt.

Explain how algorithms can be used in educational institutions.

Algorithms can be used in educational institutions to streamline processes and make them more efficient. For example, algorithms can be used to grade tests and assignments quickly and accur, accurately. Algorithms can also be used to match students with appropriate tutors and to match students with suitable learning materials.

Explain how algorithms can be used in educational institutions.

Looking forward to any helpful discussion how to improve the DeepSpeed fine-tuned model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions