Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map #24

Open
1 of 3 tasks
jasel-lewis opened this issue Dec 19, 2023 · 5 comments
Labels
bug Something isn't working config&setup instances Issues related to running instances jobs Issues related to Jobs jumpstart Issues related to JumpStart models Issues related to Models reliability&stability studio

Comments

@jasel-lewis
Copy link

Product Version

  • Amazon SageMaker Studio Classic
  • Amazon SageMaker Studio
  • Issue is not related to SageMaker Studio

Issue Description

I was using SageMaker Studio to domain-train a model (base model: huggingface-llm-mistral-7b) using a ml.g5.24xlarge instance. I left all values at default other than pointing it to specific buckets for the training data and to output the trained model and adjusted the hyperparameters with:

  • Perf type: lora
  • Instruction-Train The Model: False
  • Epochs: 3

At just over an hour (3,909 seconds) into the training run, I received the error:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise ValueError( ValueError DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:root:Subprocess script failed with return code: 1 Traceback (most recent call last) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling subprocess.run(command, shell=shell, check=True) File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '3', '--gradient_accumulation_steps', '8', '--per_device_train_batch_siz

I came across this specific post, but don't believe these to be values I can adjust via SageMaker Studio.

Any thoughts on this?

Expected Behavior

Expected the model to be domain-trained successfully.

Observed Behavior

Observed the error identified in the Issue Description section.

Product Category

JumpStart

Feedback Category

Reliability and Stability

Other Details

No response

@jasel-lewis jasel-lewis added the bug Something isn't working label Dec 19, 2023
@github-actions github-actions bot added config&setup instances Issues related to running instances jobs Issues related to Jobs jumpstart Issues related to JumpStart latency models Issues related to Models reliability&stability studio labels Dec 19, 2023
@poojak13
Copy link
Contributor

Hi @jasel-lewis, thanks for raising this. I will pull in someone who can answer this.

@poojak13 poojak13 removed the latency label Dec 22, 2023
@jasel-lewis
Copy link
Author

jasel-lewis commented Dec 22, 2023

@poojak13 Wonderful! Any help is greatly appreciated, thank you...

FYSA @shieldsjared

@jasel-lewis
Copy link
Author

jasel-lewis commented Dec 28, 2023

Update to reference a similar re:Post thread.

@jasel-lewis
Copy link
Author

Update: Converted to AWS support ticket for faster resolution.

@ashwaniyadav09
Copy link

Facing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working config&setup instances Issues related to running instances jobs Issues related to Jobs jumpstart Issues related to JumpStart models Issues related to Models reliability&stability studio
Projects
Status: Submitted
Development

No branches or pull requests

3 participants