Error: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map` #24

jasel-lewis · 2023-12-19T21:51:48Z

Product Version

Amazon SageMaker Studio Classic
Amazon SageMaker Studio
Issue is not related to SageMaker Studio

Issue Description

I was using SageMaker Studio to domain-train a model (base model: huggingface-llm-mistral-7b) using a ml.g5.24xlarge instance. I left all values at default other than pointing it to specific buckets for the training data and to output the trained model and adjusted the hyperparameters with:

Perf type: lora
Instruction-Train The Model: False
Epochs: 3

At just over an hour (3,909 seconds) into the training run, I received the error:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise ValueError( ValueError DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:root:Subprocess script failed with return code: 1 Traceback (most recent call last) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling subprocess.run(command, shell=shell, check=True) File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '3', '--gradient_accumulation_steps', '8', '--per_device_train_batch_siz

I came across this specific post, but don't believe these to be values I can adjust via SageMaker Studio.

Any thoughts on this?

Expected Behavior

Expected the model to be domain-trained successfully.

Observed Behavior

Observed the error identified in the Issue Description section.

Product Category

JumpStart

Feedback Category

Reliability and Stability

Other Details

No response

The text was updated successfully, but these errors were encountered:

poojak13 · 2023-12-22T21:39:47Z

Hi @jasel-lewis, thanks for raising this. I will pull in someone who can answer this.

jasel-lewis · 2023-12-22T21:47:07Z

@poojak13 Wonderful! Any help is greatly appreciated, thank you...

FYSA @shieldsjared

jasel-lewis · 2023-12-28T16:08:42Z

Update to reference a similar re:Post thread.

jasel-lewis · 2024-01-02T15:11:23Z

Update: Converted to AWS support ticket for faster resolution.

ashwaniyadav09 · 2024-01-11T10:59:04Z

Facing the same issue.

jasel-lewis added the bug Something isn't working label Dec 19, 2023

github-actions bot added config&setup instances Issues related to running instances jobs Issues related to Jobs jumpstart Issues related to JumpStart latency models Issues related to Models reliability&stability studio labels Dec 19, 2023

poojak13 removed the latency label Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map` #24

Error: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map` #24

jasel-lewis commented Dec 19, 2023

poojak13 commented Dec 22, 2023

jasel-lewis commented Dec 22, 2023 •

edited

jasel-lewis commented Dec 28, 2023 •

edited

jasel-lewis commented Jan 2, 2024

ashwaniyadav09 commented Jan 11, 2024

Error: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map #24

Error: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map #24

Comments

jasel-lewis commented Dec 19, 2023

Product Version

Issue Description

Expected Behavior

Observed Behavior

Product Category

Feedback Category

Other Details

poojak13 commented Dec 22, 2023

jasel-lewis commented Dec 22, 2023 • edited

jasel-lewis commented Dec 28, 2023 • edited

jasel-lewis commented Jan 2, 2024

ashwaniyadav09 commented Jan 11, 2024

Error: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map` #24

Error: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map` #24

jasel-lewis commented Dec 22, 2023 •

edited

jasel-lewis commented Dec 28, 2023 •

edited