Error: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True
or with passing a device_map
#24
Labels
bug
Something isn't working
config&setup
instances
Issues related to running instances
jobs
Issues related to Jobs
jumpstart
Issues related to JumpStart
models
Issues related to Models
reliability&stability
studio
Product Version
Issue Description
I was using SageMaker Studio to domain-train a model (base model: huggingface-llm-mistral-7b) using a
ml.g5.24xlarge
instance. I left all values at default other than pointing it to specific buckets for the training data and to output the trained model and adjusted the hyperparameters with:At just over an hour (3,909 seconds) into the training run, I received the error:
I came across this specific post, but don't believe these to be values I can adjust via SageMaker Studio.
Any thoughts on this?
Expected Behavior
Expected the model to be domain-trained successfully.
Observed Behavior
Observed the error identified in the Issue Description section.
Product Category
JumpStart
Feedback Category
Reliability and Stability
Other Details
No response
The text was updated successfully, but these errors were encountered: