-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting started "shard" model not working #70
Comments
Thank you for your interest in our project. The Apex compilation warnings are expected, I have seen these since the beginning. The I replicate your problem when following the docs as written (also using a single node with 8x A100 80gb). When I invoke docker with the additional arguments
however, it runs as expected. Please try something like this. |
@AleHD please can you add this, or at least a mention that nontrivial memory is needed to shard the weights, to the "Getting Started" section? Thanks! |
Thank you @kylematoba that solved it for me. I managed to shard the model but ran into a different issue during training. Traceback (most recent call last):
File "/epfllm/./Megatron-LLM/finetune.py", line 249, in <module>
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/epfllm/Megatron-LLM/megatron/training.py", line 138, in pretrain
iteration = _train(args,
File "/epfllm/Megatron-LLM/megatron/training.py", line 678, in _train
train_step(forward_step_func,
File "/epfllm/Megatron-LLM/megatron/training.py", line 411, in train_step
losses_reduced = forward_backward_func(
File "/epfllm/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/epfllm/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/epfllm/./Megatron-LLM/finetune.py", line 213, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/module.py", line 186, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
lm_output = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
mlp_output, mlp_bias = self.mlp(layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
bias_gelu_impl(intermediate_parallel, bias_parallel)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
return bias_gelu(bias, input)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in fallback_function
@torch.jit.script
def bias_gelu(bias, y):
x = bias + y
~~~~~~~~ <--- HERE
return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
RuntimeError: Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self' |
Hi, I'm guessing that it's an OOM that's obfuscated by the JIT-ing. In cases like this I can usually recommend commenting out the As far as I can see, you've not reported what sort of model you are trying to train. Did you look at https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements? Only the smallest models can fit into 8x A100s 80gb. |
Let me try commenting out the scripting. I am following the getting started so its Llama 2 7B and i have 8x A100 80GBs. that's my command LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 500 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
torchrun $DISTRIBUTED_ARGS ${MEGATRON_PATH}/finetune.py \
--tensor_model_parallel_size 4 \
--pipeline_model_parallel_size 1 \
--load ${MODEL_PATH}_sharded \
--save ${MODEL_PATH}_sharded \
--tensorboard_dir ${MODEL_PATH}_sharded \
--data_path ${DATASET_PATH}/megatron_text_document \
--model_name llama2 \
--tokenizer_type SentencePieceTokenizer \
--vocab_file=${MODEL_PATH}/tokenizer.model \
--bf16 \
--use_flash_attn \
--micro_batch_size 5 \
--global_batch_size 1000 \
--sequence_parallel \
--recompute_granularity selective \
--use_checkpoint_args \
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS |
The error is not really more helpful... TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
mlp_output, mlp_bias = self.mlp(layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
bias_gelu_impl(intermediate_parallel, bias_parallel)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
return bias_gelu(bias, input)
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in bias_gelu
x = bias + y
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor' Should the getting started guide: https://epfllm.github.io/Megatron-LLM/guide/getting_started.html work e2e? |
Hi, thanks for that. I'm pretty sure the problem is something that we overlooked early on: runs without I'll make sure this bug gets investigated in any case. |
Addin |
Thanks @philschmid. I'll close this and we'll fix the bug I mention above shortly. |
First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration.
I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a
Bus error (core dumped)
.I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.
below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with
in the container.
Error Stack
The text was updated successfully, but these errors were encountered: