exec sh examples/LLaMA/LLaMA_13_standalone.sh error #1

13416157913 · 2023-09-09T01:40:05Z

13416157913
Sep 9, 2023

this is my LLaMA_13_standalone.sh:
DATASET_1=""
DATASET_2=""
DATASET_3=""
#DATASET="0.2 ${DATASET_1} 0.3 ${DATASET_2} 0.5 ${DATASET_3}"
DATASET = "/home/llm-deploy/Megatron-LM/corpus_indexed/corpus_indexed_text_sentence"

TP_SIZE=2
PP_SIZE=1
WORLD_SIZE=8
MICRO_BATCH_SIZE=2

The int is the number of micro steps of gradient accumulation

GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))

GLOBAL_BATCH_SIZE=128

JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"

LOAD_CHECKPOINT_PATH="/home/llm-deploy/megatron-llama-2-7b-checkpoint"
SAVE_CHECKPOINT_PATH="/home/llm-deploy/megatron-llama-2-7b-checkpoint"
TOKENIZER_PATH="/home/llm-deploy/megatron-llama-2-7b-checkpoint"
TENSORBOARD_DIR="/home/llm-deploy/Megatron-LLaMA/tensorboard_dir"

TRAIN_ITERS=1000
EVAL_ITERS=10
EVAL_INTERVAL=1000
SAVE_INTERVAL=100
LOG_INTERVAL=1

Setting --tensorboard-queue-size to 1 significantly slows down the training

options="
--finetune
--sequence-parallel
--tensor-model-parallel-size ${TP_SIZE}
--pipeline-model-parallel-size ${PP_SIZE}
--num-layers 32
--hidden-size 4096
--num-attention-heads 32
--seq-length 2048
--max-position-embeddings 4096
--no-position-embedding
--use-rotary-position-embeddings
--swiglu
--ffn-hidden-size 13824
--disable-bias-linear
--RMSNorm
--layernorm-epsilon 1e-6
--causal-lm
--tokenizer-type PretrainedFromHF
--tokenizer-name-or-path $TOKENIZER_PATH
--make-vocab-size-divisible-by 1
--init-method-std 0.01
--micro-batch-size ${MICRO_BATCH_SIZE}
--global-batch-size ${GLOBAL_BATCH_SIZE}
--train-iters ${TRAIN_ITERS}
--lr 6.0e-5
--lr-decay-iters 10
--lr-warmup-iters 5
--min-lr 6.0e-6
--override-opt_param-scheduler
--lr-decay-style cosine
--adam-beta1 0.9
--adam-beta2 0.95
--clip-grad 1.0
--weight-decay 0.1
--overlapped-distributed-optimizer
--reduce-bucket-size=2e8
--no-gradient-accumulation-fusion
--dataloader-type cyclic
--data-impl mmap
--data-path ${DATASET}
--split 98,2,0
--eval-interval ${EVAL_INTERVAL}
--eval-iters ${EVAL_ITERS}
--save-interval ${SAVE_INTERVAL}
--save ${SAVE_CHECKPOINT_PATH}
--load ${LOAD_CHECKPOINT_PATH}
--no-load-optim
--log-interval ${LOG_INTERVAL}
--tensorboard-dir ${TENSORBOARD_DIR}
--tensorboard-queue-size 1000
--log-timers-to-tensorboard
--log-batch-size-to-tensorboard
--log-validation-ppl-to-tensorboard
--job-name ${JOB_NAME}
--bf16
--recompute-activations
--recompute-granularity selective
--use-flash-attn"

torchrun --nproc_per_node=8 --master_port=29500 pretrain_llama.py ${options}

ERROR :

setting number of micro-batches to constant 8

building PretrainedFromHF tokenizer ...
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
initializing torch distributed ...
initialized tensor model parallel with size 2
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/home/llm-deploy/Megatron-LLaMA/megatron/data'
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/home/llm-deploy/anaconda3/include/python3.10 -I/home/llm-deploy/anaconda3/lib/python3.10/site-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
make: Leaving directory '/home/llm-deploy/Megatron-LLaMA/megatron/data'

done with dataset index builder. Compilation time: 4.912 seconds
compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/llm-deploy/Megatron-LLaMA/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/llm-deploy/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -std=c++14 -c /home/llm-deploy/Megatron-LLaMA/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/llm-deploy/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -std=c++14 -c /home/llm-deploy/Megatron-LLaMA/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90'
[2/3] c++ -MMD -MF scaled_upper_triang_masked_softmax.o.d -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/llm-deploy/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/llm-deploy/Megatron-LLaMA/megatron/fused_kernels/scaled_upper_triang_masked_softmax.cpp -o scaled_upper_triang_masked_softmax.o
ninja: build stopped: subcommand failed.

li-yi-dong · 2023-09-12T03:14:15Z

li-yi-dong
Sep 12, 2023
Maintainer

Your compilation failed due to: 'nvcc fatal : Unsupported gpu architecture 'compute_90''.
If you need to run it on GPU with compute_90, like H100, you should update your CUDA.

Or, you may slightly modify megatron/fused_kernels/__init__.py.
Comment out the lines adding arch=compute_90,code=sm_90

Megatron-LLaMA/megatron/fused_kernels/__init__.py

Line 26 in 4da868f

if int(bare_metal_minor) >= 7:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exec sh examples/LLaMA/LLaMA_13_standalone.sh error #1

{{title}}

Replies: 1 comment

{{title}}

Select a reply

exec sh examples/LLaMA/LLaMA_13_standalone.sh error #1

13416157913 Sep 9, 2023

The int is the number of micro steps of gradient accumulation

GLOBAL_BATCH_SIZE=128

Setting --tensorboard-queue-size to 1 significantly slows down the training

Replies: 1 comment

li-yi-dong Sep 12, 2023 Maintainer

13416157913
Sep 9, 2023

li-yi-dong
Sep 12, 2023
Maintainer