-
Notifications
You must be signed in to change notification settings - Fork 117
Open
Description
Hi,
The program throws "Segmentation fault" when running SimAI_simulator. I am using the Mixtral_8*7 MoE
workload file generated from AICB and the provided topo Spectrum-X_128g_8gps_100Gbps_A100
, and the standard conf SimAI.conf
. The program throws segfault after processing layer 7.
chunk size is: 8192 , size is: 8192 , layer_num is: 6 , node: 0
info: all-gather forward pass collective issued for layer: mlp_moelayer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: mlp_moelayer is finished************
chunk size is: 8388608 , size is: 8388608 , layer_num is: 7 , node: 0
info: all-to-all forward pass collective issued for layer: mlp_moelayer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Segmentation fault (core dumped)
Any help is appreciated!
Workload file ~10 lines
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 4 ep: 1 pp: 1 vpp: 32 ga: 8 all_gpus: 512 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
2338
grad_norm -1 1 ALLGATHER 3505913856 1 NONE 0 1 REDUCESCATTER 7011827712 100
moe_grad_norm1 -1 1 NONE 0 1 NONE 0 1 ALLGATHER_DP_EP 1879048192 100
moe_grad_norm2 -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER_DP_EP 3758096384 100
embedding_layer -1 1 NONE 33554432 1 NONE 33554432 1 NONE 0 100
attention_column -1 1 ALLGATHER 33554432 1 NONE 0 1 NONE 0 100
attention_row -1 1 REDUCESCATTER 33554432 1 NONE 33554432 1 NONE 0 100
mlp_moelayer -1 1 ALLGATHER 8192 1 ALLGATHER 8192 1 NONE 0 100
mlp_moelayer -1 1 ALLTOALL 8388608 1 ALLTOALL 8388608 1 NONE 0 100
mlp_moelayer -1 1 ALLTOALL_EP 8388608 1 ALLTOALL_EP 8388608 1 NONE 0 100
mlp_moelayer -1 1 ALLGATHER 8388608 1 ALLGATHER 33554432 1 NONE 0 100
mlp_moelayer -1 1 REDUCESCATTER 8388608 1 ALLGATHER 8388608 1 NONE 0 100
Metadata
Metadata
Assignees
Labels
No labels