Skip to content

Segmentation fault when running SimAI_simulator #182

@youknowgjb

Description

@youknowgjb

Hi,

The program throws "Segmentation fault" when running SimAI_simulator. I am using the Mixtral_8*7 MoE workload file generated from AICB and the provided topo Spectrum-X_128g_8gps_100Gbps_A100, and the standard conf SimAI.conf. The program throws segfault after processing layer 7.

chunk size is: 8192 , size is: 8192 , layer_num is: 6 , node: 0
info: all-gather forward pass collective issued for layer: mlp_moelayer, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: mlp_moelayer is finished************
chunk size is: 8388608 , size is: 8388608 , layer_num is: 7 , node: 0
info: all-to-all forward pass collective issued for layer: mlp_moelayer, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Segmentation fault (core dumped)

Any help is appreciated!

Workload file ~10 lines

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 4 ep: 1 pp: 1 vpp: 32 ga: 8 all_gpus: 512 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
2338
grad_norm       -1      1       ALLGATHER       3505913856      1       NONE    0       1       REDUCESCATTER   7011827712      100
moe_grad_norm1  -1      1       NONE    0       1       NONE    0       1       ALLGATHER_DP_EP 1879048192      100
moe_grad_norm2  -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER_DP_EP     3758096384      100
embedding_layer -1      1       NONE    33554432        1       NONE    33554432        1       NONE    0       100
attention_column        -1      1       ALLGATHER       33554432        1       NONE    0       1       NONE    0       100
attention_row   -1      1       REDUCESCATTER   33554432        1       NONE    33554432        1       NONE    0       100
mlp_moelayer    -1      1       ALLGATHER       8192    1       ALLGATHER       8192    1       NONE    0       100
mlp_moelayer    -1      1       ALLTOALL        8388608 1       ALLTOALL        8388608 1       NONE    0       100
mlp_moelayer    -1      1       ALLTOALL_EP     8388608 1       ALLTOALL_EP     8388608 1       NONE    0       100
mlp_moelayer    -1      1       ALLGATHER       8388608 1       ALLGATHER       33554432        1       NONE    0       100
mlp_moelayer    -1      1       REDUCESCATTER   8388608 1       ALLGATHER       8388608 1       NONE    0       100

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions