Skip to content

Unexpected hard disk read amount when running NS3 simulation #5

@HeRaNO

Description

@HeRaNO

Hardware environment

  • Server: Aliyun ECS ecs.e-c1m4.large
    • CPU: Intel(R) Xeon(R) Platinum 2C
    • Memory: 8G
  • OS: Ubuntu 22.04.5 (5.15.0-106-generic)

Reproduce

Use root user.

# Pre-request
$ apt update
$ apt upgrade
$ apt install cmake
$ pip install pandas

# Clone the repository
$ git clone https://github.com/aliyun/SimAI.git
$ git clone https://github.com/aliyun/aicb.git
$ cd ./SimAI/

# Clone submodules
$ git submodule update --init --recursive
# Make sure use the newest commit
$ git submodule update --remote

# Compile SimAI-Simulation (ns3)
$ ./scripts/build.sh -c ns3

# Generate workload (use aliyun/aicb main branch)
$ cd ~/aicb
# Should change `python` to `python3`
$ sh ./scripts/megatron_workload_with_aiob.sh \
-m 7 --world_size 4096 \
--tensor_model_parallel_size 2 --pipeline_model_parallel 1 \
--frame Megatron --global_batch 8192 \
--micro_batch 1 --seq_length 4096 \
--swiglu --use_flash_attn
$ cp results/workload/gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt ~/SimAI
$ cd ~/SimAI

$ # Create network topo
$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_HPN_7.0_topo_mulgpus_one_link.py -g 128 -gt A100 -bw 400Gbps -nvbw 2400Gbps

# Running
$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100

Result

It hung at layer num: 483. The monitor on Aliyun shows the CPU usage is ~25%, but the hard disk read is abnormal.

}V~UCA%9P1YD%X7E(NH8L

Maybe the simulation needs a huge amount (>>8G) of memory, which causing using swap memory on the server.


UPD: On a 4C 16G server, the simulation stops at layer num: 599 (optimizer1), and it didn't output the collective message at last.

do.sh:

AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?

Command:

nohup bash do.sh 2>&1 > logs &

Tail of logs:

***** info: fwd pass comm collective for layer: cross_entropy1 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 597 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy2, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy2 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 598 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy3, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy3 is finished************
chunk size is: 4 , size is: 4 , layer_num is: 599 , node: 0
info: all-reduce forward pass collective issued for layer: optimizer1, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0

do.sh:

AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./example/microAllReduce.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?

Command:

nohup bash do.sh 2>&1 > logs &

Tail of logs:

All data received by node 125 is 939524096
sim_finish on sent,  Thread id: 140257080956416
All data sent from node 126 is 939524096
sim_finish on received,  Thread id: 140257080956416
All data received by node 126 is 939524096
sim_finish on sent,  Thread id: 140257080956416
All data sent from node 127 is 939524096
sim_finish on received,  Thread id: 140257080956416
All data received by node 127 is 939524096
0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions