-
Notifications
You must be signed in to change notification settings - Fork 137
Closed
Description
Hardware environment
- Server: Aliyun ECS ecs.e-c1m4.large
- CPU: Intel(R) Xeon(R) Platinum 2C
- Memory: 8G
- OS: Ubuntu 22.04.5 (5.15.0-106-generic)
Reproduce
Use root user.
# Pre-request
$ apt update
$ apt upgrade
$ apt install cmake
$ pip install pandas
# Clone the repository
$ git clone https://github.com/aliyun/SimAI.git
$ git clone https://github.com/aliyun/aicb.git
$ cd ./SimAI/
# Clone submodules
$ git submodule update --init --recursive
# Make sure use the newest commit
$ git submodule update --remote
# Compile SimAI-Simulation (ns3)
$ ./scripts/build.sh -c ns3
# Generate workload (use aliyun/aicb main branch)
$ cd ~/aicb
# Should change `python` to `python3`
$ sh ./scripts/megatron_workload_with_aiob.sh \
-m 7 --world_size 4096 \
--tensor_model_parallel_size 2 --pipeline_model_parallel 1 \
--frame Megatron --global_batch 8192 \
--micro_batch 1 --seq_length 4096 \
--swiglu --use_flash_attn
$ cp results/workload/gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt ~/SimAI
$ cd ~/SimAI
$ # Create network topo
$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_HPN_7.0_topo_mulgpus_one_link.py -g 128 -gt A100 -bw 400Gbps -nvbw 2400Gbps
# Running
$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100Result
It hung at layer num: 483. The monitor on Aliyun shows the CPU usage is ~25%, but the hard disk read is abnormal.
Maybe the simulation needs a huge amount (>>8G) of memory, which causing using swap memory on the server.
UPD: On a 4C 16G server, the simulation stops at layer num: 599 (optimizer1), and it didn't output the collective message at last.
do.sh:
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?Command:
nohup bash do.sh 2>&1 > logs &Tail of logs:
***** info: fwd pass comm collective for layer: cross_entropy1 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 597 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy2, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy2 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 598 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy3, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy3 is finished************
chunk size is: 4 , size is: 4 , layer_num is: 599 , node: 0
info: all-reduce forward pass collective issued for layer: optimizer1, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0
do.sh:
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./example/microAllReduce.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?Command:
nohup bash do.sh 2>&1 > logs &Tail of logs:
All data received by node 125 is 939524096
sim_finish on sent, Thread id: 140257080956416
All data sent from node 126 is 939524096
sim_finish on received, Thread id: 140257080956416
All data received by node 126 is 939524096
sim_finish on sent, Thread id: 140257080956416
All data sent from node 127 is 939524096
sim_finish on received, Thread id: 140257080956416
All data received by node 127 is 939524096
0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
