Unexpected hard disk read amount when running NS3 simulation

## Hardware environment

- Server: Aliyun ECS ecs.e-c1m4.large
  - CPU: Intel(R) Xeon(R) Platinum 2C
  - Memory: 8G
- OS: Ubuntu 22.04.5 (5.15.0-106-generic)

## Reproduce

Use `root` user.

```bash
# Pre-request
$ apt update
$ apt upgrade
$ apt install cmake
$ pip install pandas

# Clone the repository
$ git clone https://github.com/aliyun/SimAI.git
$ git clone https://github.com/aliyun/aicb.git
$ cd ./SimAI/

# Clone submodules
$ git submodule update --init --recursive
# Make sure use the newest commit
$ git submodule update --remote

# Compile SimAI-Simulation (ns3)
$ ./scripts/build.sh -c ns3

# Generate workload (use aliyun/aicb main branch)
$ cd ~/aicb
# Should change `python` to `python3`
$ sh ./scripts/megatron_workload_with_aiob.sh \
-m 7 --world_size 4096 \
--tensor_model_parallel_size 2 --pipeline_model_parallel 1 \
--frame Megatron --global_batch 8192 \
--micro_batch 1 --seq_length 4096 \
--swiglu --use_flash_attn
$ cp results/workload/gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt ~/SimAI
$ cd ~/SimAI

$ # Create network topo
$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_HPN_7.0_topo_mulgpus_one_link.py -g 128 -gt A100 -bw 400Gbps -nvbw 2400Gbps

# Running
$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
```

## Result

It hung at `layer num: 483`. The monitor on Aliyun shows the CPU usage is ~25%, but the hard disk read is abnormal.

![`}`V~UCA%9P1YD%X7E(NH8L](https://github.com/user-attachments/assets/64f6e73c-229b-4133-a67f-2e3572091a1d)

Maybe the simulation needs a huge amount (>>8G) of memory, which causing using swap memory on the server.

---

UPD: On a 4C 16G server, the simulation stops at `layer num: 599` (optimizer1), and it didn't output the collective message at last.

do.sh:

```bash
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./gpt_7B-world_size4096-tp4-pp1-ep1-gbs8192-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?
```

Command:

```bash
nohup bash do.sh 2>&1 > logs &
```

Tail of logs:

```plain
***** info: fwd pass comm collective for layer: cross_entropy1 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 597 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy2, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy2 is finished************
chunk size is: 16384 , size is: 16384 , layer_num is: 598 , node: 0
info: all-reduce forward pass collective issued for layer: cross_entropy3, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: cross_entropy3 is finished************
chunk size is: 4 , size is: 4 , layer_num is: 599 , node: 0
info: all-reduce forward pass collective issued for layer: optimizer1, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0
```

do.sh:

```bash
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 3 -w ./example/microAllReduce.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
echo $?
```

Command:

```bash
nohup bash do.sh 2>&1 > logs &
```

Tail of logs:

```plain
All data received by node 125 is 939524096
sim_finish on sent,  Thread id: 140257080956416
All data sent from node 126 is 939524096
sim_finish on received,  Thread id: 140257080956416
All data received by node 126 is 939524096
sim_finish on sent,  Thread id: 140257080956416
All data sent from node 127 is 939524096
sim_finish on received,  Thread id: 140257080956416
All data received by node 127 is 939524096
0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected hard disk read amount when running NS3 simulation #5

Hardware environment

Reproduce

Result

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected hard disk read amount when running NS3 simulation #5

Description

Hardware environment

Reproduce

Result

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions