Skip to content

Commit

Permalink
docs: improve the docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ymjiang committed Jun 19, 2019
1 parent de4ea95 commit cfc7ec1
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 11 deletions.
25 changes: 19 additions & 6 deletions docs/env.md
Expand Up @@ -80,22 +80,35 @@ The most important one is the number of GPUs per PCIe switches. You should confi
export BYTEPS_PCIE_SWITCH_SIZE=x
```

The rest do not impact the performance much. However, you can still experiment them if you have time.

First, you can configure the tensor partition size. A smaller size improves BytePS pipelining, but may have higher other overhead like NCCL coordination, ZMQ message headers, etc. The default and recommended value is 1024000 (in bytes).
You can also configure the tensor partition size. A smaller size improves BytePS pipelining, but may have higher other overhead like NCCL coordination, ZMQ message headers, etc. The default and recommended value is 4096000 (in bytes).

```
export BYTEPS_PARTITION_BYTES=y
```

Then, you can increase the number of concurrent NCCL streams used in local merging. However, this may lead to occasional hanging problem due to NCCL implementation.
The rest do not impact the performance much. However, you can still experiment them if you have time.

You can increase the number of concurrent NCCL streams used in local merging. However, this may lead to occasional hanging problem due to NCCL implementation.

```
export BYTEPS_NCCL_NUM_RINGS=z
```

Finally, BytePS uses group NCCL calls to reduce NCCL invoking overhead. You can try to increase the group sizes:
BytePS uses group NCCL calls to reduce NCCL invoking overhead. You can try to increase the group sizes:

```
export BYTEPS_NCCL_GROUP_SIZE=w
```

Servers can also be the performance bottleneck, e.g., when there are only one server but multiple workers.
You can try to increase the number of push threads on the servers (default is 1):

```
export SERVER_PUSH_NTHREADS=v
```
export BYTEPS_NCCL_NUM_RINGS=w

Increasing the number of engine CPU threads may also improves server performance:

```
export MXNET_CPU_WORKER_NTHREADS=p
```
10 changes: 5 additions & 5 deletions docs/running.md
Expand Up @@ -11,29 +11,29 @@ On worker 0, run:
```
DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
launcher/launcher.py YOUR_COMMAND
python launcher/launcher.py YOUR_COMMAND
```

On worker 1, run (only DMLC_WORKER_ID is different from above):

```
DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_WORKER_ID=1 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
launcher/launcher.py YOUR_COMMAND
python launcher/launcher.py YOUR_COMMAND
```

On the server, run (remove DMLC_WORKER_ID, and set role to server):

```
DMLC_ROLE=server DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 launcher/launcher.py
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
```

On the scheduler, run (remove DMLC_WORKER_ID, and set role to scheduler):

```
DMLC_ROLE=scheduler DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 launcher/launcher.py
DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
```

The order of above commands does not matter.
The order of above commands does not matter.

0 comments on commit cfc7ec1

Please sign in to comment.