Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

low acc when using multi-gpus in examples/image-classification/train_cifar10_resnet.py #2327

@tornadomeet

Description

@tornadomeet

when using one gpu to train this example(contributed by @Answeror ), it can achieve about 93% acc, so all is correct! and the log first epoch is like this:

Node[0] Epoch[0] Batch [100]    Speed: 213.04 samples/sec   Train-accuracy=0.193594
Node[0] Epoch[0] Batch [150]    Speed: 212.40 samples/sec   Train-accuracy=0.232187
Node[0] Epoch[0] Batch [200]    Speed: 213.00 samples/sec   Train-accuracy=0.252500
Node[0] Epoch[0] Batch [250]    Speed: 211.22 samples/sec   Train-accuracy=0.276875
Node[0] Epoch[0] Batch [300]    Speed: 212.90 samples/sec   Train-accuracy=0.307188
Node[0] Epoch[0] Batch [350]    Speed: 212.46 samples/sec   Train-accuracy=0.328906

but when using multi-gpus(>=2), the acc will not go up all the time, the log of the first two epoch is like this:

Node[0] Epoch[0] Batch [100]    Speed: 851.39 samples/sec   Train-accuracy=0.107031
Node[0] Epoch[0] Batch [150]    Speed: 867.40 samples/sec   Train-accuracy=0.101406
Node[0] Epoch[0] Batch [200]    Speed: 863.88 samples/sec   Train-accuracy=0.096875
Node[0] Epoch[0] Batch [250]    Speed: 833.26 samples/sec   Train-accuracy=0.100469
Node[0] Epoch[0] Batch [300]    Speed: 856.32 samples/sec   Train-accuracy=0.098281
Node[0] Epoch[0] Batch [350]    Speed: 684.60 samples/sec   Train-accuracy=0.100000
Node[0] Epoch[0] Resetting Data Iterator
Node[0] Epoch[0] Time cost=64.076
Node[0] Epoch[0] Validation-accuracy=0.100277
Node[0] Epoch[1] Batch [50] Speed: 883.92 samples/sec   Train-accuracy=0.094531
Node[0] Epoch[1] Batch [100]    Speed: 880.32 samples/sec   Train-accuracy=0.104844
Node[0] Epoch[1] Batch [150]    Speed: 687.25 samples/sec   Train-accuracy=0.096094
Node[0] Epoch[1] Batch [200]    Speed: 870.87 samples/sec   Train-accuracy=0.097500
Node[0] Epoch[1] Batch [250]    Speed: 865.23 samples/sec   Train-accuracy=0.094062
 Node[0] Epoch[1] Batch [300]    Speed: 899.02 samples/sec   Train-accuracy=0.100312
Node[0] Epoch[1] Batch [350]    Speed: 823.62 samples/sec   Train-accuracy=0.106875

i have checked the kvstore in this examples, but found nothing wrong. so do somebody what's wrong? thank~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions