cannot recognize num_gpus for more than 1 gpu per instance #222

zhaoanbei · 2020-09-20T06:21:08Z

I tried to run ./test/resources/horovod/simple for 2 ml.p3.8xlarge instances. Return:
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:38,173 sagemaker-containers INFO Reporting training SUCCESS
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:39,540 sagemaker-containers INFO Reporting training SUCCESS
Only recognize 1 gpu per instance.

Is there anything I did wrong?

icywang86rui · 2020-09-21T18:32:52Z

@zhaoanbei Could you show me how did you start your training job? Which version of the pytorch container did you use? Did you use the SageMaker python sdk? If so which version? Please paste the code here.

zhaoanbei · 2020-09-22T05:56:14Z

Sure!

import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hvdTorch'
role = sagemaker.get_execution_role()
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='simple.py',
                    role=role,
                    framework_version='1.5.0',
                    train_instance_count=2,
                    train_instance_type='ml.p3.8xlarge',
                    hyperparameters={
        "backend": "gloo",
    }
                    )
estimator.fit()
I changed simple.py a little:

parser.add_argument(
        "--backend",
        type=str,
        default=None,
        help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
)

    # Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
    
args=parser.parse_args()


use_cuda = args.num_gpus > 0
print("Number of gpus available - %d", args.num_gpus)

# device = torch.device("cuda" if use_cuda else "cpu")

world_size = len(args.hosts)
os.environ["WORLD_SIZE"] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ["RANK"] = str(host_rank)
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)

print(
            "Initialized the distributed environment: '%s' backend on %d nodes. "
            "Current host rank is %d. Number of gpus: %d",
            args.backend, dist.get_world_size(),
            dist.get_rank(), args.num_gpus
        )
        
        
        
        
ARTIFACT_DIRECTORY = '/opt/ml/model/'
FILENAME = 'local-rank-%s-rank-%s.json' % (hvd.local_rank(), hvd.rank())

with open(os.path.join(ARTIFACT_DIRECTORY, FILENAME), 'w+') as file:
    info = {'local-rank': hvd.local_rank(), 'rank': hvd.rank(), 'size': hvd.size()}
    json.dump(info, file)
    print(info)

And returned:

Number of gpus available - %d 4
Number of gpus available - %d 4
Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 0 4
{'local-rank': 0, 'rank': 0, 'size': 1}
Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 1 4
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 08:14:24,468 sagemaker-containers INFO Reporting training SUCCESS
2020-09-20 08:14:24,450 sagemaker-containers INFO Reporting training SUCCESS

As you might see: os.environ["SM_NUM_GPUS"] returns 4 both, but hvd.size() is 1

icywang86rui · 2020-09-30T04:33:36Z

@zhaoanbei Sorry for the delay. We need to enable the support of Horovod on the Python SDK side as well. the distribution arg need to be added to the pytorch estimator like the TensorFlow one here - https://github.com/aws/sagemaker-python-sdk/blob/64f600d677872fe8656cdf25d68fc4950b2cd28f/doc/frameworks/tensorflow/using_tf.rst#training-with-horovod

icywang86rui · 2021-01-07T21:45:26Z

The most recent pytorch 1.6 cpu and gpu(cuda11) images have the fixes to enable horovad. The python sdk change has been completed as well - aws/sagemaker-python-sdk#441

This should work now. Resolving. Feel free to reopen if the problem comes back after upgrading the python sdk version to the current version 2.23.2 and the PT version to 1.6.

ajaykarpur added the type: feature request New feature or request label Oct 8, 2020

icywang86rui closed this as completed Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot recognize num_gpus for more than 1 gpu per instance #222

cannot recognize num_gpus for more than 1 gpu per instance #222

zhaoanbei commented Sep 20, 2020 •

edited

Loading

icywang86rui commented Sep 21, 2020

zhaoanbei commented Sep 22, 2020 •

edited

Loading

icywang86rui commented Sep 30, 2020

icywang86rui commented Jan 7, 2021

cannot recognize num_gpus for more than 1 gpu per instance #222

cannot recognize num_gpus for more than 1 gpu per instance #222

Comments

zhaoanbei commented Sep 20, 2020 • edited Loading

icywang86rui commented Sep 21, 2020

zhaoanbei commented Sep 22, 2020 • edited Loading

icywang86rui commented Sep 30, 2020

icywang86rui commented Jan 7, 2021

zhaoanbei commented Sep 20, 2020 •

edited

Loading

zhaoanbei commented Sep 22, 2020 •

edited

Loading