Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot recognize num_gpus for more than 1 gpu per instance #222

Closed
zhaoanbei opened this issue Sep 20, 2020 · 4 comments
Closed

cannot recognize num_gpus for more than 1 gpu per instance #222

zhaoanbei opened this issue Sep 20, 2020 · 4 comments
Labels
type: feature request New feature or request

Comments

@zhaoanbei
Copy link

zhaoanbei commented Sep 20, 2020

I tried to run ./test/resources/horovod/simple for 2 ml.p3.8xlarge instances. Return:
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:38,173 sagemaker-containers INFO Reporting training SUCCESS
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:39,540 sagemaker-containers INFO Reporting training SUCCESS
Only recognize 1 gpu per instance.

Is there anything I did wrong?

@icywang86rui
Copy link
Contributor

@zhaoanbei Could you show me how did you start your training job? Which version of the pytorch container did you use? Did you use the SageMaker python sdk? If so which version? Please paste the code here.

@zhaoanbei
Copy link
Author

zhaoanbei commented Sep 22, 2020

Sure!

import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hvdTorch'
role = sagemaker.get_execution_role()
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='simple.py',
                    role=role,
                    framework_version='1.5.0',
                    train_instance_count=2,
                    train_instance_type='ml.p3.8xlarge',
                    hyperparameters={
        "backend": "gloo",
    }
                    )
estimator.fit()
I changed simple.py a little:

parser.add_argument(
        "--backend",
        type=str,
        default=None,
        help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
)

    # Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
    
args=parser.parse_args()


use_cuda = args.num_gpus > 0
print("Number of gpus available - %d", args.num_gpus)

# device = torch.device("cuda" if use_cuda else "cpu")

world_size = len(args.hosts)
os.environ["WORLD_SIZE"] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ["RANK"] = str(host_rank)
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)

print(
            "Initialized the distributed environment: '%s' backend on %d nodes. "
            "Current host rank is %d. Number of gpus: %d",
            args.backend, dist.get_world_size(),
            dist.get_rank(), args.num_gpus
        )
        
        
        
        
ARTIFACT_DIRECTORY = '/opt/ml/model/'
FILENAME = 'local-rank-%s-rank-%s.json' % (hvd.local_rank(), hvd.rank())

with open(os.path.join(ARTIFACT_DIRECTORY, FILENAME), 'w+') as file:
    info = {'local-rank': hvd.local_rank(), 'rank': hvd.rank(), 'size': hvd.size()}
    json.dump(info, file)
    print(info)

And returned:

Number of gpus available - %d 4
Number of gpus available - %d 4
Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 0 4
{'local-rank': 0, 'rank': 0, 'size': 1}
Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 1 4
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 08:14:24,468 sagemaker-containers INFO Reporting training SUCCESS
2020-09-20 08:14:24,450 sagemaker-containers INFO Reporting training SUCCESS

As you might see: os.environ["SM_NUM_GPUS"] returns 4 both, but hvd.size() is 1

@icywang86rui
Copy link
Contributor

@zhaoanbei Sorry for the delay. We need to enable the support of Horovod on the Python SDK side as well. the distribution arg need to be added to the pytorch estimator like the TensorFlow one here - https://github.com/aws/sagemaker-python-sdk/blob/64f600d677872fe8656cdf25d68fc4950b2cd28f/doc/frameworks/tensorflow/using_tf.rst#training-with-horovod

@ajaykarpur ajaykarpur added the type: feature request New feature or request label Oct 8, 2020
@icywang86rui
Copy link
Contributor

The most recent pytorch 1.6 cpu and gpu(cuda11) images have the fixes to enable horovad. The python sdk change has been completed as well - aws/sagemaker-python-sdk#441

This should work now. Resolving. Feel free to reopen if the problem comes back after upgrading the python sdk version to the current version 2.23.2 and the PT version to 1.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants