-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot recognize num_gpus for more than 1 gpu per instance #222
Comments
@zhaoanbei Could you show me how did you start your training job? Which version of the pytorch container did you use? Did you use the SageMaker python sdk? If so which version? Please paste the code here. |
Sure! import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hvdTorch'
role = sagemaker.get_execution_role()
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='simple.py',
role=role,
framework_version='1.5.0',
train_instance_count=2,
train_instance_type='ml.p3.8xlarge',
hyperparameters={
"backend": "gloo",
}
)
estimator.fit()
I changed simple.py a little:
parser.add_argument(
"--backend",
type=str,
default=None,
help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
)
# Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
args=parser.parse_args()
use_cuda = args.num_gpus > 0
print("Number of gpus available - %d", args.num_gpus)
# device = torch.device("cuda" if use_cuda else "cpu")
world_size = len(args.hosts)
os.environ["WORLD_SIZE"] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ["RANK"] = str(host_rank)
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
print(
"Initialized the distributed environment: '%s' backend on %d nodes. "
"Current host rank is %d. Number of gpus: %d",
args.backend, dist.get_world_size(),
dist.get_rank(), args.num_gpus
)
ARTIFACT_DIRECTORY = '/opt/ml/model/'
FILENAME = 'local-rank-%s-rank-%s.json' % (hvd.local_rank(), hvd.rank())
with open(os.path.join(ARTIFACT_DIRECTORY, FILENAME), 'w+') as file:
info = {'local-rank': hvd.local_rank(), 'rank': hvd.rank(), 'size': hvd.size()}
json.dump(info, file)
print(info) And returned: Number of gpus available - %d 4 As you might see: os.environ["SM_NUM_GPUS"] returns 4 both, but hvd.size() is 1 |
@zhaoanbei Sorry for the delay. We need to enable the support of Horovod on the Python SDK side as well. the |
The most recent pytorch 1.6 cpu and gpu(cuda11) images have the fixes to enable horovad. The python sdk change has been completed as well - aws/sagemaker-python-sdk#441 This should work now. Resolving. Feel free to reopen if the problem comes back after upgrading the python sdk version to the current version 2.23.2 and the PT version to 1.6. |
I tried to run ./test/resources/horovod/simple for 2 ml.p3.8xlarge instances. Return:
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:38,173 sagemaker-containers INFO Reporting training SUCCESS
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:39,540 sagemaker-containers INFO Reporting training SUCCESS
Only recognize 1 gpu per instance.
Is there anything I did wrong?
The text was updated successfully, but these errors were encountered: