Skip to content

Local mode failure for MXNet estimator #1349

@ehsanmok

Description

@ehsanmok

Describe the bug
I'm on ml.p3.2xlarge and mxnet_p36 conda env and installed python -m pip install "sagemaker[local]". The following fails training in local mode train_instance_type='local' or 'local_gpu' but works on any non-local instance type

estimator = MXNet(entry_point='main.py',
                  source_dir='code',
                  role=role,
                  train_instance_count=1, 
                  train_instance_type='local',  # 'ml.c4.2xlarge'
                  framework_version="1.4.1",
                  py_version='py3',
                  hyperparameters=hyperparameters,
                  output_path=train_output,
                  code_location=code_location,
                  sagemaker_session=session,
                 )

Screenshots or logs

ClientError                               Traceback (most recent call last)
<ipython-input-11-2c228015dcf6> in <module>()
     15                  )
     16 
---> 17 estimator.fit(input_data)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    460         self._prepare_for_training(job_name=job_name)
    461 
--> 462         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
    463         self.jobs.append(self.latest_training_job)
    464         if wait:

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs, experiment_config)
   1008             train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
   1009 
-> 1010         estimator.sagemaker_session.train(**train_args)
   1011 
   1012         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
    567         LOGGER.info("Creating training-job with name: %s", job_name)
    568         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 569         self.sagemaker_client.create_training_job(**train_request)
    570 
    571     def process(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.c4.8xlarge, ml.g4dn.2xlarge, ml.c5.9xlarge, ml.g4dn.4xlarge, ml.c5.xlarge, ml.g4dn.16xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.50.16
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
  • Framework version: 1.4.1 and 1.6.0
  • Python version: py3
  • CPU or GPU: Both
  • Custom Docker image (Y/N):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions