Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update train_instance_conunt from 2 to 1 for local mode #690



None yet
3 participants
Copy link

neelamgehlot commented Mar 22, 2019

Testing Done
Run notebook after change

Issue #, if available:

Description of changes:
Update train_instance_count from 2 to 1 because in local mode multiple instances are not used for distributed training. It is local simulation of distributed training which sometimes cause issue if the
simulation is not done correctly.

Stacktrace for error when simulation doesn't work correctly

RuntimeErrorTraceback (most recent call last)
<ipython-input-7-badf28ab2bd7> in <module>()
     11 for i in range(10):
---> 12
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.pyc in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
    334                 tensorboard.join()
    335         else:
--> 336             fit_super()
    338     @classmethod
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.pyc in fit_super()
    314         def fit_super():
--> 315             super(TensorFlow, self).fit(inputs, wait, logs, job_name)
    317         if run_tensorboard_locally and wait is False:
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
    234         self._prepare_for_training(job_name=job_name)
--> 236         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    237         if wait:
    238             self.latest_training_job.wait(logs=logs)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in start_new(cls, estimator, inputs)
    578             train_args['image'] = estimator.train_image()
--> 580         estimator.sagemaker_session.train(**train_args)
    582         return cls(estimator.sagemaker_session, estimator._current_job_name)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic)
    318'Creating training-job with name: {}'.format(job_name))
    319         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 320         self.sagemaker_client.create_training_job(**train_request)
    322     def compile_model(self, input_model_config, output_model_config, role,
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/local/local_session.pyc in create_training_job(self, TrainingJobName, AlgorithmSpecification, OutputDataConfig, ResourceConfig, InputDataConfig, **kwargs)
     72         training_job = _LocalTrainingJob(container)
     73         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 74         training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
     76         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/local/entities.pyc in start(self, input_data_config, output_data_config, hyperparameters, job_name)
     68         self.state = self._TRAINING
---> 70         self.model_artifacts = self.container.train(input_data_config, output_data_config, hyperparameters, job_name)
     71         self.end =
     72         self.state = self._COMPLETED
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/local/image.pyc in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    136             # which contains the exit code and append the command line to it.
    137             msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 138             raise RuntimeError(msg)
    140         artifacts = self.retrieve_artifacts(compose_data, output_data_config, job_name)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpBlelFv/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Update train_instance_conunt from 2 to 1 for local mode
* In local mode, multiple instances are not used for distributed training.

Testing Done
* Run notebook after change

@laurenyu laurenyu merged commit b573fe5 into awslabs:master Mar 22, 2019

@neelamgehlot neelamgehlot deleted the neelamgehlot:local-mode-distributed-training branch Mar 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.