Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update train_instance_conunt from 2 to 1 for local mode #690

Merged

Conversation

Projects
None yet
3 participants
@neelamgehlot
Copy link
Contributor

neelamgehlot commented Mar 22, 2019

Testing Done
Run notebook after change

Issue #, if available:

Description of changes:
Update train_instance_count from 2 to 1 because in local mode multiple instances are not used for distributed training. It is local simulation of distributed training which sometimes cause issue if the
simulation is not done correctly.

Stacktrace for error when simulation doesn't work correctly

RuntimeErrorTraceback (most recent call last)
<ipython-input-7-badf28ab2bd7> in <module>()
     10 
     11 for i in range(10):
---> 12     mnist_estimator.fit(inputs)
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.pyc in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
    334                 tensorboard.join()
    335         else:
--> 336             fit_super()
    337 
    338     @classmethod
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/tensorflow/estimator.pyc in fit_super()
    313 
    314         def fit_super():
--> 315             super(TensorFlow, self).fit(inputs, wait, logs, job_name)
    316 
    317         if run_tensorboard_locally and wait is False:
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
    234         self._prepare_for_training(job_name=job_name)
    235 
--> 236         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    237         if wait:
    238             self.latest_training_job.wait(logs=logs)
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in start_new(cls, estimator, inputs)
    578             train_args['image'] = estimator.train_image()
    579 
--> 580         estimator.sagemaker_session.train(**train_args)
    581 
    582         return cls(estimator.sagemaker_session, estimator._current_job_name)
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic)
    318         LOGGER.info('Creating training-job with name: {}'.format(job_name))
    319         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 320         self.sagemaker_client.create_training_job(**train_request)
    321 
    322     def compile_model(self, input_model_config, output_model_config, role,
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/local/local_session.pyc in create_training_job(self, TrainingJobName, AlgorithmSpecification, OutputDataConfig, ResourceConfig, InputDataConfig, **kwargs)
     72         training_job = _LocalTrainingJob(container)
     73         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 74         training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
     75 
     76         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/local/entities.pyc in start(self, input_data_config, output_data_config, hyperparameters, job_name)
     68         self.state = self._TRAINING
     69 
---> 70         self.model_artifacts = self.container.train(input_data_config, output_data_config, hyperparameters, job_name)
     71         self.end = datetime.datetime.now()
     72         self.state = self._COMPLETED
 
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/local/image.pyc in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    136             # which contains the exit code and append the command line to it.
    137             msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 138             raise RuntimeError(msg)
    139 
    140         artifacts = self.retrieve_artifacts(compose_data, output_data_config, job_name)
 
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpBlelFv/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Update train_instance_conunt from 2 to 1 for local mode
* In local mode, multiple instances are not used for distributed training.

Testing Done
* Run notebook after change

@laurenyu laurenyu merged commit b573fe5 into awslabs:master Mar 22, 2019

@neelamgehlot neelamgehlot deleted the neelamgehlot:local-mode-distributed-training branch Mar 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.