Skip to content

Commit

Permalink
Clean up compute target in case the cluster creation fails (#123)
Browse files Browse the repository at this point in the history
* Moved default config to YAML

* Moving to YAML config

* Removing unnecessary dask.config pring

* Fixing def close bug

* cjhange default experiment_name

* change default of Jupyter to False

* first pass at documenting `AzureMLCluster` (wip)

wip

* minor edits

* capitalize 'Compute Target'

* minor edit to datastore section

* add rest of method descriptions - all other methods shld be private

* add example close cluster

* remove line

* remote duplicate docstring

* add periods, optimize for markdown formatting

* missed one

* Changed return of __get_estimator to raise NotImplementedError

* Working testing whether AzureMLCluster  on the same VNET as the scheduler

* Added  used in SSH-tunnel (runs not on the same VNET)

* Group all "worker" runs as "child" run

Group all "worker" runs as "child" run

* Update start_worker.py

* Update azureml.py

* Add documents for scale api

Add documents for scale api;
Make scale_up and scale_down as private

* Fix style to meet flake8

formatting changes

* Working SSH-tunnel

* Widget updates

* Changing Cores -> vCPUs

* Update azureml.py

* Make azureml.py after changes flake8 passing

* Fixed missing parentheses

* fix file path bug

wow PM fixing a bug

* Fixed the relative path to setup folder bug

* Changed socket.timeout to (ConnectionRefusedError, socket.timeout)

* Update azureml.py

* fix 'passwordless' typo

* Fixed the ConnectionRefusedError bug

* Fixing hardcoded 'codes' name for the code_mount_name

* Fixing somehow missing imports

* Added additional_ports parameter

* Minor cleanup

* Update azureml.py

* GPU examples working. Added mounting default store as notebook_dir

* Reverting changes in start_worker.py

* Fixed the bug in the  worker startup script. General cleanup

* Notebook_dir to mounts

* Fixed the bug in widget

* cosmetic change on cluster widget - '/' to '()'

* fix datastores doc example issue

* whoops

* Fixing bug that was not printing the memory correctly for GPUs

* Removing the use_gpu flags

* Removing commented lines in the config file

* More debugging messages in start_worker.py

* Fix missing ip in start_worker.py

* remove code_store from azureml.py

* remove code_store from start_scheduler, rename to mount_point when needed

* remove codestore from cloudprovider.yaml

* remove print_links_ComputeVM method

* Fixed n_gpus_per_node print in start_worker.py

* Added additional exception handling for socket.gaierror

* Updating requirements

* Updating CI reqs

* Update tests for AzureMLCluster

* Major cleanup and making the code flake8 and black passing

* Minor changes to pass black

* Added logging

* Updating requirements.txt

* Updated logging in the azureml.py

* Typo fix

* Another typo fix

* Changing log to info for ConnectionRefusedError

* Making all files passing flake8 and black

* Minor changes to .gitignore

* Minor updates to docstrings

* Updated documentation to include AzureMLCluster

* Address comments

* Update url

* Update vm in example

* Suggest change

* Added VNET setup

* Fixed env definition in docs

* Removed adding packages to the environment

* Adding timeout to scheduler and worker processes

* Adding timeout flags in the properties of the class

* Typo in logger.debug

* Minor fixes in start_scheduler.py

* Minor fixes in start_worker.py

* Added import Run in start_worker.py

* Added awaiting for headnode worker to close

* Minor typo fix

* Added argument to start_scheduler.py

* Parsing string to int in scheduler script

* Complete before cancel change

* Removed rank check for MPI jobs since we submit one run per worker anyway

* Add timeout logic

Add timeout logic

* Update start_worker.py

* Update start_scheduler.py

* Fixing typo in scripts when starting worker

* Making flake8 and black pass

* Add metrics

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update start_scheduler.py

* Test cluster

* Update init

* correct init

* Update complete

* Extend timeout

* Use cancel

* Use ssh key path

* Update forwarding

* update print

* update caller

* Try close

* update close

* test child run

* test child run 2

* test child run3

* test child run 4

* test child run5

* Rever changes to child run

* Refactor

* Update azureml.py

* Update start_scheduler.py

* Delete start.py

* Delete azuremlssh.py

* Update error msg

* Address comments

* Formating

* Fix: cluster creation stuck on starting worker

* Cancel the run after max retry

* Add return

* verbose

* Only use error message

* Create compute if not provided

* Make ct optional

* Update vnet check logic

* Catch expected exception

* Refactor and using kwargs

* Resolve bytes issue

* Fix timeout issue

* Delete auto created compute target after close

* Add check for compute instance

* Update the check for CI

* Clean up previous port forwarding

* Set up portforwarding to CI

* Local portforwarding

* Update portforwarding

* Enable dashboard in CI

* Remove retry message

* Refactor rpc connection

* Check if on the same vnet

* Set time delay for port open

* Remove dummy message

* update documentation for compute cluster creation on behalf of user

* finish sentences

* capitalize Studio

* add links to AP, ML Studio

* add initial_node_count to common kwargs

* remove quota line

* capitalize Workspace

* jupyter defaults to True

* datastores, reordering

* Remove ct after connection failure

* address jacob's comments for Jupyter boolean and Python example

* Address comment

* Clean up compute in case of creation failure

* Check ssh key and admin user name

* Fix docs

* Add jupyter optional

Co-authored-by: Tomek Drabas <drabas.t@gmail.com>
Co-authored-by: Tom Drabas <todrabas@microsoft.com>
Co-authored-by: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com>
Co-authored-by: Fred Li <51424245+FredLiFromMS@users.noreply.github.com>
Co-authored-by: Cody Peterson <cody.dkdc2@gmail.com>
  • Loading branch information
6 people committed Aug 25, 2020
1 parent e44e1fb commit 1ca52ba
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 4 deletions.
30 changes: 26 additions & 4 deletions dask_cloudprovider/providers/azure/azureml.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,15 @@ def __init__(
except Exception as e:
logger.exception(e)
return
elif self.compute_target.admin_user_ssh_key is not None and (
self.admin_ssh_key is None or self.admin_username is None
):
logger.exception(
"Please provide private key and admin username to access compute target {}".format(
self.compute_target.name
)
)
return

### GPU RUN INFO
self.workspace_vm_sizes = AmlCompute.supported_vmsizes(self.workspace)
Expand Down Expand Up @@ -449,6 +458,8 @@ def __get_ssh_keys(self):
with open(pri_key_file, "wb") as f:
f.write(private_key)

os.chmod(pri_key_file, 0o600)

with open(pub_key_file, "r") as f:
pubkey = f.read()

Expand Down Expand Up @@ -556,8 +567,11 @@ async def __create_cluster(self):

if run_error:
error_message = "{} {}".format(error_message, run_error)

logger.exception(error_message)

if not self.compute_target_set:
self.__delete_compute_target()

raise Exception(error_message)

print("\n")
Expand All @@ -578,9 +592,7 @@ async def __create_cluster(self):
if self.same_vnet is None:
self.run.cancel()
if not self.compute_target_set:
### REMOVE COMPUTE TARGET
self.__delete_compute_target()

logger.exception(
"Connection error after retrying. Failed to start the AzureML cluster."
)
Expand All @@ -595,7 +607,17 @@ async def __create_cluster(self):
_scheduler = self.__prepare_rpc_connection_to_headnode()
self.scheduler_comm = rpc(_scheduler)
await self.sync(self.__setup_port_forwarding)
await self.sync(super()._start)

try:
await super()._start()
except Exception as e:
logger.exception(e)
# CLEAN UP COMPUTE TARGET
self.run.cancel()
if not self.compute_target_set:
self.__delete_compute_target()
return

await self.sync(self.__update_links)

self.__print_message("Connections established")
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,7 @@ To create cluster:
vm_size="STANDARD_DS13_V2", # Azure VM size for the Compute Target
datastores=ws.datastores.values(), # Azure ML Datastores to mount on the headnode
environment_definition=ws.environments['AzureML-Dask-CPU'], # Azure ML Environment to run on the cluster
jupyter=true, # Flag to start JupyterLab session on the headnode
initial_node_count=2, # number of nodes to start
scheduler_idle_timeout=7200 # scheduler idle timeout in seconds
)
Expand Down

0 comments on commit 1ca52ba

Please sign in to comment.