Clean up compute target in case the cluster creation fails (#123)

* Moved default config to YAML * Moving to YAML config * Removing unnecessary dask.config pring * Fixing def close bug * cjhange default experiment_name * change default of Jupyter to False * first pass at documenting `AzureMLCluster` (wip) wip * minor edits * capitalize 'Compute Target' * minor edit to datastore section * add rest of method descriptions - all other methods shld be private * add example close cluster * remove line * remote duplicate docstring * add periods, optimize for markdown formatting * missed one * Changed return of __get_estimator to raise NotImplementedError * Working testing whether AzureMLCluster on the same VNET as the scheduler * Added used in SSH-tunnel (runs not on the same VNET) * Group all "worker" runs as "child" run Group all "worker" runs as "child" run * Update start_worker.py * Update azureml.py * Add documents for scale api Add documents for scale api; Make scale_up and scale_down as private * Fix style to meet flake8 formatting changes * Working SSH-tunnel * Widget updates * Changing Cores -> vCPUs * Update azureml.py * Make azureml.py after changes flake8 passing * Fixed missing parentheses * fix file path bug wow PM fixing a bug * Fixed the relative path to setup folder bug * Changed socket.timeout to (ConnectionRefusedError, socket.timeout) * Update azureml.py * fix 'passwordless' typo * Fixed the ConnectionRefusedError bug * Fixing hardcoded 'codes' name for the code_mount_name * Fixing somehow missing imports * Added additional_ports parameter * Minor cleanup * Update azureml.py * GPU examples working. Added mounting default store as notebook_dir * Reverting changes in start_worker.py * Fixed the bug in the worker startup script. General cleanup * Notebook_dir to mounts * Fixed the bug in widget * cosmetic change on cluster widget - '/' to '()' * fix datastores doc example issue * whoops * Fixing bug that was not printing the memory correctly for GPUs * Removing the use_gpu flags * Removing commented lines in the config file * More debugging messages in start_worker.py * Fix missing ip in start_worker.py * remove code_store from azureml.py * remove code_store from start_scheduler, rename to mount_point when needed * remove codestore from cloudprovider.yaml * remove print_links_ComputeVM method * Fixed n_gpus_per_node print in start_worker.py * Added additional exception handling for socket.gaierror * Updating requirements * Updating CI reqs * Update tests for AzureMLCluster * Major cleanup and making the code flake8 and black passing * Minor changes to pass black * Added logging * Updating requirements.txt * Updated logging in the azureml.py * Typo fix * Another typo fix * Changing log to info for ConnectionRefusedError * Making all files passing flake8 and black * Minor changes to .gitignore * Minor updates to docstrings * Updated documentation to include AzureMLCluster * Address comments * Update url * Update vm in example * Suggest change * Added VNET setup * Fixed env definition in docs * Removed adding packages to the environment * Adding timeout to scheduler and worker processes * Adding timeout flags in the properties of the class * Typo in logger.debug * Minor fixes in start_scheduler.py * Minor fixes in start_worker.py * Added import Run in start_worker.py * Added awaiting for headnode worker to close * Minor typo fix * Added argument to start_scheduler.py * Parsing string to int in scheduler script * Complete before cancel change * Removed rank check for MPI jobs since we submit one run per worker anyway * Add timeout logic Add timeout logic * Update start_worker.py * Update start_scheduler.py * Fixing typo in scripts when starting worker * Making flake8 and black pass * Add metrics * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update start_scheduler.py * Test cluster * Update init * correct init * Update complete * Extend timeout * Use cancel * Use ssh key path * Update forwarding * update print * update caller * Try close * update close * test child run * test child run 2 * test child run3 * test child run 4 * test child run5 * Rever changes to child run * Refactor * Update azureml.py * Update start_scheduler.py * Delete start.py * Delete azuremlssh.py * Update error msg * Address comments * Formating * Fix: cluster creation stuck on starting worker * Cancel the run after max retry * Add return * verbose * Only use error message * Create compute if not provided * Make ct optional * Update vnet check logic * Catch expected exception * Refactor and using kwargs * Resolve bytes issue * Fix timeout issue * Delete auto created compute target after close * Add check for compute instance * Update the check for CI * Clean up previous port forwarding * Set up portforwarding to CI * Local portforwarding * Update portforwarding * Enable dashboard in CI * Remove retry message * Refactor rpc connection * Check if on the same vnet * Set time delay for port open * Remove dummy message * update documentation for compute cluster creation on behalf of user * finish sentences * capitalize Studio * add links to AP, ML Studio * add initial_node_count to common kwargs * remove quota line * capitalize Workspace * jupyter defaults to True * datastores, reordering * Remove ct after connection failure * address jacob's comments for Jupyter boolean and Python example * Address comment * Clean up compute in case of creation failure * Check ssh key and admin user name * Fix docs * Add jupyter optional Co-authored-by: Tomek Drabas <drabas.t@gmail.com> Co-authored-by: Tom Drabas <todrabas@microsoft.com> Co-authored-by: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com> Co-authored-by: Fred Li <51424245+FredLiFromMS@users.noreply.github.com> Co-authored-by: Cody Peterson <cody.dkdc2@gmail.com>
dask · Aug 25, 2020 · 1ca52ba · 1ca52ba
1 parent e44e1fb
commit 1ca52ba
Show file tree

Hide file tree

Showing 2 changed files with 27 additions and 4 deletions.
diff --git a/dask_cloudprovider/providers/azure/azureml.py b/dask_cloudprovider/providers/azure/azureml.py
@@ -201,6 +201,15 @@ def __init__(
             except Exception as e:
                 logger.exception(e)
                 return
+        elif self.compute_target.admin_user_ssh_key is not None and (
+            self.admin_ssh_key is None or self.admin_username is None
+        ):
+            logger.exception(
+                "Please provide private key and admin username to access compute target {}".format(
+                    self.compute_target.name
+                )
+            )
+            return
 
         ### GPU RUN INFO
         self.workspace_vm_sizes = AmlCompute.supported_vmsizes(self.workspace)
@@ -449,6 +458,8 @@ def __get_ssh_keys(self):
         with open(pri_key_file, "wb") as f:
             f.write(private_key)
 
+        os.chmod(pri_key_file, 0o600)
+
         with open(pub_key_file, "r") as f:
             pubkey = f.read()
 
@@ -556,8 +567,11 @@ async def __create_cluster(self):
 
             if run_error:
                 error_message = "{} {}".format(error_message, run_error)
-
             logger.exception(error_message)
+
+            if not self.compute_target_set:
+                self.__delete_compute_target()
+
             raise Exception(error_message)
 
         print("\n")
@@ -578,9 +592,7 @@ async def __create_cluster(self):
         if self.same_vnet is None:
             self.run.cancel()
             if not self.compute_target_set:
-                ### REMOVE COMPUTE TARGET
                 self.__delete_compute_target()
-
             logger.exception(
                 "Connection error after retrying. Failed to start the AzureML cluster."
             )
@@ -595,7 +607,17 @@ async def __create_cluster(self):
         _scheduler = self.__prepare_rpc_connection_to_headnode()
         self.scheduler_comm = rpc(_scheduler)
         await self.sync(self.__setup_port_forwarding)
-        await self.sync(super()._start)
+
+        try:
+            await super()._start()
+        except Exception as e:
+            logger.exception(e)
+            # CLEAN UP COMPUTE TARGET
+            self.run.cancel()
+            if not self.compute_target_set:
+                self.__delete_compute_target()
+            return
+
         await self.sync(self.__update_links)
 
         self.__print_message("Connections established")

diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -243,6 +243,7 @@ To create cluster:
    vm_size="STANDARD_DS13_V2",                                 # Azure VM size for the Compute Target
    datastores=ws.datastores.values(),                          # Azure ML Datastores to mount on the headnode
    environment_definition=ws.environments['AzureML-Dask-CPU'], # Azure ML Environment to run on the cluster
+   jupyter=true,                                               # Flag to start JupyterLab session on the headnode
    initial_node_count=2,                                       # number of nodes to start 
    scheduler_idle_timeout=7200                                 # scheduler idle timeout in seconds 
    )