Make fewer status requests and add more details in failure cases (#108) · dask/dask-cloudprovider@84b9d1e

Commit

Make fewer status requests and add more details in failure cases (#108)

* Initial commits with tests for AzureMLCluster

* Submission works but the Environment doesn't want to rebuild thus missing python_interpreter_path

* Updating .gitignore

* Working environments. Will have to check if adding conda and pip packages works as expected

* Working cluster setup using Environments

* Fixing a bug in print_links_ComputeVM

* Splitting scheduler and worker startup scripts

* clean up

* Scale up/down code

* Logic change to get the scheduler IP and port

* Added closing the cluster method and some tidying up

* Working scaling up/down and closing the cluster. Test revealed data not available to worker

* Major API refactor started

* Started adding unittest

* Update azureml.py

Show "Complete" status for canceled runs.

* Update azureml.py

Add comment to explain why we need complete()

* Minor changes to the public API

* Fixed bug in the testing suite

* First commit for deriving from dask.distributed.deploy.cluster.Cluster

* Moving to testing RPC comms

* Handling API change in asyncio between Python 3.6 and 3.7

* Added async calls to make compliant with dask.distributed.deploy.cluster.Cluster

* Working fork from the Cluster class. Also, scaling up and down

* Added support for widget so it prints proper links

* Added custom widget information

* Fix some bugs

Fix some bugs

* Update start_scheduler.py

* Make env object required

Make env object required

* Update start_scheduler.py

* Namespace change

* Changing namespace to providers/azure instead of providers/azureml

* Pass datastores list

* Update start_scheduler.py

* Handle failed cluster creating

raise exception if cluster creating fails

* Mounting of datastores works as well as defaulting to blobstore in case no code_store is provided

* Moved default config to YAML

* Moving to YAML config

* Removing unnecessary dask.config pring

* Fixing def close bug

* cjhange default experiment_name

* change default of Jupyter to False

* first pass at documenting `AzureMLCluster` (wip)

wip

* minor edits

* capitalize 'Compute Target'

* minor edit to datastore section

* add rest of method descriptions - all other methods shld be private

* add example close cluster

* remove line

* remote duplicate docstring

* add periods, optimize for markdown formatting

* missed one

* Changed return of __get_estimator to raise NotImplementedError

* Working testing whether AzureMLCluster  on the same VNET as the scheduler

* Added  used in SSH-tunnel (runs not on the same VNET)

* Group all "worker" runs as "child" run

Group all "worker" runs as "child" run

* Update start_worker.py

* Update azureml.py

* Add documents for scale api

Add documents for scale api;
Make scale_up and scale_down as private

* Fix style to meet flake8

formatting changes

* Working SSH-tunnel

* Widget updates

* Changing Cores -> vCPUs

* Update azureml.py

* Make azureml.py after changes flake8 passing

* Fixed missing parentheses

* fix file path bug

wow PM fixing a bug

* Fixed the relative path to setup folder bug

* Changed socket.timeout to (ConnectionRefusedError, socket.timeout)

* Update azureml.py

* fix 'passwordless' typo

* Fixed the ConnectionRefusedError bug

* Fixing hardcoded 'codes' name for the code_mount_name

* Fixing somehow missing imports

* Added additional_ports parameter

* Minor cleanup

* Update azureml.py

* GPU examples working. Added mounting default store as notebook_dir

* Reverting changes in start_worker.py

* Fixed the bug in the  worker startup script. General cleanup

* Notebook_dir to mounts

* Fixed the bug in widget

* cosmetic change on cluster widget - '/' to '()'

* fix datastores doc example issue

* whoops

* Fixing bug that was not printing the memory correctly for GPUs

* Removing the use_gpu flags

* Removing commented lines in the config file

* More debugging messages in start_worker.py

* Fix missing ip in start_worker.py

* remove code_store from azureml.py

* remove code_store from start_scheduler, rename to mount_point when needed

* remove codestore from cloudprovider.yaml

* remove print_links_ComputeVM method

* Fixed n_gpus_per_node print in start_worker.py

* Added additional exception handling for socket.gaierror

* Updating requirements

* Updating CI reqs

* Update tests for AzureMLCluster

* Major cleanup and making the code flake8 and black passing

* Minor changes to pass black

* Added logging

* Updating requirements.txt

* Updated logging in the azureml.py

* Typo fix

* Another typo fix

* Changing log to info for ConnectionRefusedError

* Making all files passing flake8 and black

* Minor changes to .gitignore

* Minor updates to docstrings

* Updated documentation to include AzureMLCluster

* Address comments

* Update url

* Update vm in example

* Suggest change

* Added VNET setup

* Fixed env definition in docs

* Removed adding packages to the environment

* Adding timeout to scheduler and worker processes

* Adding timeout flags in the properties of the class

* Typo in logger.debug

* Minor fixes in start_scheduler.py

* Minor fixes in start_worker.py

* Added import Run in start_worker.py

* Added awaiting for headnode worker to close

* Minor typo fix

* Added argument to start_scheduler.py

* Parsing string to int in scheduler script

* Complete before cancel change

* Removed rank check for MPI jobs since we submit one run per worker anyway

* Add timeout logic

Add timeout logic

* Update start_worker.py

* Update start_scheduler.py

* Fixing typo in scripts when starting worker

* Making flake8 and black pass

* Add metrics

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update azureml.py

* Update start_scheduler.py

* Test cluster

* Update init

* correct init

* Update complete

* Extend timeout

* Use cancel

* Use ssh key path

* Update forwarding

* update print

* update caller

* Try close

* update close

* test child run

* test child run 2

* test child run3

* test child run 4

* test child run5

* Rever changes to child run

* Refactor

* Update azureml.py

* Update start_scheduler.py

* Delete start.py

* Delete azuremlssh.py

* Update error msg

* Address comments

* Formating

Co-authored-by: Tom Drabas <todrabas@microsoft.com>
Co-authored-by: FredLiFromMS <51424245+FredLiFromMS@users.noreply.github.com>
Co-authored-by: Tomek Drabas <drabas.t@gmail.com>
Co-authored-by: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com>
Co-authored-by: Cody Peterson <cody.dkdc2@gmail.com>

Loading branch information

6 people committed Jun 30, 2020

1 parent a662fe2 commit 84b9d1e

dask_cloudprovider/providers/azure/azureml.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -393,18 +393,26 @@ async def __create_cluster(self):
  
            run = exp.submit(estimator, tags=self.tags)

            self.__print_message("Waiting for scheduler node's IP")

            status = run.get_status()

            while (

                run.get_status() != "Canceled"

                and run.get_status() != "Failed"

                status != "Canceled"

                and status != "Failed"

                and "scheduler" not in run.get_metrics()

            ):

                print(".", end="")

                logger.info("Scheduler not ready")

                time.sleep(5)

                status = run.get_status()

            if run.get_status() == "Canceled" or run.get_status() == "Failed":

                logger.exception("Failed to start the AzureML cluster")

                raise Exception("Failed to start the AzureML cluster.")

            if status == "Canceled" or status == "Failed":

                run_error = run.get_details().get("error")

                error_message = "Failed to start the AzureML cluster."

                if run_error:

                    error_message = "{} {}".format(error_message, run_error)

                logger.exception(error_message)

                raise Exception(error_message)

            print("\n\n")

    @@ -714,11 +722,8 @@ def update():
  
            return box

        def close_when_disconnect(self):

            if (

                self.run.get_status() == "Canceled"

                or self.run.get_status() == "Completed"

                or self.run.get_status() == "Failed"

            ):

            status = self.run.get_status()

            if status == "Canceled" or status == "Completed" or status == "Failed":

                self.scale_down(len(self.workers_list))

        def scale(self, workers=1):

0 comments on commit `84b9d1e`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `84b9d1e`

Commit

There are no files selected for viewing

0 comments on commit 84b9d1e

0 comments on commit `84b9d1e`