Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add max retry for socket connection (#109)
* Working environments. Will have to check if adding conda and pip packages works as expected * Working cluster setup using Environments * Fixing a bug in print_links_ComputeVM * Splitting scheduler and worker startup scripts * clean up * Scale up/down code * Logic change to get the scheduler IP and port * Added closing the cluster method and some tidying up * Working scaling up/down and closing the cluster. Test revealed data not available to worker * Major API refactor started * Started adding unittest * Update azureml.py Show "Complete" status for canceled runs. * Update azureml.py Add comment to explain why we need complete() * Minor changes to the public API * Fixed bug in the testing suite * First commit for deriving from dask.distributed.deploy.cluster.Cluster * Moving to testing RPC comms * Handling API change in asyncio between Python 3.6 and 3.7 * Added async calls to make compliant with dask.distributed.deploy.cluster.Cluster * Working fork from the Cluster class. Also, scaling up and down * Added support for widget so it prints proper links * Added custom widget information * Fix some bugs Fix some bugs * Update start_scheduler.py * Make env object required Make env object required * Update start_scheduler.py * Namespace change * Changing namespace to providers/azure instead of providers/azureml * Pass datastores list * Update start_scheduler.py * Handle failed cluster creating raise exception if cluster creating fails * Mounting of datastores works as well as defaulting to blobstore in case no code_store is provided * Moved default config to YAML * Moving to YAML config * Removing unnecessary dask.config pring * Fixing def close bug * cjhange default experiment_name * change default of Jupyter to False * first pass at documenting `AzureMLCluster` (wip) wip * minor edits * capitalize 'Compute Target' * minor edit to datastore section * add rest of method descriptions - all other methods shld be private * add example close cluster * remove line * remote duplicate docstring * add periods, optimize for markdown formatting * missed one * Changed return of __get_estimator to raise NotImplementedError * Working testing whether AzureMLCluster on the same VNET as the scheduler * Added used in SSH-tunnel (runs not on the same VNET) * Group all "worker" runs as "child" run Group all "worker" runs as "child" run * Update start_worker.py * Update azureml.py * Add documents for scale api Add documents for scale api; Make scale_up and scale_down as private * Fix style to meet flake8 formatting changes * Working SSH-tunnel * Widget updates * Changing Cores -> vCPUs * Update azureml.py * Make azureml.py after changes flake8 passing * Fixed missing parentheses * fix file path bug wow PM fixing a bug * Fixed the relative path to setup folder bug * Changed socket.timeout to (ConnectionRefusedError, socket.timeout) * Update azureml.py * fix 'passwordless' typo * Fixed the ConnectionRefusedError bug * Fixing hardcoded 'codes' name for the code_mount_name * Fixing somehow missing imports * Added additional_ports parameter * Minor cleanup * Update azureml.py * GPU examples working. Added mounting default store as notebook_dir * Reverting changes in start_worker.py * Fixed the bug in the worker startup script. General cleanup * Notebook_dir to mounts * Fixed the bug in widget * cosmetic change on cluster widget - '/' to '()' * fix datastores doc example issue * whoops * Fixing bug that was not printing the memory correctly for GPUs * Removing the use_gpu flags * Removing commented lines in the config file * More debugging messages in start_worker.py * Fix missing ip in start_worker.py * remove code_store from azureml.py * remove code_store from start_scheduler, rename to mount_point when needed * remove codestore from cloudprovider.yaml * remove print_links_ComputeVM method * Fixed n_gpus_per_node print in start_worker.py * Added additional exception handling for socket.gaierror * Updating requirements * Updating CI reqs * Update tests for AzureMLCluster * Major cleanup and making the code flake8 and black passing * Minor changes to pass black * Added logging * Updating requirements.txt * Updated logging in the azureml.py * Typo fix * Another typo fix * Changing log to info for ConnectionRefusedError * Making all files passing flake8 and black * Minor changes to .gitignore * Minor updates to docstrings * Updated documentation to include AzureMLCluster * Address comments * Update url * Update vm in example * Suggest change * Added VNET setup * Fixed env definition in docs * Removed adding packages to the environment * Adding timeout to scheduler and worker processes * Adding timeout flags in the properties of the class * Typo in logger.debug * Minor fixes in start_scheduler.py * Minor fixes in start_worker.py * Added import Run in start_worker.py * Added awaiting for headnode worker to close * Minor typo fix * Added argument to start_scheduler.py * Parsing string to int in scheduler script * Complete before cancel change * Removed rank check for MPI jobs since we submit one run per worker anyway * Add timeout logic Add timeout logic * Update start_worker.py * Update start_scheduler.py * Fixing typo in scripts when starting worker * Making flake8 and black pass * Add metrics * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update start_scheduler.py * Test cluster * Update init * correct init * Update complete * Extend timeout * Use cancel * Use ssh key path * Update forwarding * update print * update caller * Try close * update close * test child run * test child run 2 * test child run3 * test child run 4 * test child run5 * Rever changes to child run * Refactor * Update azureml.py * Update start_scheduler.py * Delete start.py * Delete azuremlssh.py * Update error msg * Address comments * Formating * Fix: cluster creation stuck on starting worker * Cancel the run after max retry * Add return * verbose * Only use error message * Catch expected exception Co-authored-by: Tom Drabas <todrabas@microsoft.com> Co-authored-by: FredLiFromMS <51424245+FredLiFromMS@users.noreply.github.com> Co-authored-by: Tomek Drabas <drabas.t@gmail.com> Co-authored-by: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com> Co-authored-by: Cody Peterson <cody.dkdc2@gmail.com>
- Loading branch information