Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Clean up compute target in case the cluster creation fails (#123)
* Moved default config to YAML * Moving to YAML config * Removing unnecessary dask.config pring * Fixing def close bug * cjhange default experiment_name * change default of Jupyter to False * first pass at documenting `AzureMLCluster` (wip) wip * minor edits * capitalize 'Compute Target' * minor edit to datastore section * add rest of method descriptions - all other methods shld be private * add example close cluster * remove line * remote duplicate docstring * add periods, optimize for markdown formatting * missed one * Changed return of __get_estimator to raise NotImplementedError * Working testing whether AzureMLCluster on the same VNET as the scheduler * Added used in SSH-tunnel (runs not on the same VNET) * Group all "worker" runs as "child" run Group all "worker" runs as "child" run * Update start_worker.py * Update azureml.py * Add documents for scale api Add documents for scale api; Make scale_up and scale_down as private * Fix style to meet flake8 formatting changes * Working SSH-tunnel * Widget updates * Changing Cores -> vCPUs * Update azureml.py * Make azureml.py after changes flake8 passing * Fixed missing parentheses * fix file path bug wow PM fixing a bug * Fixed the relative path to setup folder bug * Changed socket.timeout to (ConnectionRefusedError, socket.timeout) * Update azureml.py * fix 'passwordless' typo * Fixed the ConnectionRefusedError bug * Fixing hardcoded 'codes' name for the code_mount_name * Fixing somehow missing imports * Added additional_ports parameter * Minor cleanup * Update azureml.py * GPU examples working. Added mounting default store as notebook_dir * Reverting changes in start_worker.py * Fixed the bug in the worker startup script. General cleanup * Notebook_dir to mounts * Fixed the bug in widget * cosmetic change on cluster widget - '/' to '()' * fix datastores doc example issue * whoops * Fixing bug that was not printing the memory correctly for GPUs * Removing the use_gpu flags * Removing commented lines in the config file * More debugging messages in start_worker.py * Fix missing ip in start_worker.py * remove code_store from azureml.py * remove code_store from start_scheduler, rename to mount_point when needed * remove codestore from cloudprovider.yaml * remove print_links_ComputeVM method * Fixed n_gpus_per_node print in start_worker.py * Added additional exception handling for socket.gaierror * Updating requirements * Updating CI reqs * Update tests for AzureMLCluster * Major cleanup and making the code flake8 and black passing * Minor changes to pass black * Added logging * Updating requirements.txt * Updated logging in the azureml.py * Typo fix * Another typo fix * Changing log to info for ConnectionRefusedError * Making all files passing flake8 and black * Minor changes to .gitignore * Minor updates to docstrings * Updated documentation to include AzureMLCluster * Address comments * Update url * Update vm in example * Suggest change * Added VNET setup * Fixed env definition in docs * Removed adding packages to the environment * Adding timeout to scheduler and worker processes * Adding timeout flags in the properties of the class * Typo in logger.debug * Minor fixes in start_scheduler.py * Minor fixes in start_worker.py * Added import Run in start_worker.py * Added awaiting for headnode worker to close * Minor typo fix * Added argument to start_scheduler.py * Parsing string to int in scheduler script * Complete before cancel change * Removed rank check for MPI jobs since we submit one run per worker anyway * Add timeout logic Add timeout logic * Update start_worker.py * Update start_scheduler.py * Fixing typo in scripts when starting worker * Making flake8 and black pass * Add metrics * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update azureml.py * Update start_scheduler.py * Test cluster * Update init * correct init * Update complete * Extend timeout * Use cancel * Use ssh key path * Update forwarding * update print * update caller * Try close * update close * test child run * test child run 2 * test child run3 * test child run 4 * test child run5 * Rever changes to child run * Refactor * Update azureml.py * Update start_scheduler.py * Delete start.py * Delete azuremlssh.py * Update error msg * Address comments * Formating * Fix: cluster creation stuck on starting worker * Cancel the run after max retry * Add return * verbose * Only use error message * Create compute if not provided * Make ct optional * Update vnet check logic * Catch expected exception * Refactor and using kwargs * Resolve bytes issue * Fix timeout issue * Delete auto created compute target after close * Add check for compute instance * Update the check for CI * Clean up previous port forwarding * Set up portforwarding to CI * Local portforwarding * Update portforwarding * Enable dashboard in CI * Remove retry message * Refactor rpc connection * Check if on the same vnet * Set time delay for port open * Remove dummy message * update documentation for compute cluster creation on behalf of user * finish sentences * capitalize Studio * add links to AP, ML Studio * add initial_node_count to common kwargs * remove quota line * capitalize Workspace * jupyter defaults to True * datastores, reordering * Remove ct after connection failure * address jacob's comments for Jupyter boolean and Python example * Address comment * Clean up compute in case of creation failure * Check ssh key and admin user name * Fix docs * Add jupyter optional Co-authored-by: Tomek Drabas <drabas.t@gmail.com> Co-authored-by: Tom Drabas <todrabas@microsoft.com> Co-authored-by: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com> Co-authored-by: Fred Li <51424245+FredLiFromMS@users.noreply.github.com> Co-authored-by: Cody Peterson <cody.dkdc2@gmail.com>
- Loading branch information