Skip to content

Commit

Permalink
Add VMCluster support (#129)
Browse files Browse the repository at this point in the history
* Add VMCluster class and Digital Ocean implementation

* Fix worker startup command

* Working state

* Refactoring

* Use correct images

* Add more config options, docs and a RAPIDS example

* Add Packer support

* Add autoshutdown to EC2Cluster

* Fix init order and test sync cluster

* Skip tests with missing deps

* Flake8

* Start on docs refactor

* More docs refactor

* More documentation

* Add flake8 to precommit hooks

* Fix linting

* Black

* Change worker command to module, add worker/scheduler options and refactor

* Refactor mixins and siplify code

* Refactor EC2Cluster

* Change black version

* Run black 20.8b1

* Print black version

* Force black to 20.8b1

* Shuffle AzureML docs from merge to new location

* Add GPU docs
  • Loading branch information
jacobtomlinson committed Nov 2, 2020
1 parent 027bc1a commit 9cf4df8
Show file tree
Hide file tree
Showing 36 changed files with 2,198 additions and 421 deletions.
3 changes: 2 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
conda update -q conda
conda env create -f ci/environment-${PYTHON}.yml --name=${ENV_NAME}
source activate ${ENV_NAME}
pip install --no-deps --quiet -e .
pip install --no-deps --quiet -e .[all]
fi
conda env list
conda list ${ENV_NAME}
Expand All @@ -44,4 +44,5 @@ jobs:
- run:
command: |
/home/circleci/miniconda/envs/dask-cloudprovider-test/bin/flake8 dask_cloudprovider
/home/circleci/miniconda/envs/dask-cloudprovider-test/bin/black --version
/home/circleci/miniconda/envs/dask-cloudprovider-test/bin/black --check dask_cloudprovider setup.py
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ instance/
# Sphinx documentation
docs/_build/
doc/_build/
doc/source/_build/

# PyBuilder
target/
Expand Down
12 changes: 9 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
repos:
- repo: https://github.com/ambv/black
rev: stable
- repo: https://github.com/psf/black
rev: 20.8b1
hooks:
- id: black
language_version: python3.7
language_version: python3
exclude: versioneer.py
- repo: https://gitlab.com/pycqa/flake8
rev: 3.8.4
hooks:
- id: flake8
language_version: python3
10 changes: 5 additions & 5 deletions ci/environment-3.7.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ dependencies:
- flake8
- ipywidgets
- pytest
- black
- black >=20.8b1
- pyyaml
# dask dependencies
- cloudpickle
Expand All @@ -28,7 +28,7 @@ dependencies:
- tornado >=5
- zict >=0.1.3
- pip:
- aiobotocore
- git+https://github.com/dask/dask
- git+https://github.com/dask/distributed
- azureml-sdk >=1.1.5
- aiobotocore
- git+https://github.com/dask/dask
- git+https://github.com/dask/distributed
- azureml-sdk >=1.1.5
13 changes: 12 additions & 1 deletion dask_cloudprovider/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,25 @@

try:
from .providers.aws.ecs import ECSCluster, FargateCluster
from .providers.aws.ec2 import EC2Cluster
except ImportError:
pass
try:
from .providers.azure.azureml import AzureMLCluster
except ImportError:
pass
try:
from .providers.digitalocean.droplet import DropletCluster
except ImportError:
pass

__all__ = ["ECSCluster", "FargateCluster", "AzureMLCluster"]
__all__ = [
"ECSCluster",
"EC2Cluster",
"FargateCluster",
"AzureMLCluster",
"DropletCluster",
]

from ._version import get_versions

Expand Down
27 changes: 23 additions & 4 deletions dask_cloudprovider/cloudprovider.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ cloudprovider:
fargate_workers: False # Use fargate mode for the workers
scheduler_cpu: 1024 # Millicpu (1024ths of a CPU core)
scheduler_mem: 4096 # Memory in MB
# scheduler_extra_args: "--tls-cert,/path/to/cert.pem,--tls-key,/path/to/cert.key,--tls-ca-file,/path/to/ca.key"
# scheduler_extra_args: "--tls-cert,/path/to/cert.pem,--tls-key,/path/to/cert.key,--tls-ca-file,/path/to/ca.key"
worker_cpu: 4096 # Millicpu (1024ths of a CPU core)
worker_mem: 16384 # Memory in MB
worker_gpu: 0 # Number of GPUs for each worker
# worker_extra_args: "--tls-cert,/path/to/cert.pem,--tls-key,/path/to/cert.key,--tls-ca-file,/path/to/ca.key"
# worker_extra_args: "--tls-cert,/path/to/cert.pem,--tls-key,/path/to/cert.key,--tls-ca-file,/path/to/ca.key"
n_workers: 0 # Number of workers to start the cluster with
scheduler_timeout: "5 minutes" # Length of inactivity to wait before closing the cluster

Expand All @@ -19,7 +19,7 @@ cloudprovider:
execution_role_arn: "" # Arn of existing execution role to use (if not set one will be created)
task_role_arn: "" # Arn of existing task role to use (if not set one will be created)
task_role_policies: [] # List of policy arns to attach to tasks (e.g S3 read only access)
# platform_version: "LATEST" # Fargate platformVersion string like "1.4.0" or "LATEST"
# platform_version: "LATEST" # Fargate platformVersion string like "1.4.0" or "LATEST"

cloudwatch_logs_group: "" # Name of existing cloudwatch logs group to use (if not set one will be created)
cloudwatch_logs_stream_prefix: "{cluster_name}" # Stream prefix template
Expand All @@ -33,6 +33,19 @@ cloudprovider:
environment: {} # Environment variables that are set within a task container
find_address_timeout: 60 # Configurable timeout in seconds for finding the task IP from the cloudwatch logs.
skip_cleanup: False # Skip cleaning up of stale resources

ec2:
region: null # AWS region to create cluster. Defaults to environment or account default region.
bootstrap: true # It is assumed that the AMI does not have Docker and needs bootstrapping. Set this to false if using a custom AMI with Docker already installed.
auto_shutdown: true # Shutdown instances automatically if the scheduler or worker services time out.
# worker_command: "dask-worker" # The command for workers to run. If the instance_type is a GPU instance dask-cuda-worker will be used.
# ami: "" # AMI ID to use for all instances. Defaults to latest Ubuntu 20.04 image.
instance_type: "t2.micro" # Instance type for all workers
# vpc: "" # VPC id for instances to join. Defaults to default VPC.
# subnet_id: "" # Subnet ID for instances to. Defaults to all subnets in default VPC.
# security_groups: [] # Security groups for instances. Will create a minimal Dask security group by default.
filesystem_size: 40 # Default root filesystem size for scheduler and worker VMs in GB

azure:
experiment_name: "dask-cloudprovider" # default name of the Experiment to submit
initial_node_count: 1 # Initial node count
Expand All @@ -45,5 +58,11 @@ cloudprovider:
additional_ports: [] # list of tuples of additional ports to map/forward
admin_username: "" # username to log in to the AzureML Training Cluster for 'local' runs
admin_ssh_key: "" # password to log in to the AzureML Training Cluster for 'local' runs
telemetry_opt_out: False # by default we log the version of the AzureMLCluster being used, set to True to opt-out
telemetry_opt_out: False # by default we log the version of the AzureMLCluster being used, set to True to opt-out
datastores: [] # default list of datastores to mount

digitalocean:
token: null # API token for interacting with the Digital Ocean API
region: "nyc3" # Region to launch Droplets in
size: "s-1vcpu-1gb" # Droplet size to launch, default is 1GB RAM, 1 vCPU
image: "ubuntu-20-04-x64" # Operating System image to use

0 comments on commit 9cf4df8

Please sign in to comment.