# Introduction

This notebook serves as a codespike to use the `dask-cloudprovider` image to quickly create cloud resources with dask installed and to use EvalML on it. Primary usecase of this would be for users that want to experience parallelization of our AutoML algorithm but do not have the expertise or time to properly manage cloud resources. Accompanying resources are in the `dask` folder of the branch `js_test_dask_cloud`.

# AutoML with Dask locally

In [2]:
from evalml import AutoMLSearch
from evalml.automl.engines import DaskEngine

dask_engine = DaskEngine()

In [3]:
import evalml

X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.8)

automl = AutoMLSearch(problem_type="binary", objective="f1", max_batches=2)
automl.search(X_train, y_train, engine=dask_engine)

*****************************
* Beginning pipeline search *
*****************************

Optimizing for F1. 
Greater score is better.

Using Dask Engine to process pipelines.
Searching up to 2 batches for a total of 14 pipelines. 
Allowed model families: linear_model, extra_trees, lightgbm, catboost, decision_tree, xgboost, random_forest



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Batch 1: (1/14) Mode Baseline Binary Classification P... Elapsed:00:03
Batch 1: (2/14) Logistic Regression Classifier w/ Imp... Elapsed:00:05
Batch 1: (3/14) CatBoost Classifier w/ Imputer           Elapsed:00:05
Batch 1: (4/14) Elastic Net Classifier w/ Imputer + S... Elapsed:00:09
High coefficient of variation (cv >= 0.2) within cross validation scores. Elastic Net Classifier w/ Imputer + Standard Scaler may not perform as estimated on unseen data.
Batch 1: (5/14) Decision Tree Classifier w/ Imputer      Elapsed:00:09
Batch 1: (6/14) XGBoost Classifier w/ Imputer            Elapsed:00:09
Batch 1: (7/14) Extra Trees Classifier w/ Imputer        Elapsed:00:10
Batch 1: (8/14) LightGBM Classifier w/ Imputer           Elapsed:00:10
Batch 1: (9/14) Random Forest Classifier w/ Imputer      Elapsed:00:10
Batch 2: (10/14) CatBoost Classifier w/ Imputer           Elapsed:00:13
Batch 2: (11/14) CatBoost Classifier w/ Imputer           Elapsed:00:13
Batch 2: (12/14) CatBoost Classifier w/ Impute

# AWS Dask cluster using `dask-cloudprovider`

Since we have MFA setup on our users, I'm going to use the looking_glass AWS user. If you want the credentials ask Jeremy Shih.

1. setup credentials locally in `~/.aws/credentials`
2. install `dask/dask-requirements.txt`

## ECS-Fargate

In [1]:
from dask_cloudprovider.aws import FargateCluster
cluster = FargateCluster()

  next(self.gen)


In [2]:
import time

cluster.scale(2)
time.sleep(300)

In [3]:
from evalml import AutoMLSearch
from evalml.automl.engines import DaskEngine
from dask.distributed import Client
import time 

dask_client = Client(cluster)
dask_engine = DaskEngine(dask_client=dask_client)

OSError: Timed out trying to connect to tcp://54.161.246.27:8786 after 10 s

In [None]:
import evalml

X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.8)

automl = AutoMLSearch(problem_type="binary", objective="f1", max_batches=2)
automl.search(X_train, y_train, engine=dask_engine)

## With docker image

In [9]:
from dask_cloudprovider.aws import FargateCluster
cluster = FargateCluster(image='321459187557.dkr.ecr.us-east-1.amazonaws.com/evalml_dask')

RuntimeError: Scheduler exited unexpectedly!

In [None]:
import time

cluster.scale(2)
time.sleep(300)

In [None]:
from evalml import AutoMLSearch
from evalml.automl.engines import DaskEngine
from dask.distributed import Client
import time 

dask_client = Client(cluster)
dask_engine = DaskEngine(dask_client=dask_client)

In [None]:
import evalml

X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.8)

automl = AutoMLSearch(problem_type="binary", objective="f1", max_batches=2)
automl.search(X_train, y_train, engine=dask_engine)

### EC2

Things I tried:

- [x] using looking glass docker image
- [x] using new docker image based off of default docker image
- [x] default image doesn't work as it doesn't have evalml dependencies
- [x] add entry point back to default image
- [x] Copy their docker file and just add evalml

With default docker image:

In [4]:
from dask_cloudprovider.aws import EC2Cluster
import configparser
import os
import contextlib

def get_aws_credentials():
    parser = configparser.RawConfigParser()
    parser.read(os.path.expanduser('~/.aws/config'))
    config = parser.items('default')
    parser.read(os.path.expanduser('~/.aws/credentials'))
    credentials = parser.items('perf-test')
    all_credentials = {key.upper(): value for key, value in [*config, *credentials]}
    with contextlib.suppress(KeyError):
        all_credentials["AWS_REGION"] = all_credentials.pop("REGION")
        return all_credentials

cluster = EC2Cluster(env_vars=get_aws_credentials())

Creating scheduler instance
Created instance i-0c78516c69ff89723 as dask-0f1bdcd7-scheduler
Waiting for scheduler to run
Scheduler is running


Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight! 


In [5]:
import time

cluster.scale(2)
time.sleep(300)

Creating worker instance
Creating worker instance
Created instance i-0721cb9175c6186cc as dask-0f1bdcd7-worker-65eaed9f
Created instance i-07656b4b5994a0dd3 as dask-0f1bdcd7-worker-ce76012b


In [6]:
from evalml import AutoMLSearch
from evalml.automl.engines import DaskEngine
from dask.distributed import Client
import time 

dask_client = Client(cluster)
dask_engine = DaskEngine(dask_client=dask_client)

Mismatched versions found

+---------+---------------+---------------+---------------+
| Package | client        | scheduler     | workers       |
+---------+---------------+---------------+---------------+
| numpy   | 1.19.4        | 1.18.1        | 1.18.1        |
| python  | 3.7.4.final.0 | 3.8.0.final.0 | 3.8.0.final.0 |
+---------+---------------+---------------+---------------+


In [7]:
import evalml

X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.8)

automl = AutoMLSearch(problem_type="binary", objective="f1", max_batches=2)
automl.search(X_train, y_train, engine=dask_engine)

*****************************
* Beginning pipeline search *
*****************************

Optimizing for F1. 
Greater score is better.

Using Dask Engine to process pipelines.
Searching up to 2 batches for a total of 14 pipelines. 
Allowed model families: extra_trees, lightgbm, catboost, decision_tree, linear_model, xgboost, random_forest



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Exception: No module named 'woodwork'

With docker image with evalml

In [8]:
from dask_cloudprovider.aws import EC2Cluster
import configparser
import os
import contextlib

def get_aws_credentials():
    parser = configparser.RawConfigParser()
    parser.read(os.path.expanduser('~/.aws/config'))
    config = parser.items('default')
    parser.read(os.path.expanduser('~/.aws/credentials'))
    credentials = parser.items('perf-test')
    all_credentials = {key.upper(): value for key, value in [*config, *credentials]}
    with contextlib.suppress(KeyError):
        all_credentials["AWS_REGION"] = all_credentials.pop("REGION")
        return all_credentials

cluster = EC2Cluster(env_vars=get_aws_credentials(), docker_image='321459187557.dkr.ecr.us-east-1.amazonaws.com/evalml_dask')

Creating scheduler instance
Created instance i-0cf6699647339e1c6 as dask-c1e7775c-scheduler
Waiting for scheduler to run


KeyboardInterrupt: 

In [None]:
cluster.scale(2)

In [None]:
from evalml import AutoMLSearch
from evalml.automl.engines import DaskEngine
from dask.distributed import Client
import time 

dask_client = Client(cluster)
dask_engine = DaskEngine(dask_client=dask_client)

In [None]:
import evalml

X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.8)

automl = AutoMLSearch(problem_type="binary", objective="f1", max_batches=2)
automl.search(X_train, y_train, engine=dask_engine)

### ECS

In [None]:
from dask_cloudprovider.aws import ECSCluster

cluster = ECSCluster(cluster_arn="arn:aws:ecs:us-east-1:321459187557:cluster/dask-cluster")

same things as FarGate

# Conclusions

As long as you have the proper permissions the `dask_cloudprovider` package easily creates cloud resources with Dask on it. In this notebook we explore various AWS resources but this can be extended to GCP and Azure. However, once we began to customize the images deployed on the dask scheduler and workers we began to hit massive problems. This package simply isn't developed enough in terms of documentation and extensibility that customizing it is really difficult. I had the most succcess with the EC2 cluster as the ECS clusters gave me many AWS throttling and various networking issues that I needed to get around with. In the end I was still not able to load an image with EvalML and get it working. 

I would recommended users that want to take advantage of parallelization to set up and manage their own cloud resources as this allows users to have full control over their resource environment, dependencies, etc.. This package might work for simple usecases and could be improved in the future but does not seem like a good option now.