# Running Dask on AzureML

This notebook shows how to run a Dask cluster on an AzureML Compute cluster. 
For setup instructions of you python environment, please see the [Readme](../README.md)

## Starting the cluster

In [330]:
from azureml.core import Workspace, Experiment
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails
from azureml.core.runconfig import MpiConfiguration
from azureml.core import VERSION
import uuid
import time
VERSION

# some helper to generate URLs later
class DaskURLs:
    def __init__(self, bokeh_port, jupyter_port, jupyter_token):
        self.bokeh_port = bokeh_port
        self.jupyter_port = jupyter_port
        self.jupyter_token = jupyter_token
        
    def _repr_javascript_(self):
        return f'''
        var hostname = window.location.hostname
        var dot = hostname.indexOf('.')
        var first = hostname.substr(0, dot)
        var last = hostname.substr(dot)
        var bokeh = 'https://' + first +'-{self.bokeh_port}'+ last
        var jupyter = 'https://' + first +'-{self.jupyter_port}'+ last+'?token={self.jupyter_token}'
        element.html(`
            Bokeh: <a href=`+bokeh+` target='bokeh'>`+bokeh+`</a><br>
            Jupyter: <a href=`+jupyter+` target='jupyter'>`+jupyter+`</a><br>`)
        '''

First we will get the workspace and AML compute cluster and start the Dask cluster on it. The assumption is that you have created a cluster with the name `dask` -- else change the name below accordingly. **It is important that, as you created the cluster, you have provided a username (I am using `daskuser`) and password and ssh key (ssh key is optional), since you will need to log in to the worker nodes to establish the port forwarding to the docker container.**

![create_cluster](../img/create_cluster.png)

In [331]:
ws = Workspace.from_config()

In [332]:
# D12 v2	4	28 GiB	200 GiB	$0.379/hour
dask_cluster = ws.compute_targets['dask-DS12-V2']

Starting the Dask cluster using an Estimator with MpiConfiguration. Make sure the cluster is able to scale up to 10 nodes or change the `node_count` below. 

In [333]:
est = Estimator('dask', 
                compute_target=dask_cluster, 
                entry_script='startDask.py', 
                conda_dependencies_file='environment.yml', 
                script_params={'--datastore': ws.get_default_datastore()},
                node_count=10,
                distributed_training=MpiConfiguration())

run = Experiment(ws, 'dask').submit(est)

In [334]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [335]:
from IPython.display import clear_output

print("waiting for scheduler node's ip")
while not 'headnode' in run.get_metrics():
    print('.', end ="")
    time.sleep(5)

clear_output()
headnode_private_ip = run.get_metrics()['headnode']
print('Headnode has IP:', headnode_private_ip)

Headnode has IP: 10.0.0.6


In [336]:
# let's find the public IP and ssh port of the head node

headnode_public_ip = None
headnode_ssh_port = None
for node in dask_cluster.list_nodes():
    if node['privateIpAddress'] == headnode_private_ip:
        headnode_public_ip = node['publicIpAddress']
        headnode_ssh_port = node['port']
        break
        
if headnode_public_ip == None:
    print('Headnode not found in cluster')
else:
    print(f'Headnode is at {headnode_public_ip}:{headnode_ssh_port}')

Headnode is at 51.124.89.208:50002


## Establish the port-forwarding from Notebook VM to Dask Scheduler
Since Notebook VM does not yet support VNets, you need to build an SSH port forwarder through SSH login.

In the prior cell we looked up the public IP and port of the headnode of the cluster 

Now, open the terminal on the Notebook VM and type what the following cell outputs  


In [337]:
print(f'ssh daskuser@{headnode_public_ip} -p {headnode_ssh_port} -L 8786:localhost:8786 -L 8788:{headnode_private_ip}:8787 -L 9999:localhost:8888')


ssh daskuser@51.124.89.208 -p 50002 -L 8786:localhost:8786 -L 8788:10.0.0.6:8787 -L 9999:localhost:8888


Make sure to leave the terminal tab open to keep the port-forward running

As you see, you are forwarding 3 ports 

1. 8786 is for the scheduler and will be used to connect the client to the cluster
2. 8788 is for the Bokeh app that shows the activity on the cluster (we are mapping to the local port 8788 to avoid a conflict with the RStudio Server running on the Notebook VM)
3. 9999 is for a jupyter instance running on the head node. You can connect to the scheduler from the jupyter running on your Notebook VM or from this jupyter instance on the head node.   

To access the Bokeh app, change the URL to your notebook VM by adding `-8788` right after the machine name. If you are running this notebook on a Notebook VM, then you can create the URLs by excuting the next cell:

In [338]:
print("waiting for jupyter token")
while not 'jupyter-token' in run.get_metrics():
    print('.', end ="")
    time.sleep(5)

# this will only work when running on a Notebook VM
DaskURLs('8788', '9999', run.get_metrics()['jupyter-token'])

waiting for jupyter token


<__main__.DaskURLs at 0x7fda0d7428d0>

Hopefully, you are seeing this after you clicked on the Bokeh link and then select 'Status':

![Bokeh](../img/bokeh.png)

If you are wondering what all this port business in accomplishing, please see the graph below that tries to illustrate who talks to whom and how.

![Network](../img/network.png)

## Run some jobs on the cluster
If you are able to see the Bokeh app, it is time to use the cluster. Thanks to the port forward, the scheduler appears to the notebook VM at `tcp://localhost:8786`. You should see 10 workers.

In [None]:
from dask.distributed import Client

c = Client('tcp://localhost:8786')
c.restart()
c

See if the cluster works

In [None]:
import time
import numpy as np
from dask import delayed, visualize

def inc(x):
    time.sleep(abs(np.random.normal(5, 2)))
    return x + 1

fut = []
for i in range(10):
    fut.append( c.submit(delayed(inc), i) )

fut

In [None]:
for i in fut:
    print(i.result())

In [None]:
def sum(a):
    x = 0
    for y in a:
        x += y
    return x

results = []
for f in fut:
    results.append(f.result())
    
fut2 = c.submit(sum, results)
fut2

In [None]:
fut2.result().compute()

In [None]:
visualize(fut2.result())

# Training on Large Datasets
(from https://github.com/dask/dask-tutorial)

Sometimes you'll want to train on a larger than memory dataset. `dask-ml` has implemented estimators that work well on dask arrays and dataframes that may be larger than your machine's RAM.

In [None]:
from dask.distributed import Client
import joblib
import dask.array as da
import dask.delayed
from sklearn.datasets import make_blobs
import numpy as np

We'll make a small (random) dataset locally using scikit-learn.

In [None]:
n_centers = 12
n_features = 20

X_small, y_small = make_blobs(n_samples=1000, centers=n_centers, n_features=n_features, random_state=0)

centers = np.zeros((n_centers, n_features))

for i in range(n_centers):
    centers[i] = X_small[y_small == i].mean(0)
    
centers[:4]

The small dataset will be the template for our large random dataset.
We'll use `dask.delayed` to adapt `sklearn.datasets.make_blobs`, so that the actual dataset is being generated on our workers. 

In [None]:
n_samples_per_block = 200000
n_blocks = 500

delayeds = [dask.delayed(make_blobs)(n_samples=n_samples_per_block,
                                     centers=centers,
                                     n_features=n_features,
                                     random_state=i)[0]
            for i in range(n_blocks)]
arrays = [da.from_delayed(obj, shape=(n_samples_per_block, n_features), dtype='float64')
          for obj in delayeds]
X = da.concatenate(arrays)
X

In [None]:
# Check the size of the array
X.nbytes / 1e9

In [None]:
# Only run this on the cluster.
X = X.persist()  

The algorithms implemented in Dask-ML are scalable. They handle larger-than-memory datasets just fine.

They follow the scikit-learn API, so if you're familiar with scikit-learn, you'll feel at home with Dask-ML.

In [None]:
from dask_ml.cluster import KMeans
clf = KMeans(init_max_iter=3, oversampling_factor=10)

In [None]:
%time clf.fit(X)

In [None]:
clf.labels_

In [None]:
clf.labels_[:10].compute()

## Shut cluster down
To shut the cluster down, cancel the job that runs the cluster. 

In [340]:
for run in ws.experiments['dask'].get_runs():
    if run.get_status() == "Running":
        print(f'cancelling run {run.id}')
        run.cancel()

### Just for convenience, get the latest running Run

In [341]:
for run in ws.experiments['dask'].get_runs():
    if run.get_status() == "Running":
        print(f'latest running run is {run.id}')
        break