# Dask in Action

In this BE we'll look at deploying Dask in two different ways, on an HPC cluster and on GCP using Kubernetes. You'll then use a Dask cluster to run a hyper-parameter search, which will be the basis for your grade. This notebook is just for presentation - you will write and hand in your own notebooks depending on which Dask installation you choose.

## Outline

+ [Dask on Rainman](#rainman)
+ [Dask on GCP with Kubernetes](#kubernetes)
+ [Testing Dask](#testing)
+ [HyperParameter Optimization](#hyperparam)

## <a id="rainman">Dask on Rainman</a>

The first way we'll look at Dask is on Rainman, the High-Performance Computing cluster for students at ISAE. It is available at `rainman.isae.fr`, but only from internal networks. The architecture looks like this:

![rainman](img/rainman.png)

https://icampus.isae-supaero.fr/rainman-architecture-materielle-du-supercalculateur-etudiant?lang=fr

To connect to rainman, do:
```bash
local$ ssh USER@rainmain
```
with your student username. You must be on the internal network.

`dask` is already installed on rainman, so to use it, do:
```bash
rainman$ module load python/3.7
rainman$ source activate dask
rainman$ jupyter-notebook --no-browser --port=PORT --ip=IP
```

For `PORT`, pick a number between 8000 and 9999; if it doesn't work try a different one. For `IP`, if you are connected to `rainman1` use `10.162.9.3` and for `rainman2` use `10.162.9.4`. You should now be able to access the jupyter server from your machine at http://IP:PORT/.

Once the jupyter server is working, we can use the Dask [SLURMCluster](https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html#dask_jobqueue.SLURMCluster) class to launch batch jobs, creating workers.

In [None]:
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(cores=16,
                       memory="60GB",
                       walltime="01:00:00",
                       dashboard_address=":8787", # different from before, change it if there's a warning
                       job_extra=["--reservation=root_2"], # only between 9h and 19h on 11/03/2020
                       queue="p16")

To launch the workers, run `scale`. Please don't run more than 2 jobs.

In [None]:
cluster.scale(2)

By inspecting the cluster, we can see the Dask dashboard:

In [None]:
cluster

In [None]:
from distributed import Client
client = Client(cluster)

If you successfully run the client and can see the dashboard, you can move on to [Testing Dask](#testing)

## <a id="kubernetes">Dask on GCP with Kubernetes</a>

A second way we can use Dask is through Kubernetes on GCP. Kubernetes is a container orchestration system which can be used to deploy many different applications.
![img/kubernetes.png]

Follow the instructions [here](https://zero-to-jupyterhub.readthedocs.io/en/latest/google/step-zero-gcp.html) to set up Kubernetes. Once you've got helm verified (end of [this page](https://zero-to-jupyterhub.readthedocs.io/en/latest/setup-jupyterhub/setup-helm.html)), you can move on to the [Dask installation instructions](https://docs.dask.org/en/latest/setup/kubernetes-helm.html). You will need to create a `config.yaml` file and upload it to your GCP shell in order to install dask-ml on all your workers (section "Configure Environment").

Once the Kubernetes cluster is running, you should have access to a JupyterHub notebook. To use Dask, you just need to do:

In [None]:
from dask.distributed import Client, config
client = Client()

## <a id="testing">Testing Dask</a>

To test that our Dask installations are working, we'll look at a simple KMeans example. Copy the following code to your Dask notebook, whether on rainman or GCP.

In [None]:
import dask_ml.datasets
import dask_ml.cluster
import matplotlib.pyplot as plt

In [None]:
# Scale up: increase n_samples or n_features
X, y = dask_ml.datasets.make_blobs(n_samples=1000000,
                                   chunks=100000,
                                   random_state=0,
                                   centers=3)
X = X.persist()
X

In [None]:
km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
km.fit(X)

When you're done with a session, make sure to close your client. If you used rainman, also close the cluster.

In [None]:
client.close()
cluster.close() # only for rainman

## <a id="hyperparam">Hyperparameter Optimization</a>

A task that works well in parallel computing is hyperparameter optimization. Machine learning models have a number of different parameters, and parameter combinations can be tested in parallel independently.

For this assignment, you will perform hyperparameter optimization on the [Titanic](https://www.kaggle.com/c/titanic/data) dataset. You should follow the extensive hyperparameter optimization example in the [Dask documentation](https://examples.dask.org/machine-learning/hyperparam-opt.html). You will need to create a notebook in your Dask installation, and once you have it working, save the notebook and submit it in the LMS. 

To get started, download the Titanic data. You have to first [agree to the competition](https://www.kaggle.com/c/titanic/rules) and will need to [set your Kaggle API key](https://www.kaggle.com/docs/api).

In [None]:
!pip install kaggle

In [None]:
!kaggle competitions download -c titanic

In [None]:
!unzip titanic.zip

You will be graded on the following criteria:

+ Reproducibility - does your notebook work when I run it? Make sure to specify which method (rainman or GCP)
+ Completeness
+ Prediction performance
+ Compute performance - how fast is it?

Bonus points for:
+ Including Kaggle submission in your notebook
+ Explanation (text and code comments)