# On demand interactive Dask based large scale data analysis on Andes

This is an ondemand interactive Dask usage on Andes using Slurm.
The cell below documents the underlying behavior of the Dask cluster.

## Acquiring Dask cluster

### Non-conflicting dask-scheduler instance tied to a notbook

The proposed way to spawn for gears a single Dask cluster per notebook.  Each notebook will have an corresponding Dask scheduler anchored on the node which the notebook is running.
Here, we want to be sure the scheduler won't step on others and would generate random ports for both the scheduler and the dashboard.   In the case of gears, the notebook can run on a login node where others could be in your way.  Even yourself.

The first part of the cell below demonstrates how to do this.

### Spawning the worker pool

After you acquire the cluster and the client for the notebook, you would do a 'cluster.scale(jobs=1)' to actually request a worker pool that will be used for the execution of the subsequent cells.

The recommendation is to cluster.scale(jobs=<up to 4>), do the compute, and then cluster.scale(jobs=0) to remove the cluster to preserve node hours.

Currently, the underlying SLURMCluster object creates one slurm job which is limited to a 1 node allocation of a worker pool as per scale unit.  Scaling up to 4 jobs would mean 4 slurm jobs.  Note that 4 slurm jobs is the limit of the Andes cluster concurrently running.   If more nodes are needed, then each job would need to use job launchers such as 'srun', but unfortunately the current SLURMCluster is not compatible.

In general, assume the notebook can run sequentially from top to bottom and be sure to scale up and scale down explicitely when possible.

If cleanup is not done explicitely, the scheduler will be killed as the notebook's kernel is killed and then there the default 1 hour walltime defined underneath for the on-demand workers.

### Comments on dask-labextension's cluster usage

It is unadvisable to use dask-labextension to spawn a Dask cluster.  One of the aim is to be able to run the jupyter notebooks in a batch environment (cells running sequentially) the development is finished.  Using lab-extension requires copying arbitrary code into jupyter notebooks and would not be reproducable.

In [20]:
# Standard preamble to use the Slurm cluster
import random
from dask_jobqueue import SLURMCluster
from distributed import Client

# Slurm cluster submission to the Andes cluster
# The cluster configuration is in ./etc/dask/dask.yml with sensible defaults
# Refer to the "dask.jobqueue.slurm" section in the configuration yml file
dashboard_port = random.randint(10000,60000)
cluster = SLURMCluster(
    scheduler_options={"dashboard_address": f":{dashboard_port}"}
)
# We print out the address you copy into the dask-labextension
print("Dashboard address for the dask-labextension")
print(f"/proxy/{dashboard_port}")

# Create the client object
client = Client(cluster)
client

Dashboard address for the dask-labextension
/proxy/21632


0,1
Client  Scheduler: tcp://10.43.202.83:43105  Dashboard: http://10.43.202.83:21632/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


# Computation using the cluster

## Opening AAIMS datasets using dask

Below is an example of loading a dataset from AAIMS data lake (gen150).  It is necessary to use the options in the `read_parquet` call.

* `engine=pyarrow`: Should be explicitely given as `pyarrow` as there is an alternative `fastparquet`.  So far, evaluations on the engine performance with large datasets indicate that `pyarrow` works better.
* `index=False`: Original data has the `timestamp` column as the index, but we are taking an approach loading all columns as columns and then turning one of them into indexes using `set_index` as needed.
* `gather_statistics=True`: Unless the dataset consists many thousands and millions of files, this option will help Dask identify the divisions of the partitions.
* `split_row_groups=True`: AAIMS operational data is designed to have an adequate parquet `row_group` size internal to the large files.  These `row_groups` are sized to be used as partitions and to avoid workers being killed with out-of-memory watchdogs.  Without this option being `True`, Dask will use the file size as the partition which in some cases are 20GB while workers only of 32GB (i.e., Andes cluster).
* `columns=[...]`: It is a *must* to selectively read only the columns you need in the following analysis.  This act will significantly reduce the I/O & memory requirements for analyzing the dataset.  But more importantly, reduces the impact on your precious time as the investigator and reduces the impact on our node hours.

In [21]:
%%time
import dask.dataframe as dd

# Scale up right before running compute
# Currently, 4 jobs is all you can do in an Andes cluster
cluster.scale(jobs=4)

# Read a dataset - 202004*.parquet indicates 1 month of data - (560GB in space)
DATASET = '/gpfs/alpine/gen150/proj-shared/data/lake/summit_power_temp_openbmc/data/202004*.parquet'
df = dd.read_parquet(DATASET, engine='pyarrow', 
                     index=False, gather_statistics=True,
                     split_row_groups=True,
                     columns=['timestamp', 'total_power', 'hostname'])
df.npartitions
df

CPU times: user 941 ms, sys: 241 ms, total: 1.18 s
Wall time: 11.2 s


Unnamed: 0_level_0,timestamp,total_power,hostname
npartitions=4320,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,"datetime64[ns, UTC]",int16,object
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


## Subsequent analysis

Beyond this point, referring to [Dask best practices](https://docs.dask.org/en/latest/best-practices.html) will significantly help.  

In [22]:
%%time
# Look at the amount of records you're dealing with
# This example uses 10 million recordsF
cluster.scale(jobs=4)
value = df['total_power'].count().compute()
value

CPU times: user 8.93 s, sys: 371 ms, total: 9.3 s
Wall time: 12.2 s


11221036003

In [23]:
%%time
# Calculation utilizing the persisted dataset should be quicker
cluster.scale(jobs=4)
value = df['total_power'].std().compute()
value

CPU times: user 10.1 s, sys: 387 ms, total: 10.5 s
Wall time: 13.5 s


267.8715734854267

In [24]:
%%time
cluster.scale(jobs=4)
value = df['total_power'].mean().compute()
value

CPU times: user 15.4 s, sys: 592 ms, total: 16 s
Wall time: 22.3 s


646.7462935479185

In [25]:
# Below is the way how you debug the cluster
cluster.get_logs()

# Cleaning up

Cleaning up the cluster
Will be automatically curled up when the kernel dies but a good idea to explicitly do this

In [26]:
cluster.scale(jobs=0)

In [27]:
client.close()
cluster.close()