# Go Big or Go Home Part 1 - Dask fully distributed <img align="right" src="../resources/csiro_easi_logo.png">

In the previous notebooks we've been using `dask.distributed.LocalCluster` to split up our tasks and run them on the same compute node that is running the notebook. We noted that this could then use all cores to run tasks in parallel, greatly speeding up loading, and thanks to _chunks_ we can also process _some_ algorithms, like our NDVI seasonal mean, on datasets larger than available RAM.

But what happens if your algorithm and dataset are such they cannot fit the compute nodes' RAM, or the result of the calculation is also massive, or its just so big (memory and computation) that it takes hours to compute?

Well, `dask.distributed.LocalCluster` is just one member of the `dask.distributed` cluster family. There are several others but the one we will be using is Kubernetes Cluster (`KubeCluster`). Kubernetes is an excellent technology that takes advantage of modern Cloud Computing architectures to automatically provision (and remove) compute nodes on demand. It does a lot of other stuff well beyond the scope of this dask tutorial of course. The important point is that using `KubeCluster` we can dramatically expand the number of Compute Nodes to schedule our dask Workers to; potentially very dramatically.

In this notebook we'll expand our NDVI seasonal mean calculation to cover all of Tasmania and take it back through two decades of observations. Along the way we'll explore the dask data structures, how computation proceeds, and what we can do to tune the performance. At this spatial size we'll also look at how to interactively visualise a result that is larger than the Jupyter notebook can handle.

Everything we do next builds on the concepts of _chunks_, _tasks_, _data locality_, and _task graph_ covered previously.

## Dask Gateway and remote dask schedulers

Once we have multiple compute nodes to run workers on we have a lot more moving parts in the system. These parts are also fully distributed and will need to communicate between each other to pass results, perform tasks, confirm completion of tasks and ask for more work, etc. It is important to understand what the components are and how they interact because it can impact both _performance_ and _stablity_ of calculation, particularly at very large scales. In addition, _data locality_ really matters when your dataset is large - you don't want to `compute()` a 1 TB result and have it brought back to the Jupyter kernel on a 32 GiB machine! There are also subtleties to be aware of in how data gets from your Jupyter notebook to the dask distributed nodes - it has to be communicated somehow.

This can all be a bit overwhelming to think about. Thankfully, you don't hit all of these at once as dask does good job of hiding many of the details but then remember our two "laws" of dask from the first notebook:
1. The best thing about dask is it makes distributed parallel programming in the datacube easy
1. The worst thing about dask is it makes distributed parallel programming in the datacube easy

The transition point from gain to pain, and back to gain, is connected to these details. So let's define out various parts and their roles and start building our knowledge.

### Kubernetes

In a Kubernetes environment all programs that execute things run in __Pods__.  What a _pod_ is and how it works is a subject for another course and you can use an internet search to find out more. For our purposes it is sufficient to understand that the Jupyter notebook, the dask scheduler, the dask workers, and all the components that make this work are running in _Pods_.

* _Pods_ have resources - memory, cpu, gpu, storage.
* _Pods_ request resources and have resource limits.
* _Pods_ communicate to each other over a network.
* You can think of a _pod_ as being a kind of virtual PC on which you can run your programs.

_Pods_ run on _Compute Nodes_ - physical hardware with an actual CPU, GPU, memory and storage. _Compute Nodes_ can run more than one _Pod_ so long as the sum of all the requests will fit. For example, if your _Compute Node_ has 64 GiB of RAM and your _worker pods_ request 14 GiB each, then 4 _workers Pods_ will run on 1 _Compute Node_.

Thankfully, you don't need to figure out what Pods get placed where as Kubernetes will do that automatically. There is more to be said about this relationship and the impacts on performance and operational cost but for now just note that _Pods_ are where your code and data lives and they have requests and limits which you can control.

This diagram shows a single user (_Joe_) running a Jupyter Notebook and connected to a single dask cluster with 5 Workers (running on 3 Worker Nodes).

![Dask Cluster](../../resources/DaskCluster-ODCandDask.svg)

The __Jupyter notebook__ is where you code is typed in. It has a Python kernel of its own and will be the __dask client__ that talks to the __dask cluster__. It is running in a Pod, as are all the components. So the Jupyter notebook is _separate_ from the other components in the system and communicates over a network.

The __dask cluster__ is the __dask scheduler__ plus the group of __dask workers__ that process the tasks in a distributed manner. The _scheduler_ and the _workers_ are all Pods, which means they are _separate_ from each other and communicate over a network. This is different to `dask.distributed.LocalCluster` in which the Jupyter notebook, dask scheduler and workers all resided on the same machine and all communicated _very_ rapidly on the local machine's communications channel. Now they can be on entirely different _compute nodes_ and are _communicating over a much slower network_. We have the benefit of more compute resources, at the cost of slower communication.

The __dask gateway__ is a new component and is used to manage __dask clusters__ (note that is plural). The __Jupyter notebook__ acts as a client to the _dask gateway_ and makes requests for a __dask cluster__ (both the _scheduler_ and the _workers_) to the _dask gateway_ to create and destroy them. The _dask gateway_ manages the lifecycle of the cluster on the user's behalf.

> __Tip__: This means the dask clusters have an independent life cycle compared with the Jupyter notebook. Quitting your Jupyter notebook will not necessarily quit your dask cluster.

Moreover, you can have more than one Jupyter notebook talking to the _same_ dask cluster simultaneously. There are some good reasons for doing this but we won't be touching on them in this course.

### Running our NDVI seasonal mean on the remote dask cluster

Let's move our NDVI seasonal mean from the `LocalCluster` to our _dask gateway_ managed cluster, and add some extra compute resources to it.

The biggest change here is simply how we start and shutdown the dask cluster. The rest of the code, to do the actual computation, is _exactly_ the same.

The first thing we need to do is create a client so we can connect to the _dask gateway_.


In [None]:
# Initiliaze the Gateway client
from dask.distributed import Client
from dask_gateway import Gateway

gateway = Gateway()

Easy! We now have a `gateway` client variable. Using this we can start clusters, stop clusters, ask for a list of clusters we have running, set options for our scheduler and workers (like cpu and memory requests).

Let's see what the `cluster_options` are. We don't need to guess, we can ask the `gateway`

In [None]:
gateway.cluster_options()

That's a lot I know. The majority of these don't need to change and most users will simply tweak _worker_ parameters:  _cores_, _threads_ (probably keeping it the same as _cores_), _memory_ and the _worker group_.

We will be using the defaults for now.

### Create the Cluster

Create the cluster with default options if it doesn't already exist. If a cluster exists in your namespace, the code below will connect to the first cluster. List the available clusters with `gateway.list_clusters()`.

The cluster creation may take a little while (minutes) if a suitable _node_ isn't available for the _scheduler_. The same thing will occur for _workers_ when they start. If a _node_ does exist then this can happens in seconds.

> __Tip__: Users are often confused by the changing start up time and think something is wrong.

It can take _minutes_ for a brand new _node_ to be provisioned, please be patient. If it takes 10 minutes then yes, something is wrong and you should probably contact an administrator of the system if that problem persists.

In [None]:
clusters = gateway.list_clusters()
if not clusters:
    print('Creating new cluster. Please wait for this to finish.')
    cluster = gateway.new_cluster()
else:
    print(f'An existing cluster was found. Connecting to: {clusters[0].name}')
    cluster=gateway.connect(clusters[0].name)
cluster

### Scale the cluster

Use the GatewayCluster widget (above) to adjust the cluster size. Alternatively use the cluster API methods.

For many tasks 1 or 2 workers will be sufficient, although for larger areas or more complex tasks 5 to 10 workers may be used. If you are new to Dask, start with one worker and then scale your cluster if needed.

In this notebook we'll start with 4 workers - that's 4x the resources for workers compared to our previous `LocalCluster`. In addition the _scheduler_ is also on its own node, and so is the Jupyter notebook kernel. Lots more resources for all the components involved.

The next cell will use the cluster API to add 4 workers programmatically.

In [None]:
cluster.scale(4)

### Connect to the cluster
To connect to your cluster and start doing work, use the `get_client()` method. This step will wait until the workers are ready. You don't actually have to wait for the workers. The Jupyter notebook can be doing other things whilst the workers are coming up. We're waiting in this example so you don't end up with an unexpected wait later.

***This may take a few minutes before your workers will be ready to use.***

In [None]:
client = cluster.get_client()
client.wait_for_workers(n_workers=4)
client

The client widget provides a clickable _dask dashboard_ link so click that and you'll see your dashboard. It works the same as before despite the fact that everythign is now running in a distributed fashion. If you click the _Workers_ tab in the _dashboard_ you will see that we now have 32 cores (up from 8) made up of 4x 8-core workers. Lot's of RAM too.

Go back to the _Status_, so you can watch everything run.

### Perform the computation - same as before

We don't need to change any of our code to run this now, so let's repeat the full calculation.

In [None]:
import datacube
from datacube.utils import masking

dc = datacube.Datacube()

In [None]:
# Central Tasmania (near Little Pine Lagoon)
central_lat = -42.019
central_lon = 146.615

# Set the buffer to load around the central coordinates
# This is a radial distance for the bbox to actual area so bbox 2x buffer in both dimensions
buffer = 0.8 ### This is the same size as the Larger than RAM example

# Compute the bounding box for the study area
study_area_lat = (central_lat - buffer, central_lat + buffer)
study_area_lon = (central_lon - buffer, central_lon + buffer)

# Data products - Landsat 8 ARD from Geoscience Australia
products = ["ga_ls8c_ard_3"]

# Set the date range to load data over
set_time = ("2021-01-01", "2021-12-31")

# Set the measurements/bands to load. None eill load all of them
measurements = None

# Set the coordinate reference system and output resolution
# This choice corresponds to Aussie Albers, with resolution in metres
set_crs = "epsg:3577"
set_resolution = (-30, 30)
group_by = "solar_day"

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"fmask": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":1, "x":2048, "y":2048}, ## No change here, chunking spatially just like before
            group_by=group_by,
        )
dataset.nbytes / 2**30

In [None]:
# Identify pixels that are either "valid", "water" or "snow"
cloud_free_mask = (
    masking.make_mask(dataset.oa_fmask, fmask="valid")
)
# Apply the mask
cloud_free = dataset.where(cloud_free_mask)

# Calculate the components that make up the NDVI calculation
band_diff = cloud_free.nbart_nir - cloud_free.nbart_red
band_sum = cloud_free.nbart_nir + cloud_free.nbart_red
# Calculate NDVI and store it as a measurement in the original dataset ta da
ndvi = None
ndvi = band_diff / band_sum

ndvi_unweighted = ndvi.groupby("time.season").mean("time")  # Calculate the seasonal mean

In [None]:
%%time
actual_result = ndvi_unweighted.compute()

As before there will be a short delay as the Jupyter kernel (_client_) is used to optimize the task graph before sending it to the _scheduler_ which will then execute _tasks_ on the _workers_.

> __Tip__: You can open a terminal and use `htop` to monitor the Jupyter notebook CPU usage. You'll see at least one core using nearly 100% cpu usage during the optimisation phase. It will then drop back to idle as the _task graph_ is sent to the _scheduler_ at which point the dashboard will show activity on the cluster.

In [None]:
actual_result.sel(season='DJF').plot()

### Exploiting our new resources - adjusting our chunk size

Before we "Go Big" let's take advantage of our new resources.

We've gained memory. The more memory we have the more data we can operate on at once _and the less we need to communicate between nodes_. Communication over a network is slow relative to local communcation in a _pod_. So if we change the _chunking_ we may see an improvement in performance.

_Chunking_ will also impact the number of tasks - the fewer chunks, the fewer tasks. This in turn will impact how much _task graph optimization_ is required, how hard the _scheduler_ has to work, and how much communication of partial results goes between workers (for example, passing the partial means around to get a final mean).

Let's look at our existing chunk size - (1,2048,2048)

In [None]:
dataset.nbart_red

8 MiB in use for `nbart_red` chunks. Of course this will vary with data type and stage of computation so when tuning dask you may need to check on the _chunk size_ and _tasks_ as your computation transforms the data. We're doing a simple computation here so we can focus on this initial value. With experience it does get easier to figure out when and where this parameter needs further adjustment. We'll look at some of this later.

For now there are some things we can observe:
1. 8 MiB is pretty small when our 4 workers have 32 Gigs each so we have room to grow even with all the temporaries to allow for.
   * _We should monitor the worker memory usage in the dashboard as we make changes to ensure none are spilling to disk as that will slow things down and is unnecessary in this case_
1. Geospatial operations - the data load, the reprojection, even the masking - may benefit from having a larger spatial area.
   * No point going too large though as the satellite paths have a finite width and we'll just have lots of empty space.
1. The computation involves a seasonal mean, which means some temporal grouping might improve performance.
   * That said, _chunks_ are a unit for communication and it may mean that we're passing more information around than is necessary if we group too much together across the seasonal boundaries.

So we have good reason to increase our chunks both spatially and temporally; just be mindful of the impact on communication of results between nodes. The mean is seasonal and Landsat 8 performs repeat passes nominally every 16 days, so let's do a small grouping in time. We'll also increase the spatial chunking slightly.

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"fmask": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":2, "x":3072, "y":3072},
            group_by=group_by,
        )

In [None]:
dataset.nbart_red

As you can see the number of tasks has dropped and our chunks are larger at 36 MiB.

There is a small slither along the bottom because the chunk size isn't a good fit for the actual array size: `slither = 6253 - (2 x 3072) = 109`. Let's expand our chunk size in that direction slightly to give us a better fit.

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"fmask": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":2, "x":3072, "y":3127},
            group_by=group_by,
        )
dataset.nbart_red

Notice how that small change to remove the sliver only marginally increased our chunk memory usage (by .64 MiB) but dramatically reduced the number of chunks (down from 210 to 140).

Let's see what this does to our performance. We need to re-run the code to update all the intermediate variables in our calculation and call `compute()`

In [None]:
# Identify pixels that are either "valid", "water" or "snow"
cloud_free_mask = (
    masking.make_mask(dataset.oa_fmask, fmask="valid")
)
# Apply the mask
cloud_free = dataset.where(cloud_free_mask)

# Calculate the components that make up the NDVI calculation
band_diff = cloud_free.nbart_nir - cloud_free.nbart_red
band_sum = cloud_free.nbart_nir + cloud_free.nbart_red
# Calculate NDVI and store it as a measurement in the original dataset ta da
ndvi = None
ndvi = band_diff / band_sum

ndvi_unweighted = ndvi.groupby("time.season").mean("time")  # Calculate the seasonal mean

In [None]:
%%time
actual_result = ndvi_unweighted.compute()

You will notice the time for _task graph optimization_ - the delay between executing the cell above and seeing processing in the cluster dashboard is down significantly. Fewer tasks means less time in optimization.

We've significant decreased the computation time as well!

There is one more thing we can do before we "Go Big". We've done this before and its simple enough. Save dask the challenge of figuring out which measurements we aren't using by telling it only to load the ones we do use.
Let's add out measurements list in.

In [None]:
measurements = [ "oa_fmask", "nbart_red", "nbart_nir"]

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"fmask": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":2, "x":3072, "y":3127},
            group_by=group_by,
        )
# Identify pixels that are either "valid", "water" or "snow"
cloud_free_mask = (
    masking.make_mask(dataset.oa_fmask, fmask="valid")
)
# Apply the mask
cloud_free = dataset.where(cloud_free_mask)

# Calculate the components that make up the NDVI calculation
band_diff = cloud_free.nbart_nir - cloud_free.nbart_red
band_sum = cloud_free.nbart_nir + cloud_free.nbart_red
# Calculate NDVI
ndvi = None
ndvi = band_diff / band_sum

ndvi_unweighted = ndvi.groupby("time.season").mean("time") # perform the seasonal mean

In [None]:
%%time
actual_result = ndvi_unweighted.compute()

This didn't make to much difference to computational time but it has shortened the _task graph optimisation_ phase a little more. That time isn't prohibitive in this example but as we "Go Big" it will be.

# Be a good dask user - Clean up the cluster resources

Disconnecting your client is good practice, but the cluster will still be up so we need to shut it down as well

In [None]:
client.close()

In [None]:
cluster.shutdown()