# Dask Local Cluster - Larger than memory computation <img align="right" src="../../resources/csiro_easi_logo.png">

In the ODC and Dask (LocalCluster) notebook we saw how dask can be used to speed up IO and computation by parallelising operations into _chunks_ and _tasks_, and using _delayed tasks_ and _task graph_ optimization to remove redundant tasks when results are not used.

Using _chunks_ provides one additional capability beyond parallelisation - _the ability to perform computations that are larger than available memory_.

Since dask operations are performed on _chunks_ it is possible for dask to perform operations on smaller pieces that each fit into memory. This is particularly useful if you have a large amount of data that is being reduced, say by performing a seasonal mean.

As with parallelisation, not all algorithms are amenable to being broken into smaller pieces so this won't always be possible. Dask arrays though go a long way to make this easier for a great many operations.

Firstly, some initial imports...

In [None]:
# EASI tools
import git
import sys, os
os.environ['USE_PYGEOS'] = '0'
repo = git.Repo('.', search_parent_directories=True)
if repo.working_tree_dir not in sys.path: sys.path.append(repo.working_tree_dir)
from easi_tools import EasiNotebooks, notebook_utils
easi = EasiNotebooks()

We'll continue using the same algorithm as before but this time we're going to modify it's memory usage to exceed the LocalCluster's available memory. This example notebook is setup to run on a compute node with 28 GiB of available memory and 8 cores for the LocalCluster. We'll make that explicit here in case you are blessed with a larger number of resources.

Let's start the cluster...

In [None]:
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=2, threads_per_worker=4)
cluster.scale(n=2, memory="14GiB")
client = Client(cluster)
client

We can monitor memory usage on the workers using the dask dashboard URL below and the Status tab. The workers are local so this will be memory on the same compute node that Jupyter is running in.  

In [None]:
dashboard_address = notebook_utils.localcluster_dashboard(client=client,server=easi.hub)
print(dashboard_address)

---
As we will be using __Requester Pays__ buckets in AWS S3, we need to run the `configure_s3_access()` function below with the `client` option to ensure that Jupyter and the cluster have the correct permissions to be able to access the data.

In [None]:
from datacube.utils.aws import configure_s3_access
configure_s3_access(aws_unsigned=False, requester_pays=True, client=client);

In [None]:
import datacube
from datacube.utils import masking

dc = datacube.Datacube()

In [None]:
# Get the centroid of the coordinates in the default configuration
central_lat = sum(easi.latitude)/2
central_lon = sum(easi.longitude)/2

# or set your own by uncommenting and editing the following lines
# central_lat = -42.019
# central_lon = 146.615

# Set the buffer to load around the central coordinates
# This is a radial distance for the bbox to actual area so bbox 2x buffer in both dimensions
buffer = 0.05

# Compute the bounding box for the study area
study_area_lat = (central_lat - buffer, central_lat + buffer)
study_area_lon = (central_lon - buffer, central_lon + buffer)

# Data products - Landsat 8 ARD from Geoscience Australia
products = easi.product('landsat')

# Set the date range to load data over 
set_time = ("2021-01-01", "2021-12-31")

# Set the measurements/bands to load. None eill load all of them
measurements = None

# Set the coordinate reference system and output resolution
set_crs = easi.crs('landsat')  # If defined, else None
set_resolution = easi.resolution('landsat')  # If defined, else None
# set_crs = "epsg:3577"
# set_resolution = (-30, 30)

group_by = "solar_day"

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"fmask": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":1},
            group_by=group_by,
        )
dataset

We can check the total size of the dataset using `nbytes`. We'll divide by 2**30 to have the result display in [gibibytes](https://simple.wikipedia.org/wiki/Gibibyte).

In [None]:
print(f"dataset size (GiB) {dataset.nbytes / 2**30:.2f}")

As you can see this ROI and spatial range (1 year) is tiny, let's scale up by increasing our ROI


In [None]:
buffer = 1

# Compute the bounding box for the study area
study_area_lat = (central_lat - buffer, central_lat + buffer)
study_area_lon = (central_lon - buffer, central_lon + buffer)

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"qa_pixel": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":1},
            group_by=group_by,
        )
print(f"dataset size (GiB) {dataset.nbytes / 2**30:.2f}")

Okay, this is now larger than the available memory that our Jupyter node has available (which you should be able to see at the bottom of your window - probably 29.00 GB). This creates issues for calculation. We need to have a solution that lets us calculate the information that we want without the machine running out of memory. 

Dask can compute many tasks and handle large amounts of data over the course of a series of calculations. Collectively, these calculations might work on more data in total than can fit in RAM, but it is a problem if the final product is too big to fit in RAM. Below we will change the dataset so that the final result can fit in RAM and then use the `.compute()` function to run all the calculations.

Let's take a look at the memory usage for one of the bands, we'll use `red`.

In [None]:
dataset.red

You can see the year now has more time observations than in the first dataset because we've expanded the area of interest and picked up multiple satellite passes. The spatial dimensions are also much larger.

Take a note of the _Chunk Bytes_ - probably around 80 MiB. This is the smallest unit of this dataset that dask will do work on. To do an NDVI calculation, dask will need two bands, the mask, the result and a few other temporary variables in memory at once. This means whilst this value is an indicator of memory required on a worker to perform an operation it is not the total, which will depend on the operation.

We can adjust the amount of memory per chunk further by _chunking_ the spatial dimension. Let's split it into 2048x2048 size pieces.

In [None]:
dataset = None # clear results from any previous runs
dataset = dc.load(
            product=products,
            x=study_area_lon,
            y=study_area_lat,
            time=set_time,
            measurements=measurements,
            resampling={"fmask": "nearest", "*": "average"},
            output_crs=set_crs,
            resolution=set_resolution,
            dask_chunks =  {"time":1, "x":2048, "y":2048},  ## Adjust the chunking spatially as well
            group_by=group_by,
        )
print(f"dataset size (GiB) {dataset.nbytes / 2**30:.2f}")

As you can see the total dataset size stays the same. 

Look at the `red` data variable below. You can see the chunk size has reduced to 8 MiB, and there are now more chunks (around 700-800) - compared with around 60 previously. This will result in a higher number of Tasks for Dask to work on. This makes sense: smaller chunks, more tasks.

> __TIP__: The _relationship between tasks and chunks_ is a critical tuning parameter.

Workers have limits in memory and compute capacity. The Dask Scheduler has limits in how many tasks it can manage efficiently (and remember it is tracking all of the data variables, not just this one). The trick with Dask is to give it a good number of chunks of data which aren't too big and don't result in too many tasks. There is always a trade-off and each calculation will be different. Ideally, you want chunks to be aligned with how the data is stored, or how the data is going to be used. If those two things are different, then rechunking can result in large amounts of data needing to be held in the cluster memory, which could result in failures. In the same way, if chunks are too large, they might end up taking up too much memory, causing a crash. This is sometimes down to trial and error.

Later, when we move to a fully remote and distributed cluster, _chunks_ also become an important element in communicating between workers over networks.

If you look carefully at the cube-like diagram in the summary below you will see that some internal lines showing the chunk boundaries for the spatial dimensions. 2048 wasn't an even multiplier so dask has made some chunks on the edges smaller. The specification of `chunks` is a guide: the actual data, numpy arrays in this case, are made into `chunk` sized shapes or smaller. These are called `blocks` in dask and represent the actual shape of the numpy array that will be processed.

Somewhat confusingly the terms `blocks` and `chunks` are also used in dask literature and you'll need to check the context to see if it is referring to the _specification_ or the _actual block of data_. For the moment this differentiation doesn't matter but when performing low level custom operations knowing that your `blocks` might be a different shape does matter.

In [None]:
dataset.red

We won't worry to much about tuning these parameters right now and instead will focus on processing this larger dataset. As before we can exploit dask's ability to use _delayed_ tasks and apply our masking and NDVI directly to the full dataset. We'll also add an unweighted seasonal mean calculation using `groupby("time.season").mean("time")`. Dask will seek to complete the reductions (by chunk) first as they reduce memory usage.

It's probably worth monitoring the dask cluster memory usage via the dashboard _Workers Memory_ to see just how little ram is actually used during this calculation despite it being performed on a large dataset.

In [None]:
print(dashboard_address)

---
We will now calculate NDVI and group the results by season:

In [None]:
# Identify pixels that don't have cloud, cloud shadow or water
from datacube.utils import masking

good_pixel_flags = {
    'nodata': False,
    'cloud': 'not_high_confidence',
    'cloud_shadow': 'not_high_confidence',
    'water': 'land_or_cloud'
}

cloud_free_mask = masking.make_mask(dataset['qa_pixel'], **good_pixel_flags)

# Apply the mask
cloud_free = dataset.where(cloud_free_mask)

# Calculate the components that make up the NDVI calculation
band_diff = cloud_free.nir08 - cloud_free.red
band_sum = cloud_free.nir08 + cloud_free.red
# Calculate NDVI and store it as a measurement in the original dataset ta da
ndvi = None
ndvi = band_diff / band_sum

ndvi_unweighted = ndvi.groupby("time.season").mean("time")  # Calculate the seasonal mean

Let's check the shape of our result - it should have 4 seasons now instead of the individual dates.

In [None]:
ndvi_unweighted

Before we do the `compute()` to get our result we should make sure the final result will fit in memory for the Jupyter kernel

In [None]:
print(f"dataset size (GiB) {ndvi_unweighted.nbytes / 2**30:.2f}")

This shows that the resulting data should be around 1 GiB of data, which will fit in local memory.

If you are monitoring the cluster when you run the cell below, you might notice a delay between running the next cell and actual computation occuring. Dask performs a _task graph optimisation_ step on the _client_ not the cluster. How long this takes depends on the number of tasks and complexity of the graph. The speed of this step has improved recently due to recent Dask updates. We'll talk more about this later.

In the meantime, run the next cell and watch dask compute the result without running out of memory. You might notice that your cluster spills some data to disk (the grey part of the bars in the _Bytes stored per worker_ graph). This is not normally desirable and slows down the calculation (because reading and writing to/from the disk is slower than to/from RAM), but it is a mechanism used by Dask to help manage large calculations. 

>__Tip:__ don't forget to look at your Dask Dashboard (URL a few cells above) to watch what is happening in your cluster

In [None]:
actual_result = ndvi_unweighted.compute()

To avoid northern/southern hemisphere differences, the `season` values are represented as acronyms of the months that make them up, so:
- December, January, February = DJF
- March, April, May = MAM
- June, July, August = JJA
- September, October, November = SON

Let's plot the result for `DJF`. This will take a few seconds, the image is several thousand pixels across.

In [None]:
actual_result.sel(season='DJF').plot(robust=True, size=6, aspect='equal')

Not the most useful visualisation as a small image, and a little slow. Dask can help with this too but that's a topic for another notebook. There are many other ways to work with Dask and optimize performance. This is just the beginning of how to manage large calculations.

# Be a good dask user - Clean up the cluster resources

In [None]:
client.close()

cluster.close()