Reasoning about scaling and performance #204

TomNicholas · 2023-06-05T15:58:17Z

I tried to write out some thoughts about cubed's performance and scaling. Perhaps this could form the basis of another page of the docs, similar to how dask has pages on best practices and understanding performance. Obviously a lot of this will change as more optimizations are introduced / if we discover other reasons why scaling does not behave as expected.

Aim of writing this out:

Understand better myself
Provide mental model for diagnosing performance
Explain to users what performance they should expect
Motivate asv benchmarks
Help us to ensure real scaling matches theoretical scaling
Material for docs page / future blog post

Types of scaling to explain:

Horizontal versus vertical scaling
Weak scaling versus strong scaling

Scaling of a single step

Theoretical scaling:

Use rechunk as an example
Limited by concurrent writes to Zarr
Assuming serverless service provides infinite workers then...
Weak scaling should be totally linear
i.e. larger problem completes in same time given larger resources

Realistic scaling considerations:

Actually requires parallelism, so won't happen with single-threaded executor
Weak scaling requires more workers than output chunks (might need to set executor config for big problems)
Without enough workers strong scaling should be totally linear until more workers than output chunks
Stragglers will completely affect result - mitigate by turning on backups
Failures, once restarted, are basically stragglers
Worker start-up time would delay completion but not actually affect scaling properties, because all workers have same start-up time regardless of problem size

Scaling of a multi-step plan

Number of steps in plan sets min total execution time
So reduce number of steps in plan
Array reductions can be done in fewer iterative steps if allowed_mem is larger (is there a rule of thumb for this??) (see Creating cubed arrays from lazy xarray data #197 (comment))
Also can fuse steps
Happens automatically
Can't fuse through rechunking steps without requiring shuffle, which can violate memory constraint in general
But multiple blockwise operations can be fused (Fuse connected blockwise subgraphs #136, Fuse in case of multiple arrays derived from one source array via blockwise ops #126)
Can multiple rechunk operations be fused?

Multiple pipelines

i.e. two separate arrays you ask to compute simultaneously
Just requires enough workers for both to keep same weak scaling behaviour
Same logic for two arrays which input into single array (or vice versa)
However currently won't compute in parallel on all executors (Execute tasks for independent pipelines in parallel #80)

Executor-specific considerations

Some worker startup time much faster than others
Different limits to max workers
If you used dask as an executor its scheduler might do unintuitive things - Does Beam have different properties?

The text was updated successfully, but these errors were encountered:

tomwhite · 2023-06-05T16:40:33Z

This is very useful!

Can you say what you mean by weak and strong scaling?

TomNicholas · 2023-06-05T17:13:42Z

Can you say what you mean by weak and strong scaling?

I meant this (from wikipedia):

Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size.
Weak scaling is defined as how the solution time varies with the number of processors for a fixed problem size per processor.[10]

By this I meant essentially:

Strong scaling would be Cubed's performance (i.e. total execution time) vs number of workers on a fixed-size dataset
Weak scaling would be Cubed's performance (i.e. total throughput) vs dataset size, where total execution time remaining constant would mean total throughput was scaling linearly with problem size.

My point being that IIUC cubed is promising in that it might have good weak scaling properties even up to very large datasets, and therefore weak scaling is more relevant than strong scaling for analyzing cubed's performance. In other words, as a user I'm excited more by the prospect of being able to analyse extremely large datasets in a reasonable amount of time, as opposed to analyzing medium-sized datasets in a very small amount of time (which is still cool but less important to me).

If we had perfectly linear weak scaling, it would mean I could call .compute() from my analysis notebook, and the result would take the same amount of time (and be just as likely to complete) regardless of the size of the dataset I was computing. To me this is what the ultimate goal for the pangeo-style stack should be.

The rest of my post was about me trying to reason through when and to what extent we might expect to see this linear weak scaling with cubed.

TomNicholas · 2023-06-05T17:16:42Z

If this is at all useful then I can write it out in full sentences as a draft docs page perhaps?

tomwhite · 2023-06-05T19:01:09Z

If this is at all useful then I can write it out in full sentences as a draft docs page perhaps?

Thanks for the explanation. That would be great!

TomNicholas added the documentation Improvements or additions to documentation label Jun 5, 2023

tomwhite added the scaling label Jul 3, 2023

TomNicholas mentioned this issue Jul 7, 2023

Scaling docs #252

Merged

tomwhite closed this as completed in #252 Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasoning about scaling and performance #204

Reasoning about scaling and performance #204

TomNicholas commented Jun 5, 2023 •

edited

tomwhite commented Jun 5, 2023

TomNicholas commented Jun 5, 2023

TomNicholas commented Jun 5, 2023

tomwhite commented Jun 5, 2023

Reasoning about scaling and performance #204

Reasoning about scaling and performance #204

Comments

TomNicholas commented Jun 5, 2023 • edited

Aim of writing this out:

Types of scaling to explain:

Scaling of a single step

Theoretical scaling:

Realistic scaling considerations:

Scaling of a multi-step plan

Multiple pipelines

Executor-specific considerations

tomwhite commented Jun 5, 2023

TomNicholas commented Jun 5, 2023

TomNicholas commented Jun 5, 2023

tomwhite commented Jun 5, 2023

TomNicholas commented Jun 5, 2023 •

edited