Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasoning about scaling and performance #204

Closed
TomNicholas opened this issue Jun 5, 2023 · 4 comments · Fixed by #252
Closed

Reasoning about scaling and performance #204

TomNicholas opened this issue Jun 5, 2023 · 4 comments · Fixed by #252
Labels
documentation Improvements or additions to documentation scaling

Comments

@TomNicholas
Copy link
Collaborator

TomNicholas commented Jun 5, 2023

I tried to write out some thoughts about cubed's performance and scaling. Perhaps this could form the basis of another page of the docs, similar to how dask has pages on best practices and understanding performance. Obviously a lot of this will change as more optimizations are introduced / if we discover other reasons why scaling does not behave as expected.

Aim of writing this out:

  • Understand better myself
  • Provide mental model for diagnosing performance
  • Explain to users what performance they should expect
  • Motivate asv benchmarks
  • Help us to ensure real scaling matches theoretical scaling
  • Material for docs page / future blog post

Types of scaling to explain:

  • Horizontal versus vertical scaling
  • Weak scaling versus strong scaling

Scaling of a single step

Theoretical scaling:

  • Use rechunk as an example
  • Limited by concurrent writes to Zarr
  • Assuming serverless service provides infinite workers then...
  • Weak scaling should be totally linear
  • i.e. larger problem completes in same time given larger resources

Realistic scaling considerations:

  • Actually requires parallelism, so won't happen with single-threaded executor
  • Weak scaling requires more workers than output chunks (might need to set executor config for big problems)
  • Without enough workers strong scaling should be totally linear until more workers than output chunks
  • Stragglers will completely affect result - mitigate by turning on backups
  • Failures, once restarted, are basically stragglers
  • Worker start-up time would delay completion but not actually affect scaling properties, because all workers have same start-up time regardless of problem size

Scaling of a multi-step plan

Multiple pipelines

  • i.e. two separate arrays you ask to compute simultaneously
  • Just requires enough workers for both to keep same weak scaling behaviour
  • Same logic for two arrays which input into single array (or vice versa)
  • However currently won't compute in parallel on all executors (Execute tasks for independent pipelines in parallel #80)

Executor-specific considerations

  • Some worker startup time much faster than others
  • Different limits to max workers
  • If you used dask as an executor its scheduler might do unintuitive things - Does Beam have different properties?
@TomNicholas TomNicholas added the documentation Improvements or additions to documentation label Jun 5, 2023
@tomwhite
Copy link
Member

tomwhite commented Jun 5, 2023

This is very useful!

Can you say what you mean by weak and strong scaling?

@TomNicholas
Copy link
Collaborator Author

Can you say what you mean by weak and strong scaling?

I meant this (from wikipedia):

  • Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size.
  • Weak scaling is defined as how the solution time varies with the number of processors for a fixed problem size per processor.[10]

By this I meant essentially:

  • Strong scaling would be Cubed's performance (i.e. total execution time) vs number of workers on a fixed-size dataset
  • Weak scaling would be Cubed's performance (i.e. total throughput) vs dataset size, where total execution time remaining constant would mean total throughput was scaling linearly with problem size.

My point being that IIUC cubed is promising in that it might have good weak scaling properties even up to very large datasets, and therefore weak scaling is more relevant than strong scaling for analyzing cubed's performance. In other words, as a user I'm excited more by the prospect of being able to analyse extremely large datasets in a reasonable amount of time, as opposed to analyzing medium-sized datasets in a very small amount of time (which is still cool but less important to me).

If we had perfectly linear weak scaling, it would mean I could call .compute() from my analysis notebook, and the result would take the same amount of time (and be just as likely to complete) regardless of the size of the dataset I was computing. To me this is what the ultimate goal for the pangeo-style stack should be.

The rest of my post was about me trying to reason through when and to what extent we might expect to see this linear weak scaling with cubed.

@TomNicholas
Copy link
Collaborator Author

If this is at all useful then I can write it out in full sentences as a draft docs page perhaps?

@tomwhite
Copy link
Member

tomwhite commented Jun 5, 2023

If this is at all useful then I can write it out in full sentences as a draft docs page perhaps?

Thanks for the explanation. That would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation scaling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants