Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automation of benchmark comparison #292

Closed
fjetter opened this issue Aug 31, 2022 · 14 comments · Fixed by #309 or #314
Closed

Automation of benchmark comparison #292

fjetter opened this issue Aug 31, 2022 · 14 comments · Fixed by #309 or #314
Assignees

Comments

@fjetter
Copy link
Member

fjetter commented Aug 31, 2022

Requirements

  • Be able to compare two distinct git handles (commit, branch, etc.) of dask/dask and/or dask/distributed; needs to target forks
  • It is fine if it is merely possible to use a commit OR a branch OR a PR etc. instead of generic git refs
  • Tests are used from the current commit / PR of the coiled-runtime
  • Initial iteration must be extendable to compare N commits
  • One or many CI job(s) is spawned that runs all available benchmarks and collects a summary
  • The CI job collects all data that is also collected by the ordinary benchmark tests
  • The CI job produces a summary / report comparing these single point numbers

Nice to haves

  • Varying config options
  • Select a subset of the benchmarks to run
  • Varying environment
  • Varying test code for different git handles (if a prototype change breaks an API, for example)
  • We also want to collect memory sampling, performance reports, etc.
@crusaderky
Copy link
Contributor

crusaderky commented Sep 1, 2022

I went through the codebase. @jrbourbeau @ncclementi @ian-r-rose please correct me if I say something wrong or impossible.

The Tests github action does the following:

  1. build the latest conda package for coiled-runtime
  2. create coiled software environments for latest, 0.1.0, and 0.0.4, in cartesian product with different versions of python
  3. run the full battery of tests against the coiled environments created at step 2
  4. collect and compare results
  5. destroy coiled software environments

If that's the case, my plan is fairly straightforward:

  1. Create additional software environments
  2. Create a new static HTML report to compare whatever was run in the battery of tests

Sample workflow

I want to test dask/distributed#6614.

I create a new branch of coiled-runtime, in my own fork: crusaderky/coiled-runtime/AB_6614

In the new branch, I create the following files. Each file is a bunch of kwargs to coiled.create_software_environment:

AB_parameters/6614_before.yaml

conda:
    - python=3.9
    - coiled-runtime=0.1.0
pip:
    - dask==2022.8.1
    - git+https://github.com/dask/distributed@f07f384eb0b8b6bad7518e8535e42fdba106ed4b

AB_parameters/6614_inf.yaml

conda:
    - python=3.9
    - coiled-runtime=0.1.0
pip:
    - dask==2022.8.1
    - git+https://github.com/dask/distributed@dd81b424971e81616e1a52fa09ce4698a5002d41

AB_parameters/6614_12.yaml

conda:
    - python=3.9
    - coiled-runtime=0.1.0
pip:
    - dask==2022.8.1
    - git+https://github.com/dask/distributed@dd81b424971e81616e1a52fa09ce4698a5002d41
environ:
    DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION: "1.2"
Notes
  • I could have used the nightly build of coiled-runtime instead
  • I did not install dask 2022.8.1 using conda, because it would have conflicted with coiled-runtime=0.1.0. However, using pip allows you to bypass the constraints imposed by the conda recipe.
  • I could have added a PR that hasn't been merged yet by simply creating a branch on my fork https://github.com/crusaderky/distributed

I then have to update .github/workflows/tests.yaml:

  • In the (not yet existing) section "additional_environments", I change
matrix:
  AB_environments: []

to

matrix:
  AB_environments:
    - 6614_before
    - 6614_inf
    - 6614_12
  • In the benchmarks, runtime, and stability sections, I update the matrix to whatever I'm interested in:
    e.g. from
  benchmarks:
    ...
    strategy:
      fail-fast: false
      matrix:
        os: ["ubuntu-latest"]
        python-version: ["3.9"]
        runtime-version: ["latest", "0.0.4", "0.1.0"]

to

      matrix:
        os: ["ubuntu-latest"]
        python-version: ["3.9"]
        runtime-version: [AB_6614_before, AB_6614_inf, AB_6614_12]

Note: I would love to not need to touch tests.yaml and instead automatically generate those lists from a file. To my knowledge it's not possible but I will investigate.

Finally, I submit a PR named [DNM] A/B testing: distributed#6614 and read the report out of the PR's artifacts.
When I'm done I close the PR without merging.

The static report will show side by side the time, peak memory, and average memory for each test from each test suite that has run and has the measure.

Further improvements, out of scope for this ticket:

Requirements

All requirements in the OP are satisfied.

Nice to haves

  • Varying config options
    Yep!
  • Select a subset of the benchmarks to run
    You can completely disable benchmarks, runtime, or stability. You can additionally cherry-pick tests by tampering with the lines in tests.yaml that call pytest.
  • Varying environment
    Yep!
  • Varying test code for different git handles (if a prototype change breaks an API, for example)
    Yep!
  • We also want to collect memory sampling, performance reports, etc.
    MemorySampler timeseries are definitely a possibility for a future iteration but are out of scope for now.
  • Automatically highlight major changes
    Out of scope for now; something straightforward to implement in the future

Known issues and limitations

  • I don't know if it's straightforward, or even possible, to use the latest coiled-runtime (meaning, from the same PR) instead of the nightly build. This makes it harder to e.g. benchmark two different versions of a dependency where the pip package is not quite interchangeable with the conda one, like numpy or pandas.

@fjetter
Copy link
Member Author

fjetter commented Sep 1, 2022

The Tests github action does the following:

the tests workflow is already pretty complicated. I wouldn't mind writing something new. If it actually is easier to develop within this file, I'm OK.
If we want to reuse steps, we can probably factor them out into a custom github action (afaik this can be done within the same repo) and use different yml files

Would love to get @ian-r-rose thoughts on this

I create a new branch of coiled-runtime, in my own fork: crusaderky/coiled-runtime/AB_6614

nit: if you are running on your fork, there will be no credentials. Right now, we have to run everything on the main repo. I don't think this impacts your proposal

Note: I would love to not need to touch tests.yaml and instead automatically generate those lists from a file. To my knowledge it's not possible but I will investigate.

Agreed but I think that's an OK workflow for now

@fjetter
Copy link
Member Author

fjetter commented Sep 1, 2022

I don't know if it's straightforward, or even possible, to use the latest coiled-runtime (meaning, from the same PR) instead of the nightly build. This makes it harder to e.g. benchmark two different versions of a dependency where the pip package is not quite interchangeable with the conda one, like numpy or pandas.

Looks like the checkout action supports checking out anything you want. It just defaults to the current commit

https://github.com/actions/checkout

I suggest to do this in a follow up.

@fjetter
Copy link
Member Author

fjetter commented Sep 1, 2022

Finally, I submit a PR named [DNM] A/B testing: distributed#6614 and read the report out of the PR's artifacts.
When I'm done I close the PR without merging.

Nice follow up: the CI job posts the final results to the PR and closes the PR automatically

@ian-r-rose
Copy link
Contributor

I went through the codebase. @jrbourbeau @ncclementi @ian-r-rose please correct me if I say something wrong or impossible.

The Tests github action does the following:

1. build the `latest` conda package for `coiled-runtime`

2. create coiled software environments for `latest`, `0.1.0`, and `0.0.4`, in cartesian product with different versions of python

I would expect these to be much simpler once we merge #235 and live in a package_sync world. Then you would only have to worry about installing the correct environmnent on the CI instance, and the cluster will follow. I think it's probably not worth the time to generate new coiled software environments for every CI build right now.

3. run the full battery of tests against the coiled environments created at step 2

4. collect and compare results

No objection here :)

5. destroy coiled software environments

This would also be unnecessary in a package_sync situation.

If that's the case, my plan is fairly straightforward:

1. Create additional software environments

2. Create a new static HTML report to compare whatever was run in the battery of tests

On this front, I pushed some WIP that I have found useful for looking at regression in #294. You might be able to use that with minimal changes (it could probably use a bit of CLI work).

@ian-r-rose
Copy link
Contributor

the tests workflow is already pretty complicated. I wouldn't mind writing something new. If it actually is easier to develop within this file, I'm OK.

I agree that it's quite complicated. However, there are two things that would simplify it dramatically:

  1. We merge Package Sync #235, which already removes several hundred lines of yaml, and
  2. We adopt the proposal in Simplify CI test workflow #279 and just have a single "test" job

Would that allay your concerns about having something in tests.yml?

@ncclementi
Copy link
Contributor

@ian-r-rose beat me to it, but I was about to say just that. I think once Package sync is in and we simplify CI, things should get much simpler to work with.

@crusaderky crusaderky changed the title [Draft/WIP] Automation of benchmark comparison Automation of benchmark comparison Sep 1, 2022
@crusaderky crusaderky self-assigned this Sep 1, 2022
@ian-r-rose
Copy link
Contributor

@crusaderky I'd recommend holding off on making major time investments into changing the software environment setup process until #235 is in. Or building on top of that PR.

@crusaderky
Copy link
Contributor

I started writing #296. I'll pause and wait for #235 to go in first. Do you have a rough ETA?

@ian-r-rose
Copy link
Contributor

Do you have a rough ETA?

I'd like to get it in today or tomorrow. Before we merge it, do you have any concerns about using package_sync for this effort?

@gjoseph92
Copy link
Contributor

gjoseph92 commented Sep 1, 2022

Varying test code for different git handles (if a prototype change breaks an API, for example)

How is this supported? Let's say I wanted to test dask/dask#9446 while it's still a prototype PR.

I want to run 3 cases:

  • Low cardinality, against the PR:

    def test_cardinality(small_cluster):
        ...
        result_A = dfA.groupby("x", cardinality=0.1).mean()
        result_B = dfB.groupby("x", cardinality=0.3).mean()
  • High cardinality, against the PR:

    def test_cardinality(small_cluster):
        ...
        result_A = dfA.groupby("x", cardinality=0.8).mean()
        result_B = dfB.groupby("x", cardinality=0.9).mean()
  • Default, against main (which doesn't support the cardinality argument):

    def test_cardinality(small_cluster):
        ...
        result_A = dfA.groupby("x").mean()
        result_B = dfB.groupby("x").mean()

How would you express the fact that different test code needs to run in different cases? With this approach, it seems like I'd have to add some way into my dask fork to express the two different cardinalities via environment variables.

Or another case: what if I wanted to benchmark the above code, also on 3 different cluster sizes?

@gjoseph92
Copy link
Contributor

Another way of approaching this would be to have the ability to compare across different branches of the coiled-runtime repo itself.

You'd create a branch for each variation you want to test. In that branch, you can change the environment.yaml however you wish to use different dependencies for that case; package_sync will take care of installing them.

You also can change all code however you need to between branches. This allows you to:

  • set different cluster environment variables/configs in different cases
  • use different cluster sizes in different cases
  • use different driver test code in different cases (covering the cardinality example above)

Pros:

  • extremely flexible
  • lightweight

Cons:

  • extremely flexible
  • lightweight

The downside, of course, is that you don't have this nice YAML file describing exactly what's different between the cases, which could be nice for dashboarding. But inevitably, that YAML is going to have to grow and grow to encompass more types of things you want to compare, and eventually require its own documentation and tests. (The alternative is to push all that complexity into the forked code you're running, which I think is a bad idea. Maintaining a branch of a branch, which adds hacks to alter behavior based on environment variables, would be brittle and frustrating.)

Making the changes in different branches could become tedious, especially with a large parameter space. I could certainly see making a script which takes a YAML configuration file and sets up all the branches for you :)

There's a question of whether we want to have thousands of branches in this repo. (We certainly wouldn't open PRs for all of them—just pushing would have to be sufficient.) Creating a shared git config to ignore fetching refspecs following a pattern (like runbench/*) would alleviate any eventual git performance issue here.


The workflow would probably be something like:

  1. Create and push your branches. Don't open PRs for them.

  2. Create another branch off main named compare/<name>, add a bench-compare.txt file containing the list of branch names (or better, commit hashes) you want to compare. Open a PR for this. Two options for what this triggers:

    1. This kicks off a workflow which runs the benchmarks for all the listed branches in parallel (in separate actions). It collects the results and renders the comparison dashboard

    2. Pushing the other branches already triggered workflows for them (by matching the runbench/* branch naming pattern). This workflow waits until they're done, collects their artifacts, and renders the comparison dashboard.

      If it's feasible, I prefer this approach, because it lets you:

      1. quickly add new cases without re-running everything
      2. quickly fix an error in just one case
      3. instantly compare existing cases—or turn two different commits of the same case into different cases
      4. easily run the dashboard rendering script locally (just point it at the GHA URLs for your branches)
  3. Update the branches as necessary. Change the commit hashes in bench-compare.txt and push updates to compare/<name>.

  4. Close the PR without merging once satisfied.

@crusaderky
Copy link
Contributor

crusaderky commented Sep 2, 2022

How would you express the fact that different test code needs to run in different cases? With this approach, it seems like I'd have to add some way into my dask fork to express the two different cardinalities via environment variables.

You'd do like you always do when running tests against different versions of a dependency:

@pytest.mark.skipif(dask.__version__ < (whatever is generated by your commit))

Or another case: what if I wanted to benchmark the above code, also on 3 different cluster sizes?

Cluster size is hardcoded in the tests, and it varies from test to test, so you can't. In a future iteration we can revisit the tests so that they use an environment variable whenever sensible.

Another way of approaching this would be to have the ability to compare across different branches of the coiled-runtime repo itself.

This feels extremely complicated to me. So you'd have a "lead" PR which contains a list of git handles, and get everything from git twice?
It would also mean not having the performance comparison between various coiled runtime versions (not A/B testing), which is a really nice report to have.

@fjetter
Copy link
Member Author

fjetter commented Sep 2, 2022

How is this supported? Let's say I wanted to test dask/dask#9446 while it's still a prototype PR.

E.g. Use a branch in dask that hard codes a value or uses environment variables / configs to read this out. Solving such a problem specifically for the problem at hand is much simpler than figuring out a generic infrastructure solution that can handle everything

Another way of approaching this would be to have the ability to compare across different branches of the coiled-runtime repo itself.

You'd create a branch for each variation you want to test. In that branch, you can change the environment.yaml however you wish to use different dependencies for that case; package_sync will take care of installing them.

I'm with Guido on this. Let's keep our first iteration simple. Once it is up and running we evaluate what features are still missing.

This was referenced Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants