A/B test reports #309

crusaderky · 2022-09-06T15:56:18Z

Partially closes Automation of benchmark comparison #292
Complements A/B test jobs #314
Supersedes WIP barchart for comparing different commits rather than a timeseries. #294

In scope

Generate dashboard in PRs and upload it as an artifact
General refactoring of dashboard.py
Bar chart by test
Bar chart by runtime (absolute measures)
Bar chart by runtime / baseline runtime

Out of scope

Run additional runtimes on arbitrary git hashes (delivered by [WIP] Run test suite on arbitrary coiled environments #296)
Display historical variance of baseline as a bracket overlay (read below)
Reduce noise (read below)

The plots highlight that the performance metrics are heavily affected by noise.

In some cases, this looks like a high variance between runs (which can be seen in the historical plots); in others, the difference is consistent throughout the history, but is not justified by a difference in runtime so I suspect there's some hidden warm-up cost that differs between the runs.

For example, latest and 0.1.0 only differ by the addition of pynvml, but the A/B charts comparing the two (top right plot above) show major differences all over the place - which are all noise. The memory usage plots look just as jittery. Ideally this chart should be used to take the informed decision to finalize a new release (all bars are negative or almost zero) and is also twitter gold - but at the moment, it's not usable due to the underlying data.

On one side, I can incorporate historical data into these plots to show brackets of variance for the baseline. I will in a successive PR. However, such brackets will just tell us what we can already see in the historical graphs: that most tests are swinging wildly and that a +/- 50% runtime is nothing to be surprised about.
We need to sit down and find the best way to actually reduce variance in runtime and memory usage.

ian-r-rose · 2022-09-06T16:06:18Z

@crusaderky it looks like you were hitting some permissions issues. Is it the same as #293? I've been unable to reproduce, but you might try reverting #240 to see if that helps

ian-r-rose · 2022-09-06T16:09:02Z

dashboard.py

+        for test in tests:
+            if (func := getattr(mod, test, None)) and callable(func):
+                # FIXME missing decorators, namely @pytest.mark.parametrize
+                source[fname[len("tests/") :] + "::" + test] = inspect.getsource(func)


I'd rather continue to be defensive here with a try/except

Why? It makes a lot more sense to me to have the whole test suite crash if someone adds a new import in a test file and doesn't add it to ci/environment-dashboard.yml.

Because I view this code-scraping to be a fairly fragile hack, and weird stuff can happen on import, especially when pytest is involved. For example, a pytest.skip at the module level raises an exception, and I don't want to force everyone to do non-idiomatic pytest stuff on account of this static report. To me, having a missing code snippet for a test is not a big deal.

I'm also unhappy about having to add stuff to ci/environment-dashboard.yml in the first place to just be able to use inspect. If you have ideas about how to get snippets without relying on importing, I'm certainly open to it. But this seemed preferable to trying to parse .py files to me.

Partially reverted

ian-r-rose · 2022-09-06T16:09:47Z

dashboard.py

+        df = df.sort_values("runtime", key=runtime_sort_key_pd)
+        # Altair will re-sort alphabetically, unless it's explicitly asked to sort by
+        # something else. Could not find an option to NOT sort.
+        y = altair.Y("runtime", sort=altair.EncodingSortField(field="order"))


Does sort=None work?

crusaderky · 2022-09-06T16:29:53Z

@crusaderky it looks like you were hitting some permissions issues. Is it the same as #293? I've been unable to reproduce, but you might try reverting #240 to see if that helps

No, the issue was that

git@github.com:crusaderky/coiled-runtime.git works for pulls and pushes
git@github.com:coiled/coiled-runtime.git works for pulls, but not for pushes
https://github.com/coiled/coiled-runtime.git also works for pushes (but annoyingly it doesn't work in Gitkraken)

ian-r-rose · 2022-09-06T16:50:46Z

.github/workflows/tests.yml

+      - name: Download software environment assets
+        uses: actions/download-artifact@v3
+        with:
+          name: software-environment-py3.9


Where are you using this? Hopefully these assets will be going away soon

In the following change (line 471)

crusaderky · 2022-09-07T13:30:36Z

Ready for review (assuming CI passes)

jrbourbeau

@crusaderky it looks like there is a legitimate error in the static site generation script

Traceback (most recent call last):
Discovered 36 tests
  File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 507, in <module>
    main()
Generating static/coiled-upstream-py3.10.html
Generating static/coiled-upstream-py3.9.html
Generating static/coiled-upstream-py3.8.html
Generating static/coiled-latest-py3.10.html
Generating static/coiled-latest-py3.9.html
Generating static/coiled-latest-py3.8.html
Generating static/coiled-0.1.0-py3.10.html
Generating static/coiled-0.1.0-py3.9.html
Generating static/coiled-0.1.0-py3.8.html
Generating static/coiled-0.0.4-py3.10.html
Generating static/coiled-0.0.4-py3.9.html
Generating static/coiled-0.0.4-py3.8.html
Generating static/coiled-0.0.3-py3.9.html
Generating static/coiled-0.0.3-py3.8.html
Generating static/coiled-0.0.3-py3.7.html
Generating static/AB_by_test.html
Generating static/AB_by_runtime.html
  File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 497, in main
    align_to_baseline(latest_run, baseline),
  File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 80, in align_to_baseline
    df_baseline.set_index("fullname")
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 961, in __getitem__
    return self._getitem_tuple(key)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1147, in _getitem_tuple
    return self._multi_take(tup)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1098, in _multi_take
    d = {
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1099, in <dictcomp>
    axis: self._get_listlike_indexer(key, axis)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1330, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5796, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5859, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['stability/test_shuffle.py::test_shuffle_simple', 'stability/test_array.py::test_rechunk_in_memory', 'stability/test_shuffle.py::test_shuffle_parquet'] not in index"

crusaderky · 2022-09-07T21:21:55Z

@crusaderky it looks like there is a legitimate error in the static site generation script

@jrbourbeau fixed now

crusaderky · 2022-09-07T21:26:28Z

@ncclementi @ian-r-rose I called my latest commit test-upstream; the setup job outputs "true" in its Check upstream task; however the upstream runtime is not there. Am I missing something?

jrbourbeau · 2022-09-07T21:40:43Z

I think things are working as expected. The upstream build should only install nightly versions of dask and distributed in the latest CI builds. For example, if you look at the linux Python 3.9 latest build on your latest commit (the test-upstream one), you'll see the the dev version of dask and distributed are installed from the dask conda channel. Does that help clarify, or am I missing something?

ncclementi · 2022-09-07T21:41:18Z

@crusaderky this is a bit of a bad design in the name. What happened is that when you called test-upstream then "latest" will have the nightlies see that the nightlies were installed. https://github.com/coiled/coiled-runtime/runs/8237542627?check_suite_focus=true#step:5:20

That logic is here https://github.com/coiled/coiled-runtime/blob/1614ab58128e3e3dcfa7d213db2c88e7c55ac6d4/ci/create_latest_runtime_meta.py#L41-L49

fjetter · 2022-09-08T09:15:00Z

Re: noise of the samples
I believe if we have historic variances we can employ some statistical significance analysis on the new results. I believe this is something one typically uses a t-test for to verify whether the new measurement is in agreement to the old one. That's basically a very simple quantity t = (old_measurement - new_measurement) / sqrt(historic_variance / n_historic_data_points). If that value is large, this is true signal (depending on how we define our confidence interval) @ncclementi I believe you wrote the regression detection script. You might have more insight here than I do.

We can also look into running some tests multiple times. Particularly the ones that have very small runtimes (<30s) are likely more prone to noise and we can afford running them a couple of times.

Obviously, all of this is a follow up topic. I think this is already a great first step

wence- · 2022-09-08T13:43:48Z

I believe if we have historic variances we can employ some statistical significance analysis on the new results. I believe this is something one typically uses a t-test for to verify whether the new measurement is in agreement to the old one. That's basically a very simple quantity t = (old_measurement - new_measurement) / sqrt(historic_variance / n_historic_data_points).

A t-test is reasonable if your null hypothesis is that the data are normally distributed, if you don't have that guarantee (for example a long-tailed distribution due to outliers) then you're better off with a non-parametric test (such as Kolmogorov-Smirnov (available in scipy.stats.kstest)

crusaderky · 2022-09-08T14:14:39Z

A t-test is reasonable if your null hypothesis is that the data are normally distributed

Data is definitely not normally distributed: #315

fjetter · 2022-09-08T14:42:47Z

A t-test is reasonable if your null hypothesis is that the data are normally distributed, if you don't have that guarantee (for example a long-tailed distribution due to outliers) then you're better off with a non-parametric test (such as Kolmogorov-Smirnov (available in scipy.stats.kstest)

I think we should discuss this in a dedicated issue. I think in most situations the normal distribution is a fair assumption. My worry about more complex tests is that we don't have a lot of measurements to start with. We're not actually comparing two distributions since we often only have a single measurement

jrbourbeau

From what I can tell this looks like it's ready for a final review. @ian-r-rose do you think you'll have time to take a look at the dashboard.py changes? Or maybe you already have

There's been some discussion around merge conflicts with #235 but from what I can tell it appears they will be relatively minimal, so I'm okay merging this PR first if needed

fjetter

I only reviewed the code on a high level. If the plots turn out this way, I'm good with this.

Ian Rose and others added 4 commits August 31, 2022 17:57

WIP barchart for comparing different commits rather than a timeseries.

80a68f1

Add PyCharm files to .gitignore

b03c3f1

Merge branch 'main' into barchart

87fd490

Merge branch 'main' into barchart

80c6461

crusaderky self-assigned this Sep 6, 2022

ian-r-rose reviewed Sep 6, 2022

View reviewed changes

A/B bar charts

ca9c525

crusaderky force-pushed the barchart2 branch from 1645f3e to ca9c525 Compare September 7, 2022 10:53

Gracefully fail imports

43460fa

crusaderky force-pushed the barchart2 branch from 6c745b4 to 43460fa Compare September 7, 2022 11:04

crusaderky added 3 commits September 7, 2022 12:20

Cosmetic

bae7564

Build test report in PRs

35148aa

bugfix

905b716

jrbourbeau reviewed Sep 7, 2022

View reviewed changes

test-upstream

13a21d9

upstream is mutually exclusive with latest

907fe17

crusaderky mentioned this pull request Sep 8, 2022

A/B test jobs #314

Merged

crusaderky changed the title ~~A/B test report~~ A/B test reports Sep 8, 2022

crusaderky closed this Sep 8, 2022

crusaderky reopened this Sep 8, 2022

jrbourbeau reviewed Sep 8, 2022

View reviewed changes

fjetter approved these changes Sep 9, 2022

View reviewed changes

crusaderky merged commit d748681 into main Sep 9, 2022

crusaderky deleted the barchart2 branch September 9, 2022 14:38

crusaderky added a commit that referenced this pull request Sep 9, 2022

Merge #309

8fcb346

hendrikmakait pushed a commit that referenced this pull request Sep 14, 2022

A/B test reports (#309)

18e56aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A/B test reports #309

A/B test reports #309

crusaderky commented Sep 6, 2022 •

edited

Loading

ian-r-rose commented Sep 6, 2022

ian-r-rose Sep 6, 2022

crusaderky Sep 6, 2022

ian-r-rose Sep 6, 2022

ian-r-rose Sep 6, 2022

crusaderky Sep 7, 2022

ian-r-rose Sep 6, 2022

crusaderky Sep 6, 2022

crusaderky commented Sep 6, 2022 •

edited

Loading

ian-r-rose Sep 6, 2022

crusaderky Sep 7, 2022

crusaderky commented Sep 7, 2022

jrbourbeau left a comment

crusaderky commented Sep 7, 2022

crusaderky commented Sep 7, 2022

jrbourbeau commented Sep 7, 2022

ncclementi commented Sep 7, 2022 •

edited

Loading

fjetter commented Sep 8, 2022

wence- commented Sep 8, 2022

crusaderky commented Sep 8, 2022

fjetter commented Sep 8, 2022

jrbourbeau left a comment

fjetter left a comment

A/B test reports #309

A/B test reports #309

Conversation

crusaderky commented Sep 6, 2022 • edited Loading

In scope

Out of scope

ian-r-rose commented Sep 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Sep 7, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

crusaderky commented Sep 7, 2022

crusaderky commented Sep 7, 2022

jrbourbeau commented Sep 7, 2022

ncclementi commented Sep 7, 2022 • edited Loading

fjetter commented Sep 8, 2022

wence- commented Sep 8, 2022

crusaderky commented Sep 8, 2022

fjetter commented Sep 8, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

crusaderky commented Sep 6, 2022 •

edited

Loading

crusaderky commented Sep 6, 2022 •

edited

Loading

ncclementi commented Sep 7, 2022 •

edited

Loading