Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A/B test reports #309

Merged
merged 11 commits into from
Sep 9, 2022
Merged

A/B test reports #309

merged 11 commits into from
Sep 9, 2022

Conversation

crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Sep 6, 2022

In scope

  • Generate dashboard in PRs and upload it as an artifact
  • General refactoring of dashboard.py
  • Bar chart by test
  • Bar chart by runtime (absolute measures)
  • Bar chart by runtime / baseline runtime

Out of scope

Screenshot from 2022-09-07 22-13-14

Screenshot from 2022-09-07 22-13-34

The plots highlight that the performance metrics are heavily affected by noise.

In some cases, this looks like a high variance between runs (which can be seen in the historical plots); in others, the difference is consistent throughout the history, but is not justified by a difference in runtime so I suspect there's some hidden warm-up cost that differs between the runs.

For example, latest and 0.1.0 only differ by the addition of pynvml, but the A/B charts comparing the two (top right plot above) show major differences all over the place - which are all noise. The memory usage plots look just as jittery. Ideally this chart should be used to take the informed decision to finalize a new release (all bars are negative or almost zero) and is also twitter gold - but at the moment, it's not usable due to the underlying data.

On one side, I can incorporate historical data into these plots to show brackets of variance for the baseline. I will in a successive PR. However, such brackets will just tell us what we can already see in the historical graphs: that most tests are swinging wildly and that a +/- 50% runtime is nothing to be surprised about.
We need to sit down and find the best way to actually reduce variance in runtime and memory usage.

@crusaderky crusaderky self-assigned this Sep 6, 2022
@ian-r-rose
Copy link
Contributor

@crusaderky it looks like you were hitting some permissions issues. Is it the same as #293? I've been unable to reproduce, but you might try reverting #240 to see if that helps

for test in tests:
if (func := getattr(mod, test, None)) and callable(func):
# FIXME missing decorators, namely @pytest.mark.parametrize
source[fname[len("tests/") :] + "::" + test] = inspect.getsource(func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather continue to be defensive here with a try/except

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It makes a lot more sense to me to have the whole test suite crash if someone adds a new import in a test file and doesn't add it to ci/environment-dashboard.yml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I view this code-scraping to be a fairly fragile hack, and weird stuff can happen on import, especially when pytest is involved. For example, a pytest.skip at the module level raises an exception, and I don't want to force everyone to do non-idiomatic pytest stuff on account of this static report. To me, having a missing code snippet for a test is not a big deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also unhappy about having to add stuff to ci/environment-dashboard.yml in the first place to just be able to use inspect. If you have ideas about how to get snippets without relying on importing, I'm certainly open to it. But this seemed preferable to trying to parse .py files to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially reverted

dashboard.py Outdated
df = df.sort_values("runtime", key=runtime_sort_key_pd)
# Altair will re-sort alphabetically, unless it's explicitly asked to sort by
# something else. Could not find an option to NOT sort.
y = altair.Y("runtime", sort=altair.EncodingSortField(field="order"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sort=None work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works

@crusaderky
Copy link
Contributor Author

crusaderky commented Sep 6, 2022

@crusaderky it looks like you were hitting some permissions issues. Is it the same as #293? I've been unable to reproduce, but you might try reverting #240 to see if that helps

No, the issue was that

  • git@github.com:crusaderky/coiled-runtime.git works for pulls and pushes
  • git@github.com:coiled/coiled-runtime.git works for pulls, but not for pushes
  • https://github.com/coiled/coiled-runtime.git also works for pushes (but annoyingly it doesn't work in Gitkraken)

Comment on lines +464 to +467
- name: Download software environment assets
uses: actions/download-artifact@v3
with:
name: software-environment-py3.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you using this? Hopefully these assets will be going away soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following change (line 471)

@crusaderky
Copy link
Contributor Author

Ready for review (assuming CI passes)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky it looks like there is a legitimate error in the static site generation script

Traceback (most recent call last):
Discovered 36 tests
  File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 507, in <module>
    main()
Generating static/coiled-upstream-py3.10.html
Generating static/coiled-upstream-py3.9.html
Generating static/coiled-upstream-py3.8.html
Generating static/coiled-latest-py3.10.html
Generating static/coiled-latest-py3.9.html
Generating static/coiled-latest-py3.8.html
Generating static/coiled-0.1.0-py3.10.html
Generating static/coiled-0.1.0-py3.9.html
Generating static/coiled-0.1.0-py3.8.html
Generating static/coiled-0.0.4-py3.10.html
Generating static/coiled-0.0.4-py3.9.html
Generating static/coiled-0.0.4-py3.8.html
Generating static/coiled-0.0.3-py3.9.html
Generating static/coiled-0.0.3-py3.8.html
Generating static/coiled-0.0.3-py3.7.html
Generating static/AB_by_test.html
Generating static/AB_by_runtime.html
  File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 497, in main
    align_to_baseline(latest_run, baseline),
  File "/home/runner/work/coiled-runtime/coiled-runtime/dashboard.py", line 80, in align_to_baseline
    df_baseline.set_index("fullname")
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 961, in __getitem__
    return self._getitem_tuple(key)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1147, in _getitem_tuple
    return self._multi_take(tup)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1098, in _multi_take
    d = {
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1099, in <dictcomp>
    axis: self._get_listlike_indexer(key, axis)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexing.py", line 1330, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5796, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/usr/share/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 5859, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['stability/test_shuffle.py::test_shuffle_simple', 'stability/test_array.py::test_rechunk_in_memory', 'stability/test_shuffle.py::test_shuffle_parquet'] not in index"

@crusaderky
Copy link
Contributor Author

@crusaderky it looks like there is a legitimate error in the static site generation script

@jrbourbeau fixed now

@crusaderky
Copy link
Contributor Author

@ncclementi @ian-r-rose I called my latest commit test-upstream; the setup job outputs "true" in its Check upstream task; however the upstream runtime is not there. Am I missing something?

@jrbourbeau
Copy link
Member

I think things are working as expected. The upstream build should only install nightly versions of dask and distributed in the latest CI builds. For example, if you look at the linux Python 3.9 latest build on your latest commit (the test-upstream one), you'll see the the dev version of dask and distributed are installed from the dask conda channel. Does that help clarify, or am I missing something?

@ncclementi
Copy link
Contributor

ncclementi commented Sep 7, 2022

@crusaderky this is a bit of a bad design in the name. What happened is that when you called test-upstream then "latest" will have the nightlies see that the nightlies were installed. https://github.com/coiled/coiled-runtime/runs/8237542627?check_suite_focus=true#step:5:20

That logic is here https://github.com/coiled/coiled-runtime/blob/1614ab58128e3e3dcfa7d213db2c88e7c55ac6d4/ci/create_latest_runtime_meta.py#L41-L49

@fjetter
Copy link
Member

fjetter commented Sep 8, 2022

Re: noise of the samples
I believe if we have historic variances we can employ some statistical significance analysis on the new results. I believe this is something one typically uses a t-test for to verify whether the new measurement is in agreement to the old one. That's basically a very simple quantity t = (old_measurement - new_measurement) / sqrt(historic_variance / n_historic_data_points). If that value is large, this is true signal (depending on how we define our confidence interval) @ncclementi I believe you wrote the regression detection script. You might have more insight here than I do.

We can also look into running some tests multiple times. Particularly the ones that have very small runtimes (<30s) are likely more prone to noise and we can afford running them a couple of times.

Obviously, all of this is a follow up topic. I think this is already a great first step

@crusaderky crusaderky mentioned this pull request Sep 8, 2022
@crusaderky crusaderky changed the title A/B test report A/B test reports Sep 8, 2022
@wence-
Copy link

wence- commented Sep 8, 2022

I believe if we have historic variances we can employ some statistical significance analysis on the new results. I believe this is something one typically uses a t-test for to verify whether the new measurement is in agreement to the old one. That's basically a very simple quantity t = (old_measurement - new_measurement) / sqrt(historic_variance / n_historic_data_points).

A t-test is reasonable if your null hypothesis is that the data are normally distributed, if you don't have that guarantee (for example a long-tailed distribution due to outliers) then you're better off with a non-parametric test (such as Kolmogorov-Smirnov (available in scipy.stats.kstest)

@crusaderky
Copy link
Contributor Author

A t-test is reasonable if your null hypothesis is that the data are normally distributed

Data is definitely not normally distributed: #315

@crusaderky crusaderky closed this Sep 8, 2022
@crusaderky crusaderky reopened this Sep 8, 2022
@fjetter
Copy link
Member

fjetter commented Sep 8, 2022

A t-test is reasonable if your null hypothesis is that the data are normally distributed, if you don't have that guarantee (for example a long-tailed distribution due to outliers) then you're better off with a non-parametric test (such as Kolmogorov-Smirnov (available in scipy.stats.kstest)

I think we should discuss this in a dedicated issue. I think in most situations the normal distribution is a fair assumption. My worry about more complex tests is that we don't have a lot of measurements to start with. We're not actually comparing two distributions since we often only have a single measurement

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell this looks like it's ready for a final review. @ian-r-rose do you think you'll have time to take a look at the dashboard.py changes? Or maybe you already have

There's been some discussion around merge conflicts with #235 but from what I can tell it appears they will be relatively minimal, so I'm okay merging this PR first if needed

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the code on a high level. If the plots turn out this way, I'm good with this.

@crusaderky crusaderky merged commit d748681 into main Sep 9, 2022
@crusaderky crusaderky deleted the barchart2 branch September 9, 2022 14:38
crusaderky added a commit that referenced this pull request Sep 9, 2022
hendrikmakait pushed a commit that referenced this pull request Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automation of benchmark comparison
6 participants