Package Sync #235

ian-r-rose · 2022-08-09T22:48:45Z

Pushing #222 a bit further to see if anything breaks. This

Removes a couple of controlling environment variables, which should simplify local development as well as CI workflows. Specifically TEST_UPSTREAM has been folded into COILED_RUNTIME_VERSION, which can be a version number, "upstream", or "latest", and COILED_SOFTWARE_VERSION is no longer necessary in a package_sync universe.
Removes the setup/cleanup jobs. This also means there are fewer artifacts passed around between jobs, which is nice.
"upstream" is now run on pull requests

ian-r-rose · 2022-08-10T01:10:27Z

It seems that something did break: package_sync currently doesn't support direct URLs to conda packages, which we currently use to point to nightly dask/distributed builds.

software environments, TEST_UPSTREAM and COILED_SOFTWARE_VERSION environment variables are now gone, COILED_RUNTIME_VERSION is the only controlling environment variable, which can take a version number, "upstream", or "latest"

shughes-uk · 2022-08-12T21:44:23Z

I'm happy with the state of package sync and can give a solid thumbs up to merging this!

jrbourbeau · 2022-08-15T20:47:35Z

Just pushed an empty commit to rerun CI (we were getting AWS limit errors in the most recent CI run)

jrbourbeau · 2022-08-15T21:14:00Z

We're getting

 E               coiled.errors.ClusterCreationError: Cluster status is error (reason: Scheduler Stopped -> Errors installing conda packages:
E               package pyarrow-8.0.0-py310hbc2c91e_3_cuda requires arrow-cpp 8.0.0 py310h1f609fe_3_cuda, but none of the providers can be installed) (cluster_id: 50389)

in this CI build cc @shughes-uk

shughes-uk · 2022-08-16T18:56:11Z

Having to boot my up my x86 macbook for the first time in months to solve this one, not quite sure what's going on and couldn't replicate on my arm macbook (using a slightly stripped down version of the env export from CI)

shughes-uk · 2022-08-16T20:15:40Z

The source seems to be a pin on pyarrow 8. With that pin the macos environment generated can't solve properly for linux64, not sure which dep, looking now

shughes-uk · 2022-08-16T21:01:10Z

Got it working! 🥂

ian-r-rose · 2022-08-16T21:02:53Z

Got it working! clinking_glasses

Woo!

shughes-uk · 2022-08-16T22:50:02Z

I'm suspicious that's some kind of transient network error. I'll have to ensure they get printed out properly as It looks like it got eaten

reruns

ian-r-rose

This seems to be all green, though I would note that there was a solid amount of mamba/conda/mysterious flakiness that makes me a bit nervous.

ian-r-rose · 2022-08-17T15:23:41Z

.github/workflows/tests.yml

@@ -446,7 +315,7 @@ jobs:
  static-site:
    needs: process-results
    # Always generate the site, as this can be skipped even if an indirect dependency fails (like a test run)
-    if: always() && github.ref == 'refs/heads/main' && github.repository == 'coiled/coiled-runtime'
+    if: always()


Changed things so we always generate the static site and upload it as an artifact. So if there is a regression detected, it's easier to download the artifacts to inspect (cf #251)

shughes-uk · 2022-08-17T16:42:49Z

This seems to be all green, though I would note that there was a solid amount of mamba/conda/mysterious flakiness that makes me a bit nervous.

I'll add some changes so I can see the flakyness serverside and hopefully push fixes fast. Yell at me in the mean time

jrbourbeau · 2022-08-17T16:52:56Z

Thanks for the empty commit @ian-r-rose -- I think sanity checks like that are a good thing to do at this point. @shughes-uk

ERROR    coiled:states.py:134    | Worker Process         | test_coiled-5b0e15b6-worker-3ee9bf36b8         | error      at 16:48:59 (UTC) | Errors installing conda packages:
nothing provides requested tzdata 2022rc0

is currently popping up

shughes-uk · 2022-08-17T17:43:34Z

Interesting. That package was brand new yesterday. I think you might have had the bad luck to have tried things exactly after release of the package, when it was propagated to the CDN edge node serving CI but not the one used by the cluster. If i'm right re-running now should work

shughes-uk · 2022-08-17T17:47:31Z

I was not correct, looks like an issue dealing with the way the version gets parsed!

shughes-uk · 2022-08-31T23:50:10Z

@shughes-uk is package_sync downloading and installing in parallel right now? I don't remember.

uses micromamba so yes

This reverts commit 5d13ab9.

ncclementi · 2022-09-01T14:34:32Z

Not sure if this matters but I noticed that in the package sync list, we have two dask versions installed. Not sure if that matters/affects but mentioning it for visibility.

- dask=2022.8.1
- dask=2022.8.2a220830

ian-r-rose · 2022-09-01T15:11:43Z

Not sure if this matters but I noticed that in the package sync list, we have two dask versions installed. Not sure if that matters/affects but mentioning it for visibility.

Good eye -- I think @shughes-uk pushed a fix for that yesterday afternoon, are you still seeing it?

ian-r-rose · 2022-09-09T00:44:21Z

I'm doing a census of the failures and reported regressions for this branch during the last week. I'm deliberately removing regressions related to cluster startup time, as those are expected.

Regressions, `main`

Link
- 'coiled-0.0.4-py3.9', name = 'test_q8[5 GB (parquet)]
Link
- 'coiled-0.0.4-py3.9', name = 'test_double_diff',
Link
- coiled-0.0.4-py3.9', name = 'test_double_diff',
Link
- 'coiled-0.0.4-py3.9', name = 'test_double_diff'

Regressions, `package_sync`

Link
- coiled-0.0.4-py3.9 - test_quickstart_parquet
- coiled-0.0.4-py3.10 - test_shuffle_simple
- coiled-latest-py3.10 - test_shuffle_parquet
- coiled-upstream-py3.9 - test_read_spark_generated_data
- coiled-0.1.0-py3.10 - test_shuffle_parquet
- coiled-0.1.0-py3.10 - test_shuffle_simple
- coiled-0.1.0-py3.9 - test_q7[5 GB (parquet)]
- coiled-0.1.0-py3.9 - test_read_spark_generated_data
Link
- 'coiled-0.0.4-py3.9', name = 'test_quickstart_parquet'
- 'coiled-0.0.4-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
- 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-latest-py3.8', name = 'test_shuffle_parquet'
- 'coiled-latest-py3.10', name = 'test_shuffle_parquet'
- 'coiled-upstream-py3.9', name = 'test_double_diff'
- 'coiled-upstream-py3.10', name = 'test_rechunk_in_memory'
- 'coiled-0.1.0-py3.10', name = 'test_shuffle_parquet'
- 'coiled-0.1.0-py3.9', name = 'test_q7[5 GB (parquet)]'
Link
- 'coiled-0.0.4-py3.9', name = 'test_jobqueue'
- 'coiled-0.0.4-py3.9', name = 'test_q8[5 GB (parquet)]
- 'coiled-0.0.4-py3.9', name = 'test_q9[5 GB (parquet)]'
- 'coiled-0.0.4-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
- 'coiled-latest-py3.9', name = 'test_shuffle_simple'
- 'coiled-latest-py3.8', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.9', name = 'test_dataframe_align' (We understand this one! Related to co-assignment)
- coiled-upstream-py3.9', name = 'test_download_throughput[pandas]',
- 'coiled-upstream-py3.9', name = 'test_jobqueue'
- 'coiled-upstream-py3.9', name = 'test_q3[5 GB (parquet)]'
- 'coiled-upstream-py3.9', name = 'test_q9[5 GB (parquet)]'
- 'coiled-upstream-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-upstream-py3.9', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.8', name = 'test_shuffle_simple'
- 'coiled-0.1.0-py3.10', name = 'test_rechunk_in_memory'
- 'coiled-0.1.0-py3.9', name = 'test_download_throughput[pandas]'
- 'coiled-0.1.0-py3.9', name = 'test_q1[0.5 GB (csv)]'
- 'coiled-0.1.0-py3.9', name = 'test_shuffle_simple',
- 'coiled-0.1.0-py3.9', name = 'test_xgboost_distributed_training'
Link
- coiled-0.0.4-py3.9', name = 'test_q9[0.5 GB (parquet)]
- 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-latest-py3.8', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.9', name = 'test_anom_mean'
- 'coiled-upstream-py3.9', name = 'test_basic_sum'
- 'coiled-upstream-py3.9', name = 'test_dataframe_align',
- coiled-upstream-py3.10', name = 'test_shuffle_simple',
- 'coiled-0.1.0-py3.10', name = 'test_shuffle_parquet'
- 'coiled-0.1.0-py3.10', name = 'test_shuffle_simple'
- 'coiled-0.1.0-py3.9', name = 'test_anom_mean',
- 'coiled-0.1.0-py3.9', name = 'test_q7[0.5 GB (csv)]
- 'coiled-0.1.0-py3.9', name = 'test_q8[5 GB (parquet)]'
- 'coiled-0.1.0-py3.9', name = 'test_q9[5 GB (parquet)]'
- 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
Link
- 'coiled-0.0.4-py3.9', name = 'test_q8[5 GB (parquet)]'
- 'coiled-0.0.4-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-latest-py3.9', name = 'test_q4[5 GB (parquet)]'
- 'coiled-latest-py3.10', name = 'test_rechunk_in_memory'
- 'coiled-upstream-py3.9', name = 'test_dataframe_align' (Again, we understand this one)
- 'coiled-upstream-py3.9', name = 'test_jobqueue'
- 'coiled-upstream-py3.8', name = 'test_shuffle_parquet'
- 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
Link -- Many test failures, possibly related to some uptime issues with s3?
Link
- 'coiled-0.0.4-py3.9', name = 'test_download_throughput[pandas]'
- 'coiled-0.0.4-py3.9', name = 'test_download_throughput[s3fs]'
- 'coiled-0.0.4-py3.9', name = 'test_write_wide_data'
- 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-latest-py3.9', name = 'test_write_wide_data'
- 'coiled-latest-py3.10', name = 'test_shuffle_parquet'
- 'coiled-upstream-py3.9', name = 'test_dataframe_align' (again understood)
- 'coiled-upstream-py3.9', name = 'test_shuffle_parquet
- 'coiled-upstream-py3.8', name = 'test_shuffle_parquet'
- 'coiled-0.1.0-py3.9', name = 'test_download_throughput[pandas]'
- 'coiled-0.1.0-py3.9', name = 'test_read_hive_partitioned_data'
- 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-0.1.0-py3.9', name = 'test_shuffle'
- 'coiled-0.1.0-py3.9', name = 'test_write_wide_data'
- coiled-0.1.0-py3.9', name = 'test_write_wide_data'
Link
- 'coiled-0.0.4-py3.9', name = 'test_q3[5 GB (parquet)]'
- 'coiled-0.0.4-py3.9', name = 'test_quickstart_parquet'
- 'coiled-latest-py3.9', name = 'test_q3[5 GB (parquet)]'
- 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
- 'coiled-latest-py3.8', name = 'test_shuffle_simple'
- coiled-latest-py3.10', name = 'test_shuffle_simple'
- 'coiled-0.0.4-py3.8', name = 'test_rechunk_in_memory',
- 'coiled-0.0.4-py3.8', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.9', name = 'test_dataframe_align',
- 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
Link
- 'coiled-0.0.4-py3.9', name = 'test_vorticity' (This is weird! Why is this one showing up in 0.0.4?)
- 'coiled-latest-py3.8', name = 'test_shuffle_simple',
- 'coiled-latest-py3.10', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.9', name = 'test_q5[5 GB (parquet)]'
- 'coiled-upstream-py3.10', name = 'test_rechunk_in_memory'
- 'coiled-0.1.0-py3.8', name = 'test_shuffle_simple'
- 'coiled-0.1.0-py3.10', name = 'test_shuffle_parquet'
- 'coiled-0.1.0-py3.9', name = 'test_dataframe_align' (This is very weird! Why is this one showing up in 0.1.0?)
- 'coiled-0.1.0-py3.9', name = 'test_q1[5 GB (parquet)]
Link
- 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
- 'coiled-latest-py3.9', name = 'test_shuffle_simple'
- 'coiled-latest-py3.8', name = 'test_shuffle_parquet'
- coiled-latest-py3.8', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.9', name = 'test_anom_mean'
- 'coiled-upstream-py3.9', name = 'test_q7[5 GB (parquet)]',
- 'coiled-upstream-py3.9', name = 'test_read_hive_partitioned_data',
- 'coiled-upstream-py3.9', name = 'test_shuffle'
- 'coiled-upstream-py3.9', name = 'test_shuffle_simple',
- 'coiled-upstream-py3.8', name = 'test_rechunk_in_memory'
- 'coiled-upstream-py3.8', name = 'test_shuffle_simple'
- 'coiled-upstream-py3.10', name = 'test_shuffle_simple'
- 'coiled-0.1.0-py3.8', name = 'test_rechunk_in_memory'
- 'coiled-0.1.0-py3.8', name = 'test_shuffle_simple',
- 'coiled-0.1.0-py3.10', name = 'test_shuffle_simple'
- 'coiled-0.1.0-py3.9', name = 'test_q7[5 GB (parquet)]'
- 'coiled-0.1.0-py3.9', name = 'test_q8[5 GB (parquet)]'
- 'coiled-0.1.0-py3.9', name = 'test_shuffle_parquet',
- 'coiled-0.1.0-py3.9', name = 'test_shuffle_simple'
Link
- 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
- 'coiled-latest-py3.9', name = 'test_q3[5 GB (parquet)]'
- 'coiled-latest-py3.9', name = 'test_q5[5 GB (parquet)]'
- 'coiled-latest-py3.8', name = 'test_shuffle_simple'
- 'coiled-latest-py3.10', name = 'test_shuffle_parquet'
- 'coiled-upstream-py3.9', name = 'test_anom_mean'
- 'coiled-upstream-py3.9', name = 'test_shuffle',
- 'coiled-upstream-py3.9', name = 'test_shuffle_simple'
- 'coiled-0.1.0-py3.9', name = 'test_anom_mean'

Extra observations

This failed because it failed to find pyarrow. Did that not get sync'd successfully?
Why do I see activity in test_vorticity and test_dataframe_align in the non-upstream builds?

ian-r-rose · 2022-09-09T00:52:40Z

My top observation of the above is that the package sync PR has many more detected regressions than the ones on main. And they are not just in the upstream builds, but relatively evenly distributed. But there are some repeat offenders that jump out when they are all arrayed together:

test_quickstart_parquet
test_shuffle_simple
test_shuffle_parquet
test_read_spark_generated_data
test_double_diff
test_rechunk_in_memory
test_q*[5 GB (parquet)]
test_shuffle

Many (but not all!) of these are parquet-related. So one thing we should look at pretty closely is whether there is something different going on in the solves for boto, pyarrow, or s3fs

.github/workflows/tests.yml

ian-r-rose · 2022-09-15T17:23:53Z

tests/stability/test_spill.py

+        worker_disk_size=64,
+        worker_vm_types=["t3.large"],
+        scheduler_vm_types=["t3.large"],


@hendrikmakait in #280 you seemed okay with making the instances here a bit beefier. Is that still the case?

ian-r-rose · 2022-09-15T17:24:38Z

tests/stability/test_spill.py

+@pytest.mark.stability
+@pytest.mark.parametrize("keep_around", (True, False))
+def test_spilling(spill_client, keep_around):
+    arr = da.random.random((64, 2**27)).persist()  # 64 GiB, ~2x memory


@hendrikmakait same question here.

Remove duplicated code

jrbourbeau

Thanks all for working on this! I believe there's consensus that we want to move forward with the changes here. Two things to note:

There are questions around a couple of spill-related tests which @hendrikmakait has been pinged on. @ian-r-rose said he thinks @hendrikmakait is okay with the those test changes. @hendrikmakait we're going to move forward for the time being, but let us know if you have any reservations about those changes and we can easily revert.
This PR is starting running the upstream build as part of our normal test suite on PRs. This is nice because it's more automatically run / no longer an opt-in and also allows us to clean up some code. This isn't nice because it's possible we'll see more unrelated test failures on PRs due to upstream changes in dask + distributed. We've decided to try out upstream as it's own build on PRs and see what that feels like. If things are too noisy, we'll move back to opt-in only on PRs.

ian-r-rose force-pushed the irose/package-sync-as-default branch from 6e357d3 to 61effde Compare August 10, 2022 00:01

jrbourbeau mentioned this pull request Aug 10, 2022

Use package sync #222

Closed

ian-r-rose mentioned this pull request Aug 10, 2022

⚠️ CI failed ⚠️ #236

Closed

ian-r-rose self-assigned this Aug 11, 2022

shughes-uk and others added 5 commits August 12, 2022 13:42

sync

5f2c3e1

Dropped import in merge

828d261

I guess we should run these.

09e151c

Fix condition

b1c4c36

ian-r-rose force-pushed the irose/package-sync-as-default branch from 61effde to b1c4c36 Compare August 12, 2022 20:42

Ian Rose added 2 commits August 16, 2022 15:21

Merge branch 'main' into irose/package-sync-as-default

1b5f171

Merge branch 'main' into irose/package-sync-as-default

d0ad7f1

ian-r-rose force-pushed the irose/package-sync-as-default branch from 897531f to d0ad7f1 Compare August 16, 2022 22:22

Ian Rose added 2 commits August 16, 2022 16:11

Skip output db, in combine step, which might appear if there are CI

b200554

reruns

Always upload static site

fc40b10

ian-r-rose commented Aug 17, 2022

View reviewed changes

test ci

f17fe0c

Revert "Did we catch something"

6ab38a3

This reverts commit 5d13ab9.

ian-r-rose mentioned this pull request Sep 1, 2022

Automation of benchmark comparison #292

Closed

Test CI

347c499

crusaderky mentioned this pull request Sep 1, 2022

[WIP] Run test suite on arbitrary coiled environments #296

Closed

ncclementi mentioned this pull request Sep 2, 2022

Consider splitting integration tests and benchmarks into different repo #298

Open

Ian Rose added 2 commits September 2, 2022 08:55

test ci

1566741

Test CI

1bdcda6

crusaderky mentioned this pull request Sep 8, 2022

A/B test jobs #314

Merged

Test CI

b272706

This was referenced Sep 9, 2022

test for regressions #297

Closed

spill test running out of disk space #320

Closed

Test CI

d46ff90

ian-r-rose force-pushed the irose/package-sync-as-default branch from 040dbfc to d46ff90 Compare September 14, 2022 15:30

Ian Rose added 4 commits September 14, 2022 10:13

Merge branch 'main' into irose/package-sync-as-default

db098a2

clean up after merge

9c54eab

Merge branch 'main' into irose/package-sync-as-default

e673418

Declare bankruptcy on cluster startup time

636ad71

ian-r-rose commented Sep 15, 2022

View reviewed changes

.github/workflows/tests.yml Outdated Show resolved Hide resolved

ian-r-rose commented Sep 15, 2022

View reviewed changes

Apply suggestions from code review

3271051

Remove duplicated code

jrbourbeau approved these changes Sep 15, 2022

View reviewed changes

jrbourbeau merged commit f9f8d9b into main Sep 15, 2022

This was referenced Sep 15, 2022

Repeat A/B tests #324

Closed

A/B tests with package sync + repeats #355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package Sync #235

Package Sync #235

ian-r-rose commented Aug 9, 2022 •

edited

Loading

ian-r-rose commented Aug 10, 2022

shughes-uk commented Aug 12, 2022

jrbourbeau commented Aug 15, 2022

jrbourbeau commented Aug 15, 2022

shughes-uk commented Aug 16, 2022 •

edited

Loading

shughes-uk commented Aug 16, 2022

shughes-uk commented Aug 16, 2022

ian-r-rose commented Aug 16, 2022

shughes-uk commented Aug 16, 2022

ian-r-rose left a comment

ian-r-rose Aug 17, 2022

shughes-uk commented Aug 17, 2022

jrbourbeau commented Aug 17, 2022

shughes-uk commented Aug 17, 2022

shughes-uk commented Aug 17, 2022

shughes-uk commented Aug 31, 2022

ncclementi commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

ian-r-rose commented Sep 9, 2022 •

edited

Loading

ian-r-rose commented Sep 9, 2022

ian-r-rose Sep 15, 2022

ian-r-rose Sep 15, 2022

jrbourbeau left a comment

Package Sync #235

Package Sync #235

Conversation

ian-r-rose commented Aug 9, 2022 • edited Loading

ian-r-rose commented Aug 10, 2022

shughes-uk commented Aug 12, 2022

jrbourbeau commented Aug 15, 2022

jrbourbeau commented Aug 15, 2022

shughes-uk commented Aug 16, 2022 • edited Loading

shughes-uk commented Aug 16, 2022

shughes-uk commented Aug 16, 2022

ian-r-rose commented Aug 16, 2022

shughes-uk commented Aug 16, 2022

ian-r-rose left a comment

Choose a reason for hiding this comment

ian-r-rose Aug 17, 2022

Choose a reason for hiding this comment

shughes-uk commented Aug 17, 2022

jrbourbeau commented Aug 17, 2022

shughes-uk commented Aug 17, 2022

shughes-uk commented Aug 17, 2022

shughes-uk commented Aug 31, 2022

ncclementi commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

ian-r-rose commented Sep 9, 2022 • edited Loading

Regressions, main

Regressions, package_sync

Extra observations

ian-r-rose commented Sep 9, 2022

ian-r-rose Sep 15, 2022

Choose a reason for hiding this comment

ian-r-rose Sep 15, 2022

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

ian-r-rose commented Aug 9, 2022 •

edited

Loading

shughes-uk commented Aug 16, 2022 •

edited

Loading

ian-r-rose commented Sep 9, 2022 •

edited

Loading

Regressions, `main`

Regressions, `package_sync`