Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package Sync #235

Merged
merged 55 commits into from
Sep 15, 2022
Merged

Package Sync #235

merged 55 commits into from
Sep 15, 2022

Conversation

ian-r-rose
Copy link
Contributor

@ian-r-rose ian-r-rose commented Aug 9, 2022

Pushing #222 a bit further to see if anything breaks. This

  1. Removes a couple of controlling environment variables, which should simplify local development as well as CI workflows. Specifically TEST_UPSTREAM has been folded into COILED_RUNTIME_VERSION, which can be a version number, "upstream", or "latest", and COILED_SOFTWARE_VERSION is no longer necessary in a package_sync universe.
  2. Removes the setup/cleanup jobs. This also means there are fewer artifacts passed around between jobs, which is nice.
  3. "upstream" is now run on pull requests

@ian-r-rose ian-r-rose force-pushed the irose/package-sync-as-default branch from 6e357d3 to 61effde Compare August 10, 2022 00:01
@jrbourbeau jrbourbeau mentioned this pull request Aug 10, 2022
@ian-r-rose
Copy link
Contributor Author

It seems that something did break: package_sync currently doesn't support direct URLs to conda packages, which we currently use to point to nightly dask/distributed builds.

@ian-r-rose ian-r-rose self-assigned this Aug 11, 2022
shughes-uk and others added 5 commits August 12, 2022 13:42
software environments, TEST_UPSTREAM and COILED_SOFTWARE_VERSION
environment variables are now gone, COILED_RUNTIME_VERSION is the only
controlling environment variable, which can take a version number,
"upstream", or "latest"
@ian-r-rose ian-r-rose force-pushed the irose/package-sync-as-default branch from 61effde to b1c4c36 Compare August 12, 2022 20:42
@shughes-uk
Copy link
Contributor

I'm happy with the state of package sync and can give a solid thumbs up to merging this!

@jrbourbeau
Copy link
Member

Just pushed an empty commit to rerun CI (we were getting AWS limit errors in the most recent CI run)

@jrbourbeau
Copy link
Member

We're getting

 E               coiled.errors.ClusterCreationError: Cluster status is error (reason: Scheduler Stopped -> Errors installing conda packages:
E               package pyarrow-8.0.0-py310hbc2c91e_3_cuda requires arrow-cpp 8.0.0 py310h1f609fe_3_cuda, but none of the providers can be installed) (cluster_id: 50389)

in this CI build cc @shughes-uk

@shughes-uk
Copy link
Contributor

shughes-uk commented Aug 16, 2022

Having to boot my up my x86 macbook for the first time in months to solve this one, not quite sure what's going on and couldn't replicate on my arm macbook (using a slightly stripped down version of the env export from CI)

@shughes-uk
Copy link
Contributor

The source seems to be a pin on pyarrow 8. With that pin the macos environment generated can't solve properly for linux64, not sure which dep, looking now

@shughes-uk
Copy link
Contributor

Got it working! 🥂

@ian-r-rose
Copy link
Contributor Author

Got it working! clinking_glasses

Woo!

@ian-r-rose ian-r-rose force-pushed the irose/package-sync-as-default branch from 897531f to d0ad7f1 Compare August 16, 2022 22:22
@shughes-uk
Copy link
Contributor

I'm suspicious that's some kind of transient network error. I'll have to ensure they get printed out properly as It looks like it got eaten

Copy link
Contributor Author

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be all green, though I would note that there was a solid amount of mamba/conda/mysterious flakiness that makes me a bit nervous.

@@ -446,7 +315,7 @@ jobs:
static-site:
needs: process-results
# Always generate the site, as this can be skipped even if an indirect dependency fails (like a test run)
if: always() && github.ref == 'refs/heads/main' && github.repository == 'coiled/coiled-runtime'
if: always()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed things so we always generate the static site and upload it as an artifact. So if there is a regression detected, it's easier to download the artifacts to inspect (cf #251)

@shughes-uk
Copy link
Contributor

This seems to be all green, though I would note that there was a solid amount of mamba/conda/mysterious flakiness that makes me a bit nervous.

I'll add some changes so I can see the flakyness serverside and hopefully push fixes fast. Yell at me in the mean time

@jrbourbeau
Copy link
Member

Thanks for the empty commit @ian-r-rose -- I think sanity checks like that are a good thing to do at this point. @shughes-uk

ERROR    coiled:states.py:134    | Worker Process         | test_coiled-5b0e15b6-worker-3ee9bf36b8         | error      at 16:48:59 (UTC) | Errors installing conda packages:
nothing provides requested tzdata 2022rc0

is currently popping up

@shughes-uk
Copy link
Contributor

Interesting. That package was brand new yesterday. I think you might have had the bad luck to have tried things exactly after release of the package, when it was propagated to the CDN edge node serving CI but not the one used by the cluster. If i'm right re-running now should work

@shughes-uk
Copy link
Contributor

I was not correct, looks like an issue dealing with the way the version gets parsed!

@shughes-uk
Copy link
Contributor

@shughes-uk is package_sync downloading and installing in parallel right now? I don't remember.

uses micromamba so yes

@ncclementi
Copy link
Contributor

Not sure if this matters but I noticed that in the package sync list, we have two dask versions installed. Not sure if that matters/affects but mentioning it for visibility.

- dask=2022.8.1
- dask=2022.8.2a220830

@ian-r-rose
Copy link
Contributor Author

Not sure if this matters but I noticed that in the package sync list, we have two dask versions installed. Not sure if that matters/affects but mentioning it for visibility.

Good eye -- I think @shughes-uk pushed a fix for that yesterday afternoon, are you still seeing it?

@crusaderky crusaderky mentioned this pull request Sep 8, 2022
@ian-r-rose
Copy link
Contributor Author

ian-r-rose commented Sep 9, 2022

I'm doing a census of the failures and reported regressions for this branch during the last week. I'm deliberately removing regressions related to cluster startup time, as those are expected.

Regressions, main

  1. Link
    • 'coiled-0.0.4-py3.9', name = 'test_q8[5 GB (parquet)]
  2. Link
    • 'coiled-0.0.4-py3.9', name = 'test_double_diff',
  3. Link
    • coiled-0.0.4-py3.9', name = 'test_double_diff',
  4. Link
    • 'coiled-0.0.4-py3.9', name = 'test_double_diff'

Regressions, package_sync

  1. Link
    • coiled-0.0.4-py3.9 - test_quickstart_parquet
    • coiled-0.0.4-py3.10 - test_shuffle_simple
    • coiled-latest-py3.10 - test_shuffle_parquet
    • coiled-upstream-py3.9 - test_read_spark_generated_data
    • coiled-0.1.0-py3.10 - test_shuffle_parquet
    • coiled-0.1.0-py3.10 - test_shuffle_simple
    • coiled-0.1.0-py3.9 - test_q7[5 GB (parquet)]
    • coiled-0.1.0-py3.9 - test_read_spark_generated_data
  2. Link
    • 'coiled-0.0.4-py3.9', name = 'test_quickstart_parquet'
    • 'coiled-0.0.4-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-latest-py3.8', name = 'test_shuffle_parquet'
    • 'coiled-latest-py3.10', name = 'test_shuffle_parquet'
    • 'coiled-upstream-py3.9', name = 'test_double_diff'
    • 'coiled-upstream-py3.10', name = 'test_rechunk_in_memory'
    • 'coiled-0.1.0-py3.10', name = 'test_shuffle_parquet'
    • 'coiled-0.1.0-py3.9', name = 'test_q7[5 GB (parquet)]'
  3. Link
    • 'coiled-0.0.4-py3.9', name = 'test_jobqueue'
    • 'coiled-0.0.4-py3.9', name = 'test_q8[5 GB (parquet)]
    • 'coiled-0.0.4-py3.9', name = 'test_q9[5 GB (parquet)]'
    • 'coiled-0.0.4-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.9', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.8', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.9', name = 'test_dataframe_align' (We understand this one! Related to co-assignment)
    • coiled-upstream-py3.9', name = 'test_download_throughput[pandas]',
    • 'coiled-upstream-py3.9', name = 'test_jobqueue'
    • 'coiled-upstream-py3.9', name = 'test_q3[5 GB (parquet)]'
    • 'coiled-upstream-py3.9', name = 'test_q9[5 GB (parquet)]'
    • 'coiled-upstream-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-upstream-py3.9', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.8', name = 'test_shuffle_simple'
    • 'coiled-0.1.0-py3.10', name = 'test_rechunk_in_memory'
    • 'coiled-0.1.0-py3.9', name = 'test_download_throughput[pandas]'
    • 'coiled-0.1.0-py3.9', name = 'test_q1[0.5 GB (csv)]'
    • 'coiled-0.1.0-py3.9', name = 'test_shuffle_simple',
    • 'coiled-0.1.0-py3.9', name = 'test_xgboost_distributed_training'
  4. Link
    • coiled-0.0.4-py3.9', name = 'test_q9[0.5 GB (parquet)]
    • 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-latest-py3.8', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.9', name = 'test_anom_mean'
    • 'coiled-upstream-py3.9', name = 'test_basic_sum'
    • 'coiled-upstream-py3.9', name = 'test_dataframe_align',
    • coiled-upstream-py3.10', name = 'test_shuffle_simple',
    • 'coiled-0.1.0-py3.10', name = 'test_shuffle_parquet'
    • 'coiled-0.1.0-py3.10', name = 'test_shuffle_simple'
    • 'coiled-0.1.0-py3.9', name = 'test_anom_mean',
    • 'coiled-0.1.0-py3.9', name = 'test_q7[0.5 GB (csv)]
    • 'coiled-0.1.0-py3.9', name = 'test_q8[5 GB (parquet)]'
    • 'coiled-0.1.0-py3.9', name = 'test_q9[5 GB (parquet)]'
    • 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
  5. Link
    • 'coiled-0.0.4-py3.9', name = 'test_q8[5 GB (parquet)]'
    • 'coiled-0.0.4-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-latest-py3.9', name = 'test_q4[5 GB (parquet)]'
    • 'coiled-latest-py3.10', name = 'test_rechunk_in_memory'
    • 'coiled-upstream-py3.9', name = 'test_dataframe_align' (Again, we understand this one)
    • 'coiled-upstream-py3.9', name = 'test_jobqueue'
    • 'coiled-upstream-py3.8', name = 'test_shuffle_parquet'
    • 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
  6. Link -- Many test failures, possibly related to some uptime issues with s3?
  7. Link
    • 'coiled-0.0.4-py3.9', name = 'test_download_throughput[pandas]'
    • 'coiled-0.0.4-py3.9', name = 'test_download_throughput[s3fs]'
    • 'coiled-0.0.4-py3.9', name = 'test_write_wide_data'
    • 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-latest-py3.9', name = 'test_write_wide_data'
    • 'coiled-latest-py3.10', name = 'test_shuffle_parquet'
    • 'coiled-upstream-py3.9', name = 'test_dataframe_align' (again understood)
    • 'coiled-upstream-py3.9', name = 'test_shuffle_parquet
    • 'coiled-upstream-py3.8', name = 'test_shuffle_parquet'
    • 'coiled-0.1.0-py3.9', name = 'test_download_throughput[pandas]'
    • 'coiled-0.1.0-py3.9', name = 'test_read_hive_partitioned_data'
    • 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-0.1.0-py3.9', name = 'test_shuffle'
    • 'coiled-0.1.0-py3.9', name = 'test_write_wide_data'
    • coiled-0.1.0-py3.9', name = 'test_write_wide_data'
  8. Link
    • 'coiled-0.0.4-py3.9', name = 'test_q3[5 GB (parquet)]'
    • 'coiled-0.0.4-py3.9', name = 'test_quickstart_parquet'
    • 'coiled-latest-py3.9', name = 'test_q3[5 GB (parquet)]'
    • 'coiled-latest-py3.9', name = 'test_read_spark_generated_data'
    • 'coiled-latest-py3.8', name = 'test_shuffle_simple'
    • coiled-latest-py3.10', name = 'test_shuffle_simple'
    • 'coiled-0.0.4-py3.8', name = 'test_rechunk_in_memory',
    • 'coiled-0.0.4-py3.8', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.9', name = 'test_dataframe_align',
    • 'coiled-0.1.0-py3.9', name = 'test_read_spark_generated_data'
  9. Link
    • 'coiled-0.0.4-py3.9', name = 'test_vorticity' (This is weird! Why is this one showing up in 0.0.4?)
    • 'coiled-latest-py3.8', name = 'test_shuffle_simple',
    • 'coiled-latest-py3.10', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.9', name = 'test_q5[5 GB (parquet)]'
    • 'coiled-upstream-py3.10', name = 'test_rechunk_in_memory'
    • 'coiled-0.1.0-py3.8', name = 'test_shuffle_simple'
    • 'coiled-0.1.0-py3.10', name = 'test_shuffle_parquet'
    • 'coiled-0.1.0-py3.9', name = 'test_dataframe_align' (This is very weird! Why is this one showing up in 0.1.0?)
    • 'coiled-0.1.0-py3.9', name = 'test_q1[5 GB (parquet)]
  10. Link
    • 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.9', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.8', name = 'test_shuffle_parquet'
    • coiled-latest-py3.8', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.9', name = 'test_anom_mean'
    • 'coiled-upstream-py3.9', name = 'test_q7[5 GB (parquet)]',
    • 'coiled-upstream-py3.9', name = 'test_read_hive_partitioned_data',
    • 'coiled-upstream-py3.9', name = 'test_shuffle'
    • 'coiled-upstream-py3.9', name = 'test_shuffle_simple',
    • 'coiled-upstream-py3.8', name = 'test_rechunk_in_memory'
    • 'coiled-upstream-py3.8', name = 'test_shuffle_simple'
    • 'coiled-upstream-py3.10', name = 'test_shuffle_simple'
    • 'coiled-0.1.0-py3.8', name = 'test_rechunk_in_memory'
    • 'coiled-0.1.0-py3.8', name = 'test_shuffle_simple',
    • 'coiled-0.1.0-py3.10', name = 'test_shuffle_simple'
    • 'coiled-0.1.0-py3.9', name = 'test_q7[5 GB (parquet)]'
    • 'coiled-0.1.0-py3.9', name = 'test_q8[5 GB (parquet)]'
    • 'coiled-0.1.0-py3.9', name = 'test_shuffle_parquet',
    • 'coiled-0.1.0-py3.9', name = 'test_shuffle_simple'
  11. Link
    • 'coiled-0.0.4-py3.10', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.9', name = 'test_q3[5 GB (parquet)]'
    • 'coiled-latest-py3.9', name = 'test_q5[5 GB (parquet)]'
    • 'coiled-latest-py3.8', name = 'test_shuffle_simple'
    • 'coiled-latest-py3.10', name = 'test_shuffle_parquet'
    • 'coiled-upstream-py3.9', name = 'test_anom_mean'
    • 'coiled-upstream-py3.9', name = 'test_shuffle',
    • 'coiled-upstream-py3.9', name = 'test_shuffle_simple'
    • 'coiled-0.1.0-py3.9', name = 'test_anom_mean'

Extra observations

  • This failed because it failed to find pyarrow. Did that not get sync'd successfully?
  • Why do I see activity in test_vorticity and test_dataframe_align in the non-upstream builds?

@ian-r-rose
Copy link
Contributor Author

My top observation of the above is that the package sync PR has many more detected regressions than the ones on main. And they are not just in the upstream builds, but relatively evenly distributed. But there are some repeat offenders that jump out when they are all arrayed together:

  • test_quickstart_parquet
  • test_shuffle_simple
  • test_shuffle_parquet
  • test_read_spark_generated_data
  • test_double_diff
  • test_rechunk_in_memory
  • test_q*[5 GB (parquet)]
  • test_shuffle

Many (but not all!) of these are parquet-related. So one thing we should look at pretty closely is whether there is something different going on in the solves for boto, pyarrow, or s3fs

Comment on lines +15 to +17
worker_disk_size=64,
worker_vm_types=["t3.large"],
scheduler_vm_types=["t3.large"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmakait in #280 you seemed okay with making the instances here a bit beefier. Is that still the case?

@pytest.mark.stability
@pytest.mark.parametrize("keep_around", (True, False))
def test_spilling(spill_client, keep_around):
arr = da.random.random((64, 2**27)).persist() # 64 GiB, ~2x memory
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmakait same question here.

Remove duplicated code
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for working on this! I believe there's consensus that we want to move forward with the changes here. Two things to note:

  • There are questions around a couple of spill-related tests which @hendrikmakait has been pinged on. @ian-r-rose said he thinks @hendrikmakait is okay with the those test changes. @hendrikmakait we're going to move forward for the time being, but let us know if you have any reservations about those changes and we can easily revert.
  • This PR is starting running the upstream build as part of our normal test suite on PRs. This is nice because it's more automatically run / no longer an opt-in and also allows us to clean up some code. This isn't nice because it's possible we'll see more unrelated test failures on PRs due to upstream changes in dask + distributed. We've decided to try out upstream as it's own build on PRs and see what that feels like. If things are too noisy, we'll move back to opt-in only on PRs.

@jrbourbeau jrbourbeau merged commit f9f8d9b into main Sep 15, 2022
This was referenced Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dx Developer experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants