Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.1.0 #247

Merged
merged 5 commits into from
Aug 18, 2022
Merged

Release 0.1.0 #247

merged 5 commits into from
Aug 18, 2022

Conversation

jrbourbeau
Copy link
Member

Updates package versions for releasing

xref #233

recipe/meta.yaml Outdated
- cloudpickle ==2.1.0
- tornado ==6.1
- toolz ==0.11.2
- python-blosc ==1.10.2
- zict ==2.2.0
- xgboost ==1.6.1
- dask-ml ==2022.5.27
- openssl >1.1.0g
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we're removing this openssl pin here as it's no longer needed with newer versions of distributed that contain dask/distributed#6562

Copy link
Contributor

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, pending CI

recipe/meta.yaml Outdated
{% set version = "0.1.0" + environ.get("VERSION_SUFFIX", '') %}
{% set dask_version = environ.get("DASK_VERSION", "2022.6.0") %}
{% set distributed_version = environ.get("DISTRIBUTED_VERSION", "2022.6.0") %}
{% set version = "0.2.0" + environ.get("VERSION_SUFFIX", '') %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this version number?

Copy link
Member Author

@jrbourbeau jrbourbeau Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes into the release number for nightly builds for coiled-runtime on the coiled conda channel. Thinking about this more, this should probably be updated to be 0.1.1 instead of 0.2.0

@jrbourbeau
Copy link
Member Author

The one CI build that's failing is due to an HTTP error in one of our scheduler plugins

Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]: Traceback (most recent call last):
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:   File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/core.py", line 481, in start
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:     await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:   File "/opt/conda/envs/coiled/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:     return await fut
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:   File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/scheduler.py", line 3420, in start_unsafe
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:     await asyncio.gather(
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]:   File "https://cloud.coiled.io/api/v2/cluster_facing/preload/scheduler", line 56, in start
Aug 15 20:50:30 ip-10-0-4-89 cloud-init[1267]: tornado.httpclient.HTTPClientError: HTTP 502: Bad Gateway

Not sure if this is a known issue, or some random transient error, cc @ntabris for visibility. Regardless, I'm going to rerun CI to see if it goes away. I've also opened dask/distributed#6890 upstream to log these errors instead of failing to bring the scheduler up

@ntabris
Copy link
Member

ntabris commented Aug 16, 2022

Thanks, @jrbourbeau. I think the PATCH requests we make in scheduler preload aren't being retried on transient errors, I've made https://github.com/coiled/platform/issues/16.

@jrbourbeau
Copy link
Member Author

jrbourbeau commented Aug 16, 2022

Okay, so a regression was detected here

 Exception:  Regressions detected 2: 
runtime = 'coiled-latest-py3.9', name = 'test_xgboost_distributed_training', category = 'runtime', last_three_average_memory [GiB] = (4.786573607346107, 4.703390393938337, 4.766629988147367), average_memory_threshold [GiB] = 4.534268547507949 
runtime = 'coiled-latest-py3.9', name = 'test_xgboost_distributed_training', category = 'runtime', last_three_peak_memory [GiB] = (7.5365447998046875, 7.371295928955078, 7.538898468017578), peak_memory_threshold [GiB] = 7.304825104595042 

which is great to see automatically happening. Based on the CI logs, our test_xgboost_distributed_training test in the latest Python 3.9 build is using less memory that previous runs, which is actually a nice thing to see. We should update our regression detection logic to ignore decreases in memory usage (since this is a good thing) cc @ian-r-rose

@jrbourbeau
Copy link
Member Author

Seeing what appears to be similar behavior on a different test after #250 was merged:

Exception:  Regressions detected 1: 
runtime = 'coiled-latest-py3.9', name = 'test_write_wide_data', category = 'benchmarks', last_three_peak_memory [GiB] = (100.73250961303711, 82.24337387084961, 82.08837890625), peak_memory_threshold [GiB] = 82.00863734500297 

@ian-r-rose any thoughts?

@ian-r-rose
Copy link
Contributor

I'm not sure what the cause would be, but it certainly looks like a regression in this PR:

image

@jrbourbeau
Copy link
Member Author

Ah, I see -- I think I've been interpreting the message displayed in the regression exception incorrectly. Interestingly the detected regressions haven't been consistent between CI runs (assuming all detected regressions are printed in CI). I'll rerun CI again to see how consistent things are. If this is a regression, it's okay for us to revert all package version updates here as this particular release can be just about getting something on PyPI

@ian-r-rose
Copy link
Contributor

I may have some fixes I need to do to make re-running CI functional. I think the artifacts may be persisted between different runs, which can cause issues with how I've set things up: https://github.com/coiled/coiled-runtime/runs/7867069146?check_suite_focus=true

@ian-r-rose
Copy link
Contributor

I think I've been interpreting the message displayed in the regression exception incorrectly

Yeah, the message can probably be improved

@jrbourbeau
Copy link
Member Author

I may have some fixes I need to do to make re-running CI functional

Okay, good to know. I'll stick with pushing an empty commit, which seems to work, instead of using the rerun CI button in the GitHub UI

@ian-r-rose
Copy link
Contributor

Looks like it reproduced (hooray?)

@jrbourbeau
Copy link
Member Author

Indeed! I've gone ahead and reverted all the package version pin updates. Assuming there are no regressions now, let's go ahead with this more minimal change so we can get coiled-runtime on PyPI. I'll open an issue with the previous recipe diff for further investigation.

@ian-r-rose
Copy link
Contributor

I wonder if it's malloc trim again, though I haven't fully tracked down the timing.

@jrbourbeau
Copy link
Member Author

Interestingly the regressions reported here are both with the coiled-runtime=0.0.3 release:

Exception:  Regressions detected 2: 
runtime = 'coiled-0.0.3-py3.9', name = 'test_q3[5 GB (parquet)]', category = 'benchmarks', last_three_average_memory [GiB] = (12.255300492048264, 12.516682590011262, 12.366765445336364), average_memory_threshold [GiB] = 12.186457108029044 
runtime = 'coiled-0.0.3-py3.9', name = 'test_q8[0.5 GB (parquet)]', category = 'benchmarks', last_three_peak_memory [GiB] = (3.3844871520996094, 3.4698944091796875, 3.4943313598632812), peak_memory_threshold [GiB] = 3.3457457110722024 

I'm inclined to not have this block a 0.1.0 release, but am curious what other folks think. @ian-r-rose thoughts?

@ian-r-rose
Copy link
Contributor

I'm inclined to not have this block a 0.1.0 release

Isn't this release basically a no-op as the branch is right now? I suspect that there was an actual regression in dask or distributed that we are catching by updating latest

@ian-r-rose
Copy link
Contributor

0.0.4:
image

latest:
image

upstream:

image

Note that upstream is both less stable and moderately higher

@jrbourbeau
Copy link
Member Author

Isn't this release basically a no-op as the branch is right now? I suspect that there was an actual regression in dask or distributed that we are catching by updating latest

That's correct, the current state of this PR doesn't update any package pins. That's why I was confused by a regression being reported for the coiled=0.0.3 release. To me that seems like either (1) a false positive in our regression detection logic or (2) some indirect, non-pinned, dependency was updated and is making things use more memory. Regardless, after 0.1.0 is out we can drop 0.0.3 anyways, so I'm in favor of just releasing with this PR as is.

That said, I agree the regressions which were reported earlier, when package versions were updated do seem legitimate

@ian-r-rose
Copy link
Contributor

LGTM

@ian-r-rose ian-r-rose merged commit 3020616 into main Aug 18, 2022
@jrbourbeau jrbourbeau deleted the release-0.1.0 branch August 18, 2022 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants