Semantic versioning in dask repository #93

JulianWgs · 2020-08-17T18:14:45Z

Why didn't the the commit 7138f470f0e55f2ebdb7638ddc4dfe2e78671403 trigger a new major version of dask since the function read_metadata is incompatible with older versions? The commit introduced the return of 4 values, but the old version only returned 3. According to semantic versioning this would have been the correct behavior.

cudf got broken, because of that commit.

Code from the issue:

>>> import cudf
>>> import dask_cudf
>>> dask_cudf.from_cudf(cudf.DataFrame({'a':[1,2,3]}),npartitions=1).to_parquet('test_parquet')

>>> dask_cudf.read_parquet('test_parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask_cudf/io/parquet.py", line 213, in read_parquet
    **kwargs,
  File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", line 234, in read_parquet
    **kwargs
  File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask_cudf/io/parquet.py", line 17, in read_metadata
    meta, stats, parts, index = ArrowEngine.read_metadata(*args, **kwargs)
ValueError: not enough values to unpack (expected 4, got 3)

dask_cudf==0.14 is only compatible with dask<=0.19. In dask_cudf==0.16 the issue is fixed.

This question was first asked here. Answered by @martindurant.

I would suggest a semantic versioning scheme for all non-internal functions (those without a pre underscore). If I look at version changes I only read changelogs with major version changes to find out if that affects my code (is that behavior wrong?). For all other new versions I just expect that it works. Being unsure about breaking changes in every release does not increase trust in dask. I know that this will only happen rarely, but its usually those rare cases which are really annoying.

Edit:

Pandas uses semantic versioning since version 1.0: Link
Numpy has an an at least 1 year time window before breaking changes: Link
Scipy uses a mixture of semantic versioning and what Numpy does: Link
Spark: Link
Arch Linux: Link
Django uses a loose form of semantic versioning: Link
Tensorflow uses semantiv versioning for all public APIs and clearly distinguishes between public and private: Link
For PyTorch I did not find a document
I didn't find information for ray, koala or modin but all of them are in prelease status

Greetings,
Julian

martindurant · 2020-08-17T18:15:52Z

( related: #84 )

mrocklin · 2020-08-17T18:37:34Z

Hi Julian,

Dask doesn't use semantic versioning. We use a rolling release cycle similar to numpy, pandas, scikit-learn, and other projects in this space.

Due to the large API surface area, pretty much every release would technically bump the major version.

mrocklin · 2020-08-17T19:16:53Z

Regarding the actual issue, cc'ing @rjzamora

JulianWgs · 2020-08-18T08:18:42Z

Hi Julian,

Dask doesn't use semantic versioning. We use a rolling release cycle similar to numpy, pandas, scikit-learn, and other projects in this space.

Due to the large API surface area, pretty much every release would technically bump the major version.

Pandas does use semantic version since 1.0 (source). I also think the comparison doesn't fit neatly as Numpy, Scipy and Pandas release much less frequently.

Wouldn't it make more sense to use calendar versioning then like rolling release Linux distributions like Arch do it? Or define core and non-core modules which would trigger or not trigger a new major version?

If not a breaking changes what causes a major version increase? Right now the versioning scheme is not transparent to me (or others)?

On another note: I do like the idea of rolling releases for operating systems, but how does this actually apply to Python software packages?

Both pip and conda try to mitigate dependency issues, but this implies it is sometimes useful to stick with an old version?
Are their tools and procedures for the CI pipeline? Usually one could just run the development branch against their own CI pipeline and their would be time to fix it. With the plan to release weekly this could leave less than seven days (mor less than five) to react to a breaking change before the new breaking version is released.

What do you think about the idea to have a special tag for approved to be merged pull requests. This tag could be picked up by the CI pipeline of depending packages and be run before it is merged. I would say a 24h time window is enough to raise an objection before merging to master and releasing in the next cycle. An objection would not necessarly mean the changes would not be merged, but leave room for a discussion. Merging after that time period could also be automated. This process would replace code freezes before a release, which were discussed in the weekly release schedule issue.

Greetings,
Julian

mrocklin · 2020-08-18T13:50:28Z

You may want to raise these thoughts at github.com/dask/community/issues/new . We've discussed calver before and it was decided to go with what most pydata projects use. We could revisit that discussion though. I recommend that you bring this up there, or at a Dask monthly meeting. Many of the dependent packages test against master to check for issues like this. We typically remind folks about 24 hours in advance of a release in case there are any objections. In practice very few people seem to track these announcements.

…

On Tue, Aug 18, 2020 at 1:18 AM JulianWgs ***@***.***> wrote: Hi Julian, Dask doesn't use semantic versioning. We use a rolling release cycle similar to numpy, pandas, scikit-learn, and other projects in this space. Due to the large API surface area, pretty much every release would technically bump the major version. Pandas does use semantic version since 1.0: source <https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#new-deprecation-policy> Wouldn't it make more sense to use calendar versioning then like rolling release Linux distributions like Arch <https://www.archlinux.org/releng/releases/> do it? Or define core and non-core modules which would trigger or not trigger a new major version? If not a breaking changes what causes a major version increase? Right now the versioning scheme is not transparent to me (or others)? On another note: I do like the idea of rolling releases for operating systems, but how does this actually apply to Python software packages? - Both pip and conda try to mitigate dependency issues, but this implies it is sometimes useful to stick with an old version? - Are their tools and procedures for the CI pipeline? Usually one could just run the development branch against their own CI pipeline and their would be time to fix it. With the plan to release weekly this could leave less than seven days (mor less than five) to react to a breaking change before the new breaking version is released. What do you think about the idea to have a special tag for approved to be merged pull requests. This tag could be picked up by the CI pipeline of depending packages and be run before it is merged. I would say a 24h time window is enough to raise an objection before merging to master and releasing in the next cycle. An objection would not necessarly mean the changes would not be merged, but leave room for a discussion. Merging could also be automated. Greetings, Julian — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/dask/dask/issues/6526#issuecomment-675333662>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTE24EZLRYSGX6PXWPTSBI2PDANCNFSM4QB7TCJQ> .

JulianWgs · 2020-08-19T08:26:00Z

Ok, if I find time I will do that :)

May I repeat my question from the comment before:

What causes a major version increase?

Greetings,
Julian

JulianWgs · 2020-08-19T08:42:59Z

I also want to add that Numpy does not break functionality within a 1 year time window (at least two releases between warning and change for a half year release schedule).

https://github.com/rgommers/numpy/blob/master/doc/neps/nep-0023-backwards-compatibility.rst

mrocklin · 2020-08-19T14:12:57Z

What causes a major version increase?

It's only happened a couple of times. The first time we dropped Python 2. The second time there were a large set of changes to both dask and distributed.

I also want to add that Numpy does not break functionality within a 1 year time window

They actually break functionality in every release (small things change) but larger public facing APIs are intended to remain stable. Dask has the same objective (we very rarely intentionally break major public facing APIs).

The break that you've uncovered occurred at the intersection of Dask, Arrow, and RAPIDS. All of those projects are rapidly changing, and so breaks like this should be somewhat expected. Fortunately they're also usually rapidly fixed. It's worth pointing out that if we don't break things then we move very slowly, and new things like Dask+Arrow+RAPIDS don't end up developing.

rjzamora · 2020-08-19T17:31:20Z

@JulianWgs - I do apologize for breaking cudf with the recent read_parquet change! I can certainly understand the frustration. With that said, a primary motivation for the Dask change was to improve the performance of dask_cudf. dask_cudf and Dask are currently being developed in lockstep, and so a specific cudf release will typically rely on the latest Dask release. We do try to avoid breaking changes to the user-facing API (like dd.read_parquet).

It is worth noting that I typically assume RAPIDS is the main/only external consumer of the internal parquet Engine API. This is why I considered it "fair game" to make a breaking Dask-RAPIDS change, since I also had a complementary cudf PR ready to merge. If there are actually other external consumers of this API, it would be great to know. Ideally, we can/should establish that there is a reasonable consensus before any breaking changes are merged.

JulianWgs · 2020-08-20T08:22:48Z

@rjzamora Thank you for the explanation. No harm done :) The only thing I would've wished for is that dask_cudf==0.14 pinned dask<=0.19.

What I want to propose is a document similar to what the other libraries have where versioning and breaking changes are described. I have the feeling that right now things rely more on tight communication between stake holders than a clearly defined process. Don't get me wrong, I think people are more important than processes, but I worry about the scalability and future interpretation of the current approach. Also it might not be transparent to new users and contributors.

I've updated the original issue to incorperated such documents from different scientififc libraries.

Is this is a good idea? Who would write such a document, since I don't see myself having the ability to do it.

@mrocklin

They actually break functionality in every release (small things change) but larger public facing APIs are intended to remain stable. Dask has the same objective (we very rarely intentionally break major public facing APIs).

Still only twice a year which I think is a completely different situation of which dask is in. Are there any other libraries with a rolling release model and high release cadence? After a quick Google search I didn't find any.

I also want to point to hypothesis which releases every few days and strictly follows semantic versioning.

mrocklin · 2020-08-20T12:32:37Z

I also want to point to hypothesis which releases every few days and strictly follows semantic versioning.

Sure, but the API surface area of hypothesis is very tight. Dask has many user facing APIs, each of which is quite large (numpy, pandas, ...)

Still only twice a year which I think is a completely different situation of which dask is in. Are there any other libraries with a rolling release model and high release cadence? After a quick Google search I didn't find any.

To adapt reasonably to the activity around Dask we need to maintain a relatively frequent release cadence (there are far more bug reports for issues that have already been fixed than master than people concerned about things moving too quickly). If we do semantic versioning then we would be on Dask version 205.0.0 or something similar. I understand and agree with your stance in principle, but I don't think that it makes sense for Dask in practice today.

I think that it could make sense if we separated out dask array, dask dataframe and so on into separate projects with separate versioning systems, but I think that that would likely create more issues than it resolves.

JulianWgs · 2020-08-20T13:39:16Z

What do you think about writing down the current state of things in the docs like the other libraries have done? That would resolve the issue for me.

After the discussion I agree that semantic versioning would not be the right choice. Calendar version might seem like the correct choice, but it is very unusual. I also share the same concerns about separating the repo.

mrocklin · 2020-08-20T14:24:40Z

What do you think about writing down the current state of things in the docs like the other libraries have done? That would resolve the issue for me.

You mean our general policy around releasing? I have no objection to this in general, provided it's in a suitable place. If you have time to collect a few doc pages about releasing in prominent libraries I would actually find reading through those pretty interesting.

After the discussion I agree that semantic versioning would not be the right choice. Calendar version might seem like the correct choice, but it is very unusual. I also share the same concerns about separating the repo.

I like CalVer in principle, especially for projects like Dask that have a wide API surface-area. It seems sensible to me. I think that the reasons to stay with our current scheme are mostly about inertia and familiarity for people, and less about precise versioning.

JulianWgs · 2020-08-22T08:29:03Z

I've updated the original issue with some examples. I like how Tensorflow does it:

All the documented Python functions and classes in the tensorflow module and its submodules, except for

Private symbols: any function, class, etc., whose name start with _

Experimental and tf.contrib symbols, see below for details.

Since Dask uses upstream libraries like Pandas, Numpy is their a policy which are versions are supported and when versions are deprecated?

mrocklin · 2020-08-22T15:11:40Z

Dask maintains minimum versions in our setup.py file for dask[array] and dask[dataframe] and conda dependencies in the conda-forge recipe.

…

On Sat, Aug 22, 2020 at 1:29 AM JulianWgs ***@***.***> wrote: I've updated the original issue with some examples. I like how Tensorflow does it: All the documented Python functions and classes in the tensorflow module and its submodules, except for - Private symbols: any function, class, etc., whose name start with _ - Experimental and tf.contrib symbols, see below for details. Since Dask uses upstream libraries like Pandas, Numpy is their a policy which are versions are supported and when versions are deprecated? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/dask/dask/issues/6526#issuecomment-678613257>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEV2K46SULT5JGEUB3SB56VVANCNFSM4QB7TCJQ> .

JulianWgs · 2020-09-09T10:32:18Z

Did you have time to look through the provided ressources? Do you want more examples?

Dask maintains minimum versions in our setup.py file for dask[array] and dask[dataframe] and conda dependencies in the conda-forge recipe.

I didn't mean a definition, but a policy (a written document), when certain versions will be deprecated.

TomAugspurger · 2020-09-09T11:14:42Z

Just to clarify, pandas doesn't strictly follow semver. See https://pandas.pydata.org/docs/development/policies.html#policies-version.

We're discussing Python (and NumPy) version support in #66, though there's disagreement there..

We don't have a formal policy for when we bump a pandas version. Typically when it becomes a pain to maintain older versions, though whether we can / try to faithfully match the behavior of the installed version of pandas differs from method to method. And you can back dask dataframes by cudf DataFrames, which has it's own release cadence, bugs, and behaviors.

jcrist · 2020-09-11T14:22:37Z

(note - I've moved this to the community repository for further discussion)

hammer · 2020-09-15T15:03:16Z

For reference, here are the issues in which major version bumps were discussed previously.

0.x --> 1.x: Requirements for 1.0.0 release dask#3226
1.x --> 2.x: Release Version 2.0 dask#4973

jcrist · 2020-09-15T16:13:49Z

I like CalVer in principle, especially for projects like Dask that have a wide API surface-area. It seems sensible to me. I think that the reasons to stay with our current scheme are mostly about inertia and familiarity for people, and less about precise versioning.

I also like CalVer (https://calver.org/), and think we should consider adopting it. Most projects in the Python ecosystem don't strictly follow semver; major and minor release bumps happen when the devs feel like things have changed enough for a bump, but not strictly around api changes. For a project like Dask that has lots of surface area and a rolling development cycle, release numbers seem more like a checkpoint in time, which better matches CalVer's semantics. We don't look at every change made (new api function, bugfix, etc...) and think about what version number we'll need to bump for each release, we release every couple weeks and periodically bump the minor release number. Using a date instead removes any choice around this, and better signals to downstream projects what our release versions imply (and don't imply).

Dask is also composed of multiple projects with potentially different release schedules and versions (Dask, Distributed, Dask-Kubernetes, etc...) - adopting calver for all these projects would better group package release cycles by date rather than trying to match up which version of dask-kubernetes was released near the last release of dask, etc...

jacobtomlinson · 2020-09-15T16:30:39Z

I agree with @jcrist here. Generally I'm a huge advocate for SemVer, but with Dask's huge API surface it is hard to separate out enhancement and bug fixes, so CalVer makes more sense here.

Dask and projects like it are mimicking the API of other libraries. Which means we are beholden to those libraries for their release schedule and version semantics.

In some traditional software development lifecycles enhancements are made against a develop branch, and bug fixes against a trunk branch. With the trunk being frequently merged into the develop branch. This allows minor and patch releases to be made independently of each other. Enhancement releases may be made periodically (every few months) and bug fixes can happen at any time.

In many projects in the Python ecosystem, like Dask, enhancements and bug fixes are made together on a single branch and are released frequently, so the separate semantics of minor and patch lose meaning as most releases are minor releases. Occasionally there are emergency patch releases, but they do not happen often.

In SemVer major releases indicate breaking changes. But as we must conform to the API of multiple other libraries the result is that we would have to do major releases frequently. As @mrocklin says we would have to adopt a Chrome or Firefox approach and end up with a very high major version number.

My personal preference would be for the entire community to adopt SemVer and be disciplined about it. But given the nature of the community (it's very research heavy) this is never going to happen. Therefore having a versioning scheme which looks like SemVer but has no semantics is only causing confusion.

Switching to CalVer is a strong signal that versions are checkpoints in time that have been tested more thoroughly and created purposefully with knowledge of changes throughout the community, but nothing more.

JulianWgs · 2021-10-24T16:05:18Z

Dask switched to CalVer (#100).

jcrist transferred this issue from dask/dask Sep 11, 2020

datametrician mentioned this issue Sep 12, 2020

Weekly release schedule #84

Open

hammer mentioned this issue Sep 15, 2020

Add release candidates to release procedure #94

Open

mrocklin mentioned this issue Oct 6, 2020

Dask versioning to switch to CalVer #100

Closed

martindurant mentioned this issue Apr 6, 2021

A full CHANGELOG fsspec/s3fs#460

Open

jameslamb mentioned this issue May 14, 2021

[ci] pin dask and distributed in CI jobs microsoft/LightGBM#4288

Merged

JulianWgs closed this as completed Oct 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic versioning in dask repository #93

Semantic versioning in dask repository #93

JulianWgs commented Aug 17, 2020

martindurant commented Aug 17, 2020

mrocklin commented Aug 17, 2020

mrocklin commented Aug 17, 2020

JulianWgs commented Aug 18, 2020 •

edited

Loading

mrocklin commented Aug 18, 2020 via email

JulianWgs commented Aug 19, 2020

JulianWgs commented Aug 19, 2020 •

edited

Loading

mrocklin commented Aug 19, 2020

rjzamora commented Aug 19, 2020 •

edited

Loading

JulianWgs commented Aug 20, 2020

mrocklin commented Aug 20, 2020

JulianWgs commented Aug 20, 2020

mrocklin commented Aug 20, 2020

JulianWgs commented Aug 22, 2020

mrocklin commented Aug 22, 2020 via email

JulianWgs commented Sep 9, 2020

TomAugspurger commented Sep 9, 2020

jcrist commented Sep 11, 2020

hammer commented Sep 15, 2020

jcrist commented Sep 15, 2020

jacobtomlinson commented Sep 15, 2020

JulianWgs commented Oct 24, 2021

Semantic versioning in dask repository #93

Semantic versioning in dask repository #93

Comments

JulianWgs commented Aug 17, 2020

martindurant commented Aug 17, 2020

mrocklin commented Aug 17, 2020

mrocklin commented Aug 17, 2020

JulianWgs commented Aug 18, 2020 • edited Loading

mrocklin commented Aug 18, 2020 via email

JulianWgs commented Aug 19, 2020

JulianWgs commented Aug 19, 2020 • edited Loading

mrocklin commented Aug 19, 2020

rjzamora commented Aug 19, 2020 • edited Loading

JulianWgs commented Aug 20, 2020

mrocklin commented Aug 20, 2020

JulianWgs commented Aug 20, 2020

mrocklin commented Aug 20, 2020

JulianWgs commented Aug 22, 2020

mrocklin commented Aug 22, 2020 via email

JulianWgs commented Sep 9, 2020

TomAugspurger commented Sep 9, 2020

jcrist commented Sep 11, 2020

hammer commented Sep 15, 2020

jcrist commented Sep 15, 2020

jacobtomlinson commented Sep 15, 2020

JulianWgs commented Oct 24, 2021

JulianWgs commented Aug 18, 2020 •

edited

Loading

JulianWgs commented Aug 19, 2020 •

edited

Loading

rjzamora commented Aug 19, 2020 •

edited

Loading