-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic versioning in dask repository #93
Comments
( related: #84 ) |
Hi Julian, Dask doesn't use semantic versioning. We use a rolling release cycle similar to numpy, pandas, scikit-learn, and other projects in this space. Due to the large API surface area, pretty much every release would technically bump the major version. |
Regarding the actual issue, cc'ing @rjzamora |
Pandas does use semantic version since 1.0 (source). I also think the comparison doesn't fit neatly as Numpy, Scipy and Pandas release much less frequently. Wouldn't it make more sense to use calendar versioning then like rolling release Linux distributions like Arch do it? Or define core and non-core modules which would trigger or not trigger a new major version? If not a breaking changes what causes a major version increase? Right now the versioning scheme is not transparent to me (or others)? On another note: I do like the idea of rolling releases for operating systems, but how does this actually apply to Python software packages?
What do you think about the idea to have a special tag for approved to be merged pull requests. This tag could be picked up by the CI pipeline of depending packages and be run before it is merged. I would say a 24h time window is enough to raise an objection before merging to master and releasing in the next cycle. An objection would not necessarly mean the changes would not be merged, but leave room for a discussion. Merging after that time period could also be automated. This process would replace code freezes before a release, which were discussed in the weekly release schedule issue. Greetings, |
You may want to raise these thoughts at github.com/dask/community/issues/new
.
We've discussed calver before and it was decided to go with what most
pydata projects use. We could revisit that discussion though. I
recommend that you bring this up there, or at a Dask monthly meeting.
Many of the dependent packages test against master to check for issues like
this. We typically remind folks about 24 hours in advance of a release in
case there are any objections. In practice very few people seem to track
these announcements.
…On Tue, Aug 18, 2020 at 1:18 AM JulianWgs ***@***.***> wrote:
Hi Julian,
Dask doesn't use semantic versioning. We use a rolling release cycle
similar to numpy, pandas, scikit-learn, and other projects in this space.
Due to the large API surface area, pretty much every release would
technically bump the major version.
Pandas does use semantic version since 1.0: source
<https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#new-deprecation-policy>
Wouldn't it make more sense to use calendar versioning then like rolling
release Linux distributions like Arch
<https://www.archlinux.org/releng/releases/> do it? Or define core and
non-core modules which would trigger or not trigger a new major version?
If not a breaking changes what causes a major version increase? Right now
the versioning scheme is not transparent to me (or others)?
On another note: I do like the idea of rolling releases for operating
systems, but how does this actually apply to Python software packages?
- Both pip and conda try to mitigate dependency issues, but this
implies it is sometimes useful to stick with an old version?
- Are their tools and procedures for the CI pipeline? Usually one
could just run the development branch against their own CI pipeline and
their would be time to fix it. With the plan to release weekly this could
leave less than seven days (mor less than five) to react to a breaking
change before the new breaking version is released.
What do you think about the idea to have a special tag for approved to be
merged pull requests. This tag could be picked up by the CI pipeline of
depending packages and be run before it is merged. I would say a 24h time
window is enough to raise an objection before merging to master and
releasing in the next cycle. An objection would not necessarly mean the
changes would not be merged, but leave room for a discussion. Merging could
also be automated.
Greetings,
Julian
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/dask/dask/issues/6526#issuecomment-675333662>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE24EZLRYSGX6PXWPTSBI2PDANCNFSM4QB7TCJQ>
.
|
Ok, if I find time I will do that :) May I repeat my question from the comment before:
Greetings, |
I also want to add that Numpy does not break functionality within a 1 year time window (at least two releases between warning and change for a half year release schedule). https://github.com/rgommers/numpy/blob/master/doc/neps/nep-0023-backwards-compatibility.rst |
It's only happened a couple of times. The first time we dropped Python 2. The second time there were a large set of changes to both dask and distributed.
They actually break functionality in every release (small things change) but larger public facing APIs are intended to remain stable. Dask has the same objective (we very rarely intentionally break major public facing APIs). The break that you've uncovered occurred at the intersection of Dask, Arrow, and RAPIDS. All of those projects are rapidly changing, and so breaks like this should be somewhat expected. Fortunately they're also usually rapidly fixed. It's worth pointing out that if we don't break things then we move very slowly, and new things like Dask+Arrow+RAPIDS don't end up developing. |
@JulianWgs - I do apologize for breaking cudf with the recent It is worth noting that I typically assume RAPIDS is the main/only external consumer of the internal parquet |
@rjzamora Thank you for the explanation. No harm done :) The only thing I would've wished for is that dask_cudf==0.14 pinned dask<=0.19. What I want to propose is a document similar to what the other libraries have where versioning and breaking changes are described. I have the feeling that right now things rely more on tight communication between stake holders than a clearly defined process. Don't get me wrong, I think people are more important than processes, but I worry about the scalability and future interpretation of the current approach. Also it might not be transparent to new users and contributors. I've updated the original issue to incorperated such documents from different scientififc libraries. Is this is a good idea? Who would write such a document, since I don't see myself having the ability to do it.
Still only twice a year which I think is a completely different situation of which dask is in. Are there any other libraries with a rolling release model and high release cadence? After a quick Google search I didn't find any. I also want to point to hypothesis which releases every few days and strictly follows semantic versioning. |
Sure, but the API surface area of hypothesis is very tight. Dask has many user facing APIs, each of which is quite large (numpy, pandas, ...)
To adapt reasonably to the activity around Dask we need to maintain a relatively frequent release cadence (there are far more bug reports for issues that have already been fixed than master than people concerned about things moving too quickly). If we do semantic versioning then we would be on Dask version I think that it could make sense if we separated out dask array, dask dataframe and so on into separate projects with separate versioning systems, but I think that that would likely create more issues than it resolves. |
What do you think about writing down the current state of things in the docs like the other libraries have done? That would resolve the issue for me. After the discussion I agree that semantic versioning would not be the right choice. Calendar version might seem like the correct choice, but it is very unusual. I also share the same concerns about separating the repo. |
You mean our general policy around releasing? I have no objection to this in general, provided it's in a suitable place. If you have time to collect a few doc pages about releasing in prominent libraries I would actually find reading through those pretty interesting.
I like CalVer in principle, especially for projects like Dask that have a wide API surface-area. It seems sensible to me. I think that the reasons to stay with our current scheme are mostly about inertia and familiarity for people, and less about precise versioning. |
I've updated the original issue with some examples. I like how Tensorflow does it:
Since Dask uses upstream libraries like Pandas, Numpy is their a policy which are versions are supported and when versions are deprecated? |
Dask maintains minimum versions in our setup.py file for dask[array] and
dask[dataframe] and conda dependencies in the conda-forge recipe.
…On Sat, Aug 22, 2020 at 1:29 AM JulianWgs ***@***.***> wrote:
I've updated the original issue with some examples. I like how Tensorflow
does it:
All the documented Python functions and classes in the tensorflow module
and its submodules, except for
- Private symbols: any function, class, etc., whose name start with _
- Experimental and tf.contrib symbols, see below for details.
Since Dask uses upstream libraries like Pandas, Numpy is their a policy
which are versions are supported and when versions are deprecated?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/dask/dask/issues/6526#issuecomment-678613257>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEV2K46SULT5JGEUB3SB56VVANCNFSM4QB7TCJQ>
.
|
Did you have time to look through the provided ressources? Do you want more examples?
I didn't mean a definition, but a policy (a written document), when certain versions will be deprecated. |
Just to clarify, pandas doesn't strictly follow semver. See https://pandas.pydata.org/docs/development/policies.html#policies-version. We're discussing Python (and NumPy) version support in #66, though there's disagreement there.. We don't have a formal policy for when we bump a pandas version. Typically when it becomes a pain to maintain older versions, though whether we can / try to faithfully match the behavior of the installed version of pandas differs from method to method. And you can back dask dataframes by cudf DataFrames, which has it's own release cadence, bugs, and behaviors. |
(note - I've moved this to the |
For reference, here are the issues in which major version bumps were discussed previously.
|
I also like CalVer (https://calver.org/), and think we should consider adopting it. Most projects in the Python ecosystem don't strictly follow semver; major and minor release bumps happen when the devs feel like things have changed enough for a bump, but not strictly around api changes. For a project like Dask that has lots of surface area and a rolling development cycle, release numbers seem more like a checkpoint in time, which better matches CalVer's semantics. We don't look at every change made (new api function, bugfix, etc...) and think about what version number we'll need to bump for each release, we release every couple weeks and periodically bump the minor release number. Using a date instead removes any choice around this, and better signals to downstream projects what our release versions imply (and don't imply). Dask is also composed of multiple projects with potentially different release schedules and versions (Dask, Distributed, Dask-Kubernetes, etc...) - adopting calver for all these projects would better group package release cycles by date rather than trying to match up which version of dask-kubernetes was released near the last release of dask, etc... |
I agree with @jcrist here. Generally I'm a huge advocate for SemVer, but with Dask's huge API surface it is hard to separate out enhancement and bug fixes, so CalVer makes more sense here. Dask and projects like it are mimicking the API of other libraries. Which means we are beholden to those libraries for their release schedule and version semantics. In some traditional software development lifecycles enhancements are made against a develop branch, and bug fixes against a trunk branch. With the trunk being frequently merged into the develop branch. This allows minor and patch releases to be made independently of each other. Enhancement releases may be made periodically (every few months) and bug fixes can happen at any time. In many projects in the Python ecosystem, like Dask, enhancements and bug fixes are made together on a single branch and are released frequently, so the separate semantics of minor and patch lose meaning as most releases are minor releases. Occasionally there are emergency patch releases, but they do not happen often. In SemVer major releases indicate breaking changes. But as we must conform to the API of multiple other libraries the result is that we would have to do major releases frequently. As @mrocklin says we would have to adopt a Chrome or Firefox approach and end up with a very high major version number. My personal preference would be for the entire community to adopt SemVer and be disciplined about it. But given the nature of the community (it's very research heavy) this is never going to happen. Therefore having a versioning scheme which looks like SemVer but has no semantics is only causing confusion. Switching to CalVer is a strong signal that versions are checkpoints in time that have been tested more thoroughly and created purposefully with knowledge of changes throughout the community, but nothing more. |
Dask switched to CalVer (#100). |
Why didn't the the commit 7138f470f0e55f2ebdb7638ddc4dfe2e78671403 trigger a new major version of dask since the function read_metadata is incompatible with older versions? The commit introduced the return of 4 values, but the old version only returned 3. According to semantic versioning this would have been the correct behavior.
cudf got broken, because of that commit.
Code from the issue:
dask_cudf==0.14 is only compatible with dask<=0.19. In dask_cudf==0.16 the issue is fixed.
This question was first asked here. Answered by @martindurant.
I would suggest a semantic versioning scheme for all non-internal functions (those without a pre underscore). If I look at version changes I only read changelogs with major version changes to find out if that affects my code (is that behavior wrong?). For all other new versions I just expect that it works. Being unsure about breaking changes in every release does not increase trust in dask. I know that this will only happen rarely, but its usually those rare cases which are really annoying.
Edit:
Greetings,
Julian
The text was updated successfully, but these errors were encountered: