Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic versioning in dask repository #93

Closed
JulianWgs opened this issue Aug 17, 2020 · 22 comments
Closed

Semantic versioning in dask repository #93

JulianWgs opened this issue Aug 17, 2020 · 22 comments

Comments

@JulianWgs
Copy link

Why didn't the the commit 7138f470f0e55f2ebdb7638ddc4dfe2e78671403 trigger a new major version of dask since the function read_metadata is incompatible with older versions? The commit introduced the return of 4 values, but the old version only returned 3. According to semantic versioning this would have been the correct behavior.

cudf got broken, because of that commit.

Code from the issue:

>>> import cudf
>>> import dask_cudf
>>> dask_cudf.from_cudf(cudf.DataFrame({'a':[1,2,3]}),npartitions=1).to_parquet('test_parquet')

>>> dask_cudf.read_parquet('test_parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask_cudf/io/parquet.py", line 213, in read_parquet
    **kwargs,
  File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", line 234, in read_parquet
    **kwargs
  File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask_cudf/io/parquet.py", line 17, in read_metadata
    meta, stats, parts, index = ArrowEngine.read_metadata(*args, **kwargs)
ValueError: not enough values to unpack (expected 4, got 3)

dask_cudf==0.14 is only compatible with dask<=0.19. In dask_cudf==0.16 the issue is fixed.

This question was first asked here. Answered by @martindurant.

I would suggest a semantic versioning scheme for all non-internal functions (those without a pre underscore). If I look at version changes I only read changelogs with major version changes to find out if that affects my code (is that behavior wrong?). For all other new versions I just expect that it works. Being unsure about breaking changes in every release does not increase trust in dask. I know that this will only happen rarely, but its usually those rare cases which are really annoying.

Edit:

  • Pandas uses semantic versioning since version 1.0: Link
  • Numpy has an an at least 1 year time window before breaking changes: Link
  • Scipy uses a mixture of semantic versioning and what Numpy does: Link
  • Spark: Link
  • Arch Linux: Link
  • Django uses a loose form of semantic versioning: Link
  • Tensorflow uses semantiv versioning for all public APIs and clearly distinguishes between public and private: Link
  • For PyTorch I did not find a document
  • I didn't find information for ray, koala or modin but all of them are in prelease status

Greetings,
Julian

@martindurant
Copy link
Member

( related: #84 )

@mrocklin
Copy link
Member

Hi Julian,

Dask doesn't use semantic versioning. We use a rolling release cycle similar to numpy, pandas, scikit-learn, and other projects in this space.

Due to the large API surface area, pretty much every release would technically bump the major version.

@mrocklin
Copy link
Member

Regarding the actual issue, cc'ing @rjzamora

@JulianWgs
Copy link
Author

JulianWgs commented Aug 18, 2020

Hi Julian,

Dask doesn't use semantic versioning. We use a rolling release cycle similar to numpy, pandas, scikit-learn, and other projects in this space.

Due to the large API surface area, pretty much every release would technically bump the major version.

Pandas does use semantic version since 1.0 (source). I also think the comparison doesn't fit neatly as Numpy, Scipy and Pandas release much less frequently.

Wouldn't it make more sense to use calendar versioning then like rolling release Linux distributions like Arch do it? Or define core and non-core modules which would trigger or not trigger a new major version?

If not a breaking changes what causes a major version increase? Right now the versioning scheme is not transparent to me (or others)?

On another note: I do like the idea of rolling releases for operating systems, but how does this actually apply to Python software packages?

  • Both pip and conda try to mitigate dependency issues, but this implies it is sometimes useful to stick with an old version?
  • Are their tools and procedures for the CI pipeline? Usually one could just run the development branch against their own CI pipeline and their would be time to fix it. With the plan to release weekly this could leave less than seven days (mor less than five) to react to a breaking change before the new breaking version is released.

What do you think about the idea to have a special tag for approved to be merged pull requests. This tag could be picked up by the CI pipeline of depending packages and be run before it is merged. I would say a 24h time window is enough to raise an objection before merging to master and releasing in the next cycle. An objection would not necessarly mean the changes would not be merged, but leave room for a discussion. Merging after that time period could also be automated. This process would replace code freezes before a release, which were discussed in the weekly release schedule issue.

Greetings,
Julian

@mrocklin
Copy link
Member

mrocklin commented Aug 18, 2020 via email

@JulianWgs
Copy link
Author

Ok, if I find time I will do that :)

May I repeat my question from the comment before:

  • What causes a major version increase?

Greetings,
Julian

@JulianWgs
Copy link
Author

JulianWgs commented Aug 19, 2020

I also want to add that Numpy does not break functionality within a 1 year time window (at least two releases between warning and change for a half year release schedule).

https://github.com/rgommers/numpy/blob/master/doc/neps/nep-0023-backwards-compatibility.rst

@mrocklin
Copy link
Member

What causes a major version increase?

It's only happened a couple of times.  The first time we dropped Python 2.  The second time there were a large set of changes to both dask and distributed.

I also want to add that Numpy does not break functionality within a 1 year time window

They actually break functionality in every release (small things change) but larger public facing APIs are intended to remain stable.  Dask has the same objective (we very rarely intentionally break major public facing APIs).

The break that you've uncovered occurred at the intersection of Dask, Arrow, and RAPIDS. All of those projects are rapidly changing, and so breaks like this should be somewhat expected. Fortunately they're also usually rapidly fixed. It's worth pointing out that if we don't break things then we move very slowly, and new things like Dask+Arrow+RAPIDS don't end up developing.

@rjzamora
Copy link
Member

rjzamora commented Aug 19, 2020

@JulianWgs - I do apologize for breaking cudf with the recent read_parquet change! I can certainly understand the frustration. With that said, a primary motivation for the Dask change was to improve the performance of dask_cudf. dask_cudf and Dask are currently being developed in lockstep, and so a specific cudf release will typically rely on the latest Dask release. We do try to avoid breaking changes to the user-facing API (like dd.read_parquet).

It is worth noting that I typically assume RAPIDS is the main/only external consumer of the internal parquet Engine API. This is why I considered it "fair game" to make a breaking Dask-RAPIDS change, since I also had a complementary cudf PR ready to merge. If there are actually other external consumers of this API, it would be great to know. Ideally, we can/should establish that there is a reasonable consensus before any breaking changes are merged.

@JulianWgs
Copy link
Author

@rjzamora Thank you for the explanation. No harm done :) The only thing I would've wished for is that dask_cudf==0.14 pinned dask<=0.19.

What I want to propose is a document similar to what the other libraries have where versioning and breaking changes are described. I have the feeling that right now things rely more on tight communication between stake holders than a clearly defined process. Don't get me wrong, I think people are more important than processes, but I worry about the scalability and future interpretation of the current approach. Also it might not be transparent to new users and contributors.

I've updated the original issue to incorperated such documents from different scientififc libraries.

Is this is a good idea? Who would write such a document, since I don't see myself having the ability to do it.

@mrocklin

They actually break functionality in every release (small things change) but larger public facing APIs are intended to remain stable. Dask has the same objective (we very rarely intentionally break major public facing APIs).

Still only twice a year which I think is a completely different situation of which dask is in. Are there any other libraries with a rolling release model and high release cadence? After a quick Google search I didn't find any.

I also want to point to hypothesis which releases every few days and strictly follows semantic versioning.

@mrocklin
Copy link
Member

I also want to point to hypothesis which releases every few days and strictly follows semantic versioning.

Sure, but the API surface area of hypothesis is very tight. Dask has many user facing APIs, each of which is quite large (numpy, pandas, ...)

Still only twice a year which I think is a completely different situation of which dask is in. Are there any other libraries with a rolling release model and high release cadence? After a quick Google search I didn't find any.

To adapt reasonably to the activity around Dask we need to maintain a relatively frequent release cadence (there are far more bug reports for issues that have already been fixed than master than people concerned about things moving too quickly). If we do semantic versioning then we would be on Dask version 205.0.0 or something similar. I understand and agree with your stance in principle, but I don't think that it makes sense for Dask in practice today.

I think that it could make sense if we separated out dask array, dask dataframe and so on into separate projects with separate versioning systems, but I think that that would likely create more issues than it resolves.

@JulianWgs
Copy link
Author

What do you think about writing down the current state of things in the docs like the other libraries have done? That would resolve the issue for me.

After the discussion I agree that semantic versioning would not be the right choice. Calendar version might seem like the correct choice, but it is very unusual. I also share the same concerns about separating the repo.

@mrocklin
Copy link
Member

What do you think about writing down the current state of things in the docs like the other libraries have done? That would resolve the issue for me.

You mean our general policy around releasing? I have no objection to this in general, provided it's in a suitable place. If you have time to collect a few doc pages about releasing in prominent libraries I would actually find reading through those pretty interesting.

After the discussion I agree that semantic versioning would not be the right choice. Calendar version might seem like the correct choice, but it is very unusual. I also share the same concerns about separating the repo.

I like CalVer in principle, especially for projects like Dask that have a wide API surface-area. It seems sensible to me. I think that the reasons to stay with our current scheme are mostly about inertia and familiarity for people, and less about precise versioning.

@JulianWgs
Copy link
Author

I've updated the original issue with some examples. I like how Tensorflow does it:

All the documented Python functions and classes in the tensorflow module and its submodules, except for

  • Private symbols: any function, class, etc., whose name start with _
  • Experimental and tf.contrib symbols, see below for details.

Since Dask uses upstream libraries like Pandas, Numpy is their a policy which are versions are supported and when versions are deprecated?

@mrocklin
Copy link
Member

mrocklin commented Aug 22, 2020 via email

@JulianWgs
Copy link
Author

Did you have time to look through the provided ressources? Do you want more examples?

Dask maintains minimum versions in our setup.py file for dask[array] and dask[dataframe] and conda dependencies in the conda-forge recipe.

I didn't mean a definition, but a policy (a written document), when certain versions will be deprecated.

@TomAugspurger
Copy link
Member

Just to clarify, pandas doesn't strictly follow semver. See https://pandas.pydata.org/docs/development/policies.html#policies-version.

We're discussing Python (and NumPy) version support in #66, though there's disagreement there..

We don't have a formal policy for when we bump a pandas version. Typically when it becomes a pain to maintain older versions, though whether we can / try to faithfully match the behavior of the installed version of pandas differs from method to method. And you can back dask dataframes by cudf DataFrames, which has it's own release cadence, bugs, and behaviors.

@jcrist jcrist transferred this issue from dask/dask Sep 11, 2020
@jcrist
Copy link
Member

jcrist commented Sep 11, 2020

(note - I've moved this to the community repository for further discussion)

@hammer
Copy link

hammer commented Sep 15, 2020

For reference, here are the issues in which major version bumps were discussed previously.

@jcrist
Copy link
Member

jcrist commented Sep 15, 2020

I like CalVer in principle, especially for projects like Dask that have a wide API surface-area. It seems sensible to me. I think that the reasons to stay with our current scheme are mostly about inertia and familiarity for people, and less about precise versioning.

I also like CalVer (https://calver.org/), and think we should consider adopting it. Most projects in the Python ecosystem don't strictly follow semver; major and minor release bumps happen when the devs feel like things have changed enough for a bump, but not strictly around api changes. For a project like Dask that has lots of surface area and a rolling development cycle, release numbers seem more like a checkpoint in time, which better matches CalVer's semantics. We don't look at every change made (new api function, bugfix, etc...) and think about what version number we'll need to bump for each release, we release every couple weeks and periodically bump the minor release number. Using a date instead removes any choice around this, and better signals to downstream projects what our release versions imply (and don't imply).

Dask is also composed of multiple projects with potentially different release schedules and versions (Dask, Distributed, Dask-Kubernetes, etc...) - adopting calver for all these projects would better group package release cycles by date rather than trying to match up which version of dask-kubernetes was released near the last release of dask, etc...

@jacobtomlinson
Copy link
Member

I agree with @jcrist here. Generally I'm a huge advocate for SemVer, but with Dask's huge API surface it is hard to separate out enhancement and bug fixes, so CalVer makes more sense here.

Dask and projects like it are mimicking the API of other libraries. Which means we are beholden to those libraries for their release schedule and version semantics.

In some traditional software development lifecycles enhancements are made against a develop branch, and bug fixes against a trunk branch. With the trunk being frequently merged into the develop branch. This allows minor and patch releases to be made independently of each other. Enhancement releases may be made periodically (every few months) and bug fixes can happen at any time.

In many projects in the Python ecosystem, like Dask, enhancements and bug fixes are made together on a single branch and are released frequently, so the separate semantics of minor and patch lose meaning as most releases are minor releases. Occasionally there are emergency patch releases, but they do not happen often.

In SemVer major releases indicate breaking changes. But as we must conform to the API of multiple other libraries the result is that we would have to do major releases frequently. As @mrocklin says we would have to adopt a Chrome or Firefox approach and end up with a very high major version number.

My personal preference would be for the entire community to adopt SemVer and be disciplined about it. But given the nature of the community (it's very research heavy) this is never going to happen. Therefore having a versioning scheme which looks like SemVer but has no semantics is only causing confusion.

Switching to CalVer is a strong signal that versions are checkpoints in time that have been tested more thoroughly and created purposefully with knowledge of changes throughout the community, but nothing more.

@JulianWgs
Copy link
Author

Dask switched to CalVer (#100).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants