Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release a new version of DataFusion to crates.io #771

Closed
alamb opened this issue Jul 23, 2021 · 36 comments
Closed

Release a new version of DataFusion to crates.io #771

alamb opened this issue Jul 23, 2021 · 36 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 23, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current version of datafusion on crates.io is 4.0.0 (which among other things doesn't work with the newly released arrow 5.0.0)

Describe the solution you'd like
It would be great if we released a DataFusion 5.0.0 (or some other number if we want to diverge from arrow) -- doing so would likely involve porting the arrow-rs release scripts, from https://github.com/apache/arrow-rs/tree/master/dev/release and then sending it to the dev mailing list for a formal vote

Starting this ticket to gather some feedback

@houqp
Copy link
Member

houqp commented Jul 27, 2021

@andygrove @jorgecarleitao I couldn't find the exact commit for the 4.0.0 version we have in crates.io, Do you know where can we find it? It would be good to retroactively create the 4.0.0 tag for change log automation. I am guessing the commit is around ddaea81?

@jorgecarleitao do we want to bundle the python release with the main datafusion release as well? Or do we want to release python binding in its own cadence? I think if the python binding wants to have a more frequent release cadence than the core, then it's better to have a separate release process for it. Otherwise it would be less work to bundle it with the core Datafusion release.

@alamb
Copy link
Contributor Author

alamb commented Jul 27, 2021

I think 4.0.0 was released when datafusion was in the https://github.com/apache/arrow repo, specifically https://github.com/apache/arrow/tree/f959141ece4d660bce5f7fa545befc0116a7db79

I dug that version out of the email thread for the 4.0.0 RC3 release:

https://lists.apache.org/thread.html/rfb4e0065347235571cd0e0e71a7fa38c9652f07e27f98f89b4398feb%40%3Cdev.arrow.apache.org%3E

@andygrove
Copy link
Member

We also need to consider what to do about Ballista release cadence and version numbers. My preference would be to release DataFusion and Ballista together. Ballista version is currently 0.5.0-SNAPSHOT and I think that releasing as 0.5.0 would be appropriate.

@jorgecarleitao
Copy link
Member

That would be awesome, I would +1 on aligning to 0.5.0, also for Python Datafusion (i.e. bump from 0.2.1 to 0.5.0); I think we do have a problem with DataFusion, that is already 5.0.0 on crates.io?

@alamb
Copy link
Contributor Author

alamb commented Jul 27, 2021

FWIW, the most recent release to crates.io for datafusion is 4.0.0: https://crates.io/crates/datafusion

Given that DataFusion doesn't have the associated arrow ecosystem I think using an entirely different versioning scheme is fine for DataFusion and Ballista - I don't really have a strong opinion on versions for DataFusion (e.g I would be happy to have it be 0.5.0 or something, though I don't know how we would handle the existing versions)

@alamb
Copy link
Contributor Author

alamb commented Jul 27, 2021

BTW @houqp / @jimexist I am happy to help here / run whatever scripts / share my experience with doing the release process with arrow-rs -- just let me know what I can do to help

@houqp
Copy link
Member

houqp commented Jul 28, 2021

I also prefer we go with our own versioning scheme to better adhere to semantic versioning at the very least for ballista and datafusion python binding.

For datafusion the ship has already sailed unfortunately. If we are to release datafusion and ballista together, how do we manage the version differences? Do we just name the same tarball with different names? Or do we actually produce two different release tarballs with project specific sources?

Regardless what versioning scheme we use, the datafusion python documentation would still be needed. @jimexist perhaps you could help start building the automation for that?

@houqp
Copy link
Member

houqp commented Jul 28, 2021

Here is what I think could work for us, release all project sources in a single signed repo source tarball, which gets uploaded to apache svn. Then from within that same source release, we publish datafusion crate, ballista crate and datafusion python wheels with different versions.

For example, in the upcoming release, we could have datafusion-5.0.0, ballista-0.5.0 and datafusion python 0.3.0 (or 0.5.0 if we want to align it with ballista) published from the same source tarball apache-arrow-datafusion-5.0.0.tar.gz. Name of the source tarball won't matter much here since I don't expect downstream consumers of these projects to use it directly. Instead they would use crates.io or pypi with the proper project name and version.

Consequence of this is every time we need to release a new version of either datafusion, python binding or ballista, we would need to vote and release a new version the datafusion repo as a whole.The repo source release should be pretty light weight given we don't need to do maintenance releases in current state of the project. All we have to do is to run create-tarball.sh , send to dev list for vote, then finally run release-tarball.sh.

Changelog for each project will need to be generated separately before we propose a source release in apache svn.

WDYT?

@houqp
Copy link
Member

houqp commented Jul 28, 2021

on a side note, turns out the tree we used for 4.0.0 release (https://github.com/apache/arrow/commits/f959141ece4d660bce5f7fa545befc0116a7db79) is not in our repo.

Should I push this tree to our repo and tag it with 4.0.0? Alternatively, the state of Rust code in that release maps to 31dd3cd in our master branch, so we could also just tag that commit as 4.0.0. But I think this might not align with Apache's release policy.

@jimexist
Copy link
Member

Regardless what versioning scheme we use, the datafusion python documentation would still be needed. @jimexist perhaps you could help start building the automation for that?

@jorgecarleitao any previous work already done this area?

also there's how and where docs are supposed to be hosted, e.g.

@alamb
Copy link
Contributor Author

alamb commented Jul 28, 2021

Then from within that same source release, we publish datafusion crate, ballista crate and datafusion python wheels with different versions.

I think that would work nicely and I think it makes a lot of sense.

Should I push this tree to our repo and tag it with 4.0.0?

I would recommend not doing this (because it will effectively bring along the (large) history of the arrow repo with it, effectively making the arrow-datafusion repo several times larger).

Alternatively, the state of Rust code in that release maps to 31dd3cd in our master branch, so we could also just tag that commit as 4.0.0. But I think this might not align with Apache's release policy.

I think tagging 31dd3cd as 4.0.0 is fine (and I did something similar in arrow-rs) -- the official apache release policy, at least as I understand and discussed on the arrow-dev mailing lists, was centered around the tarballs as the artifacts. Since the 4.0.0 release / announcement was built from https://github.com/apache/arrow/commits/f959141ece4d660bce5f7fa545befc0116a7db79 if anyone wants to know exactly what was in the release they can use that reference. tagging 31dd3cd in this repo to compute the changelog seems like it would be fine

@houqp
Copy link
Member

houqp commented Jul 28, 2021

@jimexist there were some previous discussion on how to host a website for datafusion on the dev list: https://lists.apache.org/thread.html/r0ed76cc60cdf651e8cf5c82a21cc64114c1f6d174dc5487434bd32ef%40%3Cdev.arrow.apache.org%3E.

Read the docs is certainly the route with the least amount of work, but I am not 100% sure if it's something allowed for apache projects. The dev mailing list recommended hosting our docs as a sub path under the main arrow website. This incurs more work since this is not a turnkey solution. But it has the added benefit of us leveraging the arrow brand for marketing Datafusion and potentially improve SEO. We might be able to reuse the same automation that's used to generate https://arrow.apache.org/docs/python/api.html for datafusion too.

@houqp
Copy link
Member

houqp commented Jul 28, 2021

Quick update on my end, I have pushed the 4.0.0 tag with a reference to 31dd3cd to help with change log generation. Next up, I will look into subproject changelog automation using PR labels while waiting for more feedbacks on the release process proposed in #771 (comment).

@andygrove
Copy link
Member

andygrove commented Jul 30, 2021 via email

@houqp
Copy link
Member

houqp commented Jul 31, 2021

I generally only have time at weekends though.

I am on a similar boat at the moment. I have went back and tagged most of the issues with proper labels for changelog automation. I should be able to get the automation completed tomorrow and send a PR for you all to review both the code and changelog.

@andygrove I think where you will be able to help the most would be writing the announcement blog post for the release ;) But I don't think this should be a blocker for the crates.io release. In fact, if we are concerned about reserving the crate names, we could grab the name now with a placeholder release similar to https://crates.io/crates/roapi/0.1.0.

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2021

I agree that writing the blog post concurrently / after the release is totally fine.

I am also happy to help draft the release blog

@andygrove
Copy link
Member

andygrove commented Jul 31, 2021 via email

@houqp
Copy link
Member

houqp commented Jul 31, 2021

I have prepared a preview in #801. After the PR gets merged, the next step is to create a signed tarball for voting.

Here are the generated changelog for each subproject:

My proposed release process requires adding two new sets of tags: ballista-x.y.z and python-x.y.z. The x.y.z tag will be used to tag release versions for datafusion itself and is required for every voted datafusion release. ballista-x.y.z and python-x.y.z tags are optional and only needed if we want to release a new version of ballista or python binding together with the same datafusion release. In other words, releasing a new version of python binding or ballista requires a new release of datafusion as well, but not the other way around.

I am going to send the proposed release setup to the dev list to gather more feedbacks today.

@houqp
Copy link
Member

houqp commented Aug 3, 2021

I think I have all the apache release related automations completed for #801. @jimexist let me know if you need any help with python documentation automation. I am aiming push a final update to #801 this weekend and mark it as ready for review/merge if everything goes well this week. We can also release the python doc as part of a fast follow release after the 5.0.0 release if we don't want to rush it. My current goal is to make the datafusion release as light weight as we could so we can release more often.

@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2021

👍 -- @houqp let me know if there is anything I can do to help

@houqp
Copy link
Member

houqp commented Aug 7, 2021

I will update #801 and mark it as ready for review tomorrow. @andygrove I am assuming we want to wait for #831 for the release?

@andygrove
Copy link
Member

andygrove commented Aug 7, 2021 via email

@andygrove
Copy link
Member

#831 is ready for review and I would also like to merge #833 before we release.

@houqp
Copy link
Member

houqp commented Aug 7, 2021

@alamb @andygrove @Dandandan @jorgecarleitao @nevi-me @jimexist #801 is now ready for review. Once it's merged, we will be able to push tags, release tarball and send a rc voting email to the dev list.

@houqp
Copy link
Member

houqp commented Aug 7, 2021

Filed #837 to track the python doc release as a quick follow up so we don't have to block the current release.

@houqp
Copy link
Member

houqp commented Aug 14, 2021

OK, release has been approved on the dev list. I have pushed the final release tags on Github. However, I don't have write access to arrow's release directory in SVN, crates.io nor PyPI. So I will need anyone who has access to these resources help me finish up the release. The steps are documented at https://github.com/houqp/arrow-datafusion/blob/qp_release_doc/dev/release/README.md#finalize-the-release.

The remaining steps are:

@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2021

I will run the script to upload to SVN...

@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2021

(@houqp the credentials I use for svn is my apache username/password). I am not sure what, if any, additional permissions are needed to upload to the release svn directory

@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2021

The release files can be found here: https://dist.apache.org/repos/dist/release/arrow/arrow-datafusion-5.0.0/

I do not have permissions on the datafusion or ballista projects on crates.io. I believe @andygrove will have to grant one or both of us those permissions

Thanks again for all this work @houqp 🚀

@andygrove
Copy link
Member

I have invited @alamb to become an owner on all of the crates (I think this is currently a PMC role) and I have published the crates. @houqp Could you call the vote on the mailing list?

https://crates.io/crates/datafusion/
https://crates.io/crates/ballista/
https://crates.io/crates/ballista-core/
https://crates.io/crates/ballista-executor/
https://crates.io/crates/ballista-scheduler/

This whole process went very smoothly. Thank you @houqp and @alamb.

@houqp
Copy link
Member

houqp commented Aug 14, 2021

@alamb I am double checking with asf infra on the permission, but it looks like only PMC member has write access to the release folder. There is one thing that I noticed is not right, the release tarball is signed with my signing key, I have added it to the end of https://dist.apache.org/repos/dist/dev/arrow/KEYS, but I also don't have permission to add it to release/arrow/KEYS. I believe my key should be added to the release KEYS file too so people can use it to verify the signature.

@houqp Could you call the vote on the mailing list?

@andygrove do we still need to call a vote for the crates.io release? I thought the vote on the source tarball should have alredy covered the creates.io source release?

@andygrove
Copy link
Member

andygrove commented Aug 14, 2021 via email

@houqp
Copy link
Member

houqp commented Aug 14, 2021

OK, I have called the vote on the dev list, will add this step to the release document as well.

@houqp
Copy link
Member

houqp commented Aug 15, 2021

Filed #887 to track Python PyPI release as a follow up.

I believe the only remaining item is to copy my code signing key from dev/arrow/KEYS to release/arrow/KEYS.

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2021

I believe the only remaining item is to copy my code signing key from dev/arrow/KEYS to release/arrow/KEYS.

@houqp I have added your key here: https://dist.apache.org/repos/dist/release/arrow/KEYS

@houqp
Copy link
Member

houqp commented Aug 15, 2021

Thank you @alamb we are all good to close this issue then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants