-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEEDS IP CLEARANCE] ARROW-10228: Contribute Julia implementation #8393
Conversation
This pull request includes the core code from [Arrow.jl](https://github.com/JuliaData/Arrow.jl), a pure Julia implementation supporting the apache arrow in-memory/IPC format. This implementation supports the 1.0 version of the specification, including support for: * All primitive data types * All nested data types * Dictionary encodings and messages, including nested dict encodings * Extension types * Streaming, file, record batch, and replacement and isdelta dictionary messages It currently doesn't include support for: * Tensors or sparse tensors * Flight RPC * C data interface For testing, the original repo was setup to use a .travis.yml file to run all tests. Typical testing of a Julia package just involves running `Pkg.test("Arrow")` from the Julia REPL. I can explore setting up a github action if that's traditional for individual implementation testing (it seemed so from a quick scan of the repository). In the test folder, the `arrowjson.jl` includes code to read from the JSON-formatted arrow files. The `integrationtest.jl` file includes a command to convert from arrow to json, json to arrow, or validate between the two, based off what other language integration testing files seemed to support. I've looked a little into how exactly to update archery to run the Julia integration tests, but was also told on the mailing list that it can be a follow up. It's unclear to me whether an IP clearance form needs to be made, but am willing to do make one if required. It's also unclear exactly how the release process will work once the code is merged in. I understand there's a fixed release plan of some kind (time-based, I believe?); in terms of individual language releases, are they expected to make a corresponding release to the language package manager at the same time? For example, with the upcoming 2.0 release, and assuming the Julia code was all merged in, would it be expected that a 2.0 release of Arrow.jl be made in the [Julia General package registry](https://github.com/JuliaRegistries/General/tree/master/A/Arrow)? Or is the 2.0 tag for the arrow _format_ and implementations should release when their _support_ is up to date? I'm in communcations with the Julia package manager devs on how best to proceed with the setup here and subsequent release process.
To give a bit more light about 2.0, we're thinking of releasing ideally tomorrow or next week, so it seems a bit too tight :-) |
That's already a lot, congratulations!
There's no preference. We tend to use GHA more than Travis because it offers more execution capacity, but we're using a lot of the latter already. IMHO you can choose whichever you're comfortable with. We try to trim CI jobs when the changes are unrelated (for example, don't run the C++ tests when only Julia changes are involved, or the converse). We use CI configuration for this where possible, and we also have a script At some point it will be nice for Julia testing on Linux to use the same docker-compose harness as other implementations. That can be deferred to a later task or PR, though.
Definitely.
I think so. @wesm can comment.
|
This looks great, thanks Jacob! I can help you with the CI configuration to align with Arrow's current setup right after the 2.0 release. |
I think it would be best to conduct the IP clearance process for this codebase. The first step is to have a vote on the mailing list about accepting the code donation. We'll need to document the authors of the code (and collect CLAs from them -- note that CLAs are NOT necessary for contributions made directly to apahce/arrow) and any third party licenses involved as part of this |
The code included in this PR was done exclusively by me; I initially put the code under the MIT license, but recently changed it to the apache-2. Of the other julia package dependencies used (for compression, Tables.jl API integration, etc.), they all have the MIT license. I'm not aware of any other 3rd party licenses involved. Happy to do a CLA, just point me in the right direction.
Thanks @kszucs! Yeah, I've seen the recent activity towards 2.0, so definitely no rush here; just put up the PR here since it was mostly ready and I had a free night to dig in. |
@quinnj I can help you with the IP clearance process. To get started, here's a link to the Apache CLA: https://www.apache.org/licenses/contributor-agreements.html |
Ok, filled out the form and emailed the CLA. |
I've received a response that my CLA was received and recorded for Apache Arrow. Happy to move forward on next steps. On my end, I've worked with one of the core package ecosystem devs on the best path for moving the "Arrow.jl" julia package from its existing location to here. We're preparing a new pull request that includes the previous Julia package versions to preserve history (users can always install previous versions of a package as needed for reproducing old results of scripts/applications). The new pull request will be similar in code content, but include 7-8 commits, one for each past version of the package; if these can be rebase-merged (when things are ready to merge, obviously), that will ensure the correct package history is preserved for Julia users. Happy to discuss more details here if things aren't clear. UPDATE: new pull request: #8448 |
Ok @quinnj should we close this one in favor of 8448 then? I'll start the IP clearance paperwork this week. |
Sounds great |
Basically a reopening of #8393. After further discussion, this provides the cleanest contribution in terms of IP clearances and git history, and it should be fairly straightforward to manage the Julia-side of package release registering. Closes #8547 from quinnj/jq/ARROW-10228 Authored-by: Jacob Quinn <quinn.jacobd@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Basically a reopening of apache#8393. After further discussion, this provides the cleanest contribution in terms of IP clearances and git history, and it should be fairly straightforward to manage the Julia-side of package release registering. Closes apache#8547 from quinnj/jq/ARROW-10228 Authored-by: Jacob Quinn <quinn.jacobd@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This pull request includes the core code from
Arrow.jl, a pure Julia
implementation supporting the apache arrow in-memory/IPC format.
This implementation supports the 1.0 version of the specification, including support for:
It currently doesn't include support for:
For testing, the original repo was setup to use a .travis.yml file to
run all tests. Typical testing of a Julia package just involves running
Pkg.test("Arrow")
from the Julia REPL. I can explore setting up agithub action if that's traditional for individual implementation
testing (it seemed so from a quick scan of the repository).
In the test folder, the
arrowjson.jl
includes code to read from theJSON-formatted arrow files. The
integrationtest.jl
file includes acommand to convert from arrow to json, json to arrow, or validate
between the two, based off what other language integration testing files
seemed to support. I've looked a little into how exactly to update
archery to run the Julia integration tests, but was also told on the
mailing list that it can be a follow up.
It's unclear to me whether an IP clearance form needs to be made, but am
willing to do make one if required.
It's also unclear exactly how the release process will work once the
code is merged in. I understand there's a fixed release plan of some
kind (time-based, I believe?); in terms of individual language releases,
are they expected to make a corresponding release to the language
package manager at the same time? For example, with the upcoming 2.0
release, and assuming the Julia code was all merged in, would it be
expected that a 2.0 release of Arrow.jl be made in the Julia General
package
registry?
Or is the 2.0 tag for the arrow format and implementations should
release when their support is up to date? I'm in communications with
the Julia package manager devs on how best to proceed with the setup
here and subsequent release process.