Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEEDS IP CLEARANCE] ARROW-10228: Contribute Julia implementation #8393

Closed
wants to merge 2 commits into from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Oct 8, 2020

This pull request includes the core code from
Arrow.jl, a pure Julia
implementation supporting the apache arrow in-memory/IPC format.

This implementation supports the 1.0 version of the specification, including support for:

  • All primitive data types
  • All nested data types
  • Dictionary encodings and messages, including nested dict encodings
  • Extension types
  • Compression reading/writing
  • Large lists reading/writing
  • Custom alignment
  • Reading "tables" as one whole, or iterating record batches
  • Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn't include support for:

  • Tensors or sparse tensors
  • Flight RPC
  • C data interface

For testing, the original repo was setup to use a .travis.yml file to
run all tests. Typical testing of a Julia package just involves running
Pkg.test("Arrow") from the Julia REPL. I can explore setting up a
github action if that's traditional for individual implementation
testing (it seemed so from a quick scan of the repository).

In the test folder, the arrowjson.jl includes code to read from the
JSON-formatted arrow files. The integrationtest.jl file includes a
command to convert from arrow to json, json to arrow, or validate
between the two, based off what other language integration testing files
seemed to support. I've looked a little into how exactly to update
archery to run the Julia integration tests, but was also told on the
mailing list that it can be a follow up.

It's unclear to me whether an IP clearance form needs to be made, but am
willing to do make one if required.

It's also unclear exactly how the release process will work once the
code is merged in. I understand there's a fixed release plan of some
kind (time-based, I believe?); in terms of individual language releases,
are they expected to make a corresponding release to the language
package manager at the same time? For example, with the upcoming 2.0
release, and assuming the Julia code was all merged in, would it be
expected that a 2.0 release of Arrow.jl be made in the Julia General
package
registry
?
Or is the 2.0 tag for the arrow format and implementations should
release when their support is up to date? I'm in communications with
the Julia package manager devs on how best to proceed with the setup
here and subsequent release process.

This pull request includes the core code from
[Arrow.jl](https://github.com/JuliaData/Arrow.jl), a pure Julia
implementation supporting the apache arrow in-memory/IPC format.

This implementation supports the 1.0 version of the specification, including support for:
  * All primitive data types
  * All nested data types
  * Dictionary encodings and messages, including nested dict encodings
  * Extension types
  * Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn't include support for:
  * Tensors or sparse tensors
  * Flight RPC
  * C data interface

For testing, the original repo was setup to use a .travis.yml file to
run all tests. Typical testing of a Julia package just involves running
`Pkg.test("Arrow")` from the Julia REPL. I can explore setting up a
github action if that's traditional for individual implementation
testing (it seemed so from a quick scan of the repository).

In the test folder, the `arrowjson.jl` includes code to read from the
JSON-formatted arrow files. The `integrationtest.jl` file includes a
command to convert from arrow to json, json to arrow, or validate
between the two, based off what other language integration testing files
seemed to support. I've looked a little into how exactly to update
archery to run the Julia integration tests, but was also told on the
mailing list that it can be a follow up.

It's unclear to me whether an IP clearance form needs to be made, but am
willing to do make one if required.

It's also unclear exactly how the release process will work once the
code is merged in. I understand there's a fixed release plan of some
kind (time-based, I believe?); in terms of individual language releases,
are they expected to make a corresponding release to the language
package manager at the same time? For example, with the upcoming 2.0
release, and assuming the Julia code was all merged in, would it be
expected that a 2.0 release of Arrow.jl be made in the [Julia General
package
registry](https://github.com/JuliaRegistries/General/tree/master/A/Arrow)?
Or is the 2.0 tag for the arrow _format_ and implementations should
release when their _support_ is up to date? I'm in communcations with
the Julia package manager devs on how best to proceed with the setup
here and subsequent release process.
@github-actions
Copy link

github-actions bot commented Oct 8, 2020

@pitrou
Copy link
Member

pitrou commented Oct 8, 2020

To give a bit more light about 2.0, we're thinking of releasing ideally tomorrow or next week, so it seems a bit too tight :-)

@pitrou
Copy link
Member

pitrou commented Oct 8, 2020

This implementation supports the 1.0 version of the specification, including support for: [...]

That's already a lot, congratulations!

I can explore setting up a github action if that's traditional for individual implementation testing

There's no preference. We tend to use GHA more than Travis because it offers more execution capacity, but we're using a lot of the latter already. IMHO you can choose whichever you're comfortable with.

We try to trim CI jobs when the changes are unrelated (for example, don't run the C++ tests when only Julia changes are involved, or the converse). We use CI configuration for this where possible, and we also have a script ci/detect_changes.py that you should update for Julia.

At some point it will be nice for Julia testing on Linux to use the same docker-compose harness as other implementations. That can be deferred to a later task or PR, though.

I've looked a little into how exactly to update archery to run the Julia integration tests, but was also told on the mailing list that it can be a follow up.

Definitely.

It's unclear to me whether an IP clearance form needs to be made, but am willing to do make one if required.

I think so. @wesm can comment.

I understand there's a fixed release plan of some kind (time-based, I believe?); in terms of individual language releases, are they expected to make a corresponding release to the language package manager at the same time?

  1. It's loosely time-based, yes (i.e. no hard calendar constraints, but once every ~2 months is the target).
  2. yes

@kszucs
Copy link
Member

kszucs commented Oct 8, 2020

This looks great, thanks Jacob!

I can help you with the CI configuration to align with Arrow's current setup right after the 2.0 release.

@wesm wesm changed the title ARROW-10228: Contribute Julia implementation [NEEDS IP CLEARANCE] ARROW-10228: Contribute Julia implementation Oct 8, 2020
@wesm
Copy link
Member

wesm commented Oct 8, 2020

I think it would be best to conduct the IP clearance process for this codebase. The first step is to have a vote on the mailing list about accepting the code donation. We'll need to document the authors of the code (and collect CLAs from them -- note that CLAs are NOT necessary for contributions made directly to apahce/arrow) and any third party licenses involved as part of this

@quinnj
Copy link
Member Author

quinnj commented Oct 8, 2020

We'll need to document the authors of the code

The code included in this PR was done exclusively by me; I initially put the code under the MIT license, but recently changed it to the apache-2. Of the other julia package dependencies used (for compression, Tables.jl API integration, etc.), they all have the MIT license. I'm not aware of any other 3rd party licenses involved. Happy to do a CLA, just point me in the right direction.

I can help you with the CI configuration to align with Arrow's current setup right after the 2.0 release.

Thanks @kszucs! Yeah, I've seen the recent activity towards 2.0, so definitely no rush here; just put up the PR here since it was mostly ready and I had a free night to dig in.

@nealrichardson
Copy link
Member

@quinnj I can help you with the IP clearance process.

To get started, here's a link to the Apache CLA: https://www.apache.org/licenses/contributor-agreements.html

@quinnj
Copy link
Member Author

quinnj commented Oct 8, 2020

To get started, here's a link to the Apache CLA: https://www.apache.org/licenses/contributor-agreements.html

Ok, filled out the form and emailed the CLA.

@quinnj
Copy link
Member Author

quinnj commented Oct 12, 2020

I've received a response that my CLA was received and recorded for Apache Arrow. Happy to move forward on next steps.

On my end, I've worked with one of the core package ecosystem devs on the best path for moving the "Arrow.jl" julia package from its existing location to here. We're preparing a new pull request that includes the previous Julia package versions to preserve history (users can always install previous versions of a package as needed for reproducing old results of scripts/applications). The new pull request will be similar in code content, but include 7-8 commits, one for each past version of the package; if these can be rebase-merged (when things are ready to merge, obviously), that will ensure the correct package history is preserved for Julia users. Happy to discuss more details here if things aren't clear.

UPDATE: new pull request: #8448

@nealrichardson
Copy link
Member

Ok @quinnj should we close this one in favor of 8448 then?

I'll start the IP clearance paperwork this week.

@quinnj
Copy link
Member Author

quinnj commented Oct 12, 2020

Sounds great

@quinnj quinnj closed this Oct 12, 2020
@quinnj quinnj deleted the jq/ARROW-10228 branch October 12, 2020 21:10
nealrichardson pushed a commit that referenced this pull request Nov 8, 2020
Basically a reopening of #8393. After further discussion, this provides the cleanest contribution in terms of IP clearances and git history, and it should be fairly straightforward to manage the Julia-side of package release registering.

Closes #8547 from quinnj/jq/ARROW-10228

Authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Basically a reopening of apache#8393. After further discussion, this provides the cleanest contribution in terms of IP clearances and git history, and it should be fairly straightforward to manage the Julia-side of package release registering.

Closes apache#8547 from quinnj/jq/ARROW-10228

Authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants