Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up Arrow Crate #2594

Closed
tustvold opened this issue Aug 26, 2022 · 19 comments · Fixed by #3384 or #3754
Closed

Split up Arrow Crate #2594

tustvold opened this issue Aug 26, 2022 · 19 comments · Fixed by #3384 or #3754
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Aug 26, 2022

TLDR rather than fighting entropy lets just brute-force compilation

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The arrow crate is getting rather large, and is starting to show up as a non-trivial bottleneck when compiling code, see #2170. There have been some efforts to reduce the amount of generated code, see #1858, but this is going to be a perpetual losing battle against new feature additions.

I think there are a couple of problems currently:

  1. Limited build parallelism, especially if codegen-units is set low
  2. Upstream crates have to "depend" on functionality they don't need, e.g. parquet depending on compute kernels
  3. Minor changes force large amounts of recompilation, with incremental compilation only helping marginally
  4. Codegen is rarely linear in complexity, consequently larger codegen units take longer than the same amount of code in smaller units

All these conspire to often result in an arrow shaped hole in compilation, where CPUs are left idle.

Some numbers from my local machine

  • Release with default features: 232 seconds
  • Release with default features without comparison kernels: 150 seconds
  • Release with default features without compute kernels: 70 seconds
  • Release without default features without compute kernels: 60 seconds

The vast majority of the time all bar a single core is idle.

Describe the solution you'd like

I would like to propose we split up the arrow crate, into a number of sub-crates that are then re-exported by the top-level arrow crate. Users can then choose to depend on the batteries included arrow crate, or more granular crates.

Initially I would propose the following split:

  • arrow-csv: CSV reader support
  • arrow-ipc: IPC support
  • arrow-json: JSON support (related to Make JSON support Optional via Feature Flag #2300)
  • arrow-compute: contents of compute module
  • arrow-test: arrow test_utils (not published)
  • arrow-core: everything else

There is definitely scope for splitting up the crates further after this, in particular the comparison kernels might be a good candidate to live on their own, but I think lets start small and go from there. I suspect there is a fair amount of disentangling that will be necessary to achieve this.

Describe alternatives you've considered

Feature flags are another way this can be handled, however, they have a couple of limitations:

  • It is impractical to test the full combinatorial explosion of combinations, which allows for bugs to sneak through
  • They are unified for a target which limits build parallelism, just because say DataFusion depends on arrow with CSV support, shouldn't force the parquet crate to wait for this to compile before it can start compiling
  • Poor UX:
    • Discoverability is limited, it can be hard to determine what features gate what functionality
    • Hard to determine if the feature flag set is minimal, no equivalent of cargo-udeps
    • It can be a non-trivial detective exercise to determine why a given feature is being enabled
    • Necessitate counter-intuitive hacks to play nicely in multi-crate workspaces - see workspace hack

Additional context

@jimexist recently drove an initiative to do something similar to DataFusion which has worked very well - apache/datafusion#1750

FYI @alamb @jhorstmann @nevi-me

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Aug 26, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 27, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 27, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 27, 2022
tustvold added a commit that referenced this issue Aug 28, 2022
* Split out integration test plumbing (#2594) (#2300)

* Fix RAT
@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

I like this idea, for what it is worth. 👍

@tustvold
Copy link
Contributor Author

tustvold commented Sep 9, 2022

I've started work on this with #2693, I think the final split will likely end up being a different from what was initially proposed based on what components can easily be separated. I next plan to split out array data, followed by the arrays themselves. This should then allow splitting out some of the heavier kernels, e.g. sort, compare, cast, etc...

@alamb
Copy link
Contributor

alamb commented Sep 9, 2022

If the history with the datafusion split is any indication, this work is likely to end up generating lots of PRs

You can see how @jimexist broke that down and we tracked it in apache/datafusion#1750 -- perhaps something similar could be applied here.

I am happy to try and review these mechanical PRs more quickly so we can get the project done more quickly

@andygrove
Copy link
Member

How feasible would it be to move the type definitions (such as DataType, Field, and Schema) into an arrow-types crate? Maybe this is a path to arrow2 being able to use a common type system.

@tustvold
Copy link
Contributor Author

tustvold commented Sep 9, 2022

I don't see an obvious reason why that would not be possible, I'm not sure how generally useful the types will be without the array definitions though...

@andygrove
Copy link
Member

The datafusion-sql crate only uses arrow::datatypes for example, so just depending on arrow-types there would presumably help with compilation times.

@andygrove
Copy link
Member

@jorgecarleitao Would arrow2 and its ecosystem benefit from having an arrow-types crate as discussed here?

@jorgecarleitao
Copy link
Member

Hey, Thanks for the ping!

I think it would not benefit arrow2 directly right now as it has different declarations for Field (e.g. we do not have dict_id on it). Arrow2 also has an extension (DataType::Extension).

With that said, imo it is still a good design - there are systems that only require DataType, Field, Schema, and functionality to read them from a file. One example is a data catalog based on arrow logical types.

I think that datafusion's logical plans could also only depend on types, but I could be wrong (it depends on how List scalars are represented there?).

@alamb
Copy link
Contributor

alamb commented Sep 12, 2022

I think that datafusion's logical plans could also only depend on types, but I could be wrong (it depends on how List scalars are represented there?).

List scalars are represented as Vec<Box<Field> so I agree it may be the case the logical plans could depend only on type

https://github.com/apache/arrow-datafusion/blob/d16457a0ba129b077935078e5cf89d028f598e0b/datafusion/common/src/scalar.rs#L81

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 12, 2022
@tustvold
Copy link
Contributor Author

I've created #2711 which splits out the schema definitions into a crate called arrow-schema. I thought this was more clearly the logical types than something called arrow-types. PTAL 😄

@maxburke
Copy link
Contributor

As a downstream user of Arrow, one of the things we find is that we need to fork Arrow-ecosystem crates to quickly integrate patches for missing features or bugs and I think one thing that I'm dreading is having to do that with an exploded Arrow crate, having to fork half a dozen Arrow packages, Parquet, all of the Datafusion crates, ..., seems like it'll be a royal pain.

@tustvold
Copy link
Contributor Author

tustvold commented Sep 14, 2022

Hi @maxburke, the intention is to follow the work that was already performed for DataFusion, and would not involve splitting the repository. So I think you shouldn't have to maintain any more or less forks following this? The [patch.crates-io] directive would need to be for every crate though

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022
tustvold added a commit that referenced this issue Dec 8, 2022
* Split out arrow-string (#2594)

* Doc

* Clippy
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022
tustvold added a commit that referenced this issue Dec 8, 2022
* Split out arrow-ord (#2594)

* Make LexicographicalComparator public

* Tweak CI

* Fix SIMD

* Doc
@alamb
Copy link
Contributor

alamb commented Dec 19, 2022

I wonder is it done yet 🙏

@tustvold
Copy link
Contributor Author

tustvold commented Dec 19, 2022

arrow-row and arrow-arith then yes, will likely do tomorrow

@alamb
Copy link
Contributor

alamb commented Dec 20, 2022

giphy

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022
tustvold added a commit that referenced this issue Dec 21, 2022
* Split out arrow-row (#2594)

* Fix CI

* Fix doc

* More SortOptions to arrow_schema
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 21, 2022
tustvold added a commit that referenced this issue Dec 21, 2022
* Split out arrow-arith (#2594)

* Update CI

* Fix clippy

* Update docs

* Feature flag

* Fix CI

* Cleanup dependencies
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 23, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
5 participants