[parquet]: feature gate functionality in parquet. #4764

ritchie46 · 2023-09-03T10:05:06Z

I want to try partially moving/doubling IO to arrow-rs and start migration incrementally.

Describe the solution you'd like
Start feature gating some functionality in parquet that reduces compile time.

The text was updated successfully, but these errors were encountered:

tustvold · 2023-09-03T14:28:02Z

Really cool to see this happening. That being said, my experience with feature flags has been that the combinatorial explosion becomes very hard to test. It also has a tendency to result in poor test iteration times as different crates set different sets of features forcing recompilation.

I have instead found splitting up into sub-crates to be a better approach to this. However, the current granularity is already pretty small, is there a particular crate that is showing up as a bottleneck during compilation?

ritchie46 · 2023-09-03T19:52:23Z

Really cool to see this happening.

Gotta start somewhere. :)

It also has a tendency to result in poor test iteration times as different crates set different sets of features forcing recompilation.

This should not hurt your test times as I will ensure the default features remain the same. This will only make it possible for end-users as polars to cherry pick some stuff. For testing we use cargo hack to test every feature gate (not all combinations). But I don't think it's needed here as I don't plan to mess with the defaults.

is there a particular crate that is showing up as a bottleneck during compilation?

The compile times in polars are already way longer than we want. That's why I'd like to cherry pick on functionality. For the parquet crate I was looking add feature gating the IPC readers and take kernels out. Again this should not have any impact on the arrow-rs defaults/ repo and consumers.

tustvold · 2023-09-03T20:14:45Z

feature gating the IPC readers

The IPC readers are necessary to read the embedded arrow schema data, I'm also surprised if they are a meaningful bottleneck in compile times.

Yes arrow thrift encodes a base64-encoded flatbuffer 🤯

and take kernels out

The parquet crate doesn't make use of the take kernel, it does make use of the cast kernels though which indirectly use the cast kernel. This would be non-trivial to remove, especially for decimals

This should not hurt your test times as I will ensure the default features remain the same

Yes until someone uses disabled default features 😅 I dunno, I have found features to not really justify their downsides.

But I don't think it's needed here as I don't plan to mess with the defaults.

We still need to ensure compilation with non-default sets of feature flags? We currently already have a fairly long list of combinations in CI...

The compile times in polars are already way longer than we want. That's why I'd like to cherry pick on functionality

Perhaps we might do this empirically? If there are kernels that represent an outsize impact on compile times we then work out mitigation strategies for them? I strongly suspect that there are couple of particularly problematic sub-kernels (likely dictionaries) that have an outsize impact, and which it might be able to address this without having to resort to feature flags? I'm more than happy to help out with this. FWIW I have some ideas on how to make the take kernel less expensive. (Edit: nvm I already implemented this as #4705 I just forgot)

ritchie46 · 2023-09-04T06:12:35Z

I did some --timings --release builds on #4765

master	feature	perc
58.54	42.12	73%
42.6	33.8	79%
59.9	38.6	56%

There are some significant savings with some trivial features. Indeed as you said arrow-select and arrow-cast are the big boys. But I expect that by feature gating some code out we send much less code to the linker.

I understand your hesitation about feature gating, but I'd argue not all are the same. Some feature gates very naturally fit the code order and can gate entire modules. It can save a lot of cumulative compile times for downstream users. :) For the planet! 😉

tustvold · 2023-09-04T09:27:29Z

So long as we're being data driven and the returns are significant, I can probably grin and bear it

ritchie46 added the enhancement Any new improvement worthy of a entry in the changelog label Sep 3, 2023

ritchie46 mentioned this issue Sep 3, 2023

[parquet] feature gate some functionality #4765

Closed

This was referenced Sep 26, 2023

Support compiling remaining DataFusion crates (datafusion-core) to WASM apache/datafusion#7652

Open

Make parquet support optional apache/datafusion#7653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet]: feature gate functionality in parquet. #4764

[parquet]: feature gate functionality in parquet. #4764

ritchie46 commented Sep 3, 2023

tustvold commented Sep 3, 2023 •

edited

Loading

ritchie46 commented Sep 3, 2023

tustvold commented Sep 3, 2023 •

edited

Loading

ritchie46 commented Sep 4, 2023

tustvold commented Sep 4, 2023

[parquet]: feature gate functionality in parquet. #4764

[parquet]: feature gate functionality in parquet. #4764

Comments

ritchie46 commented Sep 3, 2023

tustvold commented Sep 3, 2023 • edited Loading

ritchie46 commented Sep 3, 2023

tustvold commented Sep 3, 2023 • edited Loading

ritchie46 commented Sep 4, 2023

tustvold commented Sep 4, 2023

tustvold commented Sep 3, 2023 •

edited

Loading

tustvold commented Sep 3, 2023 •

edited

Loading