-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of Aggregator
, grouping, aggregation
#4973
Comments
Aggregator
Aggregator
Aggregator
, grouping, aggregaton
I'm curious about the proposed new Aggregator API. Do you know if it needs a hash table in each aggregator? I'm wondering because I'm a bit concerned about memory usage, especially for high cardinality aggregations. Suppose keys are duplicated |
Yes, see #2723 (comment) We're trading speed for slightly more memory usage here. We could probably avoid storing the actual key in the per-aggregator hash tables by either interning the key using another hash table (key -> ID) or by hashing the keys to a "wide enough" ID (like 128bit, also assuming that the hash function is actually proper). |
So there will also be |
These are good questions. Hashing an
There is no prototype yet, but a PR must run the groupby benchmarks. That said, I am open to other suggestions. The issue is that the current system needs a redesign:
|
While I agree that we may need to revisit the current I view row hash as a fast lane for certain queries rather than a dead end. By efficiently executing queries through row hash, we can avoid slower or more general approaches. Furthermore, I am open to exploring other specialized aggregate implementations that offer performance benefits to handle more use cases quickly. For example, Spark provides multiple hash aggregate implementations, including vectorized hash, hash, and object hash. Similarly, ClickHouse offers different implementation options for its hash table, based on key types. Utilizing these specialized implementations can optimize query performance without relying on a one-size-fits-all approach.
Regarding the proposed criteria for a fallback strategy, I believe that while it should be easy to add types and aggregations, not all internal implementations should be treated equally, or else we risk the entire engine running at the speed of UDAFs.
While it would be beneficial for the fallback strategy to work with memory accounting, it is not strictly necessary. Users can take risks when defining UDAFs, or we could even risk the less-used built-in functions.
I'm not convinced that data accesses need to be typed for aggregate implementations, which is entirely internal to database engines.
I understand that you were referring to the aligned row format, but it is possible to handle variable length fields using 'pointer swizzling.' Additionally, I don't believe that |
What do people think about implementing the strategy as described here - #2723 (comment) and then revisiting the question of whether we keep additional fallback approaches based on empirical evidence as to their performance characteristics. If this proves to outperform all the existing implementations, there is no harm in removing the other approaches, if it does not we can keep them
I think it is important to highlight that the proposal in #2723 (comment) is still a row hash approach, just using a different row encoding. Early benchmarks show it to be significantly faster |
I agree this should be benchmark results driven and the outcome should be as maintainable as possible. Given how performance critical this code is, I think performance is of higher priority than requiring strong typing (though of course it would be great if the resulting solution is strongly typed as well as high performance) If we can't handle memory accounting for some reason, providing the user a choice to use potentially unlimited memory or faster performance, seems acceptable. I would be surprised, however, if we can't work a memory limit check in somehow and amortize its cost into the noise |
@tustvold There appears to be a discrepancy between your proposal and my current situation. Can you clarify how I should interpret this?
Do you know if it needs a hash table in each aggregator? What does the internal state mean? Can you elaborate If yes, what do you mean by "still a row hash approach"? Keys are rows, but aggregator states are not? If the proposal is to remove all(word-aligned/compact) row formats inside DataFusion, use row-format from arrow-rs, and achieve a significant performance improvement, I completely vote for the change. |
To clarify the situation around the row format: @tustvold and I think:
The current compact aggregation looks roughly as following:
Our proposal would be:
Now in our proposal you CAN still use reshuffle + flat state buffer, but in many cases it might be cheaper to just iterate over the input and do the aggregation directly (reshuffle VS hash multiple times). So technically NOTHING would change when this proposal is implemented, just that the aggregator operator no longer needs to care about "is this a compact aggregator or a non-compact one" and that the buffers are owned by the aggregators instead of flying around. However the change provides a base to simplify aggregators or to make them even faster w/ more hacks (if we want). |
I take it differently here; changing one single hash table to per-aggregator hash table is a fundamental change. After this change, we will theoretically:
If the current inefficiency comes from too much dyn dispatch, I would like to try JIT. |
The cache locality is actually likely better as the aggregated values can be stored inline within the hash table as they are known at compile time. Similarly the row format stores consecutive rows contiguously in memory, and so you have very good locality from that perspective as well.
I think if we can get competitive performance without having to reach for JIT I think that is a massive win to maintainability. I originally reached for a JIT when creating the row format, and came to the conclusion it is a huge additional complexity that needs to yield order of magnitude performance improvements to justify itself. In the case of the row format the JIT approach was actually slower, than a correctly vectorised version. That's not to say we couldn't use a JIT, just it should definitely not be the first thing we try.
I see no reason to be fundamentalist here, we can support multiple approaches so long as they provide benefits to certain workloads. I suggest we move ahead with a POC unless you object and get some real performance data |
Thank you @tustvold and @crepererum for your patient and detailed explanation.
I understand your main idea now and look forward to learning more about it. I agree that JIT is complex, and we should definitely try low-hanging fruits first.
I agree to add this proposal as an alternative code path for aggregation. It would be great to have performance data in the future. |
It is great to see the discussion on this ticket -- it is really nice to see and a good read |
I'm actually working on some POC to improve the hash aggregation performance, following a very similar approach. The only difference is that I'm not using The approach requires quite a few API changes. I was able to see a big improvement for simple cases at least - haven't done comprehensive benchmarks. |
Have you confirmed this, in my benchmarks I've found it to be significantly faster to encode than the DF row format? The order preserving part is basically free for most types, it is just a couple of bit twiddles. |
Aah, yeah dictionaries are expensive as it has to compute a combined dictionary. I'm curious what the DF format is doing here, is it just hydrating all the values? |
I don't think DF format considers dictionary 😂 . In the POC for group-by columns we basically "unpack" the dictionary arrays into plain rows while doing the columnar to row conversion. For aggregate columns, we initialize the buffer (using DF format) based on the schema of the aggregates. If dictionary type shows up in any of the aggregates, we'll remove it and replace it with its value type. |
Hmm... Perhaps I could add an option to RowConverter to hydrate dictionaries to allow trading off space in this way 🤔 The main advantage of the arrow format, aside from reducing the number of formats we need to maintain, is so that aggregators can easily spill sorted state, which can then be cheaply recombined. Either way, very cool to see this operator getting some love, and encouraging that we're thinking along the same lines 👍 |
Agreed 👍
This sounds reasonable (at least at first thought) |
@Dandandan does DF use SIMD in the hot paths of hash join? From my experiences LLVM auto-vectorization is pretty fragile and often times doesn't get triggered if the logic is a bit complex. I think even the |
@sunchao TBH I am not sure about whether everything generates optimal SIMD instructions, however in the hash join vectorization of The above experiment is only about the conversion from using a |
@Dandandan thanks, I agree that calculating hashes itself is probably not so critical to performances comparing to other hash table operations. We are doing some experiments to switch to vectorized hash table and see if it can deliver better performance. So far, without SIMD it's roughly the same as |
Cool @sunchao very interested about what comes out of those experiments :) |
I did another POC based on the changes in #6657. The basic idea is to reduce the memory size of Test result: Q17
Q18
/// for non fixed length group keys
pub(crate) struct NonFixedSizeGroupState {
/// Group data
pub group_data: Vec<u8>,
/// Accumulator data
pub acc_data: Vec<u8>,
}
impl NonFixedSizeGroupState {
#[inline(always)]
fn group_data(&self) -> &[u8] {
&self.group_data
}
#[inline(always)]
fn agg_data(&self) -> &[u8] {
&self.acc_data
}
}
/// for fixed length group keys
pub(crate) struct FixedSizeGroupState {
/// Group data and Accumulator state data, stored sequentially
pub group_states: Vec<u8>,
}
impl FixedSizeGroupState {
#[inline(always)]
fn group_data(&self, data_width: usize) -> &[u8] {
&self.group_states[0..data_width]
}
#[inline(always)]
fn agg_data(&self, data_width: usize) -> &[u8] {
&self.group_states[data_width..]
}
}
fn update_one_accumulator_with_native_value<T1>(
&mut self,
groups_addresses: &[usize],
agg_input_array: &T1,
acc_idx: usize,
filter_bool_array: &[Option<&BooleanArray>],
row_layout: Arc<RowLayout>,
) -> Result<()>
where
T1: ArrowArrayReader,
{
let acc = &self.row_accumulators[acc_idx];
let filter_array = &filter_bool_array[acc_idx];
let mut state_accessor = RowAccessor::new_from_layout(row_layout);
if filter_array.is_none() && agg_input_array.null_count() == 0 {
for idx in 0..groups_addresses.len() {
unsafe {
let group_state_ptr = &mut *(&self.aggr_state.group_states
[groups_addresses[idx]]
as *const FixedSizeGroupState
as *mut FixedSizeGroupState);
state_accessor.point_to(
0,
group_state_ptr.group_states[self.data_part_width..].as_mut(),
);
acc.update_value::<T1::Item>(
Some(agg_input_array.value_at_unchecked(idx)),
&mut state_accessor,
);
}
}
} else {
for idx in 0..groups_addresses.len() {
unsafe {
let group_state_ptr = &mut *(&self.aggr_state.group_states
[groups_addresses[idx]]
as *const FixedSizeGroupState
as *mut FixedSizeGroupState);
state_accessor.point_to(
0,
group_state_ptr.group_states[self.data_part_width..].as_mut(),
);
let value = col_to_value(agg_input_array, filter_array, idx);
acc.update_value::<T1::Item>(value, &mut state_accessor);
}
}
}
Ok(())
} And now, the update group stats method(hash, comparing and matching related operations) becomes the bottleneck: Total evaluate agg input elapsed: 185 |
Based on the test result, I would like do another POC and implement a new Accumulator framework. The Accumulator framework will follow your suggestions in the new Accumulator API proposal and will combine the normal Accumulator
|
Sounds like a great idea @mingmwang ❤️ -- Note I also got super excited about this project and I am in the middle of implementing a POC for the proposed approach -- you can see the structure I have so far here #6800. I am hoping to have some numbers ready to report today If you also do a POC that will be great and we can compare the results and pick the one that shows the greatest promise. |
@mingmwang @Dandandan and @tustvold (and everyone else following along) here is a POC #6800 It follows similar pattern as you describe in #4973 (comment) and It also makes q17 go more than 2x faster (and has no
I would love feedback and thoughts on the approach |
Clickbench query Q32 is also a very good benchmark of high cardinality grouping. It's also the only query that did not finish successfully when the benchmark results were last updated. |
We have implemented a hash key to further reduce allocation for fix sized group key/state, which is inspired by clickhouse: cc @mingmwang |
I am adding this comment here instead of opening a new issue as it looks the above PRs may fix/improve but please let me know if you would like a new issue. I am experimenting with using DataFusion instead of Polars as query engine for dply-rs and I have noticed that group by median is more than ~20x slower (I run these a few times the timing is consistent): The Polars version:
datafusion version:
This is using one year of data from the nyctaxi dataset. The Datafusion code: use anyhow::Result;
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
let df = ctx
.read_parquet("./data/*.parquet", Default::default())
.await?;
df.aggregate(
vec![col("payment_type")],
vec![
median(col("total_amount")).alias("total_amount"),
count(lit(1)).alias("n"),
],
)?
.sort(vec![col("n").sort(true, false)])?
.show()
.await?;
Ok(())
} The Polars code: use anyhow::Result;
use polars::prelude::*;
fn main() -> Result<()> {
let lazy = LazyFrame::scan_parquet("./data/*.parquet", ScanArgsParquet::default())?;
let df = lazy
.groupby([col("payment_type")])
.agg([col("total_amount").median(), col("total_amount").count().alias("n")])
.sort("n", Default::default())
.collect()?;
println!("{}", df);
Ok(())
} |
@vincev -- thank you Yes, I expect DataFusion will get significantly faster with the new grouping. However, the overall performance of the median aggregator will still have room to improve (someone will need to write a specialized vectorized implementation of |
@alamb spent a bit of time looking at the median aggregator, I was able to reduce the time from ~22secs to ~2secs by reducing the number of allocations:
I have created #6837. |
@alamb |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The performance of DataFusion is a very important feature, and aggregation is one of the core algorithms.
For example it features prominently in the ClickBench queries #5404
This ticket covers making it even faster.
Background
Prior Art
After #4924 we have a single GroupByHash implementation that uses the Arrow row format for group keys (good!) -- thank you @ozankabak and @mustafasrepo
However, the aggregator (e.g. the thing storing partial sums, counts, etc) is non ideal. Depending on the type of aggregate, it is either:
Box<dyn RowAccumulator>
perBox<dyn Accumulator>
The GroupByHash operator manages the mapping from group key to state and passes a
RowAccessor
to theRowAccumulator
if needed.Groups are calculated once per batch, then a
take
kernel reshuffles (aka copies) everything and slices are passed to theRowAccumulator
/Accumulator
.Proposal
Ditch the word-aligned row format for the state management and change the aggregator to:
Hence the aggregator will be dyn-dispatched ONCE per record batch and will keep its own internal state. This moves the key->state map from the
[row_]hash.rs
to the aggregators. We will provide tooling (macros or generics) to simplify the implementation and to avoid boilerplate code as much as possible.Note that this also removes the
take
kernel since we think that it doesn't provide any performance improvements over iterating over the hashable rows and perform the aggregation row-by-row. We may bring thetake
handling back (as a "performtake
if at least one aggregator wants that) if we can proof (via benchmarks) that this is desirable for certain aggregators, but we leave this out for now.This also moves the memory consumer handling into the aggregator since the aggregator knows best how to split states and spill data.
Implementation Plan
TBD
Addtional Context
Note this was broken out of #2723 and is based on the wonderful writeup from @crepererum and @tustvold on #2723 (comment)
The text was updated successfully, but these errors were encountered: