New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Format] Passing column statistics through Arrow C data interface #38837
Comments
There's no current standard convention, there's a few ways such could be sent though such as via a struct array with one row and fields for each of those or various other configurations. Passing such statistics would be beyond the scope of the current C Data interface IMHO |
There is a related discussion at duckdb/duckdb#4636 (reply in thread) |
I would argue that it makes sense to include statistical information in the C data interface. Efficient execution of complex queries requires statistical information, and I believe that most Arrow Producers possess this information. Therefore, it should be somehow passed. An alternative I can think of for the C-Data interface is to wrap the statistical information in top-level objects (e.g., ArrowRecordBatch in Python), but that approach is quite cumbersome and would necessitate specific implementations for every client API. |
Just to provide a clearer example of how statistics can be useful for query optimization. In DuckDB, join ordering is currently determined using heuristics based on table cardinalities with future plans to enhance this with sample statistics. Not only join ordering is affected by statistics but even the choice of the probe side in a hash join, will be determined based on the expected cardinalities. One example of a query that is affected by join ordering is Q 21 of tpch. However, the plan for it is too big to share it easily in a GitHub Discussion. To give a simpler example of how cardinalities affect this, I've created two tables.
My example query is a simple inner join of these two tables and we calculate the sum of SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i) Because the optimizer doesn't have any information of statistics from the Arrow side, it will basically pick the probe side depending on what's presented in the query. SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i) This would result in a slightly different plan, yet with significant differences in performance. As depicted in the screenshot of executing both queries, choosing the incorrect probe side for this query already results in a performance difference of an order of magnitude. For more complex queries, the variations in execution time could be not only larger but also more difficult to trace. For reference, the code I used for this example: import duckdb
import time
import statistics
con = duckdb.connect()
# Create table with 10^8
con.execute("CREATE TABLE t as SELECT * FROM RANGE(0, 100000000) tbl(i)")
# Create Table with 10
con.execute("CREATE TABLE t_2 as SELECT * FROM RANGE(0, 10) tbl(k)")
query_slow = '''
SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i);
'''
query_fast = '''
SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i);'''
t = con.execute("FROM t").fetch_arrow_table()
t_2 = con.execute("FROM t_2").fetch_arrow_table()
con_2 = duckdb.connect()
print("DuckDB Arrow - Query Slow")
print(con_2.execute("EXPLAIN " + query_slow).fetchall()[0][1])
execution_times = []
for _ in range(5):
start_time = time.time()
con_2.execute(query_slow)
end_time = time.time()
execution_times.append(end_time - start_time)
median_time = statistics.median(execution_times)
print(median_time)
print("DuckDB Arrow - Query Fast")
print(con_2.execute("EXPLAIN " + query_fast).fetchall()[0][1])
execution_times = []
for _ in range(5):
start_time = time.time()
con_2.execute(query_fast)
end_time = time.time()
execution_times.append(end_time - start_time)
median_time = statistics.median(execution_times)
print(median_time) |
I'm considering some approaches for this use case. This is not completed yet but share my idea so far. Feedback is appreciated. ADBC uses the following schema to return statistics: It's designed for returning statistics of a database. We can simplify this schema because we can just return statistics of a record batch. For example:
TODO: How to represent statistic key? Should we use ADBC style? (Assigning an ID for each statistic key and using it.) If we represent statistics as a record batch, we can pass statistics through Arrow C data interface. This may be a reasonable approach. If we use this approach, we need to do the followings:
TODO: Consider statistics related API for Apache Arrow C++. |
I'd be curious what others think of this approach as opposed to actually making a format change to include statistics alongside the record batches in the API. Particular in the case of a stream of batches. I'm not against it, I just don't know if others would be opposed to needing an entirely separate record batch being sent containing the statistics |
I think the top priority should be to avoid breaking ABI compatibility. I suspect that most users of the C data interface will not want to pass statistics. We should avoid doing anything that would cause disruption for those users. |
Ah, I should have written an approach that changes the current Arrow C data interface. |
This proposal is great. Just a un-related issue, Parquet |
Is |
At least for ADBC, the idea was that other types can be encoded in those choices. (Decimals can be represented as bytes, dates can be represented as int64, etc.) |
Some approaches that are based the C Data interface https://arrow.apache.org/docs/format/CDataInterface.html : (1) Add
|
Do consumers want per-batch metadata anyways? I would assume in the context of say DuckDB is that they'd like to get statistics for the whole stream up front, without reading any data, and use that to inform their query plan. |
Ah, then we should not mix statistics and |
This is just off the cuff, but maybe we could somehow signal to an ArrowArrayStream that a next call to get_next should instead return a batch of Arrow-encoded statistics data. That wouldn't break ABI, so long as we come up with a good way to differentiate the call. Then consumers like DuckDB could fetch the stats up front (if available) and the schema would be known to them (if we standardize on a single schema for this). |
Hmm. It may be better that we provide a separated API to get statistics like #38837 (comment) approach. |
We talked about this a little, but what about approach (2) from Kou's comment above, but for now only defining table-level statistics (primarily row count)? AIUI, row count is the important statistic to have for @pdet's use case, and it is simple to define. We can wait and see on more complicated or column-level statistics. Also for the ArrowDeviceArray, there is some room for extension: Lines 134 to 135 in 14c54bb
Could that be useful here? (Unfortunately there isn't any room for extension on ArrowDeviceArrayStream.) |
Thanks for sharing our talked idea. I took a look at the DuckDB implementation. It seems that DucDB uses only column-level statistics:
https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/function/table_function.hpp#L253-L255
It seems that a numeric/string column can have min/max statistics: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/numeric_stats.hpp#L22-L31 (A string column can have more statistics such as have Unicode and max length.) Hmm. It seems that column-level statistics is also needed for real word use cases. |
Hey guys, Thank you very much for starting the design of Arrow statistics! That's exciting! We are currently interested in up-front full-column statistics. Specially:
As a clarification, we also utilize row-group min-max for filtering optimizations in DuckDB tables, but these cannot benefit Arrow. In Arrow, we either pushdown filters to an Arrow Scanner or create a filter node on top of the scanner, and we do not utilize Mix-Max of chunks for filter optimization. For a more detailed picture of what we hold for statistics you can also look in our statistics folder. But I think that approx count distinct, cardinality, and min-max are enough for a first iteration. |
Table cardinality would be table level right? But of course the others are column level. Hmm. We didn't leave ourselves room in ArrowDeviceArrayStream... And just to be clear "up front" means at the start of the stream, not per-batch, right? |
Exactly!
Yes, at the start of the stream. |
@lidavidm technically the C Device structs are still marked as Experimental on the documentation and haven't been adopted by much of the ecosystem yet (as we're still adding more tooling in libarrow and the pycapsule definitions for using them) so adding room in ArrowDeviceArrayStream should still be viable without breaking people. Either by adding members or an entire separate callback function? |
I think it would be interesting to add an extra callback to get the statistics, yeah. |
It's hard though, because arguably it's kind of outside the core intent of the C Data Interface. But on the other hand, they kind of need to be here to have any hope of being supported more broadly. |
Hmmm I don't know should we represent this "Approximate" |
It seems that DuckDB uses HyperLogLog for computing distinct count: https://github.com/duckdb/duckdb/blob/d26007417b7770860ae78278c898d2ecf13f08fd/src/include/duckdb/storage/statistics/distinct_statistics.hpp#L25-L26 It may be the reason why "Approximate" is included here. |
Nice, I mean should "distinct count" ( or ndv ) be "estimated" in our data interface? And if we add a "estimated", should we add an "exact" ndv or just "estimated" is ok here? In (1) #38837 (comment) , would adding a key here be heavy? |
The ADBC encoding allows the producer to mark statistics as exact/approximate, fwiw
See #38837 (comment) |
Exactly, In general, systems don't provide exact distinct statistics because these are rather expensive to calculate. Hence, they do some approximate strategies. In DuckDB's case, it is a HyperLogLog implementation |
Thanks for sharing more information. Here is my new idea: It's based on the "(2) Add statistics to It puts all statistics to the If we have a record batch that has
We can put a
|
Since it's now per-column, maybe we can just let the type of the underlying array be one of the VALUE_SCHEMA fields? |
I think that we can't do it. |
Right, we still need multiple types. But for min/max statistics, we can include the actual type as one of the possible types, right? (In other words, a sort of |
Ah, you meant that we can add But it may be difficult to use. If we may use different schema for each column, we also need to provide an |
Hmm, the caller should know the schema still so long as we always put the dynamically typed column (so to speak) at a fixed column index |
Ah, you're right. |
Here is the latest idea. Feedback is welcome. It's based on #38837 (comment) and #38837 (comment) . If we have a record batch that has
TODO: Should we embed
|
FYI: Here is the reason why ADBC chose dictionary encoding: apache/arrow-adbc#685 (comment)
|
Since there's no available "side channel" here, string names probably make more sense |
That would essentially be a sparse union? Though I guess if you assume the caller knows the right type for a particular kind of statistic you can save a bit on the encoding, and presumably there aren't enough different statistics for the extra allocated space to matter (as compared to a dense union) |
Just to be clear, when we say
this means the address of the ArrowArray will be encoded (as a base 10 string?) in the metadata? |
OK. Let's use string for statistic key.
Yes.
This idea is not for space efficient. I thought this may be easier to use. Union may be a bit complicated. But users need to know which column is used for each statistics key as you mentioned. (e.g. Let`s use union like ADBC does. But we have a problem for the union approach:
Yes. I should have mentioned it explicitly. |
If the caller doesn't import the statistics array, how will it get released? |
How about applying the "Member allocation" semantics? https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
The current Or we can add |
Ah, that works. I suppose we could also just make |
It's a good idea! |
@ianmcook @zeroshade how does the sketch sound? |
Describe the enhancement requested
Is there any standard or convention for passing column statistics through the C data interface?
For example, say there is a module that reads a Parquet file into memory in Arrow format then passes the data arrays to another module through the C data interface. If the Parquet file metadata includes Parquet column statistics such as
distinct_count
,max
,min
, andnull_count
, can the sending module pass those statistics through the C data interface, to allow the receiving module to use the statistics to perform computations more efficiently?Component(s)
Format
The text was updated successfully, but these errors were encountered: