API, Spark: Fix aggregation pushdown on struct fields #9176

amogh-jahagirdar · 2023-11-29T06:50:26Z

Currently, aggregation pushdowns on struct fields fail due to casting errors because in certain cases (e.g. optional field in struct) the accessor visitor construct a WrappedPositionAccessor when visiting the field. This WrappedPositionAccessor has an expectation that the element it's accessing is a StructLike but for aggregation pushdown this does not apply (the top level element is a Java type).

Since there's always a single value I think we should be able to just read the evaluated value without actually performing any term evaluation.

api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java

amogh-jahagirdar · 2023-11-29T07:15:09Z

api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java

-    valueStruct.setValue(evaluateRef(file));
-    return term().eval(valueStruct);
+    return (T) evaluateRef(file);


Ok so the main thing here is do we acutally need to pass the single value down through term evaluation? what's a case where this would fail since all the tests currently pass without it but I'm doubtful that this generalizes.

Seems OK to me to just return evaluateRef(file).

I'd like to get @rdblue 's opinion on this as well.

I should also note, if we determine this approach is good we can remove the SingleValueStruct class itself.

The case that I think is missing is when the term is not a simple reference and is instead a transformed value. That's why the SingleValueStruct class is here. The idea is to evaluate the original term (which could be a transform of a field) on the value that was retrieved from the DataFile metadata.

Also, it's good to keep in mind what each eval method is for. The eval(StructLike) method is for evaluating the aggregate on individual rows. That's because this is generic and we can use the aggregate framework to calculate aggregates of normal rows. That means the change to that method isn't correct. That eval method should call the accessor that was bound to the schema of incoming rows.

This eval method, eval(DataFile) is called to produce a value for a set of rows. As I said, that value may be then transformed by the expression so we need to run the expression term on it. To be able to run the term, we need a row that can supply the value that came from the DataFile and that's where SingleValueStruct comes in: it returns the DataFile value for any position that is queried. It should work with any accessor. The problem is that I didn't consider nested cases when writing it. It always returns the value, but it should return itself if the caller expected the value to be nested.

Here's an alternative fix for SingleValueStruct that keeps the original eval implementations but handles nested values:

public <T> T get(int pos, Class<T> javaClass) { if (javaClass.isAssignableFrom(StructLike.class)) { return (T) this; } else { return (T) value; } }

The tests from this PR pass with that change.

api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

nastra · 2024-01-24T07:32:33Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

@@ -249,6 +250,78 @@ public void testAggregateNotPushDownIfOneCantPushDown() {
    assertEquals("expected and actual should equal", expected, actual);
  }

+  @Test
+  public void testAggregationPushdownStructInteger() {


an alternative to having multiple methods that effectively call the same test code is to have a single test method that is parameterized

I did some more refactoring, while we could use @Parameterized in this case it ends up not being super readable compared to just being explicit imo. I've updated the PR so that there's helpers for asserting the aggregates and the expected contents of the explain plan.

rdblue · 2024-01-28T21:42:58Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

+    long timestamp = System.currentTimeMillis();
+    long futureTimestamp = timestamp + 5000;
+    Timestamp expectedMax = new Timestamp(futureTimestamp / 1000 * 1000);
+    Timestamp expectedMin = new Timestamp(1000 * (timestamp / 1000));


Why are the expressions different other than the timestamp variable?

Also, do these need to be timestamps? Or can you convert them back to millis or micros in Spark? We want to avoid using Timestamp in tests because it has wacky behavior across time zones.

I was aiming to test multiple data types. The issue with converting in Spark, is that the goal of the test is to verify that there was indeed aggregation pushdown and using any conversion functions will eliminate the pushdown that's performed and the test loses that value.

In the latest version in Spark I insert records explicitly in UTC time (which will then be stored in timestamp without timezone format in Iceberg). Then when the records are read, the java.sql.Timestamp will be a point in time, milliseconds since epoch. This would be deterministic across time zones unless there's some implicit session conversion that I'm missing.

rdblue · 2024-01-28T22:14:01Z

@amogh-jahagirdar, I fixed the implementation in the PR above. It would be great to get this into 1.5.0 also!

rdblue · 2024-01-31T17:30:00Z

Nice work, @amogh-jahagirdar! Thanks for getting this done for the 1.5 release!

nastra · 2024-02-01T07:51:24Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

+    Arrays.stream(expectedFragments)
+        .forEach(
+            fragment ->
+                Assertions.assertThat(explainString.contains(fragment))


this should be Assertions.assertThat(explainString).as(...).contains(fragment). We typically try to avoid usage of isTrue() / isFalse() on assertions like these because they don't provide any contextual insight when an assertion fails.
On the other hand, using assertThat(explainString).as(...).contains(fragment) will always show the content of explainString and fragment in case the assertion fails.
Also the .as() typically needs to be specified before the final assertion and will be ignored otherwise.

github-actions bot added API spark labels Nov 29, 2023

amogh-jahagirdar commented Nov 29, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Nov 29, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java Outdated Show resolved Hide resolved

amogh-jahagirdar changed the title ~~API, Spark: Fix aggregation pushodwn on struct fields~~ API, Spark: Fix aggregation pushdown on struct fields Nov 29, 2023

amogh-jahagirdar force-pushed the fix-nested-agg-pushdown branch from feed9b0 to a7f9c60 Compare November 29, 2023 07:13

amogh-jahagirdar commented Nov 29, 2023

View reviewed changes

Fokko reviewed Nov 29, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java Outdated Show resolved Hide resolved

amogh-jahagirdar marked this pull request as ready for review November 30, 2023 18:10

amogh-jahagirdar requested a review from rdblue December 6, 2023 23:49

rdblue added this to the Iceberg 1.5.0 milestone Jan 2, 2024

amogh-jahagirdar commented Jan 23, 2024

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java Outdated Show resolved Hide resolved

nastra reviewed Jan 24, 2024

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java Outdated Show resolved Hide resolved

nastra reviewed Jan 24, 2024

View reviewed changes

rdblue reviewed Jan 28, 2024

View reviewed changes

rdblue mentioned this pull request Jan 28, 2024

Fix ValueAggregate.SingleValueStruct for nested values. amogh-jahagirdar/iceberg#254

Merged

amogh-jahagirdar and others added 3 commits January 30, 2024 16:57

Fix agg pushodwn on struct

a71727c

Fix ValueAggregate for nested values. (#254)

fa5cfe6

Cleanup tests

d9b1230

amogh-jahagirdar force-pushed the fix-nested-agg-pushdown branch from 810a025 to d9b1230 Compare January 31, 2024 01:01

rdblue approved these changes Jan 31, 2024

View reviewed changes

rdblue merged commit 9de693f into apache:main Jan 31, 2024
42 checks passed

nastra reviewed Feb 1, 2024

View reviewed changes

amogh-jahagirdar mentioned this pull request Feb 29, 2024

[Spark 3.4] java.lang.Integer cannot be cast to org.apache.iceberg.StructLike #9831

Closed

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

API, Spark: Fix aggregation pushdown on struct fields (apache#9176)

6e4561b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API, Spark: Fix aggregation pushdown on struct fields #9176

API, Spark: Fix aggregation pushdown on struct fields #9176

amogh-jahagirdar commented Nov 29, 2023 •

edited

amogh-jahagirdar Nov 29, 2023 •

edited

huaxingao Nov 30, 2023

amogh-jahagirdar Dec 6, 2023

rdblue Jan 28, 2024

nastra Jan 24, 2024

amogh-jahagirdar Jan 31, 2024

rdblue Jan 28, 2024 •

edited

amogh-jahagirdar Jan 31, 2024 •

edited

rdblue commented Jan 28, 2024

rdblue commented Jan 31, 2024

nastra Feb 1, 2024

API, Spark: Fix aggregation pushdown on struct fields #9176

API, Spark: Fix aggregation pushdown on struct fields #9176

Conversation

amogh-jahagirdar commented Nov 29, 2023 • edited

amogh-jahagirdar Nov 29, 2023 • edited

Choose a reason for hiding this comment

huaxingao Nov 30, 2023

Choose a reason for hiding this comment

amogh-jahagirdar Dec 6, 2023

Choose a reason for hiding this comment

rdblue Jan 28, 2024

Choose a reason for hiding this comment

nastra Jan 24, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Jan 31, 2024

Choose a reason for hiding this comment

rdblue Jan 28, 2024 • edited

Choose a reason for hiding this comment

amogh-jahagirdar Jan 31, 2024 • edited

Choose a reason for hiding this comment

rdblue commented Jan 28, 2024

rdblue commented Jan 31, 2024

nastra Feb 1, 2024

Choose a reason for hiding this comment

amogh-jahagirdar commented Nov 29, 2023 •

edited

amogh-jahagirdar Nov 29, 2023 •

edited

rdblue Jan 28, 2024 •

edited

amogh-jahagirdar Jan 31, 2024 •

edited